Wednesday, July 27, 2005

Volume History Issues

I have a very large shared library environment and currently have 3 TSM instances connecting to a 9 frame 3584. When I first configured this library we had one of the 3 instances running as the library controller (2 of the instances are on the same server the third is another system). The problems we ran into were numerous when we needed to do library maintenance since we adversely affected the server running as the library manager that was also a production TSM server for backups. So after a suggestion from Tivoli that we at least create a library manager instance we did so. It has helped a lot with handling tape issues, but one issue we were unaware of and had difficulty in resolving was with the volume history on the old library manager not releasing volumes listed as REMOTE. The new library manager could not force the old manager to "let go" of the tapes so they never were freed back into the scratch pool. We tried deleting all volumes of TYPE=REMOTE but it said it needed more parameters. Here is an example:

DELete VOLHistory TODate=TODAY Type=REMOTE FORCE=Yes
ANR2022E DELETE VOLHISTORY: One or more parameters are missing.


So no luck on that working to free up all the REMOTE tapes. I looked through the documentation on deleting volume history information and found nothing on REMOTE volumes. Also you'll notice there is nothing mentioned in the documentation that states individual volumes can be deleted. So we were in a serious jam. So I looked up the issue on ADSM.org and found the following command posted by a contributor.

DELete VOLHistory TODate=TODAY Type=REMOTE VOLume=NT1904 FORCE=Yes

This command allowed us to delete the individual volume and freed the tape to return to a scratch status. Thank goodness for search.ADSM.org or I'd never find the answer to half my problems. This command worked as advertised and the tapes were deleted from the old library managers volume history and they went back to a scratch status.

Monday, July 25, 2005

ETA Please!

Well I just had to perform a restore for a small server and the speed at which the restore ran was atrocious. I mean it was slower than rush hour in LA. Let's discuss, and THIS TIME I WANT AND REQUEST FEEDBACK! It turned out a disk went bad on a webserver and the system admins requested a number of filesystem restores. The combined amount was about 22-25GB. OK! No problem! Well that probably would have been the case if the restore request had been during the day but the request came in at night and the restore was competing with the nightly backups. Over a gig-ether (Fiber Gig-ether) connection I was able to get 1.3MBs and an aggregate rate of 668Kps. So do the math and it took a long period of time. The other thing that didn't help was it was a web server with TONS of little objects. It's livable but there were a lot of small files. The problem was eveyone and their brother wanted an ETA. "How long? It's Small! It should only take a couple hours max!" and so on. Well people now want some solution to this situation but of course the problem will be keeping it somewhat cheap. Even though everyone asks if we can halt the backups while we perform the restores we all know that's not really a viable option, so I came up with this idea, tell me what you think. Since major restores are few and far between I am proposing we create a new VLAN and run a single cable to each row of servers in the server room with enough slack to stretch to any server in the row. If a restore is required we simply plug in the "restore" connection, set an IP and away it rips. When finished we put the system back on the backup network it is assigned and rollup the excess ethernet cord and place it in the rack of the server in the middle of the row. I am only thinking of this for major restores and since I am not requesting that we buy more NIC's I think it's doable. Let me know what restore process you have in place when the network is saturated. I'd love suggestions!

Tuesday, July 19, 2005

TSM Server Upgrade Issue

Just this last week I and a co-worker were trying to get a TSM 5.3 upgrade working. The server kept saying it needed the dsmserv upgradedb command run against it. Everytime we tried the DB upgrade failed with an TIVGUID error. Since it was eating into backup times we rolled back to 5.2.4. Unfortunately the dsmserv upgrade caused 5.2.4 to state that the DB was higher level and TSM could not work with it. This happened even though the upgrade on 5.3 failed. Since support had not responed at this time we restored the DB from a full+inc that was taken just before the upgrade (WHEW! LUCKY WE HAD THAT!). When Tivoli finally responded this is what we were told:

This is a know problem with the 5.3.0 upgrade.

From TSM Support:

The problem that you are probably experiencing is a known problem with the 5.3 upgrade. If an admin has an expired password, or if a password is too short for the 5.3 password enforcement, then the upgrade can fail. Here are some steps that usually fix the problem:

1) Disable AES using the hidden option AllowAES No in dsmserv.opt file.
2) Re-initiate the upgrade db
3) Start the server. Preferably lock out all sessions & other activity
4) Use show node to identify admin & node ids that have expired or have passwords that are too short. Fix these.
(4 Alternative) Set the minimum password length to 0
5) Halt the server
6) Remove the AllowAES option
7) Start the server -- this will upgrade the passwords to AES encryption in the background

If you follow these steps then the upgrade db will probably continue successfully.

After this, start the server in background as usually and run a db backup.


SO ALL THIS OVER A PASSWORD!

Thursday, July 07, 2005

FYI: Using Mixed Media In An LTO Library

We recently upgraded one of our 3584's at work with some LTO2 drives. This is the first attempt at a mixed media environment and according to IBM and the following Redbook technote here, you must work out the issues with MOUNTLIMITS or all LTO2 drives could end up in use when you need them since they can also read LTO1 media. So we set our mountlimits accordingly, but it does not help that we have more LTO1 drives than LTO2. So by setting the mount limit to the number of LTO1 drives didn't stop TSM from accessing the LTO2 drives. This they did not explain well and they left one crucial piece out of the puzzle. When using a mixed media library if you do not specificly state which media format to use TSM will use both LTO1 and LTO2 media in a LTO designated storage pool. You want me to explain further? Ok here is how it affected us. We added LTO2 drives and an additional 2 frames to our library and setup the devclasses accordingly, setting the FORMAT to DRIVES so it would use the highest format available by the drive assigned. Well since we didn't partition the library TSM is going to grab whatever drive comes available and will mount the appropriate scratch tape. So if TSM assigns an LTO1 drive then you'll use an LTO1 tape. The only way you can force the LTO2 media to be used is to set the FORMAT setting in the devclass to ULTRIUM2 or ULTRIUM2C (w/Compression). So we didn't think about that and were bit by it when we ran out of LTO1 scratch. We didn't catch it due to our script only looking for scratch in the library not being able to designate between the two media types (which you can really only do if a different vol series is used for labeling). So without the LTO1 scratch we basically lost 2/3 of the drives and didn't know it. So I had to go switch our script to monitor both scratch types and we had to force the LTO's to their appropriate format. Once I realized what was happening it was a "NO BRAINER" that TSM would work that way. The big problem is that TSM did not seperate out the media format for LTO2 so it would be a different devclass type like they did with the 3592's. So be aware how TSM works and make sure you don't make the same mistake I did.