Tuesday, July 18, 2006

Have You Checked Your Drive Firmware Lately?

Well the weekend is over and I thought I would update you on the problem SAP system.  In the last update we had discussed how this one TSM server was consistently having its mounts go into a RESERVED and hanging. The only thing we could do was cycle the TSM server. So after talking to support they stated it was a known problem listed in APAR IC49066 and fixed in the TSM server 5.3.3.2 release. So we upgraded the TSM server, updated ATAPE on the problem server to the same level as on the controller, and turned on SANDISCOVERY. The system came back up and started to mount tapes, albeit slowly. So we let it run and all seemed well for about the first 8 hours then all hell broke loose. TSM started taking longer and longer to mount tapes until it couldn’t mount anything. We began receiving the following errors also:

7/13/2006 9:38:45 AM ANR8779E Unable to open drive /dev/rmt25, error number=16.
7/13/2006 9:38:45 AM ANR8311E An I/O error occurred while accessing drive DR25 (/dev/rmt25) for SETMODE operation, errno = 9.

We reopened the problem ticket and talked to a number of Tivoli support reps and almost had to force them to have us run a trace. The original problem fix began Friday and here we were still trying to fix it well into early Sunday morning. So after getting the trace to Tivoli they looked it over and seemed perplexed at the errors. The errors seemed as if Tivoli was trying to mount tapes for drives that the library client was not pathed for. Also it seemed TSM was polling the library for drives but was unable to get a response so it went into a polling loop until it found a drive. This caused our mount queue to get as high as 100+ tape mounts waiting and the mounts that did complete sometimes took 30-45 minutes to do so. It was NUTS! So Sunday morning a new Tivoli Rep was assigned (John Wang). As we discussed the problem and Tivoli was trying to get a developer to analyze the trace John mentioned how he had seen this type of error before in a call about an LTO-1 library. He stated that it took 3 weeks to determine the problem but that the end result was that the firmware on the drives was down-level and causing the mount issues. So I checked my 3584’s web interface (I feel bad for all those people out there without web interfaces on their libraries) and found the drive firmware level at 57F7. This seemed down-level from what little information we had so I had my oncall person call 1-800-IBM-SERV and place a SEV-1 service call. The CE called me and we discussed the firmware level. When he saw how down-level it was he was surprised and lectured me on making sure we always check with CE’s before we do any changes to the environment. The CE then gathered his needed software and came to the account to update the drives. I brought all the TSM servers down and after 30 minutes the library had dismounted all tapes from the drives. The CE then proceeded to update the firmware, which actually only took 15-20 minutes. I expected longer. So we went from firmware level 57F7 to 64D0. Huge jump! So after the firmware upgrade I audited the library and brought the controller back up. Viola! It started mounting tapes as soon as the library initialized and the response was back to what it should have been. It’s now Tuesday morning and there have been no problems. So before you upgrade TSM be sure to have checked your libraries firmware (both library and drives). It could mean the difference between sink and swim!

Saturday, July 15, 2006

Shared Library Problem Solved?

Well after trying numerous things to get the problematic shared library working I finally got support on the line to fix this problem. The other day the problem reoccurred after about a week. I thought I had fixed the problem by changing the IP/VLAN the servers communicated over, and I thought it was until it all came crashing down. So I upgraded the problem instance to the same release level as the controller (controller has to be at or higher then the client servers). That didn’t fix it! I turned RESETDRIVES to NO and that didn’t do it either (Thanks for the advice Scott). So we got Tivoli on the line and Andy Ruhl with Tivoli support worked with us to identify the problem. First thing we did was check the ATAPE level and it turned out the controller was at a higher version. So we updated it to the same level as the controller. Then Andy found this APAR IC49066 which interestingly enough describes my problem. Turns out there was a known issue with this type of behavior and although it was due to be fixed in 5.3.4 it was added to the 5.3.3.2 update. Supposedly the explanation states it occurs when the client is accessing two different controllers, but Andy stated it was a misprint and it applies to single controller instances as well. So we updated all 5 TSM servers to the fixed level and hope we don’t experience any more issues. So far so good!

Sunday, July 09, 2006

Help! Problem With Shared Library Client!

Ok, here is my problem.  I have a large shared library that supports 5 TSM servers. All of the TSM servers, save one, run without any problems. The one with problems is pathed to 32 of the 5 drives (they are designated just for its use) and periodically its mount requests go into a RESERVED state but never mount.  I have tried FORCESYNC'ing both sides to see if it was communication but that didn't work.  I have tried setting paths offline then back online and that didn't work either. The only thing that works is to cycle the TSM server then it comes back up and starts mounting tapes like nothing was wrong.  This happens on average every 24-48 hours.  This is a large DB client TSM instance so the reboots are hurting critical business apps backups. I don't see any glaring errors in the activity log or in the AIX errpt. Any suggestions would be gladly received.

Wednesday, July 05, 2006

Import/Export Question!

Can someone explain to me why moving data (exporting/importing) from one TSM server to another TSM server is such a pain in the neck? Here is my scenario, I have some servers that moved from one location to another, network wise, and they now need to backup to a different TSM server. Both servers use the same media type (LTO-3). So why is it I have to either copy all the data across the network to the new server as it creates new tapes, or dump it all to tape(s) and then rewrite it to new tape(s) when imported? My question is this - why can’t I just export the DB info and pointers for the already existing tape to the new TSM server from the old? Why can’t the old server “hand-over” the tape to the new TSM server?  It seems like a lot of wasted work to constantly have to copy the data (server to server export/import) or dump it to tape and then write the data to new tapes. I think the developers ought to work on a way of doing this.  I would also think that this process could be done on a DB level so you could in a sense reorg the DB without the long DB Dump/Load and audit process, and the tapes would be handed over to the new TSM server. No rewrites necessary.