Tuesday, July 18, 2006

Have You Checked Your Drive Firmware Lately?

Well the weekend is over and I thought I would update you on the problem SAP system.  In the last update we had discussed how this one TSM server was consistently having its mounts go into a RESERVED and hanging. The only thing we could do was cycle the TSM server. So after talking to support they stated it was a known problem listed in APAR IC49066 and fixed in the TSM server 5.3.3.2 release. So we upgraded the TSM server, updated ATAPE on the problem server to the same level as on the controller, and turned on SANDISCOVERY. The system came back up and started to mount tapes, albeit slowly. So we let it run and all seemed well for about the first 8 hours then all hell broke loose. TSM started taking longer and longer to mount tapes until it couldn’t mount anything. We began receiving the following errors also:

7/13/2006 9:38:45 AM ANR8779E Unable to open drive /dev/rmt25, error number=16.
7/13/2006 9:38:45 AM ANR8311E An I/O error occurred while accessing drive DR25 (/dev/rmt25) for SETMODE operation, errno = 9.

We reopened the problem ticket and talked to a number of Tivoli support reps and almost had to force them to have us run a trace. The original problem fix began Friday and here we were still trying to fix it well into early Sunday morning. So after getting the trace to Tivoli they looked it over and seemed perplexed at the errors. The errors seemed as if Tivoli was trying to mount tapes for drives that the library client was not pathed for. Also it seemed TSM was polling the library for drives but was unable to get a response so it went into a polling loop until it found a drive. This caused our mount queue to get as high as 100+ tape mounts waiting and the mounts that did complete sometimes took 30-45 minutes to do so. It was NUTS! So Sunday morning a new Tivoli Rep was assigned (John Wang). As we discussed the problem and Tivoli was trying to get a developer to analyze the trace John mentioned how he had seen this type of error before in a call about an LTO-1 library. He stated that it took 3 weeks to determine the problem but that the end result was that the firmware on the drives was down-level and causing the mount issues. So I checked my 3584’s web interface (I feel bad for all those people out there without web interfaces on their libraries) and found the drive firmware level at 57F7. This seemed down-level from what little information we had so I had my oncall person call 1-800-IBM-SERV and place a SEV-1 service call. The CE called me and we discussed the firmware level. When he saw how down-level it was he was surprised and lectured me on making sure we always check with CE’s before we do any changes to the environment. The CE then gathered his needed software and came to the account to update the drives. I brought all the TSM servers down and after 30 minutes the library had dismounted all tapes from the drives. The CE then proceeded to update the firmware, which actually only took 15-20 minutes. I expected longer. So we went from firmware level 57F7 to 64D0. Huge jump! So after the firmware upgrade I audited the library and brought the controller back up. Viola! It started mounting tapes as soon as the library initialized and the response was back to what it should have been. It’s now Tuesday morning and there have been no problems. So before you upgrade TSM be sure to have checked your libraries firmware (both library and drives). It could mean the difference between sink and swim!

3 comments:

  1. Unfortunately Global Services is treated like a red headed stepchild by IBM's own reps and CE's. I can't even tell you the last time I actually saw my Tivoli rep. Trust me, don't get me start on the whole internal relationship issue within IBM, it's a mess!

    ReplyDelete
  2. upgraded tape library firmware but still the same issue. Now back to formatting the server. anything else I can do?

    ReplyDelete
  3. What library and drive types are you using? What version of TSM/Spectrum Protect?

    ReplyDelete