Wednesday, February 22, 2017

Learn From My Mistakes

We recently did a DR test and ran into a serious issue with the recovery of the TSM instance. The DR test was to restore the TSM DB from data center A (DC-A) to data center B. The DC-A instance had been created the previous year so we only needed to remove the DB (DSMSERV REMOVEDB TSMDB1) and restore the DB. I followed the steps and restored the DB and after restoring the DB 99% the restore failed. The restore failure error pointed us to the size available being too small for the required. We were off by a couple GB in our DB directory sizes so we had to add another directory to the list. I tried to rerun the format of the space and quickly received and error that the DB was present. I then ran the DSMSERV REMOVEDB TSMDB1 command again then tried to rerun the format and received the same results. We tried multiple commands

db2 uncatalog db TSMDB1

db2idrop serverA

/opt/tivoli/tsm/db2/bin/db2greg -delinstrec service=db2,version=,instancename=serverA,instancepath=/drtsmserver/serverA/sqllib

This last command allows you to manually drop the DB from the global registry. When we tried this a couple times the instance would be removed from the global registry and when we would try to rerun the restore we would get the error that the DB was present and the instance would reappear in the global registry. (Notice the instance path....that is going to come play into the final solution.)

So at this point we went round and round with various commands to drop/remove/delete the instance from DB2 and the commands would fail saying that the instance was not present and then an attempt to restore the DB would fail saying the DB was present. We went round and round and round until I finally restored the DB under a new ID and instance directory which was successful.....BUUUUUUTTTTTT.....there in lies another issue. Once the DB restore was successfully restored it failed to allow us to bring up the TSM instance because the new ID was not the original owner of the DB instance. Immediately I realized my mistake as this had happened to me years ago and I totally forgot this DB2 requirement. You see DB2 does not allow an ID other than those that have been granted permission to bring the DB up. I get it, it's a security thing, but the TSM instances almost all Admins create only has one active ID allowed ownership of the DB2 instance. You can add as many as you like but it's not something IBM or Tivoli openly suggest or recommend. So if the ID you typically use wont allow the restore or for some reason can't be used and you want to use a new ID for the DB your S.O.L. If it's after the fact, IBM cannot provide a way for you to assign an additional owner of the DB2 TSM instance. IBM's recommendation was go back to DC-A and grant the new ID rights to the instance, run a TSM DB backup, copy/replicate the instance to DC-B, then restore the DB and it would work under the new ID.

By this point in the DR process we were so behind in the recovery of the TSM server we scrapped our portion of the DR test (we were not a significant portion of test, which was to test Recovery Point capabilities and TSM was a fall back). I continued to work with IBM on the restore under the original ID and the looping issue and after a couple attempts, the IBM level 2 DB2 support person identified that the creation of the TSM instance had created DB files in


and a simple rename of that directory should the the DB present/not present issue. We had repeatedly cleared out the DB and Log file systems but had forgotten about the DB folder and files created in the instance directory. TSM had created that subdirectory with DB related files. We had renamed the /drtsmserver/serverA/sqllib directory more than once, but we had forgotten about the other instance directory as IBM states in this technote.  When we renamed the second serverA directory and tried the restore and it ran to completion successfully. HEAD SLAP! This was one of the times where even with another set of eyes we were not seeing the bigger picture. IBM level one support was probably looking at the wrong issue and level two was stumped because we moved away from the initial issue we called about and were now asking them how bring up the TSM DB that we restored under the alternate ID.

OK at this point I was exhausted and frustrated with myself. So what did I come away with in this situation....

  1. Triple verify your DB file space size is adequate. (DOH!)
  2. From the instance home directory copy the critical files to an alternate location.  
    • This is in case the restore fails and clears them or updates them incorrectly (trust me this happened more than once)
  3. If the restore fails or a rerun of the restore is required rename both <inst home>/<inst name>/<inst name> and <inst home>/<inst name>/sqllib
  4. IMHO I would also create another ID or two on the production server and grant them access to the TSM DB instance just in case you had an issue with the primary ID. 
    • This is not required but cant hurt.