Monday, April 11, 2016

DR Test - Things learned

I just did a DR test from one data center to another involving TSM and our Data Domain (DD) which we have configured for NFS and VTL usage. Things to know...


  1. We backup the TSM DB to the DD NFS file system
  2. The TSM server was not brought up on its own LPAR in the DR site, but shared with an alternate TSM instance.
  3. The DR site could not facilitate LAN-Free like the primary site.

So we built the secondary instance on the LPAR currently running a TSM server that services the customer's development environment. Then I disabled the replication pair and we mounted it to the LPAR so we could restore the TSM DB. This is where our main problem rose it's head. The NFS file system from the DD was mounting under the primary TSM instances ID, So while we wrestled for this for an hour or so, I realized after Googling the issue and reading the DD notes from people that the problem was the configuration. I would have been fine disabling the replication pair and mounting it to the TSM LPAR if it had been the default user ID, but the primary instance was the owner and we could not change permissions due to what is allowed by the default ID and settings from the DD. So I had to unmount the DD NFS file system to delete the pair on the DD then remount it with the full read/write permissions. I was then able to mount it under an alternate ID. Once we overcame this we were able to start the TSM DB restore which is where our second issue arose.
We were restoring the TSM DB and the active logs were not being restored to the active log directory. The first time I used dsmserv restore db and it ran fine until all the DB records were restored and I received the following error:

ANR2970E Database rollforward terminated - DB2 sqlcode -1004 sqlerrmc TSMDB1

The restore process restored the logs to the instances home directory eventually filling the filesystem to 100% and erroring out. I thought the logs were recovery log related so I then added the RECOVERYLOGDir option to the restore command and got the same results. This wasted an hour to achieve the same results, so after some more Google searches and talking to IBM support I decided to add the ACTIVELOGDIR option to the restore. I didn't add it due to the IBM support tech suggesting it (he didn't) I just realized recovery log was not filled with any logs and the only other logs they could be are Active Log files. I added the ACTIVELOGDirectory option to the restore command and DB restored worked without any errors. The question is why didn't TSM use the ACTIVELOGDirectory option stated in the dsmserv.opt? The RECOVERYLOGDir option was used but the log for recovery were never more than maybe 1GB, but the active log was over 53GB and the db2diag.0.log registered the error that no recovery log directory was listed so the default would be used. What the hell??? It is listed in the dsmserv.opt...

ACTIVELOGDirectory          /drtsmserver/tsm30log
ARCHLOGDirectory            /drtsmserver/tsm30arch
MAXSESS 300
COMMTIMEOUT 6000
IDLETIMEOUT 6000
MAXSESSIONS 400
...

So I post this so you can learn from my mistakes. The final restore DB command was

dsmserv restore db on=db.list recoverydir=/drtsmserver/tsm30fail activelogdir=/drtsmserver/tsm30log

No comments:

Post a Comment