Wednesday, December 14, 2005

Unload/Load of DB pitfall

I am writing this article to save others from the situation we had at one of our customers. In the beginning a request came for replacing the RAID array which hosts the primary storage pools. I taked that as an excellent opportunity for reorganizing the DB which was 66GB with cca 50% utilization (after a massive deletion in last month or so). There was a planned downtime needed for copying the data from one RAID to another so I thought I had a plenty of time for unload/load.
The "dsmserv unloaddb" utility went smoothly - taking about 4hrs, creating cca 20GB dump (which was expected as the ESTIMATE DBREORGSTAT revealed we can save cca 7GB). Then I proceed with "dsmserv loadformat" and "dsmserv loaddb" (taking approx. 5hrs). So far everything was seemingly OK. I was able to start the server, reduce the DB and run some tests. The problem appeared when I tried to create a new stgpool


12/13/05 12:42:21 ANR2017I Administrator HARRY issued command: DEFINE STGPOOL archivedisk disk (SESSION: 3884)
12/13/05 12:42:21 ANR0102E sspool.c(1648): Error 1 inserting row in table "SS.Pool.Ids". (SESSION: 3884)
12/13/05 12:42:21 ANR2032E DEFINE STGPOOL: Command failed - internal server error detected. (SESSION: 3884)



Google revealed that this error is a known one and is fixed in 5.3.2.1 (we was on 5.3.2.0 - upgraded to this level just a few days before the fix appeared .. bad luck). Basically - there is an error in LOADFORMAT/LOADDB corrupting some DB values


http://www-1.ibm.com/support/docview.wss?uid=swg1IC47516


I did not want to go for loaddb again (there were changes already made to the DB, some migrations were run (luckily for me they were from stgpool with caching set on) .. etc.
So I tried to run the "dsmserv auditdb inventory fix=yes" - IBM says it can help if you do it after a loaddb - long story short - after 8hrs of audit (with message that some vaules were corrected) the problem was still there ...
So the only option was to apply the patch and do use loaddb again - so another 4hrs of waiting - and now it seems to work (still running tests). So watch for this problem and check your TSM level before reorganizing your DB.

Tuesday, December 06, 2005

System State Backup Issue

We recently had to rebuild a box (thank goodness it was not production) and found that the system state and objects had not backed up.  Stranger still there was no error stating they failed. In fact the only error was that the LVSA directory was not set and it would not be used during the backup.  Why would a failure to completely setup the LVSA affect the system objects and state? TSM doesn't require the LVSA to back them up.  We checked multiple systems and found this happening on more than one.  The client level we are on is 5.2.4 and higher.  Has anyone else experienced this?  BTW - The backup schedules were showing in the event log as COMPLETED not Failed.

Saturday, December 03, 2005

Got HACMP?

Well our group got stung again this last week. While working a HACMP implementation we were asked to install the Oracle TDP and TSM client. We installed the TSM client and all went well and we then installed the TDP and the DBA’s could not get the password file to generate. Each time they tried the utility would core dump. When trying to run the utility multiple ways it always core dumped. After calling support we were informed that the current 64-bit Oracle TDP (And from what I have found through Google it could be any 64-bit TDP). It seems they rebuilt HACMP version 5.3 and the TDP is not compatible with the new way it was compiled. There is a patch projected to be released as of end of the month, but we cannot wait that long so our options are to use temporary disk and mirror the DB, then break and backup the mirrored copy when we want to backup, or we can roll back to HACMP 5.2. The problem with the latter is that we basically have to rebuild the boxes. So be warned there are issues with the latest and greatest version of HACMP at this time.

Tuesday, November 22, 2005

Setting Up A Secondary TSM Instance

Someone requested a post on how to setup a secondary instance of TSM on a UNIX server so here is the skinny on how that is setup:

First create a directory where the config files for the new server will be stored.

mkdir /usr/tivoli/tsm/serverb
mkdir /usr/tivoli/tsm/serverb/bin

then copy the dsmserv.opt over and modify the needed settings in it like devconfig and volhist to save in the new dir.  Then create DB and Log volumes that this instance will use. Once those are created you need to export the following environmental variables:

export DSMSERV_CONFIG=/usr/tivoli/tsm/serverb/bin/dsmserv.opt
export DSMSERV_DIR=/usr/tivoli/tsm/serverb/bin
export DSMSERV_ACCOUNTING_DIR=/usr/tivoli/tsm/serverb/bin

Now you can run the dsmserv format command to initialize the DB and Log volumes and it will create the dsmserv.dsk in the serverb directory.  Make sure you run the dsmserv runfile commands to load the scripts and webimages (even if TSM 5.3).  The final step is to create the startup script so that TSM initializes correctly.  Here is our script:

#!/bin/ksh

exec > /tmp/libserv.out 2>&1  #optional -> sends output to out file

ulimit -d unlimited

export DSMSERV_CONFIG=/usr/tivoli/tsm/serverb/bin/dsmserv.opt
export DSMSERV_DIR=/usr/tivoli/tsm/serverb/bin
export DSMSERV_ACCOUNTING_DIR=/usr/tivoli/tsm/serverb/bin

print "$(date '+%D %T') Starting Tivoli Storage Manager Server"
cd /usr/tivoli/tsm/serverb/bin

dsmserv

We use the following command to start TSM so we don’t have to deal with nohup:

echo “/usr/tivoli/tsm/serverb/bin/rc.adsmserv” | at now

This uses the at command to run the script immediately.  You can then edit the inittab and place a line in to start this instance on boot or put a script in the run level startup folders, your choice.  You should now be ready to run the second instance.  One note to those thinking of doing this and sharing a library, Tivoli recommends that you create a TSM instance to be just a library manager, no clients, no real work other than handling the library and tape mounts.  I agree with this and it has been a lot easier to manage and handle library issues.  Not knowing how large the DB could get I gave it 2GB and it is currently 3.8% utilized, and it has been in place for over a year and a half. Swapping a library manager from one system to another is not as hard as it would seem so consider it and if anyone wants docs on how to do the switch let me know I’ll post it.    

Monday, November 21, 2005

UNIX Permission Issues

Awhile back I ran into an issue with how TSM handles permissions on UNIX files and wanted to get some feedback from you readers out there on how you would handle it.  What happened was a user somehow was given root and he chown’ed the /home dir recursively.  It was made worse by the fact that he did that on Friday and didn’t alert anyone until the following Monday, and by the time it got to us another day had passed.  The customer of course wanted us to restore the directory and file permissions, but the kicker was that TSM does not back a UNIX file up again when the permissions change.  It just updates the database to reflect the permission changes (I got that directly from support and was floored; I had no idea it handled UNIX that way).  So here was our dilemma, if the file was the only version in backup I would not have any way of resetting its permissions.  Is the gravity of the situation hitting home?  Because it doesn’t backup the file again or track permissions I could not successfully restore to a point-in-time.  Sure I might get a good portion of the files fixed but there would still have been a large portion that we would be unable to get the permissions corrected.  The customer wasn’t happy and our only out was that the customer should not have been doing chown‘s as root.  I thought I once saw someone post a undocumented option you can set in the options file that will backup a file if it changes in any way, permissions included, but I can’t find it.  I thought I saw it on the new ADSM.org but am unable to locate it.  Anyone know the option or have an idea on how to approach this?  I brought it up with some Tivoli people who asked me what I thought should be added or changed in TSM, but so far I haven’t seen any change in their processing.          

Monday, November 14, 2005

Poll Results

I closed the poll covering things we would like to see in future releases of TSM. I expected the removal of the ISC interface to win hands down but actually it was spread pretty even between 5 of the selections. Although Get rid of ISC interface and Return of the old web interface were the two higher vote getters I was suprised how evenly spread the voting was. I personally voted for conversion to DB2 database but I have been whining about that for years...ask the Tivoli folks I am sure they are familiar with my whining (I'm hoping for a cheese basket this christmas from the developers...oh and please no stinky cheese the wife is pregnant and foul odors make her sick).

ISC+TSM AC

Many people have been complaining about the ISC+TSM AC and I have been one of them.  The concerns have been plentiful, “We have to learn a new interface”, “There’s no DRM functions”, and “It requires a server just to support it!”  All these are valid protests and the one that bothers me the most is the fact that in a DR situation you would have to rebuild the ISC system/instance along with the TSM server to have web accessibility.  This adds time to an already urgent situation.  So what good does the ISC+TSM AC provide?  For starters a single interface for accessing all your servers, a single login, more functions when it comes to hardware management, and in the event of a disaster it forces you to learn the command line.  I know the last item might frustrate a lot of people but the truth is you need to know the command line to be a proficient TSM administrator.  I love the web interface and I recommend TSMManager, but when in a DR situation you have to be able to handle the command line if you want to get back up and running. Granted you have no other choice than command line until the system comes back up and is running again, but afterwards you’ll need to do typical admin work and the ability to do it through the command line will increase you rebuild speed helping you meet SLA time frames.  I don’t think Tivoli had this in mind when they went to the ISC+TSM AC but in my opinion too many people rely on the web and don’t learn the commands needed to be truly proficient.  I find it amazing how many don’t even know how to use the HELP command.  So I could complain about the change in interface but change happens and although we don’t always like it (I don’t care for this one) we have to be able to change and adapt with it if we want to last in this work force.

Sunday, November 13, 2005

King Of All Backups! (AKA LAN-Free to Disk)

About a year ago we were tasked to setup a large multi-clustered Exchange server and provide the best possible backup and restore performance.  After much debate and research we decided on using LAN-Free to disk.  The system was a 7 node Windows 2003 cluster connected to an SAN disk array (I can’t remember if it was EMC or Dell). The first 5 nodes were Exchange servers, the 6th was the failover node and the 7th was turned into a TSM server.  The TSM server instance had 5 500GB secondary disks assigned to it for the backup of the five Exchange servers.  These five disks would be mapped one to each Exchange server allowing for the backup to occur across the SAN to the disks owned by the TSM server. To utilize the LAN-Free to disk capability we had to install Tivoli’s SANergy product.  SANergy is no longer a separate product but is now part of a TDP/Agent type install package for TSM.  We actually installed and configured SANergy first, which was easier than it seemed in the directions, then mapped the drives.  When configured with SANergy the mapped drives become accessible across the SAN as long as the clients are on the same disk SAN fabric.  So we now had mapped SAN-accessible drives and could backup the Exchange servers to disk using the FILE device class.  The FILE device class is the device class used since TSM does not support LAN-Free backups to diskpools at this time.  The FILE device class works like a virtual tape and it was configured to migrate the data a few hours before the next backup would occur, or when the storage pool reached a specific usage threshold.  The reason for this was to allow almost a 24 hr. timeframe for a restore and along with the new Exchange 2003 restore capabilities internally; it provided a high performance backup/restore solution.  We tested the backups against a 360GB DB and backed it up in 90 minutes.  People were impressed, but they wanted to see how it performed on restore. We then restored the same amount, 360GB, in 91 minutes. WOW!  It was amazing to see those numbers (68MB/s).  We even tested it with the failover node by mapping all 5 SANergy defined drives to the failover node and still saw the same numbers. We had everything ready to go when the account decided they wanted to go in another direction.  Weeks spent configuring and implementing the solution all for not!  At least I have the experience and know it works.  So if anyone is looking to do LAN-Free to disk it works, it’s fast, it takes a lot of admin work, and it will be a good solution for anyone looking for a high performance backup/restore environment.

Wednesday, November 02, 2005

Submit A Question Or Topic

If anyone would like to submit a topic they would like to see covered or have a question about TSM or tape/SAN issues or strategy please e-mail me at chadsmal@us.ibm.com. With 5.3.2 out soon there will some things to cover. I am also thinking of posting an article on my LAN-Free to disk trial that was a great success. If anyone is interested in LAN-Free to disk let me know and I'll post my experience. If you would like to submit a post I am open to having guest contributors. Even though the name says TSMExpert I do not profess to know it all. Your contribution or questions help others out there.

Sunday, October 30, 2005

Managing RAW Volumes

Well I was recently called by another IBMer and asked how to use RAW volumes. The person called to ask why sometimes DSMFMT will format quite fast on one machine then take forever on another machine. Well one thing you have to understand about DSMFMT is that it's taking that file and making the space within it RAW. If you've ever looked inside an unused TSM volume you'll see it is a text file filled with ADSM over and over (they might have changed the fill but when I was teaching TSM thats what we saw). So why use RAW instead of actual files (other than files being a redundant process)? FAST! EASY! and when speed in DR is key it's the only way to go. It's actually easier than one would think and with a little script you can manage your RAW volumes and hdisks easily. I'll post the script along with a script to querry the serial and WWN of your tape drives. These two scripts come courtesy of Hari Patel my co-worker who is a PERL mad man. (Download tar/zip)

Tuesday, October 25, 2005

Free Web Trainings!

I am providing this link for those of you unaware of some free TSM training available from IBM.  They have a number of web based training classes available like TSM 5.3 Overview and Differences for TSM 5.3. You can find the site here.  They are also providing a free TSM concepts poster for those that request it.  I would also suggest the free Linux trainings available on the IBM Developer Works website for those that would like to learn more about it.  You can find a list of web based trainings here, and personally if you can’t become a proficient Linux user with these trainings then you need to look for a new job!

Friday, October 14, 2005

NetApp TOC Issues

We have recently found out that the TOC file creation in TSM can fail when the NetApp volume has special characters in the filename.  The has led people to believe that the backups are unsuccessful and our group would be unable to restore data. That assumption could not be farther from the truth.  We can still restore an individual file, we just can’t load a graphical representation into the web based TSM client. Anyway, the response by Tivoli was that we could identify the file with the problem because an error will report when the TOC creation fails stating the filename that caused the problem.  So we would have to do this hundreds of times since we have, on our own, identified at least 400+ files with special characters. So I have good backups just can’t restore them easily, then the question is how does TSM react when trying to restore files with special characters?    

Tuesday, September 27, 2005

Restoring An Image Backup To A Larger Disk

Well just the other day my group was asked to help the NT System Admins with the combination of two large disks into one. Both of the disks have been performing image backups so we decided that the largest image would be restored first then the second drive would be restored to the new disk normally. To combine the disks the SA’s made one huge partition (400+ GB) and an SA on my team started the image restore. The following warning was issued when the image was initiated:

***************************** WARNING ********************************
The destination volume is larger than the source volume.  This will reduce the file system '\\machinename\x$' size to '    xxx.xx MB'.
Do you wish to continue? (Yes (Y)/No (N))

This was due to the target disk being larger than the image. We figured the image would not affect the space unused but we were wrong. After the image was restored explorer saw the disk as the size of the image backup but the Windows Disk Management tool saw it as the 400+ GB it was. So the problem was how to recover the disk.  Here is what MS support instructed and the problem was resolved.

When doing an image restore:

If the destination volume is larger than the source, after the restore operation you will lose the difference between the sizes. If the destination volume is located on a dynamic disk the lost space can be recovered by increasing the size of the volume. This will also increase the size of the restored volume.

We also have the following knowledge base doc:

When a Tivoli Storage Manager (TSM) image restore is performed to a target volume that is larger in size than the original volume (from which the image backup was taken), TSM will concatenate the target volume down to the original volume size.  However, within Windows Disk Management (within the Administrative Tools -> Computer Management utility), the disk will appear to still exist as the original larger volume size, even though Windows Explorer will show that the real volume size is now smaller.  In essence, this means that space has been lost, since it is not accessible on the drive and not available for use.  To prevent the loss of this space, the same volume size should be used on the target volume as on the original volume.

As far as I know this fix only works if the disk is dynamic (logical) and not a fixed (physical) disk. So the fix was to increase the disk size again to help it see the space again. I don’t think you need to increase it much but once done it resolves the issue. So if you happen to restore an image backup to a larger partition than it originated from just be aware of what you’ll need to do to reclaim your unused space after the restore finishes.

Saturday, September 24, 2005

Question On Image Backups

The image backup feature is a wonderful piece that really saves you when restore time is of the essence, but what do you do when SA’s decide to combine two drives into one and want to use TSM to do it? Well we decided to use the image from the larger drive then do a standard restore of the other drive afterwards. Since I have never done this and don’t restore images frequently something I was alerted to was the message from TSM that states that file system would be changed (the drive was now bigger for the combining of the data). Will the image resize the partition? I don’t recall TSM doing that but then again I can’t remember. Anyone who has done a lot of image backups (particularly to larger drives than what the image came from) input would be appreciated.

Monday, August 29, 2005

New TSM Features

For those unaware IBM has added two new items to the TSM portfolio that I think take it a step above in its service offering. The first is Continuous Data Protection for Files. This software add-on will allow for invisible, real-time file replication. Basically the minute you create a new file it is replicated in a backup location. The following features are available for replication:

  • A copy is stored locally

  • A copy can be saved to a fileserver or NAS

  • A copy can be sent to a TSM server

The product does not require TSM and looks to solve a lot of problems with large fileservers. I’ll admit, I want this as soon as it comes available. The one problem I see is who is going to buy the disk required for this?

The second piece I have seen a need for some time is HSM for Windows. Well wait no longer, it’s here! IBM has just announced HSM for Windows. As with other products it allows you to migrate files off of a server when it reaches a certain age but provides for file level granularity, as the file is now a stub linked to the actual file in TSM storage. To learn more about both products you can read up on them here.

NetApp Filer TOC Issue

I thought I would pass along a notice that we have been informed of a problem with TOC file corruption on NetApp filers that have files with dates before Jan. 1, 1970, and after Jan. 19, 2038. Somehow it causes the TOC backup to abend. It also looks like the TOC can be corrupted when the bitfile in which the TOC file is stored is damaged. If this is the case and the corruption is recent it is possible that the TOC is undamaged on the copypool and could possibly be used for the restore. If the TOCs are unavailable then file level restore will need the absolute path to succeed.    

Wednesday, August 24, 2005

CHKDSK Utility Flaw Explained

I recently was asked about the file system issue with Windows and thought I would be a little more in depth. The file system problem is resolved with the following patches:

MS831375 and MS873437 for Windows 2000

Here is a good description of the problem (3rd paragraph down) and Microsoft’s page.

http://www.windowsitpro.com/Article/ArticleID/41569/41569.html

http://support.microsoft.com/?kbid=831375

Basically when a large volume has over 4,194,303 files a flaw in the chkdsk utility run in fix mode or repair mode can strip the permissions from the files. Patch MS873437 is related. Here is the link.

http://support.microsoft.com/default.aspx?scid=kb;en-us;873437

MS831374 is for Windows 2003 and is the same issue with the chkdsk utility.

http://support.microsoft.com/default.aspx?scid=kb;en-us;831374

According to our administrators we had systems on other patch reboots go into the chkdsk utility and automatically start running in fix or repair mode. So in our case it was not something that was user initiated. Hope this helps and have fun patching!

Thursday, August 18, 2005

Web Interface And The ISC

I recently was informed that the LTO3 format setting was never added to the old web interface. Is this true? I guess you can use the command line or ISC but was a little saddened to hear there was so update for that. What can I expect, they are trying to get rid of the interface. Does anyone know why the DRM feature was/is not in the ISC interface?

Wednesday, August 03, 2005

Oracle RMAN Catalogue Cleanup

Why do people love Oracle? When I hear mention of Oracle I think of Luke Skywalker when he saw the Millenium Falcon, "What a piece of junk!" Like the Falcon it looks clunky, breaks down easily, and has the most tempermental behavior. When it's running, however, it screams. The problem is that the RMAN catalogue sometimes doesn't do appropriate cleanup. If you want to make sure that a particular node is performing cleanup within TSM run the following select command -

select object_id from backups where node_name= and
backup_date < '2005-07-01 00:00:00'


This can be redirected to a file then used later to delete with an undocumented delete command. I will give the delete command out to those who need it, but remember any deletion from the TSM DB is done AT YOUR OWN RISK! It's unsupported because Tivoli doesn't trust you to not screw stuff up, and although I don't think you will, it's better safe then sorry.

Tuesday, August 02, 2005

The Return Of The Web Interface (Update!)

Previously I had posted an article on the web interface that linked to IBM's page that then linked to an FTP server for the files needed to activate the web interface on 5.3 servers. Well looks like the link was bad so I will provide it correctly. The documentation and files for Unix is here, Windows is here.

Wednesday, July 27, 2005

Volume History Issues

I have a very large shared library environment and currently have 3 TSM instances connecting to a 9 frame 3584. When I first configured this library we had one of the 3 instances running as the library controller (2 of the instances are on the same server the third is another system). The problems we ran into were numerous when we needed to do library maintenance since we adversely affected the server running as the library manager that was also a production TSM server for backups. So after a suggestion from Tivoli that we at least create a library manager instance we did so. It has helped a lot with handling tape issues, but one issue we were unaware of and had difficulty in resolving was with the volume history on the old library manager not releasing volumes listed as REMOTE. The new library manager could not force the old manager to "let go" of the tapes so they never were freed back into the scratch pool. We tried deleting all volumes of TYPE=REMOTE but it said it needed more parameters. Here is an example:

DELete VOLHistory TODate=TODAY Type=REMOTE FORCE=Yes
ANR2022E DELETE VOLHISTORY: One or more parameters are missing.


So no luck on that working to free up all the REMOTE tapes. I looked through the documentation on deleting volume history information and found nothing on REMOTE volumes. Also you'll notice there is nothing mentioned in the documentation that states individual volumes can be deleted. So we were in a serious jam. So I looked up the issue on ADSM.org and found the following command posted by a contributor.

DELete VOLHistory TODate=TODAY Type=REMOTE VOLume=NT1904 FORCE=Yes

This command allowed us to delete the individual volume and freed the tape to return to a scratch status. Thank goodness for search.ADSM.org or I'd never find the answer to half my problems. This command worked as advertised and the tapes were deleted from the old library managers volume history and they went back to a scratch status.

Monday, July 25, 2005

ETA Please!

Well I just had to perform a restore for a small server and the speed at which the restore ran was atrocious. I mean it was slower than rush hour in LA. Let's discuss, and THIS TIME I WANT AND REQUEST FEEDBACK! It turned out a disk went bad on a webserver and the system admins requested a number of filesystem restores. The combined amount was about 22-25GB. OK! No problem! Well that probably would have been the case if the restore request had been during the day but the request came in at night and the restore was competing with the nightly backups. Over a gig-ether (Fiber Gig-ether) connection I was able to get 1.3MBs and an aggregate rate of 668Kps. So do the math and it took a long period of time. The other thing that didn't help was it was a web server with TONS of little objects. It's livable but there were a lot of small files. The problem was eveyone and their brother wanted an ETA. "How long? It's Small! It should only take a couple hours max!" and so on. Well people now want some solution to this situation but of course the problem will be keeping it somewhat cheap. Even though everyone asks if we can halt the backups while we perform the restores we all know that's not really a viable option, so I came up with this idea, tell me what you think. Since major restores are few and far between I am proposing we create a new VLAN and run a single cable to each row of servers in the server room with enough slack to stretch to any server in the row. If a restore is required we simply plug in the "restore" connection, set an IP and away it rips. When finished we put the system back on the backup network it is assigned and rollup the excess ethernet cord and place it in the rack of the server in the middle of the row. I am only thinking of this for major restores and since I am not requesting that we buy more NIC's I think it's doable. Let me know what restore process you have in place when the network is saturated. I'd love suggestions!

Tuesday, July 19, 2005

TSM Server Upgrade Issue

Just this last week I and a co-worker were trying to get a TSM 5.3 upgrade working. The server kept saying it needed the dsmserv upgradedb command run against it. Everytime we tried the DB upgrade failed with an TIVGUID error. Since it was eating into backup times we rolled back to 5.2.4. Unfortunately the dsmserv upgrade caused 5.2.4 to state that the DB was higher level and TSM could not work with it. This happened even though the upgrade on 5.3 failed. Since support had not responed at this time we restored the DB from a full+inc that was taken just before the upgrade (WHEW! LUCKY WE HAD THAT!). When Tivoli finally responded this is what we were told:

This is a know problem with the 5.3.0 upgrade.

From TSM Support:

The problem that you are probably experiencing is a known problem with the 5.3 upgrade. If an admin has an expired password, or if a password is too short for the 5.3 password enforcement, then the upgrade can fail. Here are some steps that usually fix the problem:

1) Disable AES using the hidden option AllowAES No in dsmserv.opt file.
2) Re-initiate the upgrade db
3) Start the server. Preferably lock out all sessions & other activity
4) Use show node to identify admin & node ids that have expired or have passwords that are too short. Fix these.
(4 Alternative) Set the minimum password length to 0
5) Halt the server
6) Remove the AllowAES option
7) Start the server -- this will upgrade the passwords to AES encryption in the background

If you follow these steps then the upgrade db will probably continue successfully.

After this, start the server in background as usually and run a db backup.


SO ALL THIS OVER A PASSWORD!

Thursday, July 07, 2005

FYI: Using Mixed Media In An LTO Library

We recently upgraded one of our 3584's at work with some LTO2 drives. This is the first attempt at a mixed media environment and according to IBM and the following Redbook technote here, you must work out the issues with MOUNTLIMITS or all LTO2 drives could end up in use when you need them since they can also read LTO1 media. So we set our mountlimits accordingly, but it does not help that we have more LTO1 drives than LTO2. So by setting the mount limit to the number of LTO1 drives didn't stop TSM from accessing the LTO2 drives. This they did not explain well and they left one crucial piece out of the puzzle. When using a mixed media library if you do not specificly state which media format to use TSM will use both LTO1 and LTO2 media in a LTO designated storage pool. You want me to explain further? Ok here is how it affected us. We added LTO2 drives and an additional 2 frames to our library and setup the devclasses accordingly, setting the FORMAT to DRIVES so it would use the highest format available by the drive assigned. Well since we didn't partition the library TSM is going to grab whatever drive comes available and will mount the appropriate scratch tape. So if TSM assigns an LTO1 drive then you'll use an LTO1 tape. The only way you can force the LTO2 media to be used is to set the FORMAT setting in the devclass to ULTRIUM2 or ULTRIUM2C (w/Compression). So we didn't think about that and were bit by it when we ran out of LTO1 scratch. We didn't catch it due to our script only looking for scratch in the library not being able to designate between the two media types (which you can really only do if a different vol series is used for labeling). So without the LTO1 scratch we basically lost 2/3 of the drives and didn't know it. So I had to go switch our script to monitor both scratch types and we had to force the LTO's to their appropriate format. Once I realized what was happening it was a "NO BRAINER" that TSM would work that way. The big problem is that TSM did not seperate out the media format for LTO2 so it would be a different devclass type like they did with the 3592's. So be aware how TSM works and make sure you don't make the same mistake I did.

Saturday, June 18, 2005

Restoring NT Shares

Well, it inevitably happens that a large fileserver is being replaced, goes down, or goes bad and you have to restore it. Well if you are restoring to the same type of hardware then life is beautiful and NT/2000/2003 has no problems and acts like the good little boy it should be, but what happens when you have to restore to new hardware, or are refreshing the server to a newer more powerful server? This is when NT can be worse than that bratty little kid I wanted to strangle in the movie Problem Child. Personally John Ritter should have shot him and buried him in the back yard, but I digress. So how can you restore Shares or any other piece of the registry with TSM when NT doesn't like having the registry of another machine of different hardware restored? Well that's were a little ingenuity and some preemptive strategy comes in handy. The easiest way is to setup a script using the command line registry tool REG.EXE that runs a REG EXPORT KEYNAME each day and stores it as a file. Here is the string needed to backup the Shares key. (The following is all one line)

REG EXPORT HKLM\SYSTEM\CurrentControlSet\Services\LanmanServer\Shares C:\BACKUP\SHARESEXP.TXT

The command backs up the Shares key to a folder I created on the C: drive called Backup, but you can put it wherever you like. A simple batch file will suffice to execute the export and you can either schedule it through Windows daily or through TSM. Yes, you can do that and TSM will gladly back it up. This allows you to then restore the export file and reimport it into the server of choice. Just make sure that with any schedule where you are executing a command/batch file you put the full path of the file or else TSM can possibly fail to execute it, even if its in the TSM BACLIENT directory.

Ok, not everyone has that process in place and needs to restore the Share key now! Well have no fear TSM can help you accomplish it. Let's look at the process for this situation and I'll provide some links. First off you can restore the registry without activating it. The command to do so is

RESTORE REGISTRY -ACTIVATE=NO (for Win2K/Win2003)

RESTORE REGISTRY ENTIRE -ACTIVATE=NO (for WinNT)

Now for each process you can restore specific keys but for ease of use the entire restore will work. TSM will restore the registery to the c:\adsm.sys\computername\ directory. From this point you can load the hive in RegEdit containing the share key and export it to a file. Once exported you should be able to import it into the current registry thereby restoring the shares. Here is a Microsoft document that covers loading the hive and processes to use. There is also a good document on backing up and restoring the registry here. If you have any questions feal free to comment and I'll respond as soon as possible.

Tuesday, June 14, 2005

TSMManager

Well, we have been using TSMManager for some time now and I have to say I rely on it a lot for day to day monitoring of our servers. Our environment is pretty large so we need something that will allow us to see problems quickly and easily. TSMManager has a slew of tools and the interface is fairly easy to use. We have a number of Jr. level admins who are able to handle multiple TSM servers with TSMManager's help. The key piece of TSMManager is the administrators console. It is a tabbed interface that allows me to see things at a glance without having to query the activity log in search of the problem. The first tab is basically the admin interface running in console mode. The second tab is the greatest gift to admin in that it is an error tab. It parses out the errors and shows all warnings, errors, and severe errors.


This is the error tab. Click on it and you'll see how it consolidates all errors into a simple interface. Great for those times you're having server issues. Posted by Hello

TSMManager (Cont.)

When not checking for errors the majority of your time will be spent using the Admin tab. The admin tab allows for entering in TSM commands and you'll notice down the upper left side of the window there are preset queries you can issue to the TSM server you have selected. Below the preset commands you'll see the list of the servers you have defined the to TSMManager collector and a simple click on the server you wish to monitor and enter commands for switches the ENTIRE console to that machine. It's instantaneous and you can immediately select the other tabs to see what is happening on your server or enter commands to query for yourself. Another nice feature you can see in the picture below is the ability to save commonly used commands and queries and select them from the drop down list whenever you wish to use them.


Here we have the admin command line and stored and preset commands. Posted by Hello

TSMManager is not DB based so it has a few limitations. We have noticed with it monitoring 30 large TSM servers that the collector will have issues with the dsmadmc executables failing (TSMManager opens a dsmadmc on the collector server for each TSM server it is monitoring up to 30 servers per collector). This could be the host server it is running on, but other than that the product has been a godsend to our workload and daily monitoring. I would recommend it for shops with limited TSM skills and those that require the end users ability to monitor backup. It has e-mailing and alert capabilties, individual TSM server reports, consolidated server reports (combines the info for all TSM servers monitored by TSMManager), and even has a DRM/tape management feature for those without DRM. With TSMManagers admin web interface you can monitor you systems and it also allows the end users to log in to a secure client website that will allow them READONLY access to reports and results of server backups. I have had to use it a number of times to make our DBA's and application owners feel more secure with their system's backups and it works very well. There are numerous other features within TSMManager, but just the ones I've mentioned so far make it worth the price of purchase. If you feel like TSM is causing you the onset of early stage dementia check out TSMManager you'll be happy you did.

Monday, June 13, 2005

Just Google It!

Ok, if you have every had a problem with TSM then you probably have wished Tivoli support was a little more responsive. I'll be honest with you, the large majority of my problems have been resolved with two web based resources. The first is search.ADSM.org which has always been a great resource for finding the answers to most problems. The second and probably the least used is Google. When I Google search my problems I am amazed at how often I can find my answer. Sometimes the problems are OS related and not a true TSM issue and Google helps me identify the correct resolution. I will say this, if ADSM.org and Google don't have the solution check the TSM client README for any OS dependent patches since that has bit me in the rear a number of times. Hey, Google is great and its free and its faster at responding than Tivoli support. You might still have to get with support on the REALLY tough stuff but before you go down that road try the alternatives and you'll be greatly surprised.

Saturday, June 11, 2005

The Case For Raw Volumes!

If you are serious about TSM server rebuild times and want the quickest way to get up and running then I suggest you look into raw logical volumes for all your TSM DB, Log, and storage needs. Of course if you are running on NT I can't say I know of any way TSM can use raw, but in our AIX shop we live by raw volumes. The creation time is quick and with a little script I can have my volumes created and ready for the DB restore in no time. I have been down the road of DSMFMT and know how long large volumes can take to create and since TSM does not like more than 16 volumes some older Unix servers can take time to format. The other nice thing about raw volumes is if the server crashes its rare, except for disk failure, for volume corruption to occur. I have had too many dirty super blocks to deal with in my time, and I don't miss them. Remember, all TSM is really doing with DSMFMT is creating a file and in a way converting it back into raw. So why do the extra steps, save yourself some time if you ever are in a true DR situation.

Wednesday, June 08, 2005

TSM 5.3 Server And The Return Of The Web Interface

There has been such an uproar over TSM 5.3 dumping the web interface that IBM/Tivoli quietly released documentation on how to turn it back on for 5.3 servers. The sad thing is that Tivoli in their push to integrate websphere did not look at the problems faced when you take a critical process like backup and turn the tables on system and storage administrators overnight. The truth is that the new interface is clunky and adds time to the whole rebuild process if your people are not capable of using the command line. I have used the ISC but not enough to have a solid verdict. It looks like it has some nice features but by not making it somewhat similar to the old web interface they have only made a lot of people angry. Here is the documentation for turning back on the web interface on 5.3 servers. For now it works, we'll have to see how long they allow it.

Crititcal Windows Filesystem Issues

Well this is a copy from my other less TSM centric blog Storage Admin Blues but it needs to be known. Here is the post from February -

So I have been working a major data issue on a server with a certain Redmond operating system for almost a week now. As it turns out the afore mentioned OS has a known issue with volumes that contain over 4 million files. What happens is that the system will reboot after a patch or update is applied and on the startup begin a check-disk. That wouldn't be so bad but the check-disk strips permissions from almost all files. The response from Redmond was that we would have to restore the data if we want to fix the permissions. OK! Great! Restore 6 million files when the server is used 24/7. The volume in question is over 570GB space used and it has a gig-Ethernet connection. I swear if there isn't a conspiracy against storage administer when it comes to restore SLA's. I got called in to fix the problem and have had almost no sleep for a week, and in conjunction with that I have a virus that is causing me to cough incessantly and make it hard to breathe. Thank goodness for telecommuting or I'd be in the hospital by now. Lets just hope the people in charge listen this time (it has happened two times before this) when we warn about volume size/file management. If it wasn't for Arrested Development and Scrubs I would go nuts.

The resolution for this problem was to restore the directory structures and then have the system admins apply a script that cascaded the permissions to the files within the directory structure since they all inherited their permissions from the parent folder. We also decided to change the environment to backup all directory structures to disk and retain them there as long as possible before migrating to tape. So we had to use the migration delay and migration continue feature on the disk pool. Trust me restoring directories off of tape is no picnic...very slow. If you have run into something like this let me know how you resolved it. Sharing info is how we learn more and although the my e-mail might say TSM Expert, I'm don't know everything.

TSM and Windows Volume Shadow Copy

First off let me just say that the name TSMExpert was more of a joke. I was setting up aliases on Notes Mail and for fun put TSMExpert and viola I got it. Didn't think it would go through but hey as long as I have it I might as well use it. I can be reached at tsmexpert@us.ibm.com and will answer questions as much as possible. Remember TSM is a fickle beast but when setup correctly it is the best enterprise backup/archive tool on the market.

I have been working with many issues on Windows 2003 and the TSM client. We upgraded to 5.2.4 on many of our clients and some to 5.3 due to the shadow copy issues. TSM seems to have serious issues with VSC and gives RC12 on schedules when its implemented. I am not quite sure how VSC is suppose to help when it resides on the same disk as the data its supposed to correct, but I get its more like the snapshot feature on NetApps. I would recommend you upgrade all 2003 servers if you are having this problem to 5.3 since it is supposed to resolve this issue. Also be aware there are problems with older client versions stability and reliability on 2003 servers. We have experienced numerous crashes and System State backup failures with the older clients. This is another reason for upgrading.