TSM Topics Feed

Friday, March 28, 2014

Poor Performance

Currently I work in an environment where we have a specific TSM instance for a large SAP DB (99TB currently). We just upgraded the drives in the tape library (yes we use tape! I know...I know....) from MagStar 3592 TS1130 (E06) drives to TS1140 (E07) drives. The upgrade was pushed in hopes of a jump in write/backup performance, but I was skeptical. TSM adds so much overhead you cannot use the RAW tape read/write numbers from any manufacturer. Typically IBM is somewhat reasonable with their numbers, but in this case I have seen NO performance increase what-so-ever.  Here is a query of the processes for storage pool backup.

UPDATE (04/04/2014):  Let me give you some more specs, we have the 99TB DB split between 4 TSM Storage Agents each having 4 8Gb HBA's. Each storage agent runs 4 sessions (allocates 4 drives) for their backup process. So all 4 storage agents account for 16 simultaneous sessions and it still takes over 24 hours to perform the 99TB backup. The backups are averaging around 70-78MB/sec. Is this a TSM overhead issue or do I have a tuning issue with the TDP and TSM? I'm getting less than 50% of the throughput I should see.

Here's the command that is run to execute the DB backup:

ksh -c export DB2NODE=7 ; db2 "backup db DB8   LOAD /usr/tivoli/tsm/tdp_r3/db264/libtdpdb264.a OPEN 4 SESSIONS OPTIONS /db2/DB8/dbs/tsm_config/vendor.env.7 WITH 14 BUFFERS BUFFER 1024 PARALLELISM 8 WITHOUT PROMPTING" ; echo BACKUP_RC=$?

    PROCESS: Backup Storage Pool
 START_TIME: 03-27 23:21:54
   DURATION: 00 23:20:13
      BYTES: 6.0TB
 AVG_THRPUT: 75.87 MB/s

    PROCESS: Backup Storage Pool
 START_TIME: 03-27 23:21:55
   DURATION: 00 23:20:12
      BYTES: 6.2TB
 AVG_THRPUT: 78.48 MB/s

    PROCESS: Backup Storage Pool
 START_TIME: 03-27 23:21:55
   DURATION: 00 23:20:12
      BYTES: 6.2TB
 AVG_THRPUT: 77.99 MB/s

    PROCESS: Backup Storage Pool
 START_TIME: 03-27 23:21:55
   DURATION: 00 23:20:12
      BYTES: 6.4TB
 AVG_THRPUT: 80.13 MB/s

I average anywhere from 75 to 80 MB/sec.  Here is the Magstar performance chart. I am using JB media, not JC so I do take a little hit in performance for that.

So with JB media I could get as high as 200MB/sec but I am not even 50% of that number.  Is there any specific tuning parameter I should look at that could be hindering the performance? 

FYI - The backup of the 99TB DB runs LAN-Free using 16 tape drives over 26 hrs.

Friday, January 10, 2014

New TSM Admin In The House!

Just thought I should let everyone know that my wife and I had a son December 3rd. The holidays and lead up to his being born have kept me busy. My son makes 8 kids total and I'm a very busy man. So don't worry I shall return but the last 9 months have been a blur.

Sunday, December 08, 2013

Full TSMExplorer for TSM version 5 is free now

Got this info from Dmitry Dukhov - creator of TSMExplorer

Procedure of required registration for getting free license and TSMExplorer for TSM version 5  are available on http://www.s-iberia.com/download.html

Tuesday, October 22, 2013

Archive Report

Where I work we have a process that bi-monthly generates a mksysb then archives it to TSM. Recently an attempt to use an archived mksysb found that sometimes the mksysb process does not create a valid file, but it is still archived to TSM. So the other AIX admins asked me to generate a report that would show the amount of data that was archived and on what date it occurred. Now I would have told them it was impossible if they had asked for data from the backup table, but our archive table is not as large as the backups so I gave it a go.

First problem was determining the best table(s) to use. I could use the summary table, but it doesn't tell me what schedule ran and some of these UNIX servers do have archive schedules other than the mksysb process. The idea I came up with was to query the contents table and join it with the archive table using the object_id field.  Here's an example of the command:

select a.node_name, a.filespace_name, a.object_id, cast((b.file_size/1048576)as integer(9,2))AS SIZE_MB , cast((a.ARCHIVE_DATE)as date) as ARCHIVE from archives a, contents b where a.node_name=b.node_name and a.filespace_name='/mksysb_apitsm' and a.filespace_name=b.filespace_name and a.object_id=b.object_id and a.node_name like 'USA%'

This select takes at least 20 hours to run across 6 TSM servers. I guess that I should be happy it returns at all, but TSM is DB2! It should be a lot faster, so I am wondering if I could clean up the script or add something that would make the index the data faster??? I am considering dropping the "like" and just matching node_name between the two tables. Would putting node_name matching first then matching object_id be faster? Would I be better off running it straight out of DB2? Suggestions appreciated.

Monday, August 12, 2013

TSM Command Processing Tip

I am constantly having to run a large list of commands and sometimes just don't want to deal with running them through a shell script. So whats the best way to run a list of commands without having to deal with TSM prompting for a YES/NO. I can using a batch command with the -NOPROMPT option from a admin command-line, but sometimes thats more work than I want to deal with. There's got to be a better way. Well the simple answer is to define the TSM server to itself and use it in the command when you run it.  Here's an example....I have to delete empty volumes from storage pools rather than wait for the 1 day delay.

select  'ustsm07:del vol', cast((volume_name)as char(8)) as VOLNAME, from volumes where pct_utilized=0 and devclass_name <> 'DISK' 


Unnamed[1]           VOLNAME   
----------------     --------- 
ustsm07:del vol      K00525    
ustsm07:del vol      K00526    
ustsm07:del vol      J00789    
ustsm07:del vol      J00197    
ustsm07:del vol      J00303    
ustsm07:del vol      J01172    
ustsm07:del vol      J01233    
ustsm07:del vol      J00850    
ustsm07:del vol      J00861    
ustsm07:del vol      K00018    
ustsm07:del vol      J01613    
ustsm07:del vol      J01624    
ustsm07:del vol      J01671    
ustsm07:del vol      J01687    
ustsm07:del vol      K00116    
ustsm07:del vol      K00130    
ustsm07:del vol      K00340    
ustsm07:del vol      K00348 

tsm: USTSM07>USTSM07:del vol       K00525
ANR1699I Resolved USTSM07 to 1 server(s) - issuing command DEL VOL K00525 against server(s).
ANR1687I Output for command 'DEL VOL K00525' issued against server USTSM07 follows:
ANR2208I Volume K00525 deleted from storage pool TAPE_A.
ANR1688I Output for command 'DEL VOL K00525' issued against server USTSM07 completed.
ANR1694I Server USTSM07 processed command 'DEL VOL K00525' and completed successfully.
ANR1697I Command 'DEL VOL K00525  processed by 1 server(s):  1 successful, 0 with warnings, and 0 with errors.

So I copy the data and paste it into my command line and because I am using server routing (even to the same server I am on) TSM does not prompt for confirmation. So make sure you have defined your TSM servers to themselves so you can take advantage of this simple feature.  Also note that TSM wont delete a tape with data, so I leave the "DISCARD=YES" option off so only EMPTY tapes are deleted.

Wednesday, July 31, 2013

IBM P7 Strange Behaviour

We have a P7 frame that has 4 LPARs that are used as TSM storage agents from which snapshots of our SAP DB's are mounted for backup. They have always had great performance until one LPAR had a bad HBA that phoned home and was replaced. After it was replaced performance for backups dramatically decreased from 800MB/s to 150MB/s and overall performance of the server would drastically drop. When the DB requiring backup is over 25TB that is a huge hit, and we could not find the root cause.  At first IBM said it was our Hitachi disk that was the problem. We eliminated that right away, so we then replaced the new HBA, checked our fiber, and then checked the GBIC and nothing seemed to fix the situation. During the first week I asked the IBM service technician if we could possibly have a bad drawer or slot and he emphatically said "No! If you did you would have errors all over the place." So we checked firmware, we moved cards within the frame (again), we double checked the fiber, now we were going into the third week. So I kept asking if something could be wrong with the drawer/slots and I kept getting the same answer. The reason I suggested it was due to previous experience. I have seen hardware go bad without totally going "out". So after exhausting everything other than the replacing the slots, IBM finally replaced the slots. Viola! Backup speeds went back to normal and system degradation during the backup disappeared.  So the slots/drawer was the issue. No errors relating to a slot/drawer hardware issue occurred but something caused the slots to degrade performance.  It took almost a month to resolve the issue, I wouldn't say that IBM support was very thorough and at times tried to push off the problem to other vendors (i.e. Hitachi). I can only suggest in the future you trust your instincts and push the CE's to follow down every avenue. My headache is over, but now the RCA begins.