TSMAdmin: TSM Backup Issue

Monday, January 23, 2012

TSM Backup Issue

Anyone had an issue where their backups were extremely slow and their Interrupts were huge? I've got 400GB DB's taking 40hrs to backup over a 4 port Ether-channel connection. No errors in my AIX errpt and the network guys are telling me they don't think it's them. Any suggestions on what to look at are appreciated. Below is an example when I run entstat.

ETHERNET STATISTICS (en8) :

Device Type: IEEE 802.3ad Link Aggregation

Hardware Address: 00:14:5e:e7:26:41

Elapsed Time: 9 days 19 hours 20 minutes 35 seconds

Transmit Statistics: Receive Statistics:

-------------------- -------------------

Packets: 5470416553 Packets: 24510516113

Bytes: 440661650021 Bytes: 32245892708954

Interrupts: 0 Interrupts: 6027433898

Transmit Errors: 0 Receive Errors: 691

Packets Dropped: 0 Packets Dropped: 0

Bad Packets: 0

Max Packets on S/W Transmit Queue: 298

S/W Transmit Queue Overflow: 0

Current S/W+H/W Transmit Queue Length: 355

Broadcast Packets: 8786 Broadcast Packets: -1346793420

Multicast Packets: 225928 Multicast Packets: 136913

No Carrier Sense: 0 CRC Errors: 0

DMA Underrun: 0 DMA Overrun: 691

Lost CTS Errors: 0 Alignment Errors: 0

Max Collision Errors: 0 No Resource Errors: 0

Late Collision Errors: 0 Receive Collision Errors: 0

Deferred: 141004 Packet Too Short Errors: 0

SQE Test: 0 Packet Too Long Errors: 0

Timeout Errors: 0 Packets Discarded by Adapter: 0

Single Collision Count: 0 Receiver Start Count: 0

Multiple Collision Count: 0

Current HW Transmit Queue Length: 355

General Statistics:

-------------------

No mbuf Errors: 0

Adapter Reset Count: 0

Adapter Data Rate: 1701737521

Driver Flags: Up Broadcast Running

Simplex 64BitSupport ChecksumOffload

PrivateSegment LargeSend DataRateSet

21 comments:

Anonymous1/24/12, 2:47 AM
HI Chad,

Have you tried an FTP copy from source to TSM to heck the through put is what you imagined?
ReplyDelete
Replies
Anonymous1/24/12, 6:32 AM
can you please pout your options files (.sys and .opt) and a query system ?
ReplyDelete
Replies
Anonymous1/24/12, 6:34 AM
Please put your .opt and .sys files and also raise a query system
backup direct on tape/disk .....

and any relevant info

txs
ReplyDelete
Replies
jvde.be1/24/12, 6:36 AM
Please put your .opt and .sys files and also raise a query system
backup direct on tape/disk .....

and any relevant info

txs
ReplyDelete
Replies
Chad Small1/24/12, 7:46 AM
The problem is with the backup server not my clients. All my clients are being affected by this and it seems to be occurring on ALL my TSM servers on the backup network VLAN. The network Admins say they don't see any errors, so I'm left with the settings in AIX. Any ideas what I should look at?
ReplyDelete
Replies
Tom Kyle1/24/12, 12:32 PM
Hi Chad,

As always, first question is: what changed?

If you can, try ivorblognow's suggestion and copy a large file over via FTP or scp. That takes TSM and the DB out of the equation.

Also, can you clear the stats (entstat -r) and see what kind of rate you are getting? Not sure if the interrupts are anything to be concerned about.

Also also, what is the output for `entstat -p tcp `?

Good luck,

Tom
ReplyDelete
Replies
Chad Small1/24/12, 1:57 PM
We cleared the stats and immediately the receive interrupts immediately started moving up. I tried the ftp but nothing conclusive.
ReplyDelete
Replies
Chad Small1/24/12, 2:21 PM
Transfering a 1GB file across the backup network averaged 17.4MB per second. No errors.
ReplyDelete
Replies
Tom Kyle1/24/12, 2:32 PM
Check my math, here's what I'm getting from your posted entstat output:

9 days 19 hours 20 minutes 35 seconds

((9 * 24) + 19) * 60 + 20) * 60 + 35 = 847235 seconds

32245892708954 bytes / 847235 seconds = 38,060,151 bytes/sec

Average network throughput of 290.4 Mb/s
ReplyDelete
Replies
Chad Small1/24/12, 2:43 PM
Well if you take

38,060,151bytes /1024 (KB) /1024 (MB) = 36.3MB/sec
ReplyDelete
Replies
Tom Kyle1/24/12, 3:08 PM
36.3MB/sec * 8b/B = 290.4Mb/s

So...assuming the system is running flat-out all the time, you might be maxing out a GigE connection there.

What mode is that aggregated link?
ReplyDelete
Replies
cilesiz1/26/12, 1:41 AM
Hi,
DMA Overrun: 691 indicates that your pci-bus is overloaded or it has not enough CPU to handle the 4GB network adapter on your system.
ReplyDelete
Replies
Chad Small1/26/12, 9:00 AM
Interestingly enough I didn't notice the DMA overruns, getting caught up in the interrupt amount. I'll definitely look into that because I am seeing it on all my TSM servers in the Power 6 frame. I think we may need to tweak the VIO servers settings.
ReplyDelete
Replies
Anonymous2/18/12, 3:07 PM
I have same issue with TSM 6.2.1
ReplyDelete
Replies
Anonymous2/20/12, 4:23 AM
Hi.
Did You consider that issue can be on node DB side.
Please try to transfer (archive) 50GB before, at this same time and after db backup.
ReplyDelete
Replies
Anonymous3/8/12, 9:41 AM
We're having a similar problem, and have so far tracked it down to the client side.

Try using iperf. Run iperf -s on the tsm server, and "iperf -c -l 1M -w 10M $tsmname". This will tell you how the link is working between client and server. Iperf will either eliminate or spotlight the network between client and server.

Assuming you're writing to disk pools, be sure to test your disk w/ something like dd.

Keep in mind that LACP is not a load balancing protocol. If this client is one IP, then it's only talking to one port in your LACP interface in one VIO server.

On the DMA overrun, we see this regularly, but interestingly, it's when the adapters are congesting at @ 107-109MB/s.
ReplyDelete
Replies
Anonymous9/21/12, 9:01 AM
We had similar problem, seeing exact same number of "Receive Errors" and "DMA Overrun" errors.

Our admin found our TSM Server was running with these AIX default values:
rfc1323 = 0
tcp_recvspace = 16384
tcp_sendspace = 16384

This is significantly smaller than necessary for adequate throughput with our network setup (we also have 802.3ad bonded link). He followed the "TSM Performance Tuning Guide v6.2" (GC23-9788-02) which states on pages 6-7 (22-23/90 in pdf) that AIX should use minimum 64KB window, instead of 16KB.

Here is the AIX config now, with greatly improved throughput:
rfc1323 = 1
tcp_recvspace = 64512
tcp_sendspace = 64512

Hope this helps!

Robert L.
ReplyDelete
Replies
IvorBlogNow11/3/12, 9:57 AM
Hi Chad,
Did you get to the bottom of this? I am experiencing the same issue on a TSM 6.2.4 Windows 2008 R2 (no SP1) environment.
ReplyDelete
Replies
Anonymous2/6/13, 7:13 AM
Hi Chad,

I have tsm server at 6.2.4.0 and OS is aix. Now i have 2 client nodes which is taking long time to complete its systemstate backup when doing its daily incremental backup .
The client OS is windows-2008 and has BA client version:6.2.1.3.

Impact:actually my data expiration is taking almost 24 hours to complete and it is overlapping the next day schedule as previous process is still running , now checking the expiration activity from the summary table i found that 2 wintel nodes which is starting as per the daily admin schedule in the morning but getting completed the next day morning , sometime till afternoon . And now when i check those two nodes it has a two diff schedule one for normal fs backup and the other is for taking a system state object

some thing like this
*******************************************
tsm>q assoc G0_WIN TEMP_SYSSTATE

Policy Domain Name Schedule Name Associated Nodes
------------------------------ ------------------------------ ----------------------------------------------------------------
G0_WIN TEMP_SYSSTATE PRS02991 PRS04811

tsm: >q sched n=PRS02991 f=d

Policy Domain Name: G0_WIN
Schedule Name: G0_WIN_INC_2000
Description: NON-LIVE Schedule
Action: Incremental
Subaction:
Options:
Objects:
Priority: 5
Start Date/Time: 05/20/11 20:00:00
Duration: 4 Hour(s)
Schedule Style: Classic
Period: 1 Day(s)
Day of Week: Any
Month:
Day of Month:
Week of Month:
Expiration:
Last Update by (administrator): IN026082
Last Update Date/Time: 08/10/12 13:22:47
Associated Nodes: PRS02991
Managing profile:

Policy Domain Name: G0_WIN
Schedule Name: TEMP_SYSSTATE
Description: Schedule for SYSTEMSTATE Backup
Action: Backup
Subaction: Systemstate
Options:
Objects:
Priority: 5
Start Date/Time: 11/23/12 17:30:00
Duration: 15 Minute(s)
Schedule Style: Classic
Period: 1 Day(s)
Day of Week: Any
Month:
Day of Month:
Week of Month:
Expiration:
Last Update by (administrator): IN005166
Last Update Date/Time: 11/23/12 16:31:08
more... ( to continue, 'C' to cancel)

Associated Nodes: PRS02991
Managing profile:

******************************************

NODE details :

PRS04811 --development

PRS2991 --> production

OS level =Windows 2008 , R2 Standard, SP 1
Memory 12 Gb , Virtual Server , 2 cpu with 2.7Ghz.

RMDV application is running .

1Gps full du-plex .

Autonegotiation

Client Version: Version 6, release 2, level 1.3

*****************************

Can you please help me in this resolving this ..or any hint ..
ReplyDelete
Replies
Anonymous5/24/16, 12:25 PM
you change de data base in solid disk for increment the o performance
ReplyDelete
Replies
Philip12/22/19, 11:47 PM
Interesting case, have you found the cause?
ReplyDelete
Replies

Add comment

Pages

Monday, January 23, 2012

TSM Backup Issue

21 comments: