Monday, July 25, 2005

ETA Please!

Well I just had to perform a restore for a small server and the speed at which the restore ran was atrocious. I mean it was slower than rush hour in LA. Let's discuss, and THIS TIME I WANT AND REQUEST FEEDBACK! It turned out a disk went bad on a webserver and the system admins requested a number of filesystem restores. The combined amount was about 22-25GB. OK! No problem! Well that probably would have been the case if the restore request had been during the day but the request came in at night and the restore was competing with the nightly backups. Over a gig-ether (Fiber Gig-ether) connection I was able to get 1.3MBs and an aggregate rate of 668Kps. So do the math and it took a long period of time. The other thing that didn't help was it was a web server with TONS of little objects. It's livable but there were a lot of small files. The problem was eveyone and their brother wanted an ETA. "How long? It's Small! It should only take a couple hours max!" and so on. Well people now want some solution to this situation but of course the problem will be keeping it somewhat cheap. Even though everyone asks if we can halt the backups while we perform the restores we all know that's not really a viable option, so I came up with this idea, tell me what you think. Since major restores are few and far between I am proposing we create a new VLAN and run a single cable to each row of servers in the server room with enough slack to stretch to any server in the row. If a restore is required we simply plug in the "restore" connection, set an IP and away it rips. When finished we put the system back on the backup network it is assigned and rollup the excess ethernet cord and place it in the rack of the server in the middle of the row. I am only thinking of this for major restores and since I am not requesting that we buy more NIC's I think it's doable. Let me know what restore process you have in place when the network is saturated. I'd love suggestions!


  1. I think that would work, not the prettiest solution, but workable. Maybe another solution would be to identify Critical and or large clients and create a permenant backup VLAN for those clients. IT would probably be tough to implement for all your nodes so evalutaing on a case by case basis may be the best.

  2. Yeah we have dedicated backup VLAN's for all large to medium servers the problem is we have 4 and they all become saturated during backup hours. So the idea was to have a dedicated restore VLAN but keep costs down by not connecting it to every server but making a drop available for each row in the server room.

  3. Chad, you seem set on blaming the infrastructure. I don't have all the information, but how can you be sure that the TSM server was not the restore bottleneck, because it was busy with backups. e.g it the TSM server only has one gigabit Ethernet channel, or 4 channels, that is saturated durning nightly backups, then adding a special channel to the client wont help.
    Just my opinion.



  4. I don't discount the TSM server being a bottleneck. By no means is it running full tilt during backups, but the throughput was definitely affected by the backups running. The response time increased as backups over that particular VLAN subsided. As you know when customers start screaming about their data no matter what the restore SLA is no one sticks with the program and starts hounding you on how long it will take. So to alleviate any restore issues due to backups I am exploring having the seperate restore VLAN. Of course since this article we are finally getting our SA's and architects to look into etherchannel (teaming) two cards for better throughput.