Friday, January 06, 2006

Archiving - The Great Debate!

I knew when I decided to cover archiving that many of you would have comments on it, so I am expecting your feedback and a good discussion on the matter. The problems I see with archiving are many but most companies fall within one of three categories:

Category 1 - They don’t know what they need archived
  • They don’t know where all the data needing archiving is

  • They don’t know how long they should keep their data

  • They don’t know where to begin to discover data that meets archive requirements

  • They don’t want to spend the money on a true archiving solution
Category 2 - They somewhat know what they need archived
  • They know some of the data needing archiving

  • They don’t have all data identified

  • They have an idea on how to identify data needing long term retention

  • They are willing to investigate purchasing a true archiving solution
Category 3 - They know exactly what they need archived
  • They know all data needing archiving

  • They know retention times for all data

  • They constantly review systems and apps to identify data that meets archiving requirements

  • They are willing to pay for a true archiving solution

The major problem with archiving is how many companies fall within the first two categories and getting them to the point of category 3 is sometimes impossible. It’s amazing to see huge companies scratch their heads and get that perplexed look when you discuss archiving with TSM. Most people come from the old school of taking a weekend or month end full and keeping it forever.  They think this protects them in case they need any data in the future and so the customer follows this pattern and doesn’t see the problems inherent in that scenario. The problems are numerous but the most glaring one is that 99.99% of the data on those weekend or month end tapes will never be needed and you are now paying a huge amount of money for tapes and offsite storage.  Now introduce TSM into the mix and the customer or management are accustomed to the old process and wonder why they now need to identify their data for archiving.

Let’s be honest most companies become overwhelmed when asked to do discovery on specific locations and data types that should be archived, so to make it easier and less work they try to make TSM conform to the old process. Unfortunately as TSM admins we tend to either not argue the case, or when we do dissent we are overruled. So you end up doing backupsets (If you are actually archiving whole machines please see a Psychologist immediately) and relying on the customer to keep his restores to a minimum. The problem is that backupsets sound good, they give management that false sense of security, which gets them off your back, and they are independent of any particular TSM server. The truth is that they stink! Backupsets are the worst archiving process you could use. Sure Tivoli has supposedly updated them to make them more functional in 5.3, but the truth is you still end up using too many resources, waste tape, and pay more for offsite storage due to that increase in tape usage. We wont even talk about the restore times, and what happens when syntax is wrong. So backupsets are the wrong solution for anything but DR needs or portability.

TSM is an adequate archiving tool. It does a good job for small to moderate archiving, but when you have situations where the customer needs to have very descriptive meta data stored with the archives to make retrieval easier you need an enterprise tool like IBM Content Manager, Xerox DocuShare, or one of the many others out there. The problems always seem to come down to cost.  What do you do when the customer or management can’t part with the money to truly protect themselves?  That is where you need to work with them to explicitly identify the data they need archived and that retention requirements are met, that management classes and include statements are used to match data with retention times, that they document the owners of the data for future reference, and that the documents and contact information are reviewed at least once per year. I had a situation where data was being archived and a few years down the road some one asked for data and the person who had been managing the archive had left the company and no one knew what process was in place and what data was being archived. They didn’t know who all the owners of the data were and the previous manager had not done any transition or hand over to other personnel.

You need to do constant review and audit of archiving processes and standards. Too many times requirements, laws, and applications change and you find yourself without the data required. Archiving tends to be like Ron Popeil’s Rotisserie, “Set it and forget it!” This is the breaking point. As a TSM admin even I have fallen into the trap of forgetting about archive jobs and assuming they are working. So I had to make changes to how we handled archive jobs and retention. Typically I recommend reviewing requirements and processes at least twice a year if not quarterly. This will hopefully allow you to identify any issues with new data brought online, changes in requirements, and application changes. Schedules need to be reviewed, shell scripts need to be checked, and archive data should periodically be audited to make sure they are performing correctly. DO NOT RELY ON YOUR SCHEDULE EVENT RECORDS! THEY DON”T GIVE YOU A COMPLETE PICTURE! What if the customer or management decides to change the location he or she stores the data? What do you do when the customer or management wants data archived from a directory weekly but does not want the data to be deleted? What if the customer wants data kept online (in the library) and also sends a copy offsite? These are the issues you will have to deal with as you work with archives. If you were expecting solutions and answers I only have suggestions. There is no one-way to do archiving so you have to find the best process that fits your needs. The key is helping your company or customer understand what is best for them even if they don’t initially like what they hear. When it comes to data the customer is not always right. Of course you can’t make the company or customer do exactly what you’d like but you’ll have to do your best to help them understand how much they stand to lose if they don’t follow the right procedures.


  1. Hi all,

    well, archiving is kind of a question.
    My largest client is an IT company (so it means computer literate people) and to be honest - it took us (their backup administrator and me) more than a year until we had any usable specification from their management. The worst thing you can hear is "everything, forever".

    ... back to archives ...
    IMHO the worst problem with archives is that they are DB-space eaters. If you need to make an archive every week and you have to archive a million of files - it means cca 0.5G DB increase a week ... If you are lucky (and I am) then the customer can pack the files into great zip/tar/... and you end up with a few huge files - which is much better :)

    Backupsets, on the other hand, take just only one entry in the DB. The other good thing is that you can use them without server with client code only installed. I think that ideal usage of backupsets is for DR on remote locations. Create backupset, send it there with a TSM client CD .... if you have to restore GBs over a slow link - it is the only solution.
    The bad thing (atleast in my environment) is the time you need to create a backupset - on some of my servers it takes days ...

    So which one to choose? Do not know any "golden rule". I want to hear your opinions :) All begins with the data structure.


  2. Hi all again :)

    I want to discuss an idea we had with one of my coleague.

    As a lot of our servers are being replaced during the time and the names are being reused we had to figure out how to archive them.

    What we test now is:
    1) we have designated one node as an "Archive server" - it is a client running on TSM server as this node is never going to be decomissioned.
    2) we archive the data from other nodes using a script doing (among other things)
    dsmc archive -virtualnodename=archive_node_name objects -description=real_node_name/date/description ....

    So we know where to find all archives, we can delete/manage using a description field, with archives we have no problem with identical paths (as would happen with backups) ... and so on

    First time we have figured out this idea I was wandering if "virtualnodename" option is going to work with archiving (it seems to me it is designed for restore/retrieve purposes) ... but it works :)

    Our archive retention policy is the same for all files so no problem there ...

    What do you think?


  3. Harry,

    In the case of decommissioning servers I do like backupsets. In my experience the customer rarely requests the data from a decommed server since the data is either from an app no longer used or the data was moved to another newer server so no real data is gone from the environment. So I like backupsets in this case because it meets the needed request without being too much of a headache. Does the customer really expect to need the data frequently? If so why is it considered decommissioned?

  4. Hi,
    archived data are mostly access logs needed frequently for legal purposes. They have to be kept for several monts/years (new legal stuff of EU). The services/applications producing these logs are beging developed and retired over the time so the servers are being reconfigured frequently. In few months no one is going to know, if this service was running on serverA or serverB so finding the archive can be tricky - using service name and date as a description filed in archive and storing all archives under one node seems to be a solution.