Introduction

Multi data set archiver is a tool to archive several datasets together in chunks of relatively large size. When a group of datasets is selected for archive it is verified if they are all together of proper size and then they are being stored as one big container file (tar) on the destination storage.

When unarchiving data sets from a multi data set archive the following rules are obeyed:

At the moment deletion from the multi data set archive is not supported. That is, DeleteFromArchiveMaintenanceTask (see Maintenance Tasks) will throw a NotImplementedException.

To test the archiver find the datasets you want to archive in openbis GUI and "add to archive".

Important technical details

The archiver requires configuration of three important entities.

Multi dataset archiver is not compatible with other archivers. You should have all data available before configuring this archiver.

Workflows

The multi data set archiver can be configured for four different workflows. The workflow is selected by the presence/absence of the properties staging-destination and replicated-destination.

Simple workflow

None of the properties  staging-destination and replicated-destination are present.

  1. Wait for enough free space on the archive destination.
  2. Store the data set in a container file directly on the archive destination.
  3. Perform sanity check. That is, getting the container file to the local disk and compare the content with the content of all data sets in the store.
  4. Add mapping data to the PostgreSQL database.
  5. Remove data sets from the store if requested.
  6. Update archiving status for all data sets.

Staging workflow

Property staging-destination is specified but replicated-destination is not.

  1. Store the data sets in a container file in the staging folder.
  2. Wait for enough free space on the archive destination.
  3. Copy the container file from the staging folder to the archive destination.
  4. Perform sanity check.
  5. Remove container file from the staging folder.
  6. Add mapping data to the PostgreSQL database.
  7. Remove data sets from the store if requested.
  8. Update archiving status for all data sets.

Replication workflow

Property replicated-destination is specified but staging-destination is not.

  1. Wait for enough free space on the archive destination.
  2. Store the data set in a container file directly on the archive destination.
  3. Perform sanity check.
  4. Add mapping data to the PostgreSQL database.
  5. Wait until the container file has also been copied (by some external process) to a replication folder.
  6. Remove data sets from the store if requested.
  7. Update archiving status for all data sets.

Some remarks:

Staging and replication workflow

When both properties staging-destination and replicated-destination are present staging and replication workflow will be combined.

Clean up

In case archiving fails all half-baked container files have to be removed. By default this is done immediately.

But in context of tape archiving systems (e.g. Strongbox) immediate deletion might not always be possible all the time. In this case a deletion request is schedule. The request will be stored in file. It will be handled in a separate thread in regular time intervals (polling time). If deletion isn't possible after some timeout an e-mail will be sent. Such deletion request will still be handled but the e-mail allows manual intervention/deletion. Note, that deletion requests for non-existing files will always be handled successfully.

Configuration steps

Recovery from corrupted archiving queues

In case the queues with the archiving commands get corrupted, they cannot be used any more, they need to be deleted before the DSS starts and a new one will be created. The typical scenario where this happens is when you get out of space on the disk where the queues are stored.

The following steps describe how to recover from such a situation.

  1. Finding out the data sets that are in 'ARCHIVE_PENDING' status.

    SELECT data_id, size, present_in_archive, share_id, location FROM external_data WHERE status = 'ARCHIVE_PENDING';
     
    openbis_prod=> SELECT data_id, size, present_in_archive, share_id, location FROM external_data WHERE status = 'ARCHIVE_PENDING'; 
     data_id |    size     | present_in_archive | share_id |                               location                                
    ---------+-------------+--------------------+----------+-----------------------------------------------------------------------
        3001 | 34712671864 | f                  | 1        | 585D8354-92A3-4C24-9621-F6B7063A94AC/17/65/a4/20170712111421297-37998
        3683 | 29574172672 | f                  | 1        | 585D8354-92A3-4C24-9621-F6B7063A94AC/39/6c/b0/20171106181516927-39987
        3688 | 53416316928 | f                  | 1        | 585D8354-92A3-4C24-9621-F6B7063A94AC/ca/3b/93/20171106183212074-39995
        3692 | 47547908096 | f                  | 1        | 585D8354-92A3-4C24-9621-F6B7063A94AC/b7/26/85/20171106185354378-40002


  2. The data sets found, can be or not in the archiving process. This is not easy to find out instantly. It's easier just to execute the above statement in subsequent days.

  3. If the data sets are still in 'ARCHIVE_PENDING' after a sensible amount of time (1 week for example) and there is no other issues, like the archiving destination is not available there is a good change, they are really stuck on the process.
  4. Reaching this point, the data sets are most likely still on the data store as indicated by the combination of share ID and location indicated. Verify this! If they are not there hope they are archived or you are on trouble.
  5. If they are on the store, you need to set the status to available again using a SQL statement.

     openbis_prod=> UPDATE external_data SET status = 'AVAILABLE', present_in_archive = 'f'  WHERE data_id IN (SELECT id FROM data where code in ('20170712111421297-37998', '20171106181516927-39987')); 

     

    If there is half copied files on the archive destination, these need to be delete too, to find them run the next queries.


    # To find out the containers:
     
    SELECT * FROM data_sets WHERE CODE IN('20170712111421297-37998', '20171106181516927-39987', '20171106183212074-39995', '20171106185354378-40002');
    
    multi_dataset_archive_prod=> SELECT * FROM data_sets WHERE CODE IN('20170712111421297-37998', '20171106181516927-39987', '20171106183212074-39995', '20171106185354378-40002');
     id  |          code           | ctnr_id | size_in_bytes 
    -----+-------------------------+---------+---------------
     294 | 20170712111421297-37998 |      60 |   34712671864
     295 | 20171106185354378-40002 |      61 |   47547908096
     296 | 20171106183212074-39995 |      61 |   53416316928
     297 | 20171106181516927-39987 |      61 |   29574172672
    (4 rows)
    
    multi_dataset_archive_prod=> SELECT * FROM containers WHERE id IN(60, 61);
     id |                    path                     | unarchiving_requested 
    ----+---------------------------------------------+-----------------------
     60 | 20170712111421297-37998-20171108-105339.tar | f
     61 | 20171106185354378-40002-20171108-130342.tar | f
     
    

    NOTE: We have never seen it but if there is a container with data sets in different archiving status then, you need to recover the ARCHIVED data sets from the container and copy them manually to the data store before being able to continue.

    multi_dataset_archive_prod=> SELECT * FROM data_sets WHERE ctnr_id IN(SELECT ctnr_id FROM data_sets WHERE CODE IN('20170712111421297-37998', '20171106181516927-39987', '20171106183212074-39995', '20171106185354378-40002'));


  6. After deleting the files clean up the multi dataset archiver database.

    openbis_prod=> DELETE FROM containers WHERE id IN (SELECT ctnr_id FROM data_sets WHERE CODE IN('20170712111421297-37998', '20171106181516927-39987', '20171106183212074-39995', '20171106185354378-40002'));