formats – Stefano Costa

I found a CD and a floppy disk inside one of the lockers in our laboratory:

Legacy archaeological data and storage devices

They are from 8 years ago, and they contain data about the ceramic finds from the 2003 campaign in Gortyna. This looked like a very interesting finding, from two related points of view:

legacy storage devices;
legacy data formats.

Storage devices

Nowadays, only certain desktop PCs are equipped with a floppy disk reader, and normal laptops are missing it since 2004 at least. I verified that the content of the floppy disk and that of the optical disk were exactly the same. My colleague made a good choice then, when she decided to backup the content of the floppy disk to the CD-R. With the exception of netbooks and tablets, all computers have an optical drive for CDs and DVDs, and they are probably going to have one for a long time.

On the other hand, optical disks are the least resistant medium for digital archiving. Good quality CDs have an average life of 10-15 years, if they are stored properly and never taken out of their case. This specific disk was manufactured by TDK, that’s a good quality manufacturer generally speaking. Magnetic disks (like 3.5″ floppy disks) can last much longer, if they are kept away from electromagnetic sources (e.g. a TV, radio, PC, electric cables in general). Magnetic tape is generally considered the most safe archival medium, but it’s not available to common users.

Whatever storage medium is chosen, the true difference is actually made by the long-term archival strategy of your team or (better) institution. Having regular backups at secure and distant locations is perhaps easier nowadays with cloud providers like Dropbox and Ubuntu One, but a research institution should (must?) be able to define its own archival strategy under full control of its own staff. What I see in a lot of universities is instead a team-wise scale for long-term storage of digital archives, with a diversity of approaches. I’ve written elsewhere about existing general recommendations, and it’s no surprise that all of them are from the UK, Germany or the EU.

Data formats

As for storage devices, the issue of data formats is two-fold:

the digital formats in which information is stored;
the way information is structured.

Digital formats are well known and usually associated with file extensions, such as .jpg for JPEG images, .odt for Open Document text files and so on. Open formats are better than proprietary, especially in the light of digital preservation. Some formats are nevertheless so widespread that it’s impossibile to to avoid them, if not for yourself from your colleagues. This is the case of Microsoft Office file formats.

The data I found on the CD are in a Microsoft Excel spreadsheet (.xls extension). Apart from that, it’s just a table that could have been saved in any other format, including a plain-text comma-separated values file (.csv) or inside a relational database. With this I mean that one thing is the format in which data is translated into bits on a machine, and another thing is how information (or data, if you prefer) is structured inside a file or any other container, and also how much it is actually structured.

Let’s take a look at this dataset then. As common in spreadsheets, there are three separate sheets:

drawings catalogue;
fabrics catalogue;
general catalogue of all diagnostic sherds.

The third one is the main table and contains more than 660 entries from 59 contexts. It also has pointers to the two previous tables, so this should have been a relational database if it was done right. Or not. A relational database made in 2003 would have been either Microsoft Access or FileMaker (still two incredibily popular choices). In both cases, recovering data if you don’t own a copy of the software is almost impossible. Then? Hooray for relational databases in spreadsheets! In other words, always try to use formats that can be accessed by a variety of programs, and leave the logic of your data inside the data themselves instead of assuming that other people will be able to infer them by means of conventional knowledge.

I was able to recover the entire archive and it’s now being integrated into our current documentation system.

Tag: formats

Legacy archaeological data and storage devices

Storage devices

Data formats