Pottery and archaeology on the Web

Today marks five years since Tiziano Mannoni passed away.

There’s one thing that always characterised his work in publications and lectures: a need to visualise anything from research processes to production processes and complex human-environment systems in a schematic, understandable way. The most famous of such diagrams is perhaps the “material culture triangle” in which artifacts, behaviors and significants are the pillars on which archaeology is (or should be) based.

As a student, I was fascinated by those drawings, to the point of trying myself to create new ones. In 2012, in a rare moment of lucidity, I composed the diagram below trying to put together several loosely-related activities I had been doing in the previous years. Not much has changed since then, but it’s interesting to look back at some of the ideas and the tools.

 

Pottery 2.0

Kotyle is the name I gave to a prototype tool and data format for measurements of the volume/capacity of ceramic vessels. The basic idea is to make volume/capacity measurement machine-readable and allow for automated measurements from digital representations of objects (such as SVG drawings). Some of the ideas outlined for Kotyle are now available in a usable form from the MicroPasts project, with the Amphora Profiling tool (I’m not claiming any credit over the MicroPasts tool, I just discussed some of the early ideas behind it). Kotyle is proudly based on Pappus’s theorems and sports Greek terminology whenever it can.

SVG drawings of pottery are perhaps the only finalised item in the diagram. I presented this at CAA 2012 and the paper was published in the proceedings volume in 2014. In short: stop using [proprietary format] and use SVG for your drawings of pots, vases, amphoras, dishes, cups. If you use SVG, you can automatically extract geometric data from your drawings ‒ and maybe calculate the capacity of one thousand different amphoras in 1 second. Also, if you use SVG you can put links to other archaeological resources such as stratigraphic contexts, bibliographic references, photographs, production sites etc directly inside the drawing, by means of metadata and RDFa.

Linked Open Archaeological Data (with the fancy LOAD acronym) is without doubt the most ambitious idea and ‒ unsurprisingly ‒ the least developed. Based on my own experience with studying and publishing archaeological data from excavation contexts, I came up with a simplified (see? I did this more than once) ontology, building on what I had seen in ArchVocab (by Leif Isaksen), that would enable publication of ceramic typologies and type-series on the Web, linked to their respective bibliographic references, their production centres (Pleiades places, obviously) and then expand this to virtually any published find, context, dig, site. Everything would be linked, machine-readable and obviously open. Granularity is key here, and perhaps the only thing that is missing (or could be improved) in OpenContext. A narrow view of what it may look like for a single excavation project is GQBWiki. I don’t see anything similar to LOAD happening in the near future however, so I hope stating its virtual existence can help nurture further experiments in this direction.

The original case study for LOAD is ARSILAI: African Red Slip in Late Antique Italy, that is my master’s thesis. The web-based application I wrote in Django naturally became the inspiration for creating a published resource that could have been constantly updated, based on feedback and contributions from the many scholars in the field of late Roman pottery. Each site, dig, context, sherd family, sherd type, ware has a clean URI, with sameAs links where available (e.g. sites can be Pleiades places, digs can be FastiOnLine records). Bibliographic references are just URIs of Zotero resources, since creating bibliographic databases from scratch is notoriously a bad idea. In 2012 I had this briefly online using an AWS free tier server, but since then I have never had again the time to deploy it somewhere (in the meantime, the release lifecycle of Django and other dependencies means I need to upgrade parts of my own source code to make it run smoothly again). One of the steps I had taken to make the web application less resource-hungry when running on a web server was to abandon Matplotlib (which I otherwise love and used extensively) and create the plots of chronology distribution with a Javascript library, based on JSON data: the server will just create a JSON payload from the database query instead of a static image resulting from Matplotlib functions. GeoJSON as alternate format for sites was also a small but useful improvement (and it can be loaded by mapping libraries such as Leaflet and OpenLayers). One of the main aims of ARSILAI was to show the geospatial distribution of African Red Slip ware, with the relative and absolute quantities of finds. Quantitative data is the actual focus of ARSILAI, with all the implications of using sub-optimal “data” from literature, sometimes 30 years old (but, honestly, most current publications of ceramic contexts are horrible at providing quantitative data).

So the last item in the “digital approaches to archaeological pottery” toolbox is statistics. Developing open source function libraries for R and Python that deal with commonly misunderstood methods like estimated vessel equivalents and their statistical counterpart, pottery information equivalents (pie-slices). Collect data from bodysherds with one idea (assessing quantity based on volume of pottery, that I would calculate from weight and thickness sherd-by-sherd) just to find out an unintended phenomenon that I think was previously unknown (sherd weight follows a log-normal or power-law distribution, at any scale of observation) Realise that there is not one way to do things well, but rather multiple approaches to quantification based on what your research question is, including the classic trade networks but also depositional histories and household economics. At this point, it’s full circle. The diagram is back at square one.

Being a journal editor is hard

I’ve been serving as co-editor of the Journal of Open Archaeology Data (JOAD) for more than one year now, when I joined Victoria Yorke-Edwards in the role. It has been my first time in an editorial role for a journal. I am learning a lot, and the first thing I learned is that being a journal editor is hard and takes time, effort, self-esteem. I’ve been thinking about writing down a few thoughts for months now, and today’s post by Melissa Terras about “un-scholarly peer review practices […] and predatory open access publishing mechanisms” was an unavoidable inspiration (go and read her post).

Some things are peculiar of JOAD, such as the need to ensure data quality at a technical level: often, though, improvements on the technical side will reflect substantially on the general quality of the data paper. Things that may seem easily understood, like using CSV for tabular data instead of PDF, or describing the physical units of each column / variable. Often, archaeology datasets related to PhD research are not forged in highly standardised database systems, so there may be small inconsistencies in how the same record is referenced in various tables. In my experience so far, reviewers will look at data quality even more than at the paper itself, which is a good sign of assessing the “fitness for reuse” of a dataset.

The data paper: you have to try authoring one before you get a good understanding of how a good data paper is written and structured. Authors seem to prefer terse and minimal descriptions of the methods used to create their dataset, giving many passages for granted. The JOAD data paper template is a good guide to structuring a data paper and to the minimum metadata that is required, but we have seen authors relying almost exclusively on the default sub-headings. I often point reviewers and authors to some published JOAD papers that I find particularly good, but the advice isn’t always heeded. It’s true, the data paper is a rather new and still unstable concept of the digital publishing era: Internet Archaeology has been publishing some beautiful data papers,and I like to think there is mutual inspiration in this regard. Data papers should be a temporary step towards open archaeology data as default, and continuous open peer review as the norm for improving the global quality of our knowledge, wiki-like. However, data papers without open data are pointless: choose a good license for your data and stick with that.

Peer review is the most crucial and exhausting activity: as editors, we have to give a first evaluation of the paper based on the journal scope and then proceed to find at least two reviewers. This requires having a broad knowledge of ongoing research in archaeology and related disciplines, including very specific sub-fields of study ‒ our list of available reviewers is quite long now but there’s always some unknown territory to explore, for this asking other colleagues for help and suggestions is vital. Still, there is a sense of inadequacy, a variation on the theme of impostor syndrome, when you have a hard time finding a good reviewer, someone who will provide the authors with positive and constructive criticism, becoming truly part of the editorial process. I am sorry for the fact that our current publication system doesn’t allow for the inclusion of both the reviewers’ names and their commentary  ‒ that’s the best way to provide readers with an immediate overview of the potential of what they are about to read, and a very effective rewarding system for reviewers themselves (I keep a list of all peer reviews I’m doing but that doesn’t seem as satisfying). Peer review at JOAD is not double blind, and I think often it would be ineffective and useless to anonymise a dataset and a paper, in a discipline so territorial that everyone knows who is working where. It is incredibly difficult to get reviews in a timely manner, and while some of our reviewers are perfect machines, others keep us (editors and authors) waiting for weeks after the agreed deadline is over. I understand this, of course, being too often on the other side of the fence. I’m always a little hesitant to send e-mail reminders in such cases, partly because I don’t like receiving them, but being an annoyance is kind of necessary in this case. The reviews are generally remarkable in their quality (at least compared to previous editorial experience I had), quite long and honest: if something isn’t quite right, it has to be pointed out very clearly. As an editor, I have to read the paper, look at the dataset, find reviewers, wait for reviews, solicit reviews, read reviews and sometimes have a conversation with reviewers, if something is their comments are clear and their phrasing/language is acceptable (an adversarial, harsh review must never be accepted, even when formally correct). All this is very time consuming, and since the journal (co)editor is an unpaid role at JOAD and other overlay journals at Ubiquity Press (perhaps obvious, perhaps not!) , usually this means procrastinating: summing the impostor syndrome dose from criticising the review provided by a more experienced colleague with the impostor syndrome dose from being always late on editorial deadlines yields frustration. Lots. Of. Frustration. When you see me tweet about a new data paper published at JOAD, it’s not an act of deluded self-promotion, but rather a liberatory moment of achievement. All this may sound naive to experienced practitioners of peer review, especially to those into academic careers. I know, and I still would like to see a more transparent discussion of how peer review should work (not on StackExchange, preferably).

JOAD is Open Access. It’s the true Open Access, not to differentiate between gold and green (a dead debate, it seems) but between two radically different outputs. JOAD is openly licensed under the Creative Commons Attribution license and we require that all datasets are released under open licenses so readers know that they can download, reuse, incorporate published data in their new research. There is no “freely available only in PDF”, each article is primarily presented as native HTML and can be obtained in other formats (including PDF, EPUB). We could do better, sure ‒ for example, provide the ability to interact directly with the dataset instead of just providing a link to the repository ‒ but I think we will be giving more freedom to authors in the future. Publication costs are covered by Article Processing Charges, 100 £, that will be paid by the authors’ institutions: in case this is not possible, the fee will be waived. Ubiquity Press is involved in some of the most important current Open Access initiatives, such as the Open Library of Humanities and most importantly does a wide range of good things to ensure research integrity from article submission to … many years in the future.

You may have received an e-mail from me with an invite to contribute to JOAD, either by submitting an article or giving your availability as a reviewer ‒ or you may receive it in the next few weeks. Here, you had a chance to learn what goes on behind the scenes at JOAD.

Joining the Advisory Board of the Journal of Open Archaeology Data

I’m joining the Advisory Board of the Journal of Open Archaeology Data.

The Journal of Open Archaeology Data (JOAD ‒ @up_joad) is an open access, peer reviewed journal for data papers describing deposited archaeological datasets. JOAD is published by Ubiquity Press, that has a

flexible publishing model makes humanities journals affordable, and enables researchers around the world to find and access the information they need, without barriers.

Ubiquity Press began publishing at University College London (UCL) and is now the largest open access publisher of UCL journals.

JOAD aims at bridging the gap between standard publishing processes and the dissemination of open data on the Web, by following existing standards (such as DOI) and pushing altogether for a novel approach to the publication of datasets, based on data papers describing the methods used to obtain and create data, the way in which it is structured and its potential for re-use by others.

As its name implies, JOAD is not a data repository: your dataset should be already deposited with one of the recommended repositories that will take care of its digital preservation. As with most open access journals, it’s the author(s) who pay for the costs involved in the publishing process, not the readers. JOAD aims at being a low-cost and effective way to disseminate your data to a wide audience, without the limitations and slowness of pre-existing publication venues.

A look at pollen data in the Old World

Since the 19th century, the study of archaeobotanical remains has been very important for combining “strictly archaeological” knowledge with environmental data. Pollen data enable assessing the introduction of certain domesticated species of plants, or the presence of other species that grow typically where humans dwell. Not all pollen data come from archaeological fieldwork, but the relationship among the two sets is strong enough to take an interested look at pollen data worldwide, their availability and most importantly their openness, for which we follow the Open Knowledge Definition.

The starting point for finding pollen data is the NOAA website.

The Global Pollen Database hosted by the NOAA is a good starting point, but apparently its coverage is quite limited outside the US. Furthermore, data from 2005 onwards aren’t available via FTP in simple documented formats, but are instead downloadable as Access databases from another external website. Defining MS Access databases as a Bad Choice™ for data exchange is perhaps an euphemism.

Unfortunately, a large number of databases covering single continents or smaller regions is growing, and the approaches to data dissemination show marked differences.

Americas

For both North and South America, you can get data from more than one thousand sites directly via FTP. There are no explicit terms of use. Usually, data retrieved from federal agencies are public domain data.

The README document only states NOTE: PLEASE CITE ORIGINAL REFERENCES WHEN USING THIS DATA!!!!!. Fair enough, the requirement for attribution is certainly compatible with the Open Knowledge Definition.

Europe

From the GPD website we can easily reach the European Pollen Database, that is found at another website tough (and things can be even more confusing, provided that the NOAA website has some dead links).

You can download EPD data in PostgreSQL dump format (one file for each table, with a separate SQL script create_epd_db.sql). Data in the EPD can be restricted or unrestricted. That’s fine, let’s see how many unrestricted datasets there are. Following the database documentation, the P_ENTITY table contains the use status of each dataset:

steko@gibreel:~/epd-postgres-distribution-20100531$ cat p_entity.dump \
 | awk -F "t" {' print $5 '} | sort | uniq -c 
 154 R 
 1092 U

which is pretty good because almost 88% of them are unrestricted (NB I write most of my programs in Python but I love one liners that involve awksort and uniq). We could easily create an “unrestricted” subset and make it available for easy download to all those who don’t want to mess up with restricted data.

But what do “unrestricted” mean for EPD data? Let’s take a more careful look (emphasis mine):

  1. Data will be classified as restricted or unrestricted. All data will be available in the EPD, although restricted data can be used only as provided below.
  2. Unrestricted data are available for all uses, and are included in the EPD on various electronic sites.
  3. Restricted data may be used only by permission of the data originator. Appropriate and ethical use of restricted data is the responsibility of the data user.
  4. Restrictions on data will expire three years after they are submitted to the EPD. Just prior to the time of expiration, the data originator will be contacted by the EPD database manager with a reminder of the pending change. The originator may extend restricted status for further periods of three years by so informing the EPD each time a three-year period expires.

Sounds quite good, doesn’t it? “for all uses” is reassuring and the short time limit is a good trade off. The horror comes a few paragraphs below with the following scary details:

  1. The data are available only to non-profit-making organizations and for research.

Profit-making organizations may use the data, even for legitimate uses, only with the written consent of the EPD Board, who will determine or negotiate the payment of any fee required.

Here the false assumption that only academia is entitled to perform research is taken for granted. And there are even more rules about the “normal ethics”: basically if you use EPD data in a publication the original data author should be listed among the authors of the work. I always thought citation and attribution were invented just for that exact purpose, but it looks like they have distinctly different approach to attribution. The EPD is even deciding what are “legitimate” uses of pollen data (I can hardly think of any possible unlegitimate use).

Africa

You write “Africa” but you read “Europe” again, because most research projects are from French and English universities. For this reason, the situation is almost the same. What is even worst is that in developing countries there are far less people or organizations that can afford buying those data, notwithstanding the fact that in regions under rapid development the study and preservation of environmental resources are of major importance.

Data are downloadable for individual sites using a search engine, in Tilia format (not ASCII unfortunately). The problems come out with the license:

The wording is almost exactly the same as for the EPD seen above:

Normal ethics pertaining to co-authorship of publications applies. The contributor should be invited to be a co-author if a user makes significant use of a single contributor’s site, or if a single contributor’s data comprise a substantial portion of a larger data set analysed, or if a contributor makes a significant contribution to the analysis of the data or to the interpretation of the results. The data will be available only to non-profit-making organisations and for research. Profit-making organisations may use the data for legitimate purposes, only with the written consent of the majority of the members of the Advisory board, who will determine or negotiate the payment of any fee required. Such payment will be credited to the APD.

Conclusions

As for dendrochronological data, there is a serious misunderstanding by universities and research centers of their role in society as places of research, innovation that is available for everyone. In other words, academia is a closed system producing data (at very high costs for society) that are only available inside its walls, but it’s all done with public money.

The only positive bit of the story, if any, is that these datasets are nevertheless available on the web, and their terms of use are clearly stated, no matter how restrictive. It would be just impossible to write a similar article about archaeological pottery, or zooarchaeological finds.

Appendix: Using pollen data

Pollen data are usually presented in forms of synthetic charts where both stratigraphic data and quantitative pollen data are easily readable. Each “column” of the chart stands for a species or genus. You can create this kind of visualization with free software tools.

The stratigraph package for R can be used for

plotting and analyzing paleontological and geological data distributed through through time in stratigraphic cores or sections. Includes some miscellaneous functions for handling other kinds of palaeontological and paleoecological data.

See the chart for an example of how they look like.

An example plot using the R stratigraph package
An example plot using the R stratigraph package

Copyright not applicable to geodata?

There has been some rumor on the italian GFOSS mailing list, and also on many other forums about geodata copyright issues, after this exhaustive post by Jo Walsh.

The issue is quite simple: many open geodata advocates have been thinking for months that a good compromise for public institutions that own public geodata would have been a Creative Commons-family license. First of all, let’s be clear about Creative Commons: just talking about CC means nothing, because you should always specify which kind of licensing you intend to use for your creation.

Our last words, creation, introduce the major problem here. Geodata are mostly factual, because there’s no creativity in them. They just describe facts, geofacts if you like. People do not create data, they just put in digital shape something that exists. Or, if you prefer, there’s no creativity in geodata, opposed to music, poetry, photography… Creative Commons, yes.

This could mean that public geodata could only go through public domain. I have no hope for European States to choose public domain for our geodata, that have been paid with our much-suffered taxes. GFOSS will try to convince administrators to follow this innovating way.

I geodati delle Regioni italiane

OK. Ora voglio che tutti i visitatori, lettori, amatori, odiatori, tori-tori del mio blog vadano a farsi un giro su questa pagina. Fatevi un’idea. Imparate che questa cosa esiste e credetemi – è importante. Vi risulterà difficile trovare qualcosa – anche stupida – che non possa giovarsi enormemente di un nuovo corso nell’accesso ai geodati.

È anche interessante, tra l’altro. Vedere come, nonostante la corsa al cyber-tutto, una fetta così grande di tecnologia sia dedicata alla gestione del mondo vero di terra e pietre e rumenta. Non ve lo immaginereste.

Pensateci. E notate anche come la destra e la sinistra siano assolutamente inesistenti anche riguardo a questo argomento. Non solo, si intende.