As a winner of the Publishing better science through better data Naturejobs competition (you can find my winning entry here), I had the honour to participate to the Scientific Data conference held in London last week and to report on the event (watch their blog if you want to read also my second entry).
Before attending the conference, I had no idea of how big data are so profoundly changing the way we think about science. The invited speakers walked us through a series of topics: from how current technologies make data collection so easy that also people that are not directly connected to research can collect very interesting sets of data (for example people from the sewage companies) to the issues related to the storage of such an enormous amount of information.
In particular, I was fascinated by a relatively new trend, that I believe will contribute to making science better: more reproducible and more collaborative. Apparently, it is now possible, and I would also say advisable, to publish your science by directly publishing your big data set.
Penev and Chavan in a publication from 2009 define a data paper as “a scholarly publication of a searchable metadata document describing a particular online accessible dataset, or a group of datasets, published in accordance with the standard academic practices”.
Today several data journals are available, such as Scientific Data (by the Nature Publishing Group and that was also sponsored during the conference), Geoscience Data, Earth System Science Data, the Journal of Chemical and Engineering Data to name a few (you can find a good list of further journals here).
In these journals, the raw data sets are prepared and modified so to be available and comprehensible to whoever would like to consult them (a strategy in line with an Open Science attitude).
But what are the advantages of publishing data?
First of all, making the data sets accessible to everyone allows researchers to reproduce and reuse the data of others. This is very powerful because researchers coming from different fields might have different perspectives and use your data for analysis you did not think of. At the same time, this also ensures that credit is always given to the people that collected and manage the data.
Another important consequence is that publication guarantees high quality of the data (which are peer-reviewed). On the long term, this might contribute to resolving the issue of scientific reproducibility (it has recently been reported that only between 10 and 30% of published scientific results are reproducible).
Data sets are curated in-house, which allows for content to be standardized and uniformly discoverable and are also linked to trusted specialised repositories (such as Figshare and Dryad) helping in data management and storage.
The advantages of publishing your data are therefore many. But if we really want to go towards this new direction, we will need to deal with several challenges, too: such as which data are worth storing and sharing, how to handle open access of sensitive data (for example data deriving from clinical trials) and how to cope with the increasing amount of information to be stored.
The first important step is to inform the community (for example through conferences as the one to which I participated) and to make people understand that big data is a reality with which, sooner or later, everyone will have to be involved. Technological solutions to the aforementioned challenges are already out there; we just need to tune them to our personal needs. The advantages deriving from embracing the change are worth a little effort.