Data Management

The various issues concerned with ‘data’ are becoming increasingly important, because of the increasing emphasis on reproducibility and replicability in science.

Making data available to reviewers and others allows them to verify your analyses and also re-use data for other purposes, in meta-analyses or other projects that you had not thought of.

Most public funding agencies (e.g., NSF [USA], NERC [UK]) require deposition of data in a data archive, as do many reputable journals. There are, also, good reasons for not making your data available, such as if they contain sensitive information about endangered species or personal information about interviewees.

Ensuring that your data are collected, managed, and stored well is therefore another skill you will have to learn as data scientists.

There are several steps in this process, that will also help you as you manage and analyse and write up your data.

Collecting data

Quality control during data collection is important because often it is either not possible, or prohibitively expensive in terms of time and money to collect the data again.

Key things to consider:

Processing data

Once you have your raw data, it will likely need to be processed into some form of digital spreadsheet or database.

First, keep all raw data in an un-writeable format. You may need to go back to it. Take digital copies of all paper field notes.

File format

Use non-proprietary formats as much as possible (.txt, .csv, …), to ensure that anyone (including you!) will be able to use and open them.

File and folder names

Should be unique, descriptive, ordered, consistent.

Avoid spaces, which can cause problems with software.

Dates should be ISO: YYYY-MM-DD, to ensure correct ordering.

Folders and directory structure

Organise your files in a sensible, planned, and consistent way.

Draw a folder map to aid others.

─ Flowering
  |
  ├─ raw_data
  |   ├─ 2015
      |    └─2015-flowering.csv
  |   ├─ 2016
  |   └─ 2017
  ├─ processed_data
  |   ├─ script-to-process-raw-flowering-data.R
  |   └─ data_all_yrs.csv
  ├─ results
  └─ figures
      └─ plot1.png

Consider version control

Version control can either be via software such as Git.

Or you can impliment a simplified version yourself:

More details here

Documenting data

It is important to include as much information about the data as possible, so that they can be understood and interpreted correctly in the long term.

Information may include: project aim, objective, and hypotheses; personnel; sponsors and funders; methods and instruments; standards used; software; known issues and limitations; intellectual property.

This kind of information is frequently referred to as meta-data, and is required by all data archives.

Storing data

Data should be stored in a way that will ensure it can be found and used again.

This means using open and non-proprietary formats, as well as high redundancy and multiple back-ups (online, external devices, hard copies), and a system of responsibility for doing so.

Sharing data

All the above systems will ensure that your data can easily be found by you and others, and easily shared, archived, and re-used.

Consider depositing data in an online archive.


References

Borer et al. 2009. Some simple guidelines for effective data management. Bulletin of the Ecological Society of America. PDF

British Ecological Society. 2014. A Guide to Data Management in Ecology and Evolution. Link

Cook et al. 2001. Best practices for preparing ecological data sets to share and archive. Bulletin of the Ecological Society of America. Link Presentation

Cook et al. Updated. Best Practices for Preparing Environmental Data Sets to Share and Archive Link

White et al. 2013. Nine simple ways to make it easier to (re)use your data. Ideas in Ecology and Evolution 6, 1–10. link

Wickham, H. Tidy Data. Journal of Statistical Software PDF Link

UK Data Archive. 2013. Managing and Sharing Data: Best practice for researchers. Link

Online Resources

Ecological Society of America data sharing

UK Data Archive