(Re)usable data

Sharing data with others only makes sense if these can be reused quickly and easily. A number of simple measures will help improve usability of your data. These are:

  • make sure your data are ‘tidy’
  • document your data
  • use open data formats as much as possible

Make sure your data are ‘tidy’
Usable data are first and foremost data that can be easily altered and processed, i.e. that they may be quickly and easily:

  • imported into data management systems;
  • analyzed by analysis software;
  • combined with other data, and;
  • visualized.

For data in tables this will be the case when the structure of a table is ‘tidy’, i.e.:

  • each column (field) represents a single variable (parameter);
  • each row (record) in the table represents a single observation;
  • each cell contains a single value, and;
  • a table is provided for each type of information.

“The problem is that people like to view data in a totally different way than a computer likes to process it.” (Kien Leong)

‘Messy’ data are the opposite of tidy data. Various tools for making messy data tidy are available, e.g. OpenRefine. For this purpose, R software features the tidyr package.

Document your data
Documentation of (tabellary) data begins with documenting the data table itself. Normally the top row of the table will contain the names of the variables. The names of the variables should be indicative and clear. For the values in the cells, standard names (e.g. derived from a taxonomy) or formats should be used. One simple example of the latter is the date format: YYYY-MM-DD.

“Research outputs that are poorly documented are like canned goods with the label removed (…)” (Carly Strasser)

Dataset documentation should comprise at least the following elements:

  • the size of the dataset, i.e. the number of observations and variables;
  • a clear explanation of the variables, how they were measured and the measurement units (code book);
  • a dataset description including the scope of the dataset;
  • the provenance and history of the raw dataset: how were data obtained or gathered, what research methodology was used, what apparatus or instruments were used, which computations (cleaning, organizing, analyzing, producing final outputs) did the data undergo? These computations are nowadays often performed using software applications like R, documentation consisting partly of software scripts to be saved along with the data.

A simple readme file may often do to circumscribe a dataset’s documentation. Sometimes however this will not be sufficient and something more like a ‘data guide’ may prove necessary.

Last but not least: if the dataset is to be made available to the public, it will be useful to assign it a license (of usage), specifying under which conditions others may (re)use the data. So-called Creative Commons licences were developed specifically for this purpose.

Use open dataformats
Tidiness and dataset documentation both relate to the usability of the dataset itself. Use of open (non-proprietary) data formats relate to the ‘sustainability’ of data, meaning the usability of data in the long run. Can the dataset still be computed and processed in twenty years’ time? Saving data in simple and open data formats - e.g. csv for tabellary data – will guarantee this. Data archives that are focused on long-term saving of data will therefore often use these data formats.

Use data formats that are compatible with storage in data archives. Click here for formats preferred by the 4TU.Centre of Research data.