Data documentation, organization, and metadata

The image is from Aalto University material bank.

Metadata describes other data, and is necessary in order for data to be reused. From a researcher perspective, there are two main considerations:

  • The description of the research data, how it was created, meaning, the software needed to use it, etc.
  • Basic bibliographic information that is needed to retrieve the research data and make citations, including information about the creator, license, relevant dates, title, year of publication, repository, and identifier.

Basic description of data

One of the most common reasons for data to become unusable is because the contents, collection parameters, fields, or so on are forgotten. Thus, you should always take care to record this type of information in any way possible. The simplest way to record this is in an unstructured README file.

An unstructured data description is sufficient for most data, but large or high-value datasets should be be more structured, including their metadata. This is because they are more likely to be reused, including reuse without a human manually understanding and reusing data. Data may become more structured over time as its value becomes apparent.

Structuring data is very field-dependent, and the best advice is to search for standards of your field and follow them. For general examples of structured data, see 5-star data. The most basic step is to use an open, machine-readable data format which is not likely to ever become obsolete.

Organizing data

If you don't have a clear organization strategy, your data will become unmaintainable and unfindable, even by you.  Every type of project has different requirements, so it is difficult to make generalizations.  However, try to be strict with your data folders: rigorously sort things early, and give a unique name to different project spaces.  Have projects relate to each other, rather than copying and pasting or embedding.

Within a project space, sort files by type or usage instead of allowing everything to become mixed.

Metadata for repositories and archiving

Each data repository requires certain fields for each dataset, just like journals require authors, affiliation, publication dates, and so on. Collectively, this is referred to as metadata. For the most part, researchers should be concerned with finding an appropriate repository and follow its instructions for depositing data, which includes filling out the relevant metadata.

Sometimes, the repository can require or use data if it is structured in a certain format - as in, the data itself has certain metadata and is in certain formats. This is related to the structuring of data mentioned in the previous section. A repository of structured data allows large-scale, automated processing and data mining.

Repositories can also provide persistent identifiers for datasets, similar to the Digital Object Identifier (DOI) system used to identify research articles. In fact, DOIs are the most common persistent identifier for datasets, and allow citation of datasets in the same way as papers.

One can find evaluations of repositories, including the quality of their metadata, from the Aalto information and in the Registry of Research Data Repositories. One should strongly prefer repositories that provide persistent identifiers and standards suitable for the quality of the data.

How to describe datasets in the data repositories

Common repositories and their metadata standards

Within Aalto, data are cataloged in the research information management system, ACRIS, and metadata harvested to the national metadata catalogue, Etsin, in the future.

  • ACRIS uses a CERIF-compliant metadata model
  • FSD (Finnish Social Sciences Data Archive) uses the DDI metadata model. FSD staff can help add the metadata to materials that are stored and opened in the repository, for example interviews.
  • Zenodo accepts several metadata standards

Discipline-specific repositories may have special fields required in order to catalog and archive in a way that is useful to their discipline.

Metadata standards

In general, one does not directly select a metadata standard but it is selected by the repository. More information:

Global persistent identifier

Persistent identifiers identify online resources, such as datasets, by providing a permanent "name" and link to them. Even if the data changes location on the Internet, the identifier remains the same and will still link to the data, regardless of the new location. Just like with articles, it provides a unique way to cite data. It can also provide versioning.

Two common types of persistent identifiers are DOI (Digital Object Identifier) and Handle e.g.:

  • doi:10.3998/3336451.0003.204
  • hdl:102.100.100/15

In general, one gets a persistent identifier for a dataset when the data is put into a repository that provides one. Only use repositories that provide persistent identifiers.

Links to research data management instructions

Follow these links to navigate through research data management instructions.

Aalto Unversity collecting data

Working with research data

Information of services and considerations to help in research data management during the research work.

People talking with each other

Research Data Management (RDM) and Open Science

Properly managed research data creates competitive edge and is an important part of a high-quality research process.

This service is provided by:

Research and Innovation Services

Did you find what you were looking for? If not, please contact us.
  • Published:
  • Updated:
URL copied!