Data documentation, folder organization, and describing datasets in repositories

If you don't have a clear organization strategy, your research data may become an unmaintainable mess. Data documentation, organization, and publishing and describing datasets in data repositories helps to ensure the usability and comprehensibility of your research data. A well-documented data set is also easier to reuse and cite.
The image is from Aalto University material bank.

Data documentation

One of the most common reasons for data to become unusable is because the contents, collection parameters, fields, or so on are forgotten. The simplest way to record this is in an so-called README style metadata created to accompany the data files. If there are discipline-specific documentation formats in your field, then it is important to use these as the basis for describing datasets. The contents of a README file may be as follows.

  • Who are the creators and what are their affiliations
  • What is the license chosen to allow reuse
  • How, when and by whom the data has been collected/created
  • How the data has been prepared for analysis
  • What kind of data manipulations have taken place
  • How and what methods have been used to analyse the data
  • What instruments and devices have been used
  • Which scientific publications are based on this data
  • File formats and file naming conventions
  • What is the software used to process and analyse the data

File naming can also be a technique of data documentation. The following is a simple formula for file naming: “AR_21022022_XRD_sample01” where

  • AR = Researcher’s initials
  • 21022022 = The date of data collection (ddmmyyyy)
  • XRD = Abbreviation of the measurement type
  • Sample01 = Sample number

Use, e.g., README style metadata files to describe the file naming system, including used abbreviations. Agree on the data documentation including the naming of data files within your research group.

Data folder organization

Aalto Science IT provides guidance for organizing folder structures for research data in single person and multiuser research projects, and a framework for a long-term master directory for the whole research group. According to Aalto Science-IT instructions, one of the key aspects of data organization is to store the code separate from the data. See Aalto Science-IT’s Data organization pages for more helpful tips on data organization.

It is important to choose one place for storage of research data. Use Aalto IT’s storing solutions that allow sharing access as this reduces the probability of having different file versions in different locations that undergo simultaneous editing. Consider who is granted access to the data. Be careful when handling and storing personal and sensitive data and comply with IT’s privacy policy and guidelines. Use also Aalto Version Control System and Aalto Notebook when applicable. Decide on the directory hierarchy and the used storing solutions within your research group.

Describing datasets in repositories  

The data repositories have various descriptive metadata elements and may name them differently. This instruction covers commonly included metadata elements. We also address how to set the access and re-use conditions, as you typically have to define that when depositing the data in the repository.


The title is the most important element to find your data and to determine if the dataset meet the user’s needs. Provide a unique title by focusing on the data you are sharing. Even if your data relates to an article, consider to give a distinctive title to your data. The informative title covers topic, timeliness of the data, specific information about place and geography. Below are some title examples.

  • Time series of microbial carbon release from soil as carbon dioxide under different nitrogen and phosphorus treatments with a low glucose concentration added as a carbon source in the Conwy catchment, North Wales, UK (2016) 
  • Finnish National Election Study 2011: Telephone Interviews among Finnish-speaking Voters 


Be specific and quantify when possible to give enough information about your data. Look at the checklist below to help you answer all relevant questions.

  • What the data were about?  
  • How, when and by whom the data were collected/generated? 
  • How the data was processed? 
  • What methods were used? 
  • What equipment and software were used?  
  • Why the data were collected /generated? 
  • What is the geographical location and the temporal coverage of the data? 


Keywords are not typically a mandatory field. It is highly recommended anyway. Target the keywords according to the audiences you have in mind. Use field-specific keywords, e.g. special thesauri when available and free words and phrases, too.  

Persistent identifier 

A persistent identifier is a unique string of characters given to the data by the repository. For example, the DOI (Digital Object Identifier) and the URN (Uniform Resource Name) are commonly used identifiers. Persistent identifiers identify online resources, such as datasets, by providing a permanent "name" and link to them. It is important to choose a repository that provides a persistent identifier along with the data deposit. 

Version control

One important thing to know, at least in case of general-purpose repositories i.e. Zenodo, is version control. There is always a possibility that you will update your dataset in the future or notice a mistake. You cannot delete a dataset, but you can upload a new version. While the dataset will keep the same DOI, versions will have different ones.

Terms of access to the data

When you deposit your data in the repository, you need to define the conditions where your data can be accessed. Typically, in the repositories you will define first how the user can access the data and then the right to re-use the data by licensing it. Some data repositories allow for more access options than the others do.  

Open access to data Means that data is freely used, re-used and redistributed by anyone.
Embargoed access to data Means that data is not available during the defined period of time.
Restricted access / Access by request to data Means that the data is shared within specific conditions. Re-users have to request access and the data creator has to permit or deny the access.
Closed access to data Means that re-users have no access to the data.

Licensing the data for re-use

The users’ rights to employ the deposited data are commonly defined by licenses. The license protects your author rights. It ensures that the user provides credit for you by citing your data. The license also reduces the uncertainty by letting the potential users know how your data can be re-used, combined, mined, or re-distributed. The Creative Commons license CC BY 4.0 is compulsory for publicly funded research data published in a data repository, compliant with the requirement of Open data directive, implemented in Finland with the national law Laki julkisin varoin tuotettujen tutkimusaineistojen uudelleenkäytöstä.

Links to research data management instructions

Follow these links to navigate through research data management instructions.

Aalto univerisity library

Publishing and reusing open data

Overview and instructions to services for sharing and publishing research data

People talking with each other

Research Data Management (RDM) and Open Science

Properly managed research data creates competitive edge and is an important part of a high-quality research process. Here you will find links to support, services and instructions for research data management.

This service is provided by:

Research and Innovation Services

Did you find what you were looking for? If not, please contact us.
  • Published:
  • Updated: