An unstructured data description is sufficient for most data, but large or high-value datasets should be be more structured, including their metadata. This is because they are more likely to be reused, including reuse without a human manually understanding and reusing data. Data may become more structured over time as its value becomes apparent.
Structuring data is very field-dependent, and the best advice is to search for standards of your field and follow them. For general examples of structured data, see 5-star data. The most basic step is to use an open, machine-readable data format which is not likely to ever become obsolete.
If you don't have a clear organization strategy, your data will become unmaintainable and unfindable, even by you. Every type of project has different requirements, so it is difficult to make generalizations. However, try to be strict with your data folders: rigorously sort things early, and give a unique name to different project spaces. Have projects relate to each other, rather than copying and pasting or embedding.
Within a project space, sort files by type or usage instead of allowing everything to become mixed.
Metadata for repositories and archiving
Each data repository requires certain fields for each dataset, just like journals require authors, affiliation, publication dates, and so on. Collectively, this is referred to as metadata. For the most part, researchers should be concerned with finding an appropriate repository and follow its instructions for depositing data, which includes filling out the relevant metadata.
Sometimes, the repository can require or use data if it is structured in a certain format - as in, the data itself has certain metadata and is in certain formats. This is related to the structuring of data mentioned in the previous section. A repository of structured data allows large-scale, automated processing and data mining.
Repositories can also provide persistent identifiers for datasets, similar to the Digital Object Identifier (DOI) system used to identify research articles. In fact, DOIs are the most common persistent identifier for datasets, and allow citation of datasets in the same way as papers.
One can find evaluations of repositories, including the quality of their metadata, from the Aalto information and in the Registry of Research Data Repositories. One should strongly prefer repositories that provide persistent identifiers and standards suitable for the quality of the data.
How to describe datasets in the data repositories
Common repositories and their metadata standards
Within Aalto, data are cataloged in the research information management system, ACRIS, and metadata harvested to the national metadata catalogue, Etsin, in the future.
- ACRIS uses a CERIF-compliant metadata model
- FSD (Finnish Social Sciences Data Archive) uses the DDI metadata model. FSD staff can help add the metadata to materials that are stored and opened in the repository, for example interviews.
- Zenodo accepts several metadata standards
Discipline-specific repositories may have special fields required in order to catalog and archive in a way that is useful to their discipline.
In general, one does not directly select a metadata standard but it is selected by the repository. More information:
Global persistent identifier
Persistent identifiers identify online resources, such as datasets, by providing a permanent "name" and link to them. Even if the data changes location on the Internet, the identifier remains the same and will still link to the data, regardless of the new location. Just like with articles, it provides a unique way to cite data. It can also provide versioning.
Two common types of persistent identifiers are DOI (Digital Object Identifier) and Handle e.g.:
In general, one gets a persistent identifier for a dataset when the data is put into a repository that provides one. Only use repositories that provide persistent identifiers.