Data documentation

Data documentation refers to the documentation and description of the content, collection, variables, and other matters important to the research. Describe your research data in such a way that even outsiders can understand why and how the data has been collected, and how it can be used for their own research. The key is to describe the content and structure of the research data, not the publication or results of the research. Produce descriptive data throughout the study and record basic information regarding data collection and processing. This makes it easier to create a published description at the publication stage of the data. Descriptive data can be recorded in connection with the data or in a separate file, such as a README file.

Describing the data is important because well-described research data is easier to find and use. The importance of descriptive data is emphasised in the publication phase. Therefore, always publish metadata about your data in a national or international data storage service. Publish metadata especially when you cannot make the actual dataset open access. This way you can gain visibility even for your closed data and you promote the implementation of the FAIR principles for your data.

It is recommended to use a metadata standard suitable for your discipline or research method. The use of a standard is not mandatory, but if you publish your data in a data repository, the metadata standard used by the archive may be helpful in producing your own description.

Digital Curation Center (DCC). A website specialising in digital curation and research data management. The website contains discipline-specific metadata standards.
Research Data Alliance (RDA). A movement promoting open sharing and reuse of data. The site contains metadata standards for different fields of science and tools for fulfilling them.
GoFAIR. The project promotes the implementation of the FAIR principles.
FAIRSharing. The project promotes research data management, and curates metadata standards for describing research data.
Finnish Social Science Data Archive. Metadata standards from FSD's website.

In practice, the description of metadata during the publication of research data consists of three categories: content, access rights, and identifiers. Content helps other users find the data and understand its purpose. Access rights tells the user what can be done with the research data and who owns it. Identifiers are used to make research data citable, among other things.

A suitable tool for describing metadata during publication of research data is Qvain, produced and maintained by CSC. Qvain supports the use of controlled vocabularies, provides a comprehensive list of licenses, and creates a permanent identifier for your metadata. On Qvain's website, you will find field-specific description instructions. In addition, Qvain allows you to publish your metadata directly in the Etsin service, which is a national research data search service. Descriptive data can also be published in places such as Zenodo.

File formats and folder structures

If possible, choose a file format that allows digital preservation. Favour file formats that have typically been used in the research field. Favour the following features:

Multi-platform and multi-app interoperability.
Available without fees or restrictions.
Available on a variety of software (does not cause IPR issues).

Common, documented, and open file formats also support the implementation of the FAIR principles, as they support interoperability and accessibility. The file formats used during research and formats that enable digital preservation can be different. During research, the choice of data format is affected by, for example, how and with which software you process and analyse your data. You may need to convert these working files to file formats used for digital preservation.

We recommend the following file formats for digital preservation:

Compression: TAR, GZIP, ZIP
Databases: XML, CSV
Geographical data: SHP, DBF, GeoTIFF, NetCDF
Videos: MOV, MPEG (MPEG-1/2, MPEG-4), AVI, MXF
Audio file: WAV, AIFF, MP3, MXF
Numerical data: ASCII, DTA, POR, SAS, SAV
Images: TIFF, JPEG 2000, PDF/A, PNG, GIF, BMP
Tabular data: CSV
Texts: XML (ODT, DOCX), PDF/A, HTML, ASCII (RTF, TXT)

You can read more about file formats in FSD's Data Management Guidelines and UK Data Service's file format recommendations.

Folder structure and file naming are key parts of the everyday description work during a project. Systematic folder structure and consistent naming help in finding information. The key is to create and follow a consistent naming convention. The naming convention should be agreed upon jointly in a research group. For more control and discoverability, folders and files can be named descriptively (for example, interviews, images, measurements, statistics, etc.). If there are many files and different data types, descriptive main and subfolders make it easier to manage the whole. When planning and implementing naming, you should also aim for machine readability, as it enables machine reading and further processing of files. However, folder and file names should not include personal information, confidential, classified, or sensitive information.