Research Guides: Research Data Management: Best Practices to Managing Research Data

Types of Research Data

Examples of Research Data include:

Documents (text, Word), spreadsheets, print outs
Laboratory notebooks, field notebooks, diaries
Questionnaires, transcripts, codebooks
Audio, video
Photographs, films, x-rays, negatives,
Protein or genetic sequences
Spectra, spectroscope data
Test responses
Slides, artifacts, specimens, samples
Collection of digital objects acquired and generated during the process of research
Database contents (video, audio, text, images)
Models, algorithms, scripts, code, software
Contents of an application (input, output, logfiles for analysis software, simulation software, schemas)
Methodologies and workflows
Standard operating procedures and protocols
Computers and computer data storage devices
Synthetic compounds
Organisms, cell lines, viruses, cell products
Cloned coordinates, plants animals

File Formats

File formats used to capture, store and deliver research data are an important consideration as they influence future file/program accessibility. It is important to plan for software obsolescence.

Formats more likely to be accessible in the future are:

Non-proprietary
Open, documented standard
Common usage by research community
Standard representation (ASCII, Unicode)
Unencrypted
Uncompressed

Examples of preferred file format choices include:

ODF or PDF/A, not Word
ASCII, not Excel
MPEG-4, not Quicktime
TIFF or JPEG2000, not GIF or JPG
XML or RDF, not RDBMS

Consider migrating your data into a format with the above characteristics, in addition to keeping a copy in the original software format. Note that not all repositories are able to migrate data files to newer file formats for preservation.

For more, see the UK Data Service Recommended Formats or the Recommended Formats Statement of the Library of Congress

OSU Research Data Services Libguide: Data Types and File Formats

File Names

File names should be unique, consistent, informative and have the ability to be sorted/updated easily. Before beginning your project, determine any file naming hierarchy and file naming conventions. File names should easily indicate which project they belong to. Elements that may be included in your file names are date, project name, type of data, location, and version. There are other features to consider as you design your file naming plan described on this google doc.

When organizing files, it's important to standardize file naming and directories so they're descriptive.

Best Practice

File names should reflect the contents of the file and include enough information to uniquely identify the data file. File names may contain information such as project acronym, study title, location, investigator, year(s) of study, data type, version number, and file type.

When choosing a file name, check for any database management limitations on file name length and use of special characters. Also, in general, lower-case names are less software and platform dependent. Avoid using spaces and special characters in file names, directory paths and field names. Automated processing, URLs and other systems often use spaces and special characters for parsing text string. Instead, consider using underscore ( _ ) or dashes ( - ) to separate meaningful parts of file names. Avoid $ % ^ & # | : and similar.

If versioning is desired a date string within the file name is recommended to indicate the version.

Avoid using file names such as mydata.dat or 1998.dat.

Description Rationale

Clear, descriptive, and unique file names may be important when your data file is combined in a directory or FTP site with your own data files or with the data files of other investigators. File names that reflect the contents of the file and uniquely identify the data file enable precise search and discovery of particular files.

Examples

An example of a good data file name:

Sevilleta_LTER_NM_2001_NPP.csv

Sevilleta_LTER is the project name
NM is the state abbreviation
2001 is the calendar year
NPP represents Net Primary Productivity data
csv stands for the file type—ASCII comma separated variable

Source: DataOne

Metadata Standards

Metadata (data about data) standards help to describe data in a consistent manner. Metadata can include descriptive information, provenance, quality and access/use of data. Here are a few standards that may be useful in describing your data for access and preservation.

Metadata: Distributed Active Archive Center (Oak Ridge National Laboratory)
Schema Library (Altova) of common industry and cross-industry standards
Seeing Standards: A Visualization of the Metadata Universe documents 105 standards used by the cultural heritage community
DDI: Metadata specification for the social and behavioral sciences
Astronomy Visualization Metadata
Content Standard for Digital Geospatial Metadata (more on metadata from the FGDC)
Darwin Core
Data Documentation Initiative
Dublin Core
Ecological Metadata Language

Data Dictionaries

USGS defines a Data Dictionary as a repository of structured data names that define and describe a resource.

See Best Practices for Data Dictionary Definitions and Usage (PDF) by Northwest Environmental Data Network

Source: USGS Data Dictionaries and Thesauri

How to Document Your Data

Documenting your data includes capturing sufficient metadata (descriptive information) about your data in order to make it discoverable, identifiable and usable in the future. Information you capture should include some, if not all, of the following elements:

Title of the dataset or research project
Creator names of individuals or institutions responsible for creating the data
Unique Identifier that helps distinguish the data used to identify the data
Dates: Project start and end dates, release date, any other date of importance during the length of the research study
Subject: Keywords or phrases describing the subject or content of the data
Funding Agency responsible for funding the research
Intellectual Property Rights associate with the data
Language(s) in which data is generated
Sources for data derived from other sources
Geographical location or coverage where data was collected
Methodology for data collection
Version of the dataset if updated

Using sustainable metadata standards is highly recommended though to ensure that data are accessible in the future. Such standards are open (not proprietary), used widely, uncompressed, use standard encoding and contain enough information to analyze the context, content and structure of record.

Metadata schema sources

The UK's Digital Curation Centre maintains a comprehensive list of disciplinary metadata standards on their website.
Metadata: Distributed Active Archive Center (Oak Ridge National Laboratory)
Schema Library (Altova) of common industry and cross-industry standards
Seeing Standards: A Visualization of the Metadata Universe contains a list of 105 standards used by the cultural heritage community
DDI: Metadata specification for the social and behavioral sciences
Cornell University has an excellent guide to writing readme style metadata.

CalTech Library's File Naming Convention Worksheet
This worksheet helps researchers to build their own work file names

Reproducibility of Data

When searching for data, whether locally on one's machine or in external repositories, one may use a variety of search terms. In addition, data are often housed in databases or clearinghouses where a query is required in order access data. In order to reproduce the search results and obtain similar, if not the same results, it is necessary to document which terms and queries were used.

Note the location of the originating data set
Document which search terms were used
Document any additional parameters that were used, such as any controls that were used (pull-down boxes, radio buttons, text entry forms)
Document the query term that was used, where possible
Note the database version and/or date, so you can any limit newly-added data sets since the query was last performed
Note the name of the website and URL, if applicable

Description Rationale

In order to reproduce a data set or result set, it is necessary to document which terms were originally used to capture that data. By documenting this information while the search is being conducted, one greatly enhances the chance of being able to reproduce the results at a later date.

Source: DataONE

Data Storage and Preservation

Storage

Storing data reliably is an important function of data management. There are several options to store your data files -

Personal computers, external hard drives, departmental or university servers
Other cloud storage services that may suit your data storage/backup needs include Amazon S3, Elephant Drive, Jungle Disk, Mozy, Carbonite
CDs or DVDs are not recommended because they fail frequently.

Security

Unencrypted security is ideal for storing your data so that you and others can easily read it, but if encryption is required because of sensitive data:
- Keep passwords and keys on paper (2 copies) and in a PGP (pretty good privacy) encrypted digital file.
- Don’t rely on 3rd party encryption alone.
Uncompressed is also ideal for storage, but if you need to do so to conserve space limit compression to your 3rd backup copy.

To make sure your backup system is working properly, test your system periodically. Try to retrieve data files and make sure you can read them.

The UK Data Archive provides additional guidelines on data storage, back-up, and security.

Sensitive Research Data Management

Purdue University Libraries has a very useful guide for addressing issues with sharing research data involving human subjects or other sensitive data sets.

Research Data Management

Online Tutorials and Trainings

Tools