Skip to Main Content

Research Data Management

This guide contains information about research data management and best practices for faculty, researchers and graduate students

Types of Research Data

Examples of Research Data include:

  • Documents (text, Word), spreadsheets, print outs
  • Laboratory notebooks, field notebooks, diaries
  • Questionnaires, transcripts, codebooks
  • Audio, video
  • Photographs, films, x-rays, negatives,
  • Protein or genetic sequences
  • Spectra, spectroscope data
  • Test responses
  • Slides, artifacts, specimens, samples
  • Collection of digital objects acquired and generated during the process of research
  • Database contents (video, audio, text, images)
  • Models, algorithms, scripts, code, software
  • Contents of an application (input, output, logfiles for analysis software, simulation software, schemas)
  • Methodologies and workflows
  • Standard operating procedures and protocols
  • Computers and computer data storage devices
  • Synthetic compounds
  • Organisms, cell lines, viruses, cell products
  • Cloned coordinates, plants animals

File Formats

File formats used to capture, store and deliver research data are an important consideration as they influence future file/program accessibility. It is important to plan for software obsolescence.

Formats more likely to be accessible in the future are:

  • Non-proprietary
  • Open, documented standard
  • Common usage by research community
  • Standard representation (ASCII, Unicode)
  • Unencrypted
  • Uncompressed

Examples of preferred file format choices include:

  • ODF or PDF/A, not Word
  • ASCII, not Excel
  • MPEG-4, not Quicktime
  • TIFF or JPEG2000, not GIF or JPG
  • XML or RDF, not RDBMS

Consider migrating your data into a format with the above characteristics, in addition to keeping a copy in the original software format. Note that not all repositories are able to migrate data files to newer file formats for preservation.

For more, see the UK Data Service Recommended Formats or the Recommended Formats Statement of the Library of Congress

File Names

File names should be unique, consistent, informative and have the ability to be sorted/updated easily. Before beginning your project, determine any file naming hierarchy and file naming conventions. File names should easily indicate which project they belong to. Elements that may be included in your file names are date, project name, type of data, location, and version. There are other features to consider as you design your file naming plan described on this google doc.
When organizing files, it's important to standardize file naming and directories so they're descriptive.
Best Practice

File names should reflect the contents of the file and include enough information to uniquely identify the data file. File names may contain information such as project acronym, study title, location, investigator, year(s) of study, data type, version number, and file type.

When choosing a file name, check for any database management limitations on file name length and use of special characters. Also, in general, lower-case names are less software and platform dependent. Avoid using spaces and special characters in file names, directory paths and field names. Automated processing, URLs and other systems often use spaces and special characters for parsing text string. Instead, consider using underscore ( _ ) or dashes ( - ) to separate meaningful parts of file names. Avoid $ % ^ & # | : and similar.

If versioning is desired a date string within the file name is recommended to indicate the version.

Avoid using file names such as mydata.dat or 1998.dat.

Description Rationale

Clear, descriptive, and unique file names may be important when your data file is combined in a directory or FTP site with your own data files or with the data files of other investigators. File names that reflect the contents of the file and uniquely identify the data file enable precise search and discovery of particular files.

Examples

An example of a good data file name:

Sevilleta_LTER_NM_2001_NPP.csv

  • Sevilleta_LTER is the project name
  • NM is the state abbreviation
  • 2001 is the calendar year
  • NPP represents Net Primary Productivity data
  • csv stands for the file type—ASCII comma separated variable

SourceDataOne

Metadata Standards

Metadata (data about data) standards help to describe data in a consistent manner. Metadata can include descriptive information, provenance, quality and access/use of data.  Here are a few standards that may be useful in describing your data for access and preservation.

Data Dictionaries

USGS defines a Data Dictionary as a repository of structured data names that define and describe a resource.

See Best Practices for Data Dictionary Definitions and Usage by Northwest Environmental Data Network

Source: USGS Data Dictionaries and Thesauri

How to Document Your Data

Documenting your data includes capturing sufficient metadata (descriptive information) about your data in order to make it discoverable, identifiable and usable in the future.  Information you capture should include some, if not all, of the following elements:

Title of the dataset or research project
Creator names of individuals or institutions responsible for creating the data
Unique Identifier that helps distinguish the data used to identify the data
Dates: Project start and end dates, release date, any other date of importance during the length of the research study
Subject: Keywords or phrases describing the subject or content of the data
Funding Agency responsible for funding the research
Intellectual Property Rights associate with the data
Language(s) in which data is generated
Sources for data derived from other sources
Geographical location or coverage where data was collected
Methodology for data collection
Version of the dataset if updated

Using sustainable metadata standards is highly recommended though to ensure that data are accessible in the future. Such standards are open (not proprietary), used widely, uncompressed, use standard encoding and contain enough information to analyze the context, content and structure of record.
 

Metadata schema sources

CalTech Library's File Naming Convention Worksheet
This worksheet helps researchers to build their own work file names

Reproducibility of Data

When searching for data, whether locally on one's machine or in external repositories, one may use a variety of search terms. In addition, data are often housed in databases or clearinghouses where a query is required in order access data. In order to reproduce the search results and obtain similar, if not the same results, it is necessary to document which terms and queries were used.

  • Note the location of the originating data set
  • Document which search terms were used
  • Document any additional parameters that were used, such as any controls that were used (pull-down boxes, radio buttons, text entry forms)
  • Document the query term that was used, where possible
  • Note the database version and/or date, so you can any limit newly-added data sets since the query was last performed
  • Note the name of the website and URL, if applicable
Description Rationale

In order to reproduce a data set or result set, it is necessary to document which terms were originally used to capture that data. By documenting this information while the search is being conducted, one greatly enhances the chance of being able to reproduce the results at a later date.

Source: DataONE

Data Storage and Preservation

Storage

Storing data reliably is an important function of data management. There are several options to store your data files -

  • Personal computers, external hard drives, departmental or university servers
  • ​Other cloud storage services that may suit your data storage/backup needs include Amazon S3Elephant DriveJungle DiskMozyCarbonite
  • CDs or DVDs are not recommended because they fail frequently.

Security

  • Unencrypted security is ideal for storing your data so that you and others can easily read it, but if encryption is required because of sensitive data:
    • Keep passwords and keys on paper (2 copies) and in a PGP (pretty good privacy) encrypted digital file.
    • Don’t rely on 3rd party encryption alone.
  • Uncompressed is also ideal for storage, but if you need to do so to conserve space limit compression to your 3rd backup copy.

To make sure your backup system is working properly, test your system periodically. Try to retrieve data files and make sure you can read them.

The UK Data Archive provides additional guidelines on data storage, back-up, and security.

Sensitive Research Data Management

Purdue University Libraries has a very useful guide for addressing issues with sharing research data involving human subjects or other sensitive data sets.