Englander Institute for Precision Medicine

Linking big biomedical datasets to modular analysis with Portable Encapsulated Projects.

TitleLinking big biomedical datasets to modular analysis with Portable Encapsulated Projects.
Publication TypeJournal Article
Year of Publication2021
AuthorsSheffield NC, Stolarczyk M, Reuter VP, Rendeiro AF
JournalGigascience
Volume10
Issue12
Date Published2021 Dec 06
ISSN2047-217X
KeywordsComputational Biology, Documentation, Metadata, Software
Abstract

BACKGROUND: Organizing and annotating biological sample data is critical in data-intensive bioinformatics. Unfortunately, metadata formats from a data provider are often incompatible with requirements of a processing tool. There is no broadly accepted standard to organize metadata across biological projects and bioinformatics tools, restricting the portability and reusability of both annotated datasets and analysis software.

RESULTS: To address this, we present the Portable Encapsulated Project (PEP) specification, a formal specification for biological sample metadata structure. The PEP specification accommodates typical features of data-intensive bioinformatics projects with many biological samples. In addition to standardization, the PEP specification provides descriptors and modifiers for project-level and sample-level metadata, which improve portability across both computing environments and data processing tools. PEPs include a schema validator framework, allowing formal definition of required metadata attributes for data analysis broadly. We have implemented packages for reading PEPs in both Python and R to provide a language-agnostic interface for organizing project metadata.

CONCLUSIONS: The PEP specification is an important step toward unifying data annotation and processing tools in data-intensive biological research projects. Links to tools and documentation are available at http://pep.databio.org/.

DOI10.1093/gigascience/giab077
Alternate JournalGigascience
PubMed ID34890448
PubMed Central IDPMC8673555
Grant ListR35 GM128636 / GM / NIGMS NIH HHS / United States

Weill Cornell Medicine Englander Institute for Precision Medicine 413 E 69th Street
Belfer Research Building
New York, NY 10021