Data Release Pipeline

Data agreement (DAF)

A data agreement is a contract designed to ensure correctly classified and properly formatted datasets of a given assay type (Dnase-seq, TFBS, RNA-seq, etc) are issued from us and accepted, without error, by the data repository (ENCODE DCC or other public data repository such as the SRA). In the initial phases of setting up a DAF, the lab staff works closely with repository personnel to identify the complete scope of a data-type and provide definitions for a controlled vocabulary. New terms are produced and registered on an ongoing basis (new cell types, treatments, protocols, etc).

Internal data preparation

Selection criteria for data release is derived from specific grant requirements, availability of high quality pilot datasets, and overall capacity. Once a decision has been made to submit a dataset, additional sequencing is performed (if required). When the read numbers meet the required threshold the dataset is marked as “submittable” and identified as such in the LIMS system.

Periodically, lists of “submittable” identifiers are generated. This listing is used by the data release pipeline software to query the LIMS system via a RESTful API, resulting in a complete accounting of available data files for each listed item (for many assay types this corresponds to sequencing lane FASTQ).

If required by the data repository, lane data is re-aligned with the appropriate aligner to the specified genome and formatted for an enrichment calling pipeline, such as the RNA-seq RPKM pipeline, the Dnase-seq Hotspot pipeline, the DGF footprint calling pipelines, etc., as appropriate for the data type being submitted (RNA-seq, Dnase-Seq, DGF, ChIP-Seq, etc.).

After processing, data is loaded into an internal mirror of the UCSC genome browser. This is a treated as a “pseudo-release” or “internal release” and gives us an opportunity to ensure that all meta data is current. Additionally, data formatting is validated prior to submission by passing data files through local versions of the ENCODE DCC validation software. The resulting data and browser tracks are available internally to researchers and managers as a resource for visualization and analysis as well as data tracking and accounting.

As the final step, the pipeline creates a package (a tar archive including meta data, compressed data files and a manifest document) in accordance with repository standard operating procedure. A data submission stub is created at the repository website (figure 1) and the package is uploaded (via ftp, http, Aspera, or other specified data transfer software).

Figure 1.

Public release

For ENCODE, public release of data is preceded by a formal list of checks that the DCC maintains in order to assure data consistency and the correct visual display of the data on the UCSC browser. A component of this process has managers and scientists on our end review the data in the UCSC test browser and give an approval once any issues are addressed to satisfaction.