YAP is an extensible parallel framework, written in Python using OpenMPI libraries. It allows researchers to quickly build high throughput big data pipelines without extensive knowledge of parallel programming. The user interacts with the framework through simple configuration files to capture analysis parameters and user directed metadata, enabling reproducible research. Using YAP, analysts have been able to achieve a significant speed up of up to 36× in RNASeqworkflow execution time.
YAP has been designed to be scalable and flexible. We have implemented YAP with a focus on next-generation sequencing (NGS), to meet the large data processing challenges at NIBR. However, the framework can be easily adapted for any kind of analysis. It can be executed on your local Linux workstations or large HPC cluster systems. The framework achieves efficiency by implementing optimal data handling mechanisms such as, parallel data distribution, avoiding file I/O using data streams and named pipes.
YAP compared to analysts' scripts
Number of cores
Analyst methods (hrs)
RNASeq QC and Counts
3 billion reads (150 samples)
Bacterial studies using Mothur
ChIPSeq Peak Calls
190 million reads (6 samples)
400 million reads (5 samples)
1400 file reads 1200 file writes
200 file reads 800 file writes
1 MPI job
File-based reads reduced by 70% File-based writes reduced by 30%
Example Analysis Output
The following images are the results of various applications run within the YAP framework, such as FastQC, FastQScreen, PicardTools, etc.
YAP consolidates results from across the samples for the various packages, such as gene counts from HTSeq and normalized counts from Cufflinks.
HTSeq gene counts
Normalized counts from Cufflinks
YAP only runs on Linux systems!
The following dependencies have to be first installed in your environment. Once installed, make sure these dependencies are added to your path.
Recent versions of gcc (gcc 4.8.x is well tested)
MPI4py - 1.3
PyPdf - 1.13
Numpy - 1.7.1
netsa-utils - 1.4.3
bedtools - 2.15.0
samtools - 0.1.18
YAP provides a framework to run external tools and data, so the tools used in the workflows drive the system requirements. It can be installed on multicore linux workstation with a decent amount of memory for small data, or on large cluster systems to scale optimally for large data processing. The framework has been tested extensively for NGS data on clusters with minimum system configuration of 8-12 cores and 24-48 GB memory.
Download the yap source from here Uncompress the source directory
for example: uncompress the directory as /home/packages/YAP
Set YAP_HOME environment variable to the source directory.
$ export YAP_HOME=/home/packages/YAP
Add bin directory to path
$ export PATH=$PATH:$YAP_HOME/bin/
Set YAP_LOCAL_TEMPDIR environment variable for temporary computation. For optimum performance point this directory to a location which is local to the machine.
Once you've set your environment, it is best to run a quick demo job to get the feel of running YAP. The following section is meant to be interactive and hence you would need Linux account access and access to the cluster.
After downloading the project, please see the demo configuration files in yap/cfg.
There are 3 stages in YAP - Preprocess, Alignment and Postprocess. You can have command level control of these three stages in the namesake configuration files and a workflow level control in the workflow_configuration.
bwa, bowtie, bowtie2, tophat or insert your own aligner
postalignment packages, generate counts or metrics
pre-alignment packages to massage your seqdata
manage metadata, specify input files, paths and output directories
submitting your job to the cluster
The demo runs a RNASeq QC and counts workflow consisting of:
Preprocess: FastQC, Fastqscreen
Alignment: Bowtie, both queryname and coordinate sorted
Postprocess: yap junction and exon counts, Picard tools (PostQC), HTSeq (Raw counts) and Cufflinks (normalized counts)
We run this workflow on 2 nodes on the UGE cluster.
To run the yap_demo job, we next need to check to see if our configuration files are correct using the command.
cd <your_working directory>
yap --check workflow_configuration.cfg