The CWL alignment workflow included in this case study is designed by Data Biosphere. It adapts the alignment pipeline originally developed at Abecasis Lab, The University of Michigan. This workflow is part of NIH Data Commons initiative and comprises of four stages.
First step, "Pre-align'' accepts a Compressed Alignment Map (CRAM) file (a compressed format for BAM files developed by European Bioinformatics Institute (EBI)) and human genome reference sequence as input and using underlying software utilities of SAMtools such as view, sort and fixmate returns a list of fastq files which can be used as input for the next step.
The next step "Align'' also accepts the human reference genome as input along with the output files from "Pre-align'' and uses BWA-mem to generate aligned reads as BAM files. SAMBLASTER is used to mark duplicate reads and SAMtools view to convert read files from SAM to BAM format.
The BAM files generated after "Align'' are sorted with "SAMtool sort''.
Finally, these sorted alignment files are merged to produce single sorted BAM file using SAMtools merge in "Post-align'' step.
This dataset folder is a CWLProv Research Object that captures the Common Workflow Language execution provenance, see https://w3id.org/cwl/prov/0.6.0 or use https://pypi.org/project/cwlprov/ to explore