Creating a custom workflow on Basepair

Modified on Fri, 7 Apr, 2023 at 12:57 PM

Basepair allows you to import and run a custom bioinformatics workflow in a few simple steps.

We recommend packaging individual steps of the workflow into docker containers for portability, version control, and reproducibility. Then, using YAML text files, the individual steps (modules) can be defined and stitched together to create the workflow. YAML format is supported by all major programming languages, easy to edit and read, and let's you keep your modules and workflows in version control. After creating the workflow, it then of course needs to be tested to ensure it is running properly and producing the expected results.

*Alternatively, you can provide your custom pipeline to Basepair as-is, and we can Dockerize/build the pipeline as well.

Feel free to check out our pages on creating and storing Docker images for more information.

Define the modules

A "module" refers to a single tool or a group of tools that run as a single unit.

For example, a module may be:
- a single command using Picard MarkDuplicates to remove duplicate reads from a BAM file

- or -

- a call to BWA to align the data as well as a SAMtools call to sort and index the alignments

In the 2nd example above, BWA and SAMtools calls are combined in a single module because many tools require both sorted and indexed BAM files for downstream analysis. This way, the module performs the complete logical task of alignment, even though it is calling two different software tools.

Importantly, modules are often used across multiple workflows, so it's important that command options and input files are not hard-coded, but instead exposed as parameters in the docker run call.

The module YAML file defines 3 primary pieces information:

Path to executables and command structure
Inputs
Outputs

Define the workflow

A Basepair workflow is a collection of modules that run together.

A workflow may have a single module, or 20+ modules performing QC, alignment, and generating figures, etc. The workflow is defined as a directed acyclic graph (DAG), where each module is a node in the graph. This structure allows a module to have multiple parent modules along with further flexibility in designing a workflow.

The workflow YAML file defines 3 primary pieces of information:

A collection of nodes, each representing a module with a unique node id. If a module is required multiple times in a workflow, each instance is assigned a different node id. Furthermore, default parameters can be set for each module.
A collection of edges, each representing the parent-child relationship between all modules. Starting with a root module, all other modules are connected in a way that defines how the workflow will run.
Mappings: Common global meta-data is directly passed onto the modules, such as the genome assembly being used

Testing

Basepair provides testing frameworks for two purposes:

Technical testing: testing the structure of modules and workflows to check if the parameters are passed on correctly, commands are formed as expected, flow of data is moving as expected, etc.
Scientific testing: Testing the output files of a workflow with validated results to ensure that the data is being correctly analyzed. This testing is important to establish the scientific accuracy of the workflows.