Effective use of Snakemake for workflow management on high performance compute clusters

Overview

Welcome! This is a learning resource on how to effectively use the workflow management tool Snakemake to make your life easier when processing large amounts of data on high performance compute clusters.

Goals

Hopefully, this resource will help you:

  1. Understand why workflow management tools are useful
  2. Create generalizable workflows with Snakemake
  3. Create reproducible workflows with Snakemake
  4. Create Snakemake workflows that run on compute clusters in an efficient manner (parallelized job submission and resource management)

Getting started

  1. If you’ve never used Snakemake before, or if you don’t understand what a Snakemake is or why you’d use it, start in the Snakemake Basics section, with About Snakemake and Snakemake Essentials.
  2. If you’ve used Snakemake before but don’t know how to specify params and resource requirements, how to customize conda environments, or how to integrate config files with Snakemake, look toward the More Snakemake tab, with Elaborating Rules and Config Files.
  3. If you have the Snakemake fundamentals down, check out Cluster-friendly Snakemake under the Cluster Integration tab.

Applications to shotgun metagenomics

Some examples will be focused on dealing with high-throughput sequencing data, such as metagenomics, since this is what I (and the original audience of this resource) work with on a daily basis. However, the contents and principles still apply to other data types.

A primer on shotgun metagenomic data:

  • The Fastq format is how we store sequencing data. It’s a (sometimes compressed) text file with the genome sequence and quality of the sequence.
  • If any of these concepts are unfamiliar to your, or if they spark an interest in working with metagenomic data:

Applications to 16S rDNA data

Other examples may be based on the analysis of 16S rDNA sequencing data which is commonly performed for microbiome profiling and is what the other contributor to this resource works with on a daily basis. Once again, don’t fret if you’ve never encountered 16S rDNA data, the basic principles of Snakemake can be applied in a variety of situations.

A brief background on 16S rDNA data:

  • Like shotgun metagenomics, the sequencing data is stored in Fastq format as a compressed file.
  • These fastq.gz files are then put into microbiome profiling programs such as Qiime2 or mothur.

Contributing to this resource

If you have questions or suggestions, feel free to open an issue or pull request on the GitHub repo for this site :)