Snakemake Essentials

How do I get myself one of these snakemakers? (Installation)

Installation instructions can be found on the Snakemake documentation page here. In short, you want to install conda and mamba-forge, then run:

mamba create -c conda-forge -c bioconda -n snakemake_env snakemake

Alternatively, if you want to install Snakemake in an environment you already have set up, first activate that environment, then run:

pip install snakemake

This is especially important if you will be using more than one conda environment in your pipeline.

Running Snakemake

The installation command above creates a conda environment with Snakemake accessible. Any time you want to use Snakemake, run conda activate snakemake_env to activate that environment, then you can run the command snakemake. Try running snakemake -h to view the help options and make sure you have it installed.

Rules

Snakefiles are structured in rules. Each step in your pipeline is a rule. For example, if you want to run the following 3 steps:

  1. Trim fastq files
  2. Run fastQC
  3. Aggregate fastQC reports using multiQC

And the main focus is the multiQC report file at the end, your snakefile should look like this:


rule all:
    input:
        "multiQC_report.html"

rule trim_fastq:
    ...

rule run_fastQC:
    ...

rule run_multiQC:
    ...

What is “rule all”?

At this point, you may be asking “Wait! You said I should have a rule for each step! What’s this rule all mess??”

rule all is how we specify the output files we want, and it’s located at the very top of your snakefile. You can specify target files (pipeline endpoints) using the input to rule all. In a scientific case, consider these files to be the input to the paper you’ll write from your analysis.

Snakemake is tracking what rules need to be run in order to generate the inputs for other rules, so it will track that - rule run_multiQC’s outputs -> rule all’s inputs - rule run_fastQC’s outputs -> rule run_multiQC’s inputs - rule trim_fastq’s outputs -> rule run_fastQC’s inputs

Therefore, it knows that in order to have the final files, it will use rule trim_fastq -> rule run_fastQC -> rule run_multiQC -> rule all

flowchart LR
  C[run_multiQC] --> D(all)
  B[run_fastQC] --> C[run_multiQC]
  A[trim_fastq] --> B[run_fastQC]
  

Structure of rules

Rules provide crucial information to Snakemake, such as a step’s inputs, output, and the command to run. These are the bare bones of each rule, but as we develop more, we will start to also include aspects of each rule, including the Conda environment, resource requirements (time and memory), and other parameters.

rule do_things:
    input:
        "input_file"
    output:
        "output_file"
    shell:
        """
        # do things to input file to make ouput file
        """

Instead of shell:, users can also use run: which will run Python code.

rule do_things:
    input:
        "input_file"
    output:
        "output_file"
    run:
        # python things ...

Inputs and outputs

Snakemake traces the inputs and outputs for each rule to know what rules need to be run (and what order to run them in). These are specified very explicitly in the rule, using input: and output:, followed by an indented, comma-separated list, with one entry per line. These can also be named in the list. Another great attribute of Snakemake is that these can be referenced in the command it runs.

Examples

1 input, 1 output

rule rename_file:
    input:
        "old_name.txt"
    output:
        "new_name.txt"
    shell:
        """
        mv {input} {output}
        # same as running
        # mv old_name.txt new_name.txt
        """

2 named inputs, 2 named outputs

rule rename_multiple_files:
    input:
        file_1="first_file.txt",
        file_2="second_file.txt"
    output:
        file_1="file_1.txt",
        file_2="file_2.txt"
    shell:
        """
        mv {input.file_1} {output.file_1}
        mv {input.file_2} {output.file_2}
        """

Example with trimming reads

Simple

Imagine you want to trim one fastq file (10 base pairs from the beginning, 5 base pairs from the end) using SeqTK. This is what a very simple snakefile could look like:

rule all:
    input:
        "trimmed_reads.fq"

rule trim_fastq:
    input:
        "raw_reads.fq"
    output:
        "trimmed_reads.fq"
    shell:
        """
        seqtk trimfq -b 10 -e 5 {input} > {output}
        """

Chaining rules

Now, imagine the raw reads are compressed. We want to unzip them, trim the reads, and recompress them. You could do this in 1 step, but let’s break it up for the sake of learning. That would look like this:

rule all:
    input:
        "trimmed_reads.fq.gz"

rule unzip_fastq:
    input:
        "raw_reads.fq.gz"
    output:
        "raw_reads.fq"
    shell:
        """
        gunzip {input}
        """

rule trim_fastq:
    input:
        "raw_reads.fq"
    output:
        "trimmed_reads.fq"
    shell:
        """
        seqtk trimfq -b 10 -e 5 {input} > {output}
        """

rule zip_trimmed:
    input:
        "trimmed_reads.fq"
    output:
        "trimmed_reads.fq.gz"
    shell:
        """
        gzip {input}
        """

Which would create a workflow like this:

flowchart LR
  C[zip_trimmed] --> D(all)
  B[trim_fastq] --> C[zip_trimmed]
  A[unzip_fastq] --> B[trim_fastq]

(We will build on this example)

Wildcards

About wildcards

Wildcards are a big part of how we can expand and generalize how our snakemake pipeline works. Consider a wildcards to be a list of values for which you want to run a snakemake rule multiple times. This is a bit like a for loop:

for value in wildcards: 
    run rule...

Syntax

One wildcard

We can expand a rule using wildcards by expand()ing the input to a rule that requires this rule’s output. For example:

rule all:
    input:
        expand("file_{sample}.txt",
               sample=["1","2","3"])

rule create_file:
    output:
        "file_{sample}.txt"
    shell:
        """
        touch {output}
        """

In this case, Snakemake will create a workflow that looks like this:

flowchart LR
  A[file_1] --> F(all)
  B[file_2] --> F(all)
  C[file_3] --> F(all)

About the syntax

All you need for a wildcard is a list, which you pass to the expand() function. The first argument in the expand function is your filepath string, with the wildcard in {curly brackets}. Then, you pass the list to use for expanding the filepath. In the previous example, the sample=["1","2","3"] defines the values of the wildcard sample in file_{sample}.txt.

Multiple wildcards

This doesn’t have to be sample, and you can have multiple wildcards in an expand() function. Imagine you’re plating a 5-course meal for 3 people - your snakefile would look like this:

course_numbers=["1","2","3","4","5"]
people=["John", "Madi", "Casey"]

rule all:
    expand("course_{course_number}_for_{person}.plate",
           course_number=course_numbers,
           person=people)

rule make_food:
    ouput:
        "course_{course_number}_for_{person}.plate"
    shell:
        """
        make food...
        """

Wildcards to run snakemake rules for each sample

This is a very common use for wildcards! We often use wildcards we want to run the same rules for each sample. This is one of the ways Snakemake starts to shine. If you have a metadata file with 500 sample IDs, you can read that list of sample IDs into Snakemake using Python/Pandas, then run your snakemake pipeline for all samples. This is what that looks like:

import pandas as pd
metadata = pd.read_csv("metadata.csv")
samples = metadata["SampleID"] # Samples are in a column named SampleID

rule all:
    input:
        expand("file_{sample}.txt",
               sample=samples)

rule create_file:
    output:
        "file_{sample}.txt"
    shell:
        """
        touch {output}
        """

Wildcards for parameter exploration

Another common use of wildcards is to explore the effects of using different parameters in your analysis.

Let’s look back to our example of trimming sequencing reads. If we wanted to look at changes in read quality if we trim each sample’s reads at different base positions, this is what our snakefile could look like:

import pandas as pd
metadata = pd.read_csv("metadata.csv")
samples = metadata["SampleID"] # Samples are in a column named SampleID

read_trim_positions = ["0","10","20"]
read_trunc_positions = ["0","10","20"] 

rule all:
    input:
        "multiqc_report/multiqc_report.html"

# Trim fastq files with varying parameters
rule trim_fastq:
    input:
        "raw_reads_{sample}.fq"
    output:
        "trimmed_reads_{sample}_b{start}_e{end}.fq"
    shell:
        """
        seqtk trimfq -b {wildcards.start} -e {wildcards.end} {input} > {output}
        """

# Run each of the trimmed fastq files through fastQC for quality control
rule run_fastQC:
    input:
        "trimmed_reads_{sample}_b{start}_e{end}.fq"
    output:
        "trimmed_reads_{sample}_b{start}_e{end}_fastqc.zip"
    shell:
        """
        fastqc {input} -o .
        """

# Aggregate fastQC reports using MultiQC
rule run_multiQC:
    input:
        ### LOOK HERE, THIS IS OUR EXPAND COMMAND #################
        expand("trimmed_reads_{sample}_b{start}_e{end}_fastqc.zip",
                sample=samples,
                start=read_trim_positions,
                end=read_trunc_positions)
    output:
        "multiqc_report/multiqc_report.html"
    shell:
        """
        multiqc . -o multiqc_report
        """