flowchart LR C[run_multiQC] --> D(all) B[run_fastQC] --> C[run_multiQC] A[trim_fastq] --> B[run_fastQC]
Snakemake Essentials
How do I get myself one of these snakemakers? (Installation)
Installation instructions can be found on the Snakemake documentation page here. In short, you want to install conda
and mamba-forge
, then run:
mamba create -c conda-forge -c bioconda -n snakemake_env snakemake
Alternatively, if you want to install Snakemake in an environment you already have set up, first activate that environment, then run:
pip install snakemake
This is especially important if you will be using more than one conda environment in your pipeline.
Running Snakemake
The installation command above creates a conda environment with Snakemake accessible. Any time you want to use Snakemake, run conda activate snakemake_env
to activate that environment, then you can run the command snakemake
. Try running snakemake -h
to view the help options and make sure you have it installed.
Rules
Snakefiles are structured in rules. Each step in your pipeline is a rule. For example, if you want to run the following 3 steps:
- Trim fastq files
- Run fastQC
- Aggregate fastQC reports using multiQC
And the main focus is the multiQC report file at the end, your snakefile should look like this:
rule all:
input:
"multiQC_report.html"
rule trim_fastq:
...
rule run_fastQC:
...
rule run_multiQC:
...
What is “rule all”?
At this point, you may be asking “Wait! You said I should have a rule for each step! What’s this rule all
mess??”
rule all
is how we specify the output files we want, and it’s located at the very top of your snakefile. You can specify target files (pipeline endpoints) using the input to rule all
. In a scientific case, consider these files to be the input to the paper you’ll write from your analysis.
Snakemake is tracking what rules need to be run in order to generate the inputs for other rules, so it will track that - rule run_multiQC
’s outputs -> rule all
’s inputs - rule run_fastQC
’s outputs -> rule run_multiQC
’s inputs - rule trim_fastq
’s outputs -> rule run_fastQC
’s inputs
Therefore, it knows that in order to have the final files, it will use rule trim_fastq
-> rule run_fastQC
-> rule run_multiQC
-> rule all
Structure of rules
Rules provide crucial information to Snakemake, such as a step’s inputs, output, and the command to run. These are the bare bones of each rule, but as we develop more, we will start to also include aspects of each rule, including the Conda environment, resource requirements (time and memory), and other parameters.
rule do_things:
input:
"input_file"
output:
"output_file"
shell:
"""
# do things to input file to make ouput file
"""
Instead of shell:
, users can also use run:
which will run Python code.
rule do_things:
input:
"input_file"
output:
"output_file"
run:
# python things ...
Inputs and outputs
Snakemake traces the inputs and outputs for each rule to know what rules need to be run (and what order to run them in). These are specified very explicitly in the rule, using input:
and output:
, followed by an indented, comma-separated list, with one entry per line. These can also be named in the list. Another great attribute of Snakemake is that these can be referenced in the command it runs.
Examples
1 input, 1 output
rule rename_file:
input:
"old_name.txt"
output:
"new_name.txt"
shell:
"""
mv {input} {output}
# same as running
# mv old_name.txt new_name.txt
"""
2 named inputs, 2 named outputs
rule rename_multiple_files:
input:
file_1="first_file.txt",
file_2="second_file.txt"
output:
file_1="file_1.txt",
file_2="file_2.txt"
shell:
"""
mv {input.file_1} {output.file_1}
mv {input.file_2} {output.file_2}
"""
Example with trimming reads
Simple
Imagine you want to trim one fastq file (10 base pairs from the beginning, 5 base pairs from the end) using SeqTK. This is what a very simple snakefile could look like:
rule all:
input:
"trimmed_reads.fq"
rule trim_fastq:
input:
"raw_reads.fq"
output:
"trimmed_reads.fq"
shell:
"""
seqtk trimfq -b 10 -e 5 {input} > {output}
"""
Chaining rules
Now, imagine the raw reads are compressed. We want to unzip them, trim the reads, and recompress them. You could do this in 1 step, but let’s break it up for the sake of learning. That would look like this:
rule all:
input:
"trimmed_reads.fq.gz"
rule unzip_fastq:
input:
"raw_reads.fq.gz"
output:
"raw_reads.fq"
shell:
"""
gunzip {input}
"""
rule trim_fastq:
input:
"raw_reads.fq"
output:
"trimmed_reads.fq"
shell:
"""
seqtk trimfq -b 10 -e 5 {input} > {output}
"""
rule zip_trimmed:
input:
"trimmed_reads.fq"
output:
"trimmed_reads.fq.gz"
shell:
"""
gzip {input}
"""
Which would create a workflow like this:
flowchart LR C[zip_trimmed] --> D(all) B[trim_fastq] --> C[zip_trimmed] A[unzip_fastq] --> B[trim_fastq]
(We will build on this example)
Wildcards
About wildcards
Wildcards are a big part of how we can expand and generalize how our snakemake pipeline works. Consider a wildcards to be a list of values for which you want to run a snakemake rule multiple times. This is a bit like a for
loop:
for value in wildcards:
run rule...
Syntax
One wildcard
We can expand a rule using wildcards by expand()
ing the input to a rule that requires this rule’s output. For example:
rule all:
input:
expand("file_{sample}.txt",
sample=["1","2","3"])
rule create_file:
output:
"file_{sample}.txt"
shell:
"""
touch {output}
"""
In this case, Snakemake will create a workflow that looks like this:
flowchart LR A[file_1] --> F(all) B[file_2] --> F(all) C[file_3] --> F(all)
About the syntax
All you need for a wildcard is a list, which you pass to the expand()
function. The first argument in the expand function is your filepath string, with the wildcard in {curly brackets}
. Then, you pass the list to use for expanding the filepath. In the previous example, the sample=["1","2","3"]
defines the values of the wildcard sample
in file_{sample}.txt
.
Multiple wildcards
This doesn’t have to be sample
, and you can have multiple wildcards in an expand()
function. Imagine you’re plating a 5-course meal for 3 people - your snakefile would look like this:
course_numbers=["1","2","3","4","5"]
people=["John", "Madi", "Casey"]
rule all:
expand("course_{course_number}_for_{person}.plate",
course_number=course_numbers,
person=people)
rule make_food:
ouput:
"course_{course_number}_for_{person}.plate"
shell:
"""
make food...
"""
Wildcards to run snakemake rules for each sample
This is a very common use for wildcards! We often use wildcards we want to run the same rules for each sample. This is one of the ways Snakemake starts to shine. If you have a metadata file with 500 sample IDs, you can read that list of sample IDs into Snakemake using Python/Pandas, then run your snakemake pipeline for all samples. This is what that looks like:
import pandas as pd
metadata = pd.read_csv("metadata.csv")
samples = metadata["SampleID"] # Samples are in a column named SampleID
rule all:
input:
expand("file_{sample}.txt",
sample=samples)
rule create_file:
output:
"file_{sample}.txt"
shell:
"""
touch {output}
"""
Wildcards for parameter exploration
Another common use of wildcards is to explore the effects of using different parameters in your analysis.
Let’s look back to our example of trimming sequencing reads. If we wanted to look at changes in read quality if we trim each sample’s reads at different base positions, this is what our snakefile could look like:
import pandas as pd
metadata = pd.read_csv("metadata.csv")
samples = metadata["SampleID"] # Samples are in a column named SampleID
read_trim_positions = ["0","10","20"]
read_trunc_positions = ["0","10","20"]
rule all:
input:
"multiqc_report/multiqc_report.html"
# Trim fastq files with varying parameters
rule trim_fastq:
input:
"raw_reads_{sample}.fq"
output:
"trimmed_reads_{sample}_b{start}_e{end}.fq"
shell:
"""
seqtk trimfq -b {wildcards.start} -e {wildcards.end} {input} > {output}
"""
# Run each of the trimmed fastq files through fastQC for quality control
rule run_fastQC:
input:
"trimmed_reads_{sample}_b{start}_e{end}.fq"
output:
"trimmed_reads_{sample}_b{start}_e{end}_fastqc.zip"
shell:
"""
fastqc {input} -o .
"""
# Aggregate fastQC reports using MultiQC
rule run_multiQC:
input:
### LOOK HERE, THIS IS OUR EXPAND COMMAND #################
expand("trimmed_reads_{sample}_b{start}_e{end}_fastqc.zip",
sample=samples,
start=read_trim_positions,
end=read_trunc_positions)
output:
"multiqc_report/multiqc_report.html"
shell:
"""
multiqc . -o multiqc_report
"""