Elaborating rules

Params

We’ve covered how to provide inputs and outputs for a rule, but we can actually provide anything we want as parameters of a rule. These rule params can be accessed with curly brackets just like we access inputs and outputs. For example:

rule count_ducks:
    output: 
        "ducks_file.txt"
    params:
        ducks=2,
        eggs=4
    shell:
        """
        echo "The {params.ducks} ducks laid {params.eggs} eggs!" > {output}
        """

This would print “The 2 ducks laid 4 eggs!” to the output file, filling the params into the shell command, like it does with {output}.

Threads

For parallelizable jobs, we can pass threads directly to rules as well. In your rule, you can then access your number of threads with curly brackets (yet again). Here is an example (assuming we have 8 separate locations to count ducks at):

rule count_ducks_in_parallel:
    output: 
        "ducks_file.txt"
    threads: 8
    shell:
        """
        count_ducks_in_parallel.py --threads {threads} > {output}
        """

Why would I ever do it this way?

You may be wondering, “can’t I just hard code this in the shell command?” You absolutely could. It would even save you a few lines of code in these examples.

However, providing params and threads as components of the rule provides two main benefits:

  1. It allows us to also start drawing these variables from config files, which we’ll get into on the next page.
  2. It increases readability. You no longer need to sift through your code to figure out which exact parameter you provided; this way, you can look at the rule (or ideally your config file), and it should jump out at you.

Conda environments

A similar syntax can be used to specify conda environments for each rule. You can pass a conda environment in one of three ways:

  1. conda: "name_of_environment"
  2. conda: "path/to/environment"
  3. conda: "environment_specs.yaml", where the .yaml file contains the requirements for setting up a conda environment.

When passing conda environments for rules, if you provide the flag --use-conda when running Snakemake (e.g., snakemake --cores 1 --use-conda), Snakemake will run the specified rules in that specific conda environment.

Combining use of conda environment name and YAML files

You can use a combination of the two above techniques if your environments take a long time to install or have specific installation requirements that can’t be put in the .yaml file. You can create .yaml files for your needed conda environments and then use those to create the environments prior to running your workflow. I like to put mine in a bash script with any needed installation requirements. For example:

#!/bin/bash

set -e
set -u
set -o

# qiime2 has specific installation requirements and will give you issues if you simply try to use the .yaml file
echo "--------creating qiime environment"
# make sure that the qiime installation that you have in here is for the proper software system
# these instructions are for apple silicon (M1 and M2 chips)
wget https://data.qiime2.org/distro/core/qiime2-2023.5-py38-osx-conda.yml
CONDA_SUBDIR=osx-64 conda env create -n qiime2-2023.5 --file qiime2-2023.5-py38-osx-conda.yml
conda config --env --set subdir osx-64
rm qiime2-2023.5-py38-osx-conda.yml
conda activate qiime2-2023.5
pip install snakemake
conda deactivate

echo "--------creating R envrionment"
conda env create -f r_env.yml

echo "--------creating picrust environment"
conda env create -f picrust2.yml

And then run the bash script via:

sh my_environments.sh

From there, you can reference the conda environment name or path in your Snakefile. Since these .yaml files can be used to create environments on any machine, this method adds portability and reproducibility while breaking up the process computationally and allows you to get around issues with specific installation requirements. However, if your conda environment .yaml files work just fine, I would reccommend sticking with that method.

Resources

Another way you can elaborate on rules is through specifying resource requirements. Specifying resources has a similar syntax to specifying params, where resources are provided as a subsection of the rule. This isn’t always useful on a personal computer, but it starts to become very important when we’re working on a computer cluster. Here is an example:

rule run_fastqc:
  input:
    "{sample}.trimmed.{read}.fq"
  output:
    "{sample}.trimmed.{read}_fastqc.zip"
  conda: "conda_envs/fastqc.yaml"
  resources:
    partition="short",
    mem_mb=int(2*1000), # MB, or 2 GB
    runtime=int(2*60), # min, or 2 hours
    slurm="<slurm extra parameters>"
  threads: 1
  shell:
    """
    fastqc {input} -o .
    """

Limiting resources

When running Snakemake locally, specifying resources can be useful for making sure Snakemake doesn’t overrun the resources available on your computer. On the command line, a flag for maximum resources used can be passed. For example, if you wanted to limit Snakemake to 8 GB total memory, you could run it using the command snakemake --cores 1 --resources mem_mb=8000. If you needed to run the run_fastq rule above for 50 samples but passed 8 GB as the max memory footprint, Snakemake would make sure no more than 4 samples were running at a time.

Snakemake doesn’t monitor resource usage

Snakemake does not monitor real-time resource useage but instead just goes off of the benchmarks you provide it. If you say a rule needs 2 GB memory, but it actually uses 10 GB, Snakemake will not know or adjust how it is scheduling the rules. More information on resources can be found here.

Resources are most useful for cluster integration

Resources specifications become very useful when integrating Snakemake with a compute cluster with some sort of resource management system like Slurm. This is discussed in Cluster-friendly Snakemake, but a Slurm profile for Snakemake can look at the resource requirements for each rule and request that many resources when submitting batch job requests on the cluster.