Using R with Snakemake

Setting up R with Snakemake

Running R with Snakemake works beautifully and makes your life easier - if you’ve already gone through the pains of setting up your R to be compatible with the scripts you want to run.

The Issue:

All of your R packages (called libraries in R once installed). If you’re running Snakemake locally, it will use R and its packages already installed on your computer, so there’s no problem. BUT if you’re running Snakemake elsewhere, such as a super computer or any other computer that isn’t yours, this becomes a massive issue.

Here’s a way that we’ve gotten around it:

Creating an R-specific conda environment, installing all needed packages into that environment, and referencing it in your snakefile.

Installing your desired R packages through conda is a lot faster than if you were to install them via R (which seems semi counter-intuitive) and the process is fairly fool proof. You can set up your R environment in a .yaml file - similar to your other environments. For example:

name: r_env
channels:
    - bioconda
    - conda-forge
    - anaconda
    - r
dependencies:
    - pip
    - r-base=> 4.0
    - r-essentials
    - r-qiime2r
    - r-rstatix
    - r-ggpubr
    - r-cowplot
    - r-ggh4x
    - r-argparse
    - pip:
        - snakemake

You need to make sure that you’re keeping track of all needed R packages and putting them in your .yaml file. I would also reccommend validating the proper channels and dependencies for the R packages that you need at the Anaconda website by looking up the package and reading the conda installation instructions. Be warned that not all R packages can be installed through conda channels.

Once your .yaml file is set up, you can run:

conda env create -f r_env.yml

Et viola! Your R-specific conda environment has successfully been installed!

Running R scripts in Snakemake

An example workflow using R scripts:

rule all:
    input:
       "faith_plot.pdf",
       "faith_stats.tsv" 


rule running_r_script:
    input:
        "metadata_file.tsv",
        "faith_pd.tsv"
    output:
        "faith_plot.pdf",
        "faith_stats.tsv"
    conda:
        "r_env"
    shell:
        """
        Rscript faith_pd.R 
        """

The workflow looks pretty straightforward, right? Now let’s take a peek at what the inside of the faith_pd.R script looks like:

## this is the most important thing to specify at the beginning of your script
## needed libraries
library(ggpubr)
library(ggplot2)
library(magrittr)
library(tidyverse)
library(broom)

## input file paths 
metadata_FP <- 'metadata_file.tsv'
faith_pd_FP <- 'faith_pd.tsv'

## reading in metadata and faith's pd .tsvs
metadata <- read_tsv(metadata_FP)
faith_pd <- read_tsv(faith_pd_FP)

metadata %>%
    left_join(faith_pd, by = 'sampleid') -> combined_faith_table

## creating plot
combined_faith_table %>%
    ggplot(aes(x = sample_date, y = faith_pd)) +
        geom_boxplot(aes(group = sample_date)) +
        geom_jitter(width = 0.1, height = 0, alpha = 0.6) +
        geom_smooth(se = FALSE) +
        labs(x = 'Day',
            y = "Faith's PD",
            title = "Faith's PD Over Time") -> faith_pd_plot

## running stats on my faith's pd results by sample date
combined_faith_table %>%
    do(tidy(kruskal.test(faith_pd ~ sample_date,
                         data = .))) -> kruskal_test

combined_faith_table %>%
    dunn_test(faith_pd ~ sample_date,
              p.adjust.method = 'BH',
              data = .) -> dunn_test

## saving my outputs
ggsave('faith_plot.pdf',
       plot = faith_pd_plot,
       width = 7,
       height = 5)

write_tsv(dunn_test,
          'faith_stats.tsv')

It looks like a typical R script and that’s all it needs to look like. The most important piece is including all needed libraries (that you installed into your R-specific conda environment) to run that script as well as matching your input and output file paths between your script and Snakemake rule. It can be a bit annoying to need to edit your R scripts if your file paths change so you can install and utilize r-argparse which allows you to edit your file paths directly from the Snakemake rule. Here is an example on implimenting r-argparse in your R scripts and here is an example for it in your snakefile.