Using R with Snakemake
Setting up R with Snakemake
Running R with Snakemake works beautifully and makes your life easier - if you’ve already gone through the pains of setting up your R to be compatible with the scripts you want to run.
The Issue:
All of your R packages (called libraries in R once installed). If you’re running Snakemake locally, it will use R and its packages already installed on your computer, so there’s no problem. BUT if you’re running Snakemake elsewhere, such as a super computer or any other computer that isn’t yours, this becomes a massive issue.
Here’s a way that we’ve gotten around it:
Creating an R-specific conda environment, installing all needed packages into that environment, and referencing it in your snakefile.
Installing your desired R packages through conda is a lot faster than if you were to install them via R (which seems semi counter-intuitive) and the process is fairly fool proof. You can set up your R environment in a .yaml
file - similar to your other environments. For example:
name: r_env
channels:
- bioconda
- conda-forge
- anaconda
- r
dependencies:
- pip
- r-base=> 4.0
- r-essentials
- r-qiime2r
- r-rstatix
- r-ggpubr
- r-cowplot
- r-ggh4x
- r-argparse
- pip:
- snakemake
You need to make sure that you’re keeping track of all needed R packages and putting them in your .yaml
file. I would also reccommend validating the proper channels and dependencies for the R packages that you need at the Anaconda website by looking up the package and reading the conda installation instructions. Be warned that not all R packages can be installed through conda channels.
Once your .yaml
file is set up, you can run:
conda env create -f r_env.yml
Et viola! Your R-specific conda environment has successfully been installed!
Running R scripts in Snakemake
An example workflow using R scripts:
rule all:
input:
"faith_plot.pdf",
"faith_stats.tsv"
rule running_r_script:
input:
"metadata_file.tsv",
"faith_pd.tsv"
output:
"faith_plot.pdf",
"faith_stats.tsv"
conda:
"r_env"
shell:
"""
Rscript faith_pd.R
"""
The workflow looks pretty straightforward, right? Now let’s take a peek at what the inside of the faith_pd.R
script looks like:
## this is the most important thing to specify at the beginning of your script
## needed libraries
library(ggpubr)
library(ggplot2)
library(magrittr)
library(tidyverse)
library(broom)
## input file paths
<- 'metadata_file.tsv'
metadata_FP <- 'faith_pd.tsv'
faith_pd_FP
## reading in metadata and faith's pd .tsvs
<- read_tsv(metadata_FP)
metadata <- read_tsv(faith_pd_FP)
faith_pd
%>%
metadata left_join(faith_pd, by = 'sampleid') -> combined_faith_table
## creating plot
%>%
combined_faith_table ggplot(aes(x = sample_date, y = faith_pd)) +
geom_boxplot(aes(group = sample_date)) +
geom_jitter(width = 0.1, height = 0, alpha = 0.6) +
geom_smooth(se = FALSE) +
labs(x = 'Day',
y = "Faith's PD",
title = "Faith's PD Over Time") -> faith_pd_plot
## running stats on my faith's pd results by sample date
%>%
combined_faith_table do(tidy(kruskal.test(faith_pd ~ sample_date,
data = .))) -> kruskal_test
%>%
combined_faith_table dunn_test(faith_pd ~ sample_date,
p.adjust.method = 'BH',
data = .) -> dunn_test
## saving my outputs
ggsave('faith_plot.pdf',
plot = faith_pd_plot,
width = 7,
height = 5)
write_tsv(dunn_test,
'faith_stats.tsv')
It looks like a typical R script and that’s all it needs to look like. The most important piece is including all needed libraries (that you installed into your R-specific conda environment) to run that script as well as matching your input and output file paths between your script and Snakemake rule. It can be a bit annoying to need to edit your R scripts if your file paths change so you can install and utilize r-argparse
which allows you to edit your file paths directly from the Snakemake rule. Here is an example on implimenting r-argparse
in your R scripts and here is an example for it in your snakefile.