Use Snakemake to build your workflow¶

Description:

In this section we will build our intended workflow using Snakemake.

Why use Snakemake? There are a lot of features that give us more control over our workflow than a bash script:

It is FOSS (Free, open-source software)
It can run on Windows/MacOS/Linux and works with HPC
It is Python-based and you can use Python in your code.
If you are familiar with make you understand the basic way it Snakemake organizes things.

Snakemake example¶

Snakemake uses metaphors like rules and recipes to describe how a workflow is organized. A rule states what output should be created and defines what steps (the recipe) needs to be follow to create that output. Let’s work on the first step of our workflow.

Import a sequence from SRA

We will import a SRA file, so the first thing we do is create a rule that states what output we should end up with. In our tutorial, we will be working from a single set of reads from the Plasmodium data:
```
rule import_from_sra:
  output:
     "SRR8245081.sra"
```

We have to build on this by specifying some shell command that will create this. We will use a docker container:

rule import_from_sra:
    output:
       "/scratch/learn-snake/sra_files/SRR8245081/SRR8245081.sra"
    shell:
  # Use a docker run command and mount a director
       "docker run -v /scratch/learn-snake/:/experiment/ "
  # set a working directory - this will be created if needed
       "--workdir /experiment/sra_files "
  # set the docker container
       "quay.io/biocontainers/sra-tools:2.10.0--pl526he1b5a44_0 "
  # use the prefetch tool
       "prefetch "
  # show progress
       "-p 1 "
  # get our desired accession
       "SRR8245081 "
  # output to our desired directory
       "--output-directory /experiment/sra_files/"

Tip

Use a text editor to compose your Snakefile as python will be sensitive to indenting.

Once we have figured out the correct set of commands, we write the code in a file usually called a Snakefile. We then run the Snakefile.
```
# This assumes you are in a directory with
# a file `Snakefile`
snakemake -s Snakefile
```

When run we get some helpful outputs:

Building DAG of jobs...
Using shell: /bin/bash
Provided cores: 1 (use --cores to define parallelism)
Rules claiming more threads will be scaled down.
Job counts:
         count    jobs
         1        import_from_sra
         1

[Thu Dec  5 04:16:50 2019]
rule import_from_sra:
  output: /scratch/learn-snake/sra-files/SRR8245081.sra
  jobid: 0

  ...

[Thu Dec  5 04:17:03 2019]
Finished job 0.
1 of 1 steps (100%) done
Complete log: /scratch/learn-snake/.snakemake/log/2019-12-05T042511.996668.snakemake.log

There is information about how (or if) this job is parallelized, what job(s) will be run, and what is the expected output. Importantly, if the expected outputs are not created, Snakemake will inform you.

We can also see a diagram of the workflow visualized

#install a needed dependancy
sudo apt-get install -y graphviz

#make a PNG of your workflow as a "directed acyclic graph"
snakemake --dag | dot -Tpng > dag.png

You can use scp to copy the file to your computer.

# remember to use your username and Atmosphere IP
scp [email protected]:/scratch/learn-snake/dag.png ~/Desktop

Use scp to copy this. It’s not that impressive, but will be as we build our workflow.

Tip

To build your own Snakemake workflow familiarize yourself with Snakemake grammar.

Full Snakemake worfklow¶

First, let’s set up our workflow in our tutorial folder

Create an snakemake folder in our tutorial folder

mkdir -p /scratch/reproducibility-tutorial/snakemake

Here is a much more advanced, fully assembled workflow. We will walk through and discuss the key features:

#SRA definitions
ACCESSION=["SRR8245081",]
READS=["_1","_2"]
REFERENCE_URLS=["ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/002/765/GCF_000002765.4_ASM276v2/GCF_000002765.4_ASM276v2_genomic.fna.gz",
                "ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/002/765/GCF_000002765.4_ASM276v2/GCF_000002765.4_ASM276v2_genomic.gff.gz"]
REFERENCES_BASE=["GCF_000002765.4_ASM276v2_genomic",]

#containers
SRA_TOOLS_CONTAINER="quay.io/biocontainers/sra-tools:2.10.0--pl526he1b5a44_0"
FASTP_CONTAINER="biocontainers/fastp:v0.19.6dfsg-1-deb_cv1"
SPADES_CONTAINER="biocontainers/spades:v3.13.0dfsg2-2-deb_cv1"
QUAST_CONTAINER="quay.io/biocontainers/quast:5.0.2--py27pl526ha92aebf_0"

rule all:
    input:
        sra=expand("/scratch/reproducibility-tutorial/experiment/sra_files/{accession}/{accession}.sra",
                accession=ACCESSION),
        fastq=expand("/scratch/reproducibility-tutorial/experiment/fastq_files/{accession}.sra{reads}.fastq",
                accession=ACCESSION,
                reads=READS),
        fastq_trimmed=expand("/scratch/reproducibility-tutorial/experiment/fastq_files/{accession}.sra{reads}_trimmed.fastq",
                accession=ACCESSION,
                reads=READS),
        fastp_report=expand("/scratch/reproducibility-tutorial/experiment/fastq_files/{accession}_fastp_report.html",
                accession=ACCESSION),
        assembly_contigs=expand("/scratch/reproducibility-tutorial/experiment/{accession}_assembly/contigs.fasta",
                accession=ACCESSION),
        reference_genome=expand("/scratch/reproducibility-tutorial/experiment/reference_genome/{reference_genome_fna}.fna",
                reference_genome_fna=REFERENCES_BASE),
        reference_annotation=expand("/scratch/reproducibility-tutorial/experiment/reference_genome/{reference_genome_gff}.gff",
                reference_genome_gff=REFERENCES_BASE),
        quast_assembly_report=expand("/scratch/reproducibility-tutorial/experiment/{accession}_assembly_stats/report.html",
                accession=ACCESSION)



rule SRA_Import:
    output:
        "/scratch/reproducibility-tutorial/experiment/sra_files/{accession}/{accession}.sra"
    params:
        container={SRA_TOOLS_CONTAINER},
        command="prefetch",
        user="root",
        progress=1,
        outputdirectory="/experiment/sra_files"
    shell:
        "docker run --user {params.user} "
        " -v /scratch/reproducibility-tutorial/experiment:/experiment "
        "--workdir /experiment "
        "{params.container} {params.command} "
        "-p {params.progress} "
        "--output-directory {params.outputdirectory} "
        "{ACCESSION}"

rule SRA_to_fastq:
    input:
        sra=expand("/scratch/reproducibility-tutorial/experiment/sra_files/{accession}/{accession}.sra",
                accession=ACCESSION)
    output:
        "/scratch/reproducibility-tutorial/experiment/fastq_files/{accession}.sra_1.fastq",
        "/scratch/reproducibility-tutorial/experiment/fastq_files/{accession}.sra_2.fastq"
    params:
        container={SRA_TOOLS_CONTAINER},
        command="fasterq-dump",
        user="root",
        outputdirectory="/experiment/fastq_files"
    threads:
        8
    shell:
        "docker run --user {params.user} "
        " -v /scratch/reproducibility-tutorial/experiment:/experiment "
        "--workdir /experiment/sra_files/{ACCESSION}/ "
        "{params.container} {params.command} "
        "{ACCESSION}.sra "
        "-O {params.outputdirectory} "

rule fastq_qc:
    input:
        fastq=expand("/scratch/reproducibility-tutorial/experiment/fastq_files/{accession}.sra{reads}.fastq",
            accession=ACCESSION,
            reads=READS)
    output:
        "/scratch/reproducibility-tutorial/experiment/fastq_files/{accession}.sra_1_trimmed.fastq",
        "/scratch/reproducibility-tutorial/experiment/fastq_files/{accession}.sra_2_trimmed.fastq",
        "/scratch/reproducibility-tutorial/experiment/fastq_files/{accession}_fastp_report.html"
    params:
        container={FASTP_CONTAINER},
        command="fastp",
        user="root",
        outputdirectory="/experiment/fastq_files"
    threads:
        8
    shell:
        "docker run --user {params.user} "
        " -v /scratch/reproducibility-tutorial/experiment:/experiment "
        "--workdir /experiment/fastq_files/ "
        "{params.container} {params.command} "
        "-V "
        "--cut_front "
        "--cut_front_mean_quality 30 "
        "- R '{ACCESSION} fastp report' "
        "--html {ACCESSION}_fastp_report.html "
        "-i /experiment/fastq_files/{ACCESSION}.sra_1.fastq "
        "-o /experiment/fastq_files/{ACCESSION}.sra_1_trimmed.fastq "
        "-I /experiment/fastq_files/{ACCESSION}.sra_2.fastq "
        "-O /experiment/fastq_files/{ACCESSION}.sra_2_trimmed.fastq"

rule SPAdes_assembly:
    input:
        fastq_trimmed=expand("/scratch/reproducibility-tutorial/experiment/fastq_files/{accession}.sra{reads}_trimmed.fastq",
            accession=ACCESSION,
            reads=READS)
    output:
        "/scratch/reproducibility-tutorial/experiment/{accession}_assembly/contigs.fasta"
    params:
        container={SPADES_CONTAINER},
        command="spades.py",
        user="root",
        outputdirectory="/experiment/{accession}_assembly"
    threads:
        8
    shell:
        "docker run --user {params.user} "
        " -v /scratch/reproducibility-tutorial/experiment:/experiment "
        "--workdir /experiment/fastq_files/ "
        "{params.container} {params.command} "
        "-1 /experiment/fastq_files/{ACCESSION}.sra_1_trimmed.fastq "
        "-2 /experiment/fastq_files/{ACCESSION}.sra_2_trimmed.fastq "
        "-o {params.outputdirectory}"

rule import_reference_genome:
    output:
        reference_genome="/scratch/reproducibility-tutorial/experiment/reference_genome/{REFERENCES_BASE}.fna",
        reference_annotation="/scratch/reproducibility-tutorial/experiment/reference_genome/{REFERENCES_BASE}.gff"
    shell:
        "mkdir -p /scratch/reproducibility-tutorial/experiment/reference_genome &&  "
        "cd /scratch/reproducibility-tutorial/experiment/reference_genome && "
        "wget {REFERENCE_URLS} && "
        "gzip -d {REFERENCES_BASE}*.gz"

rule assembly_stats:
    input:
        assembly_contigs=expand("/scratch/reproducibility-tutorial/experiment/{accession}_assembly/contigs.fasta",
            accession=ACCESSION),
        reference_genome=expand("/scratch/reproducibility-tutorial/experiment/reference_genome/{reference_genome_fna}.fna",
            reference_genome_fna=REFERENCES_BASE),
        reference_annotation=expand("/scratch/reproducibility-tutorial/experiment/reference_genome/{reference_genome_gff}.gff",
            reference_genome_gff=REFERENCES_BASE)
    output:
        "/scratch/reproducibility-tutorial/experiment/{accession}_assembly_stats/report.html"
    params:
        container={QUAST_CONTAINER},
        command="quast.py",
        user="root",
        outputdirectory="/experiment/{accession}_assembly_stats",
        container_input=expand("/experiment/{accession}_assembly/contigs.fasta",
            accession=ACCESSION)
    threads:
        8
    shell:
        "docker run --user {params.user} "
        " -v /scratch/reproducibility-tutorial/experiment:/experiment "
        "--workdir /scratch/reproducibility-tutorial/experiment/reference_genome/ "
        "{params.container} {params.command} "
        "-o {params.outputdirectory} "
        "--features gene:/experiment/reference_genome/{REFERENCES_BASE}.gff "
        "-r /experiment/reference_genome/{REFERENCES_BASE}.fna "
        "-t 8 "
        "--eukaryote "
        "--circos "
        "{params.container_input}"

In the snakemake folder you can run this job (warning, even using only one SRA input, this can take an hour or more to complete).
```
snakemake --cores all
```

Tip

Use `tmux`

Since we happen to be working on the cloud, you may want to walk away from an ssh session and have your work run without being connected (e.g. you want to close your laptop). You can use a tool like tmux to do that. See this tmux tutorial.

Discussion - Parallelism

What happens when you run Snakemake

When you finish, don’t forget to generate a diagram of your workflow

snakemake --dag | dot -Tpng > dag.png

Document your work¶

As previously, you should copy your results to CyVerse and commit

code changes to GitHub (remember to have a git ignore file so you don’t track large files)

Fix or improve this documentation

Search for an answer: CyVerse Learning Center

Learning Center Home