CyVerse logo

Home_Icon Learning Center Home

Use Biocontainers to assemble your tools

Description:

In this section, we will introduce Biocontainers as an easy way to use bioinformatics tools and their dependancies

Discussion - Conda vs. containers

Why and when should we use Conda or containers?

We suggest completing the Creating and Running Docker Containers tutorial before this section, especially if you have not used containers before.

Pipeline Description

You will have to have some rationale for which tools you use and why. For our tutorial example we will perform the following steps with the following containers:

Exploring containers with Biocontainers

Once you have identified the container of interest, you need to explore how to run the container and the tool itself. For example,let’s build a two-step workflow with some sample data. Before we move ahead with out own workflow, we willtry to check the quality of a fastq file with fastqc and then use trimmomatic to trim the file.

  1. We can get a small fastq file from the CyVerse data store:

    mkdir /scratch/test-workflow
    
    cd /scratch/test-workflow
    
    iget -P /iplant/home/shared/cyverse_training/datasets/PRJNA79729/fastq_files/SRR064156.fastq.gz .
    
  2. We can run fastqc on this file using a container from Biocontainers

    docker run quay.io/biocontainers/fastqc:0.11.7--4 fastqc SRR064156.fastq.gz
    

This produces an error:

"Skipping 'SRR064156.fastq.gz' which didn't exist, or couldn't be read"

So, we have to go through the (often long) process of getting our individual tools to work before we can assemble them into a pipeline.

The problem is that the docker container can not “see” the input file; we need to become more familiar with the Docker LINK. We would then learn that we need to mount our local (Atmosphere) disk so that the file can be seen:

  1. Rerun the docker command using the -v option.

    #mount atmosphere directory /scratch/test-workflow
    #to a directory /work which will be created in the container
    docker run -v /scratch/test-workflow/:/work quay.io/biocontainers/fastqc:0.11.7--4 fastqc /work/SRR064156.fastq.gz
    
  2. Try running fastqc on the sample fastq file above using another version available on biocontainers.

Combining Biocontainers in a bash script

Ideally, we want to automate how we handle data. One way to automatically have our applications work is to develop a scrip. Let’s add one more tool (trimmomatic) to our example workflow before doing the main workflow for the tutorial. You can use the trimmomatic manual to determine how the Docker command should work.

Question

Using the trimmomatic container below, make a docker command that trims the single read using a sliding window of 4 bases and trimming when average quality is less than 30 (phred score). Use up 8 threads and single-end mode.

quay.io/biocontainers/trimmomatic:0.39--1

Answer

docker run -v /scratch/test-workflow/:/work quay.io/biocontainers/trimmomatic:0.39--1 trimmomatic SE -threads 8 /work/SRR064156.fastq.gz /work/SRR064156_trimmed.fastq.gz SLIDINGWINDOW:4:30

Once you have the commands you want to use, you could write a bash script to automate the use of these tools. For more on writing bash scripts see the Data Carpentry Genomics Lesson.

A script could look something like this

#!/bin/bash

# Make a directory and stage our data

mkdir -p /scratch/example-script/data
DATADIRECTORY=/scratch/example-script/data

#Import data from CyVerse data store
iget -P /iplant/home/shared/cyverse_training/datasets/PRJNA79729/fastq_files/SRR064156.fastq.gz $DATADIRECTORY

#Make a directory for our analysis
mkdir -p /scratch/example-script/analyses
ANALYSISDIR=/scratch/example-script/analyses

#Use Docker container to do fastqc
docker run -v $DATADIRECTORY:/work quay.io/biocontainers/fastqc:0.11.7--4 fastqc /work/SRR064156.fastq.gz

#move results to analyses directory
mkdir -p $ANALYSISDIR/fastqc
mv $DATADIRECTORY/*fastqc* $ANALYSISDIR/fastqc

#Use Docker container to do trimmomatic
docker run -v $DATADIRECTORY:/work quay.io/biocontainers/trimmomatic:0.39--1 trimmomatic SE -threads 8 /work/SRR064156.fastq.gz /work/SRR064156_trimmed.fastq.gz SLIDINGWINDOW:4:30

#move results to analyses directory
mkdir -p $ANALYSISDIR/trimmomatic
mv $DATADIRECTORY/*_trimmed.fastq.gz $ANALYSISDIR/trimmomatic

Discussion - Bash script

Is this Bash script a good solution? What problems could we run into when making our larger workflow? What could improve this script?


Fix or improve this documentation

Search for an answer: CyVerse Learning Center


Home_Icon Learning Center Home