CyVerse logo

Home_Icon Learning Center Home

Computer and project setup with Github, Conda, and the Data Store

Description:

In this section, we will be setting up our computer with some tools we will need. We will also start our initial folder.

Note

Using Atmosphere

Most if not all of this tutorial can be done on a local computer. However, we will assume you are using the Atmosphere Image we suggested at the beginning of the tutorial. If you are using your own computer, anticipate that some steps may be slightly different on your own system.


Connect to the Atmosphere VM

  1. Following the instructions in the Atmosphere Guide launch the recommended Atmosphere Image and then use ssh to connect to the instance.

Make a local clone of your Github repository

Once you have connected to Atmosphere, you can now clone the GitHub repository to that computer.

  1. Go to your GitHub repository and click the Clone or download button. Copy the URL that is provided.

  2. For much of this tutorial we are going to be working as the root user so you can switch to root in Atmosphere using this command:

    $ sudo su
    

    Tip

    Your CyVerse password is your root password

  3. Next, we will need to clone our GitHub repository. In Atmosphere we have a large amount of disk space mounted at /scratch so we will change to that directory and clone there:

    cd /scratch
    
    # make sure to use your actual URL
    git clone "YOUR GITHUB REPO URL"
    

Install Conda

conda is a popular tool for installing software. Typically software you want to use requires other software (dependancies) to be installed. Conda can manage all of this for you. Each available Conda package is part of a “recipe” that includes everything you need to run your software. There are different versions of Conda, including some specific for bioinformatics like Bioconda. We will install Conda and then use it to install some of the tools we need. We will install a lightweight version of Conda called MiniConda.

Install MiniConda

  1. We can install MiniConda with the following code:
# download the Miniconda installer
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh

# instal miniconda silently (-b) in path (-p) /opt/conda
bash Miniconda3-latest-Linux-x86_64.sh -b -p /opt/conda

#make sure all conda packages will be in path by symbolic links to /bin
#this step is a bit of a hack and you may get some warnings about
#symbolic links that cannot be created - it's ok.
ln -s /opt/conda/pkgs/*/bin/* /bin
ln -s /opt/conda/pkgs/*/lib/* /usr/lib

Install and test JupyterLab

  1. Install Jupyter, bash kernel, and dependencies using conda:
# Install Jupyter lab version 1.2.3
/opt/conda/bin/conda install -c conda-forge -y jupyterlab=1.2.3
/opt/conda/bin/conda install -c conda-forge -y nodejs=10.13.0

/opt/conda/bin/pip install bash_kernel
/opt/conda/bin/pip install ipykernel
/opt/conda/bin/python3 -m bash_kernel.install
# if the above fails try /opt/conda/bin/conda install -y ipykernel

Tip

It is a good idea in your code to specifically indicate the versions of software packages you are installing (e.g. jupyterlab=1.2.3). You can search for available package versions on conda using the command:

conda search <PACKAGE NAME>

You can then install the version of choice.

  1. We can start jupyterlab with the following command:
/opt/conda/bin/jupyter lab --no-browser --allow-root --ip=0.0.0.0 --NotebookApp.token='' --NotebookApp.password='' --notebook-dir='/scratch/reproducibility-tutorial/'
  1. You can access your Jupyter lab session by opening a web browser and entering the IP address of your Atmosphere instance followed by :8888 (e.g. 123.45.67:8888) (press CNTL+C to shutdown/exit)

Install and test snakemake

Next we will use conda to install Snakemake

  1. Check for snakemake on the bioconda channel
# See what snakemake version are available
/opt/conda/bin/conda search -c bioconda snakemake

# Let's choose the 5.8.1.
/opt/conda/bin/conda install -c bioconda -c conda-forge -y snakemake=5.8.1

# hack conda again

ln -s /opt/conda/bin/* /bin
ln -s /opt/conda/lib/* /usr/lib

# verify the installation
snakemake --version

Install and test Docker

We will install Docker as recommended for Linux:

# update ubuntu apt-get package manager
sudo apt-get update

# install some needed packages
# It's ok to say yes to any warnings
sudo apt-get install -y \
apt-transport-https \
ca-certificates \
curl \
gnupg-agent \
software-properties-common

# add the Docker key
curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo apt-key add -

#add the repository
sudo add-apt-repository \
 "deb [arch=amd64] https://download.docker.com/linux/ubuntu \
 $(lsb_release -cs) \
 stable"

# update apt-get with new repository information
sudo apt-get update

# install docker
sudo apt-get install -y docker-ce docker-ce-cli containerd.io

#test docker
docker run hello-world

Setup your local filesystem

Now that we have setup the basic software we will need, we should setup some of the file system we want for our project. A Quick Guide to Organizing Computational Biology Projects is still a good reference and we will borrow some of those recommendations.

  1. Use the history command to document the installation steps (and software versions) in our installation process:
# change into our project directory
cd /scratch/reproducibility-tutorial/

# append your bash history to the README.MD file
history >>README.md

# Use nano to edit the history down to the relevant commands
# use markdown to make this into a readable report
nano README.md

After editing, a well-documented README may look something like…

    # reproducibility-tutorial

## Computer Setup

    #download the Miniconda installer
    wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh

    # instal miniconda silently (-b) in path (-p) /opt/conda
    bash Miniconda3-latest-Linux-x86_64.sh -b -p /opt/conda

    #make sure all conda packages will be in path by symbolic links to /bin
    ln -s /opt/conda/pkgs/*/bin/* /bin
    ln -s /opt/conda/bin/* /bin
    ln -s /opt/conda/pkgs/*/lib/* /usr/lib

    # Install Jupyter lab version 1.2.3
    /opt/conda/bin/conda install -c conda-forge -y jupyterlab=1.2.3
    /opt/conda/bin/conda install -c conda-forge -y nodejs=10.13.0

    python3 -m pip install bash_kernel
    pip install ipykernel
    python3 -m bash_kernel.install

    #Test Jupyterlab
    jupyter lab --no-browser --allow-root --ip=0.0.0.0 --NotebookApp.token='' --NotebookApp.password='' --notebook-dir='/scratch/reproducibility-tutorial/'

    #Install Snakemake
    /opt/conda/bin/conda install -c bioconda -c conda-forge -y snakemake=5.8.1

    #install Docker
    # update ubuntu apt-get package manager
    sudo apt-get update

    # install some needed packages
    sudo apt-get install -y \
    apt-transport-https \
    ca-certificates \
    curl \
    gnupg-agent \
    software-properties-common

    # add the Docker key
    curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo apt-key add -

    #add the repository
    sudo add-apt-repository \
     "deb [arch=amd64] https://download.docker.com/linux/ubuntu \
     $(lsb_release -cs) \
     stable"

    # update apt-get with new repository information
    sudo apt-get update

    # install docker
    sudo apt-get install -y docker-ce docker-ce-cli containerd.io

    #test docker
    docker run hello-world

Ordinarily, we might we might create a few folders for our raw data, and we will get to those steps shortly. For now, let’s get the metadata from our SRA experiment. Unfortunately, SRA does not have a way to automatically do this, so we will we go to the SRA Run Selector for the chosen sample data.

  1. Go to SRA Study SRP170758 and click on Metadata to download SraRunTable.txt and save the table (text file) to your computer.

We will use scp to copy the SraRunTable.txt from our local system to our Atmosphere instance (If you are using Widows and don’t have Windows Linux Subsystem installed, you could also try pscp.exe from the PuTTY project.

  1. First On your Atmosphere Computer change the permissions on /scratch/reproducibility-tutorial/ so that the account associated with your CyVerse username can write to it:

    chown -R YOURCYVERSEUSERNAME /scratch/reproducibility-tutorial/
    
  2. On your local computer, do an scp transfer to upload your data to to the Atmosphere computer. Remember to adjust the code below to reflect your 1) the location of the download SraRunTable.txt, 2) CyVerse username and 3) Atmosphere password.

    scp ~/Downloads/SraRunTable.txt YOURCYVERSEUSERNAME@123.456.78.9:/scratch/reproducibility-tutorial/
    

Tip

We won’t cover it here, but once you have this table, you can use various bash shell commands to parse the data and even create a metadata file that you can use to upload to CyVerse

# get the first column from the SraRunTable.txt delimited by commas
cut -f1 SraRunTable.txt -d ,
  1. Create an experiment folder and then an sra_filesmetadata folder to keep with the SRA files. This can eventually be used as metadata you associate with files when you upload to CyVerse (See: Data Store Guide - associating data with metadata)

    # -p flag makes the last directory and all prior directories needed to
    # complete the path you have described
    
    mkdir -p /scratch/reproducibility-tutorial/experiment/sra_files/metadata
    mv SraRunTable.txt /scratch/reproducibility-tutorial/experiment/sra_files/metadata
    

Commit to GitHub and copy your work to the Data Store

We can now save our work to GitHub (although we could have committed after we updated our README - which we may also want to update with a note about the metadata).

  1. Commit and push changes to GitHub

    # check what files git is tracking
    git status
    
    #If you created Jupyter notebooks .ipnyb files or a
    # `.ipynb_checkpoints/` directory delete these
    
    # add the experiment directory to version control
    git add experiment/
    
    # commit changes with a helpful message
    git commit -am "update readme and record project metadata"
    
    #push changes to github
    git push
    
  2. Make a directory on CyVerse Data Store and push this to CyVerse (See Data Store Guide - iCommands if you are unfamiliar with iCommands).

    # Setup iCommands (See Data Store Guide for more info)
    iinit
    
    #make a directory
    imkdir reproducibility_tutorial
    
    # copy files
    iput -rPV /scratch/reproducibility-tutorial reproducibility_tutorial
    

    Note

    Right now we don’t have any substantial data in our repository. Keep in mind, Git/GitHub won’t let you version control large files so you will need to avoid tracking these files. See ignoring files for tips - files with extensions like .fastq,`.fastq.gz`,`.sra`, .bam, etc. are good ideas to exclude. Of course the Data Store was meant for big data like this.


Fix or improve this documentation

Search for an answer: CyVerse Learning Center


Home_Icon Learning Center Home