Intro and Github setup¶
Description:
In this tutorial, we will assemble the genome of Plasmodium falciparum - the parasite that causes malaria. We will be importing data from the NCBI Sequence Read Archive. There are a variety of science and reproducibility goals that we will attempt during the tutorial:
Scientific Goals
- Importing data from the NCBI Sequence Read Archive
- Checking and filtering sequences reads by quality using fastp
- Assembling a small genome using SPAdes
- Checking the quality of the assembly and visualizing results using QUAST
Reproducibility Goals
- Developing code using version control on GitHub
- Setting up tools using conda on the Atmosphere cloud
- Assembling tools using Biocontainers
- Building a workflow using Snakemake
- Reporting on results using JupyterLab
Note
This tutorial is under active development - a few caveats and prerequisites. Currently this tutorial is designed to be used in a workshop with an experienced instructor and is not yet at the level of detail for a novice learner to get through all the materials on their own.
The goal is to demonstrate how certain tools could be used to enhance the reproducibility of your analyses. The focus is not the science, so this tutorial is not a substitute for a thorough explanation of a genome assembly.
This tutorial requires some substantial background knowledge including:
- Knowledge of the linux command line
- Some knowledge of Git
- Some knowledge of bioinformatics tools
If you are new to these areas, you can still benefit, but be sure to visit the Github Repo Link and find the Issues link: leave us a comment about things you did not understand or needed more help with. Where appropriate we will link to other tutorials that provide some of this background knowledge.
Choose a License¶
Following the 4OSS recommendations, we should develop our code openly from day one. If you a single line of code for your project, you are a developer and have a responsibility to scientific ethics, your laboratory, your colleagues, and yourself.
We cannot make specific recommendations about what license you should choose. We suggest checking your laboratory data management plans/policies and or consulting with your University/Institution on their requirements. Choosealicense and Creative Commons have some recommendations on licensing.
Go to Choosealicense and Creative Commons and choose an appropriate license.
For this tutorial we suggest the GNU GPLv3
Setup a Github Repository¶
Next, we will create a repository for any scripts, notebooks, etc. that will be a part of our project. To learn more about Github and git, see the Software Carpentry Git Lesson and our FOSS Camp Git
Go to GitHub and sign into your account.
On the upper-right of your GitHub page you should find a “+” link; from this menu select “New repository”
Name your repository (e.g. reproducibility-tutorial) and git it a description.
Select “Initialize this repository with a README” and also choose a license. (e.g. GNU General Public License v3.0)
Click Create Repository to create your repository.
Tip
Although we want encourage open source, you do not need to make the repository public at this time. You can have one private repository in your GitHub free account.
Warning
Never “commit” any sensitive information (e.g. logins, passwords, private keys, protected data) in your GitHub repository. Git and Github were designed to make it possible to go back to previous versions of documents/files so even if you delete a mistake, there is a history which may allow someone else to see past commits.
Fix or improve this documentation
Search for an answer: CyVerse Learning Center