Chirag Gupta

This tutorial will help wet-lab biologists get started with analysis of their first RNA-seq dataset. You can essentially perform all these steps on platforms with user-friendly interface, such as the Galaxy server. But doing it on the terminal is a lot more fun, and of course saves a lot of time. Also, this workflow deals with analyzing RNA-seq data when a reference genome is available. In case if you have no reference genome available, you might want to take a shot at de novo assembly of transcriptomes, which is a different ballgame and hence different combination of software at play. I will write a separate tutorial on that sometime later.

I am assuming that you already have access to all software listed below. If not, install them first using instructions given on their individual pages.

Prerequisites

We will need the following software for the workflow

Fastqc
STAR
Samtools
Htseq
Multiqc
R packages limma and edgeR

Setting up project directory

Open the terminal and type

mkdir rnaproj
cd rnaproj
mkdir raw outputs genome 
mkdir genome/star_index
mkdir outputs/bams outputs/counts
mkdir outputs/bams/tmp

This will create a directory structure we will use throughout this tutorial.

Obtaining tutorial data

Skip this section if you already have data to work with. We will develop our workflow with this Arabidopsis dataset from GEO. It has two conditions (control and drought treatment) with two replicates each. A total of four samples.

Download this data from SRA in the raw directory

cd raw
fastq-dump

Quality check

fastqc *.fastq.gz
multiqc .

Click on the *.html file. How is the data per sample? quality overall? Are the ends good? needs trimming?

Alignment

We will use the STAR aligner to map our reads to the reference genome. But first, we will need to create an index for the reference genome.

cd ../
star index

This will create index files in the star_index directory. Next, we will start aligning the reads in our fastq files of our first sample to the reference genome.

Run STAR

 
star --genomeDir genome/star_index  --runThreadN 32 --readFilesIn raw/pe1 raw/pe2 --outFileNamePrefix outputs/bams/s1.bam --outSAMtype BAM Unsorted
star --genomeDir genome/star_index  --runThreadN 32 --readFilesIn raw/pe1 raw/pe2 --outFileNamePrefix outputs/bams/s1.bam --outSAMtype BAM Unsorted
star --genomeDir genome/star_index  --runThreadN 32 --readFilesIn raw/pe1 raw/pe2 --outFileNamePrefix outputs/bams/s1.bam --outSAMtype BAM Unsorted
star --genomeDir genome/star_index  --runThreadN 32 --readFilesIn raw/pe1 raw/pe2 --outFileNamePrefix outputs/bams/s1.bam --outSAMtype BAM Unsorted

Sort with samtools

samtools sort -@ 32 -T outputs/bams/tmp -O BAM -o outputs/bams/sorted.bam outputs/bams/s1.bam 
samtools sort -@ 32 -T outputs/bams/tmp -O BAM -o outputs/bams/sorted.bam outputs/bams/s1.bam 
samtools sort -@ 32 -T outputs/bams/tmp -O BAM -o outputs/bams/sorted.bam outputs/bams/s1.bam 
samtools sort -@ 32 -T outputs/bams/tmp -O BAM -o outputs/bams/sorted.bam outputs/bams/s1.bam

These four bam file contains information about where each sequenced read aligned in the reference genome. The bam files also contain other information such as the read quality and alignment rate. We can actually check what percentage of reads in each of our fastq files aligned correctly.

cd outputs/bams/
multiqc *.bams

This will produce an html file looking something like this. What percentage of reads aligned in each sample?

Counting reads

Our intention is to estimate expression levels of known gene models, therefore we will use the GTF file of Arabidopsis to count reads per gene.

htseq-count -r pos -f bam sorted.bam genome/gtf > outputs/counts/s1.counts
htseq-count -r pos -f bam sorted.bam genome/gtf > outputs/counts/s1.counts
htseq-count -r pos -f bam sorted.bam genome/gtf > outputs/counts/s1.counts
htseq-count -r pos -f bam sorted.bam genome/gtf > outputs/counts/s1.counts

You should have a total of 4 counts files for the four samples. Let’s check that

ls outputs/counts/*.counts | wc -l

Merge the four count files using this R script.

Expression

Now that we have our count files for the four samples, we will need to first normalize these counts before we go about estimating expression levels. Why is normalization necessary?

I have written an R script to normalize gene count data using voom and DE analysis using limma. This script will also output Transcript Per Million (TPM) units as a measure of gene expression in case if you want to compare genes within the same sample.

mkdir codes
wget Rscript
mv rscript codes
Rscript codes/rscript counts/countmatrix.txt genome/GTF