RNA-seq Analysis

Introduction

This short article will summary how to perform RNA-seq analysis. The raw data in fastq file from Illumina only contain the sequences of short peices of complete RNAs, and each read in the raw data has the following format:

@SRR539279.1412 HWI-ST169:209:B072AABXX:5:1101:4907:3562 length=105
ATAGATTAAAATTCAACACCTACAGAAGATCCTTCGATGTTCCACCTCCCCCAATCGACGCTTCTTCTCCATTCTCTCAAAAGTGTGATGAAAGATACAAGTACG
+SRR539279.1412 HWI-ST169:209:B072AABXX:5:1101:4907:3562 length=105
@CCFFFFFHHHHHJJIJJJJJJJJJIJJJIJJJJJJJJJJJJJIIJJJJJHHIGJJJJJJJHHHFFFFFFEEFEEEEECCACDDDEDDEDEDDDACDDDCD>@CB

The SRR539279.1412 in first and third line is the identifier of the read, and the second line is actual sequence of the read, in this case, it has 105bp length. And the fourth line is the quality score for each base, each character encodes a specific number (see Quality Score Encoding). Quality score 10 means the error probability is 0.1, quality score 20 means the error probability is 0.01, in general,

Error probability = 10-Q/10.

Mapping to reference

The basic idea for RNA-seq analysis is to map each read to the reference genes and then count how many reads are mapped to each gene, which can indicates the expression level of such gene. Tools for RNA-seq reads alignment includes HISAT2, Bowtie, STAR and etc. Detailed discussion can be seen in this paper.

And the shell script in linux command line for RNA-seq analysis is quite simple:

hisat2-build $1 $1
hisat2 -x $1 -U $2 -S $2.sam
samtools view -S -b  $2.sam -o $2.bam
samtools sort -O SAM $2.bam -o $2.sortedsam

If we want to map the paired-end reads to genome, we can substitute the second script with

hisat2 -x $1 -1 $2_r1 -2 $2_r2 -S $2.sam

in which $1 is the reference genome/transcriptome file, $2 is the RNA-seq fastq file.

Counting hits

The next step is to analyze the .sortedsam file. The content in .sortedsam file is

@HD     VN:1.0  SO:coordinate
@SQ     SN:YNCA0001W    LN:564
@SQ     SN:YNCA0002W    LN:72
@SQ     SN:YNCA0003W    LN:102
@SQ     SN:YNCA0004W    LN:73
@SQ     SN:YNCA0005W    LN:82
@SQ     SN:YNCA0006C    LN:82
@SQ     SN:YNCB0001W    LN:84
@SQ     SN:YNCB0002W    LN:73
@SQ     SN:YNCB0003W    LN:88
@SQ     SN:YNCB0004W    LN:74
...
@PG     ID:hisat2       PN:hisat2       VN:2.2.1        CL:"/usr/bin/hisat2-align-s --wrapper basic-0 -x S288C_RNA.fasta -S SRR539279.fastq.sam -U SRR539279.fastq"
@PG     ID:samtools     PN:samtools     PP:hisat2       VN:1.13 CL:samtools view -S -b -o SRR539279.fastq.bam SRR539279.fastq.sam
@PG     ID:samtools.1   PN:samtools     PP:samtools     VN:1.13 CL:samtools sort -O SAM -o SRR539279.fastq.sortedsam SRR539279.fastq.bam
SRR539279.2409018       0       YNCA0001W       1       60      8S97M   *       0       0       ATAGATTCGGGCCCTTTCTTCCGTTTGAACGTAAAGGCATTTTTGAGACCATTACCAAACCTAGCAAATAAACCGGGAGGCTTGACTGCTCGTAGGGATTGTGGT       @@CFFFFFHHHGFHIJIEHEIIJJJJ<GHIIDGGIJIIIJIJJIGHIGIJIJGIIIJGIHGEHHFEDFFFDEEEDD@=BBB?C@?C@(9A>@0<?<@3<A#####       AS:i:-8 XN:i:0  XM:i:0  XO:i:0  XG:i:0  NM:i:0  MD:Z:97 YT:Z:UU NH:i:1
SRR539279.2781666       0       YNCA0001W       1       60      7S98M   *       0       0       ATAGATTGGGCCCTTTCTTCCGTTTGAACGTAAAGGCATTTTTGAGACCATTACCAAACCTAGCAAATAAACCGGGAGGCTTGACTGCTCGTAGGGATTGTGGTT       @BCFFFFFHHGHHGIGIJJ@FHEHJIJJJIJGIJIIFEIJIJJJJJJIIIJJJJJIJJJIGGIDHHEHEEFFFFDDBB?BBC@DDC4<@CB58?B@#########       AS:i:-7 XN:i:0  XM:i:0  XO:i:0  XG:i:0  NM:i:0  MD:Z:98 YT:Z:UU NH:i:1
SRR539279.4183122       0       YNCA0001W       1       60      8S97M   *       0       0       ATAGATTCGGGCCCTTTCTTCCGTTTGAACGTAAAGGCATTTTTGAGACCATTACCAAACCTAGCAAATAAACCGGGAGGCTTGACTGATCGTAGGGATTGTGGT       @CCFFFFFFHFFFIEHJBGIIIIGIJJEICHFCG=<9;F<B4BBDFHCE;CFFIEGIJJHHGECDEC@7);@(6/92;@B?<C######################       AS:i:-10        XN:i:0  XM:i:1  XO:i:0  XG:i:0  NM:i:1  MD:Z:80C16      YT:Z:UU NH:i:1
...

The above file is in SAM format, introduction for SAM can be found here. Then we need to output the mapped genes (with repeats) into a new file.

cat *.sortedsam | grep -v '^@'| awk '{print $3}' | grep -v '\*' > list

I will use python script to count how many times each gene appears in the list. It is worth noting that this script works on the list that has been sorted by gene name, which has already been done in .sortedsam file.

#!/usr/bin/python3

import sys

file_name = str(sys.argv[1])

i = 1
holdseq = "bob"
with open(file_name, 'r') as input:
    while True:
        res = input.readline().strip()
        if not res: break
        if res == holdseq:
            i = i+1
        if res != holdseq:
            if holdseq != "bob": print(holdseq, i)
            holdseq = res
            i = 1

Then we get the counts table,

YNCA0001W 345
YNCA0003W 8
YNCB0008W 137
YNCB0009C 6
YNCB0010W 714
...

Normalization

Because different samples may have different amount of RNA and different genes may have different transcript length, we need to normalize the reads number by total RNA amount and the transcripts length. TPM and FPKM are two normalization methods. TPM stands for transcript per million and FPKM stands for fragments per kilobase of exon per million mapped fragments, their definition is \[TPM_i = \frac{q_i/l_i}{\sum_j q_j/l_j}\times 10^6, \quad FPKM_i=\frac{q_i/l_i}{\sum_j q_j}\times 10^9,\] where \(q_i\) is the number of reads for gene \(i\), and \(l_i\) is the length of the transcript of gene \(i\), their transformation formula is \[TPM_i =\frac{FPKM_i}{\sum_j FPKM_j}\times 10^6.\]

Here is the script I use to calculate TPM score for each gene.

#!/usr/bin/python3

import sys
import subprocess

file_name = str(sys.argv[1])


#define a function to get protein length of input gene
def GetLength(Gene_ID):
    cmdd = ("seqret S288C.mRNA:" +  Gene_ID + " stdout | grep -v '^>' | wc -m" )
    pss = subprocess.Popen(cmdd, shell=True, stdout=subprocess.PIPE, stderr=subprocess.STDOUT)
    output = str(pss.communicate()[0], 'ascii').split('\n')
    length = int(output[1])
    return length



table = [] #[[gene,normalized count],...]

Sum_counts = 0

#read the file line by line
with open(file_name, 'r') as input:
    while True:
        res = input.readline().strip()
        if not res : break
        res = res.split(' ')
        gene = res[0]          #gene name
        count = float(res[1])   #count of reads
        gene_len =  GetLength(gene)   #transcript length
        norm_count = count / gene_len #normalized count by length
        table.append([gene,norm_count])
        Sum_counts += norm_count #calculate the sum of normalized count

#output

for i in range(len(table)):
    TPM = table[i][1] /Sum_counts * 1000000
    print(table[i][0], " " , TPM)

In this script, I use shell code

seqret S288C.mRNA:$1 stdout | grep -v '^>' | wc -m
to get the transcript length of input gene, in which S288C.mRNA is the reference file, grep -v '^>' is used to get rid of the identifier line, and wc -m is used to count the character number (i.e. nucleotide number) of transcript.


Huarui Zhou @ 2022