This short article will summary how to perform RNA-seq analysis. The raw data in fastq file from Illumina only contain the sequences of short peices of complete RNAs, and each read in the raw data has the following format:
@SRR539279.1412 HWI-ST169:209:B072AABXX:5:1101:4907:3562 length=105 ATAGATTAAAATTCAACACCTACAGAAGATCCTTCGATGTTCCACCTCCCCCAATCGACGCTTCTTCTCCATTCTCTCAAAAGTGTGATGAAAGATACAAGTACG +SRR539279.1412 HWI-ST169:209:B072AABXX:5:1101:4907:3562 length=105 @CCFFFFFHHHHHJJIJJJJJJJJJIJJJIJJJJJJJJJJJJJIIJJJJJHHIGJJJJJJJHHHFFFFFFEEFEEEEECCACDDDEDDEDEDDDACDDDCD>@CB
The SRR539279.1412 in first and third line is the identifier of the read, and the second line is actual sequence of the read, in this case, it has 105bp length. And the fourth line is the quality score for each base, each character encodes a specific number (see Quality Score Encoding). Quality score 10 means the error probability is 0.1, quality score 20 means the error probability is 0.01, in general,
The basic idea for RNA-seq analysis is to map each read to the reference genes and then count how many reads are mapped to each gene, which can indicates the expression level of such gene. Tools for RNA-seq reads alignment includes HISAT2, Bowtie, STAR and etc. Detailed discussion can be seen in this paper.
And the shell script in linux command line for RNA-seq analysis is quite simple:
hisat2-build $1 $1 hisat2 -x $1 -U $2 -S $2.sam samtools view -S -b $2.sam -o $2.bam samtools sort -O SAM $2.bam -o $2.sortedsam
If we want to map the paired-end reads to genome, we can substitute the second script with
hisat2 -x $1 -1 $2_r1 -2 $2_r2 -S $2.sam
in which $1 is the reference genome/transcriptome file, $2 is the RNA-seq fastq file.
The next step is to analyze the .sortedsam
file. The content in .sortedsam
file is
@HD VN:1.0 SO:coordinate @SQ SN:YNCA0001W LN:564 @SQ SN:YNCA0002W LN:72 @SQ SN:YNCA0003W LN:102 @SQ SN:YNCA0004W LN:73 @SQ SN:YNCA0005W LN:82 @SQ SN:YNCA0006C LN:82 @SQ SN:YNCB0001W LN:84 @SQ SN:YNCB0002W LN:73 @SQ SN:YNCB0003W LN:88 @SQ SN:YNCB0004W LN:74 ... @PG ID:hisat2 PN:hisat2 VN:2.2.1 CL:"/usr/bin/hisat2-align-s --wrapper basic-0 -x S288C_RNA.fasta -S SRR539279.fastq.sam -U SRR539279.fastq" @PG ID:samtools PN:samtools PP:hisat2 VN:1.13 CL:samtools view -S -b -o SRR539279.fastq.bam SRR539279.fastq.sam @PG ID:samtools.1 PN:samtools PP:samtools VN:1.13 CL:samtools sort -O SAM -o SRR539279.fastq.sortedsam SRR539279.fastq.bam SRR539279.2409018 0 YNCA0001W 1 60 8S97M * 0 0 ATAGATTCGGGCCCTTTCTTCCGTTTGAACGTAAAGGCATTTTTGAGACCATTACCAAACCTAGCAAATAAACCGGGAGGCTTGACTGCTCGTAGGGATTGTGGT @@CFFFFFHHHGFHIJIEHEIIJJJJ<GHIIDGGIJIIIJIJJIGHIGIJIJGIIIJGIHGEHHFEDFFFDEEEDD@=BBB?C@?C@(9A>@0<?<@3<A##### AS:i:-8 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:97 YT:Z:UU NH:i:1 SRR539279.2781666 0 YNCA0001W 1 60 7S98M * 0 0 ATAGATTGGGCCCTTTCTTCCGTTTGAACGTAAAGGCATTTTTGAGACCATTACCAAACCTAGCAAATAAACCGGGAGGCTTGACTGCTCGTAGGGATTGTGGTT @BCFFFFFHHGHHGIGIJJ@FHEHJIJJJIJGIJIIFEIJIJJJJJJIIIJJJJJIJJJIGGIDHHEHEEFFFFDDBB?BBC@DDC4<@CB58?B@######### AS:i:-7 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:98 YT:Z:UU NH:i:1 SRR539279.4183122 0 YNCA0001W 1 60 8S97M * 0 0 ATAGATTCGGGCCCTTTCTTCCGTTTGAACGTAAAGGCATTTTTGAGACCATTACCAAACCTAGCAAATAAACCGGGAGGCTTGACTGATCGTAGGGATTGTGGT @CCFFFFFFHFFFIEHJBGIIIIGIJJEICHFCG=<9;F<B4BBDFHCE;CFFIEGIJJHHGECDEC@7);@(6/92;@B?<C###################### AS:i:-10 XN:i:0 XM:i:1 XO:i:0 XG:i:0 NM:i:1 MD:Z:80C16 YT:Z:UU NH:i:1 ...
The above file is in SAM
format, introduction for SAM
can be found here.
Then we need to output the mapped genes (with repeats) into a new file.
cat *.sortedsam | grep -v '^@'| awk '{print $3}' | grep -v '\*' > list
I will use python script to count how many times each gene appears in the list. It is worth noting that this script works on the list that has been sorted by gene name, which has already been done in .sortedsam
file.
#!/usr/bin/python3 import sys file_name = str(sys.argv[1]) i = 1 holdseq = "bob" with open(file_name, 'r') as input: while True: res = input.readline().strip() if not res: break if res == holdseq: i = i+1 if res != holdseq: if holdseq != "bob": print(holdseq, i) holdseq = res i = 1
Then we get the counts table,
YNCA0001W 345 YNCA0003W 8 YNCB0008W 137 YNCB0009C 6 YNCB0010W 714 ...
Because different samples may have different amount of RNA and different genes may have different transcript length, we need to normalize the reads number by total RNA amount and the transcripts length. TPM and FPKM are two normalization methods. TPM stands for transcript per million and FPKM stands for fragments per kilobase of exon per million mapped fragments, their definition is \[TPM_i = \frac{q_i/l_i}{\sum_j q_j/l_j}\times 10^6, \quad FPKM_i=\frac{q_i/l_i}{\sum_j q_j}\times 10^9,\] where \(q_i\) is the number of reads for gene \(i\), and \(l_i\) is the length of the transcript of gene \(i\), their transformation formula is \[TPM_i =\frac{FPKM_i}{\sum_j FPKM_j}\times 10^6.\]
Here is the script I use to calculate TPM score for each gene.
#!/usr/bin/python3 import sys import subprocess file_name = str(sys.argv[1]) #define a function to get protein length of input gene def GetLength(Gene_ID): cmdd = ("seqret S288C.mRNA:" + Gene_ID + " stdout | grep -v '^>' | wc -m" ) pss = subprocess.Popen(cmdd, shell=True, stdout=subprocess.PIPE, stderr=subprocess.STDOUT) output = str(pss.communicate()[0], 'ascii').split('\n') length = int(output[1]) return length table = [] #[[gene,normalized count],...] Sum_counts = 0 #read the file line by line with open(file_name, 'r') as input: while True: res = input.readline().strip() if not res : break res = res.split(' ') gene = res[0] #gene name count = float(res[1]) #count of reads gene_len = GetLength(gene) #transcript length norm_count = count / gene_len #normalized count by length table.append([gene,norm_count]) Sum_counts += norm_count #calculate the sum of normalized count #output for i in range(len(table)): TPM = table[i][1] /Sum_counts * 1000000 print(table[i][0], " " , TPM)
In this script, I use shell code
seqret S288C.mRNA:$1 stdout | grep -v '^>' | wc -mto get the transcript length of input gene, in which
S288C.mRNA
is the reference file,
grep -v '^>'
is used to get rid of the identifier line, and wc -m
is used to count the character number (i.e. nucleotide number) of transcript.