Today I will introduce you to the methods for RNA and protein analysis at massive scale - RNA sequencing (RNA-seq) and tandem mass tag mass spectrometry (TMT-MS). These two methods provide as an output enormously large amount of data that needs to be properly analysed in order to obtain the biological meaning of phenomena you're investigating.
Trancriptome and central dogma of molecular biology
Before we dive into the description of RNA sequencing methodology, first we need to explain certain basics of molecular biology.
According to the central dogma of molecular biology, genetic information contained within the DNA molecule is transcribed into the pre-mRNA, which is further processed and spliced into the mature mRNA. Mature mRNA is then exported from nucleus to cytoplasm, where is being translated into protein product, which is modified, folded into its final conformation and transported into the right cellular compartment.
Now that we know what is mRNA, we can define transcriptome, which represents a complete set of mRNA transcripts contain in one cell.
RNA sequencing
RNA sequencing allows RNA analysis through direct determination of cDNA sequence, and it's based on high-throughput next-generation DNA sequencing (NGS) technologies.
Briefly, mRNA is isolated from samples (in our case, transgenic cell line I personally made), and using the process of reverse transcription mRNA is transcribed into cDNA (complementary DNA). This collection of cDNA is called cDNA library and contains cDNA fragments with adaptors attached on both ends. After that, each molecule is being sequenced and millions of short reads are produced.
After the sequencing process is finished, filtering of reads and adapter trimming are performed, followed by de novo assembly of transcripts (in case that reference genome is not available) or mapping (alignment) of sequencing reads to reference genome or transcriptome.
Beside all of the above steps, many other "adjustments" of the raw data are necessarily performed, until the expression values are finally obtained:
Screenshot of my RNA-seq data
In these data I'm analyzing expression values are represented as TPM (Transcripts Per Kilobase Million). This means that these data are normalized to gene length and sequencing depth, which allows us to compare normalized reads between different samples.
In my RNA-seq data I have 9616 different genes for which I have to analyze expression values and compare them to control values!
In order to filter data out, I perform standard Student's t-test to obtain p values for each gene, which will help me discard data that are not statistically significant.
After filtering for p < 0.05, I got 5102 statistically significant, differentially expressed genes left - still sounds like a lot, don't you think? :)
Proteome analysis by TMT-MS
So we have determined the expression values for mRNA extracted from our model system, but is that enough to draw conclusions on what's going on in our cells over-expressing our gene of interest?
Well, not exactly, because if you remember the central dogma, mRNA is translated into protein in cells - so it would be useful to know the amount of each protein product as well.
One of the ways to quantify proteins within the sample is to use tandem mass tag mass spectrometry (TMT-MS).
Tandem mass tags (TMT or TMTs) represent chemical labels that are used for quantification and identification of biological macromolecules (eg. proteins). TMTs belong to isobaric mass tags, meaning that those chemical groups have the same mass. The method is based on pairing lighter with heavier regions of tags, in such way that the entire tag when attached to the peptide adds the same mass shift, which enables detection of the amount of each peptide.
What we get as output data after raw data analysis is normalized relative abundance (%) of each protein:
Screenshot of my TMT data
Initially, I got 7260 detected proteins, and after I performed Student's t-test and obtain p values for each protein (same as with RNA-seq data), I ended up with 2844 statistically significant (p < 0.05), differentially expressed proteins.
In the next Lab Diaries post, I will explain which method I use for analysis of such large-scale RNA-seq and TMT-MS data, and how do we obtain biological meaning from such enormous amount of data.
Until then, relax and keep steemSTEM! ;)
Literature
[5] RPKM, FPKM and TPM, clearly explained
[7] Tandem mass tag
I'm curious if you're using other statistical tests for differential expression? Deseq2 comes to mind. I'm wondering that if you used that (or adjusted alpha for you p-value test multiple testing) if it might not cut down the initial number of genes and proteins to look at.
I'm familiar with RNA-seq but have no proteomics experience. How big (in terms of memory) do those datasets get?
Thank you for your comment! Basically, DESeq2 would not improve analysis from the biological point of view that much, improvement may be visible in those samples that have large deviation caused by measuring error only...
To obtain biological meaning of datasets, I use GSEA which is based on Kolmogorov–Smirnov algorithm, and then I analyse biological pathways and genes within them.
For proteomics data, size is not even close as of RNA-seq - you get your results in one Excel file :)
That makes sense. I arrived at DESeq2 through examining microbial population changes and it's been super helpful in that context. It's always fun to see how a tool of choice can vary in efficacy between problems.
As I'm doing the analyses in biology and chemistry I understand your standing point.
In almost all the cases, we are able to offer them "better" math, but... Those things are alive.
We often start with bad data or we get qualitatively the same thing at the end.
I also noticed that sometimes we have "one-time hit" after application of exotic math that works well, but only once, on one dataset.
Due to this, if the procedure works for the data (signals) - don't touch it, you will break something :)
If they are stuck with data we are coming to the rescue.
Processing massive data sets like the ones you've generated are some of the biggest challenges in modern molecular biology. Looking forward to reading how you process this data.
Thank you for your post. :) I have voted for you: 🎁! To call me just write @contentvoter in a comment.
Congratulations @scienceangel! You have completed some achievement on Steemit and have been rewarded with new badge(s) :
Award for the number of upvotes
Click on any badge to view your own Board of Honor on SteemitBoard.
For more information about SteemitBoard, click here
If you no longer want to receive notifications, reply to this comment with the word
STOP
It looks amazing! I wish I would have more time to read more about this. I will probably be the first to use my brain as a storage device when it will be technically possible :D
It's very interesting and I envy you for being able to do this, although I don't regret doing my job at all.
Basically I would like to do everything! Could you clone me, pls?
I'm still practicing on cells (creating new species, etc.), but as soon as my skills are good enough you can be the first person to have the honor of being cloned by me :D
Hey for the people who din't listen to their 8-grade teachers
BASICALLY!
There are three basic types of RNA used for protein synthesis of the body
Which are mRNA/tRNA/ and rRNA
So it basically follows this way
DNA > mRNA > tRNA > rRNA > Protein
Lol quite funny... Nice breakdown
heheheh
This is quite a post... I really didn't understand much on this topic. Your post was helpful great work