The first computational research project I worked on in Gerstein Lab (debugging and quality control analysis for an exRNA-Seq analysis pipeline) was finally published, nearly 4 years after I started working on it.
Rozowsky J, Kitchen RR, Park JJ, Galeev T, Diao JA, Warrell J, Thistlethwaite W, Subramanian SL, Milosavljevic A, Gerstein M. “exceRpt: a comprehensive analytic platform for extracellular RNA profiling.” Cell Systems. 4 Apr 2019. https://doi.org/10.1016/j.cels.2019.03.004.
Our bodily fluids (e.g., blood, urine, etc.) contain many different molecules that can be used to diagnose and track disease. Most non-invasive diagnostic tests–from pregnancy tests to cancer biomarkers–look for abnormal levels of specific proteins. More recently, scientists have discovered that another type of molecule, exRNA, can also be measured from fluid samples for diagnostic purposes. To help standardize the study of exRNA, the NIH put together a consortium–the Extracellular RNA Communication Consortium (ERCC).
The analysis of exRNA presents a number of technical problems. For example: there’s not that much of it floating around in most bodily fluids, and sequencing samples are often vulnerable to contamination and artifacts. To address these challenges, our lab developed a pipeline called exceRpt: extra-cellular RNA processing toolkit. exceRpt consists of a series of filtering steps (to remove contamination and artifacts) and alignment steps (to match observed molecular sequences with known sequences). It accepts raw sequence reads as inputs and returns abundance estimates and quality control reports (I contributed to this part!) as outputs.
The hope is that everyone can process their sequencing data using the same pipeline, so that their outputs are all comparable and fully reproducible. So far, exceRpt has been used to process thousands of exRNA-seq datasets in the public exRNA Atlas (our paper on this came out today too!) and is freely available at http://genboree.org/site/exrna_toolset and github.gersteinlab.org/exceRpt.
I started by running a few hundred human and mouse sequence files from the public Sequence Read Archive (SRA) through a preliminary version of the exceRpt pipeline. In the process, I worked on testing and debugging errors to make the pipeline more robust to malformed inputs. When I became more comfortable with this workflow, I also worked on testing and visualizing quality control metrics, including the transcriptome-mapped reads and transcriptome-to-genome mapped reads ratio metrics that ended up in Figure 2D of the exceRpt paper.
This was my first computational experience; it’s been interesting to look back and think about my progress. When I joined the lab, I didn’t have much to offer. I had no experience with sequencing analysis, no experience working with computing clusters, and no experience with R, Python, or UNIX. It took quite a while before I was able to do anything useful. But with the help of my very patient mentor and a whole lot of StackOverflow, I eventually started to get the hang of things. It felt good to see the paper in print, to read the parts that would have been gibberish four years before, and to think about how far I’ve come.