As the scale of biological data grows, the timely processing of this data for mining
biological knowledge and clinical use is increasingly challenging. In particular, RNA
abundance quantification has become routine and affordable thanks to
high-throughput “short-read” technologies that provide accurate molecule counts at the
gene level. However, accurate and affordable quantification of definitive full-length,
transcript isoforms has remained a stubborn challenge, despite its obvious biological
significance across a wide range of problems. “Long-read” sequencing platforms now
produce data-types that can, in principle, drive routine definitive isoform quantification.
Nevertheless, some particulars of contemporary long-read data-types, together with
isoform complexity and genetic variation present bioinformatic challenges. We show
here that fast and accurate quantification of long-read data is possible. To perform
quantifications we developed lr-kallisto, which adapts the kallisto bulk and single-cell
RNA-seq quantification methods for long-read technologies. We demonstrate
lr-kallisto’s high accuracy in comparison to popular and new long read tools in
quantifying isoforms of genes with benchmarking datasets, high depth datasets of
paired long- and short-reads, and simulations. Furthermore, we show the increased
scalability that is achieved by lr-kallisto. Finally, we share preliminary results of
applying lr-kallisto and a modified “autoencoder” to understand the relationship
between 3’ UTR transcript abundance information and full-length transcript abundance
information.