Agrigenomics, Consumer Genomics

Blazing the trail to empower agrigenomics research and conservation

Together, Texas A&M AgriLife and Illumina are relentlessly optimizing for higher quality with less input

Blazing the trail to empower agrigenomics research and conservation
19 December 2023

Sequencing data for a single human genome, at 30× coverage, takes up to 70 gigabytes of storage. Illumina instruments produced 280 million gigabytes of data in 2021 alone, and by 2025, we’ll need storage capacity for 40 billion gigabytes—and that’s just for human genomes.

The Genomics & Bioinformatics Service of Texas A&M AgriLife (known as “TxGen”) sequences thousands of samples from diverse species for Texas A&M University and agricultural clients around the world: crops like wheat, corn, sorghum, and cotton; an extensive array of insect pests; plant and animal pathogens; and wildlife from minuscule Amazonian frogs to endangered mammalian species across Africa, incorporating archival museum samples that are vital for wildlife conservation. TxGen’s mantra is simple: “If it has DNA, we can sequence it.”

Their genomic insights prove invaluable for selecting optimal plant and animal candidates for selective breeding and gene editing, combating vector-borne diseases, and addressing climatic challenges like drought tolerance. The seasonal nature of agricultural breeding cycles means they receive a high volume of submissions at certain times of the year (they often have more than 40 active projects at any given time), most of which require rapid results so their clients can make timely decisions about what to plant.

TxGen is well equipped to manage this vast breadth and volume of sequencing needs thanks to their twin NovaSeq 6000 Systems and on-premises DRAGEN server for secondary analysis. While DRAGEN is available on the cloud and on instrument, the on-site DRAGEN server was the right choice for TxGen. When they installed the server in 2019, TxGen’s director, Dr. Charles Johnson, said they anticipated it would “solve many of the analysis issues we have been facing,” and that Illumina’s technology was “unprecedented in comparison with anything else on the market.”

His words weren’t just hype—DRAGEN is well known for its speed. It’s capable of analyzing a whole human genome at 30× coverage in about 30 minutes, and it’s demonstrated its accuracy by winning Best Performance in small-variant calling in both the Difficult-to-Map Regions and All Benchmark Regions categories of the PrecisionFDA Truth Challenge V2. It also won in the NCTR Indel Calling from Oncopanel Sequencing Data Challenge, for Best Precision and Best Overall.

But how well does DRAGEN perform for nonhuman applications, like the ones run by Texas A&M AgriLife? This fall, Dr. Marcel Brun, TxGen assistant director and senior bioinformatics scientist, gave a presentation to the European Molecular Biology Laboratory (EMBL), available to watch at the end of this article, explaining the advances they’ve made, thanks in large part to DRAGEN.

DRAGEN reduces demultiplexing time and SNP call time

TxGen’s groundbreaking work demonstrates DRAGEN’s efficacy beyond human genomics. They found that using a high number of unique dual-index barcodes (UDI) for every sample significantly reduced costs and processing time. These UDI ensure that the sample libraries can be quickly and accurately stitched back together after sequencing—a process called “demultiplexing.” For instance, they were able to sequence 1536 rice samples at once, which reduced the cost of WGS genotyping by 10 times when combined with robotic automation.

You’d think that using so many barcodes would be hard to manage and increase the demultiplexing time. Luckily, this is one of the many tasks where DRAGEN running on-premises soars above the cloud.

Running on the cloud-based Amazon Web Services, DRAGEN can demultiplex 3.18 billion clusters from 300 samples in a little over 50 minutes, according to TxGen. They report that it performs slightly better on Google Cloud Platform. DRAGEN bclConvert, running on a DRAGEN server, demultiplexes the same amount in less than 20 minutes.

For TxGen, DRAGEN running on an on-premises server outperforms cloud-based platforms, with zero upload time, fast demultiplexing, and significantly reduced processing times for all downstream steps. They were able to use it for mapping and aligning sequencing reads, along with SNP calls, for 221 rice samples at 360 million bases sequenced per sample; completing the analysis in 6.5 hours with 99.3% concordance with original SNPs, compared to 108 hours with a standard high-performance server—that’s 17 times faster.

Using reference-panel-based imputation to genotype

Another way to reduce the cost of genotyping for nonclinical applications—particularly in agrigenomics—is to use low-coverage sequencing at 1× rather than 30×, and to impute genotypes through a reference panel.

Imputation is a method of leveraging existing population haplotypes from the same species as the sample of interest to predict the alleles and genotypes of that sample where it had no coverage at 1×. DRAGEN’s imputation process has been discussed in detail on Illumina’s Genomics Research Hub. Imputation is also often used in non-agrigenomics applications, such as enriching variant calls in large human cohorts.

The team at TxGen ran population-based imputation on DRAGEN as part of the low-coverage pipeline, using a probability cutoff to balance the number of new calls obtained with their expected accuracy. They tested four species—tomato, goat, rice, and boll weevil—and found that imputation improved their call rate every time. For rice, their call rate without imputation was a mere 55%; with imputation, it rose to 98.9%, with an imputation accuracy of 98.5%.

Encouraged by these results, TxGen collaborated with Illumina to build two proof-of-concept reference panels for rice and sorghum, which are ready to be used with DRAGEN software. For rice, they used a public FASTQ database of 430 megabases (Mb) in 202 samples at 1.5x, producing 6 million variants and only 4.8% missing calls. For sorghum, they sequenced 800 Mb in 96 samples at 10x, producing 7 million variants and only 2% missing calls.

The road to the $10 genome

The icing on the data cake is the excellent support TxGen has experienced with Illumina. “DRAGEN customer support is amazing,” Brun says. “Any time we have an issue, they help us. We get a quick answer. We are so thrilled to be working with them; it’s truly amazing to be able to talk directly to the DRAGEN team at Illumina.”

Even as Illumina technology comes ever closer to achieving the company’s longtime goal of the $100 genome, Brun and Johnson are quick to caution that their clients already want the $10 genome. This aspiration will only be possible through relentless optimization, finding out how little data you can collect and still produce quality results. Reference-panel-based imputation may prove to be an essential part of that optimization, and the scientists at TxGen believe their results demonstrate that DRAGEN can excel in nonhuman applications.

For further detail, please see Marcel Brun's full presentation to EMBL below:

Recent Articles

Reaching new heights in data quality and throughput on a benchtop sequencer
Reaching new heights in data quality and throughput on a benchtop sequencer
Inside the UK’s largest life sciences event
Video: Inside the UK’s largest life sciences event
What 5000 ancient human genomes can reveal about European health and heritage
What 5000 ancient human genomes can reveal about European health and heritage