Informatics advances reveal the TruPath™ Genome towards comprehensive genomic insights

Introduction

Standard short-read sequencing-by-synthesis technology (SBS) produces billions of sequencing reads up to 500 base pairs (bp) in length. Benefits of standard short-read sequencing include high accuracy and throughput at scale. Illumina’s short-read sequencing benefits from a mature and well-established ecosystem of data analysis tools and pipelines. Despite these advantages, short-read technology has limited ability to interrogate a small proportion of challenging, complex regions of the genome, many of which harbor variants that have potential roles in human genetic disease.[1, 2]

Some of the most medically important regions of the genome have remained out of reach—until now. Illumina TruPath Genome (for Research Use Only), launched in 2026, delivers an efficient solution for comprehensive human genome sequencing. The simple, ~10 minutes of hands-on time combines long-distance genomic insights to be generated using standard short-read sequencing with the NovaSeq X™ series. Here, we describe the various ways proximity information is leveraged within the DRAGEN™ software suite and demonstrate the effects of improvements in variant detection. Together, TruPath Genome and the DRAGEN Germline application fundamentally extend the reach of SBS, enabling long-distance phasing and variant/haplotype discovery in genomic regions that were previously inaccessible.

Highlights

  • Simplest sample-to-sequencer workflow with ~10 minutes of hands-on time

  • Enabled using existing NovaSeq X systems (v1.4 Digital package required)

  • Highly accurate and phased germline Single Nucleotide Variant (SNV) calling

  • Improved coverage of difficult-to-map regions of the genome

  • Ultra-long phasing

  • Reliable de novo haplotype-resolved variant calling in paralogous regions

  • Enhanced detection of structural variants

  • Improved resolution of Short Tandem Repeats (STRs)

Proximity mapped read technology

The TruPath Genome on flowcell library prep technology unlocks longdistance genomic insights with unprecedented simplicity, enabling >200 kb template reconstruction using standard shortread sequencing.

TruPath Genome with proximity-mapped read technology provides unprecedented workflow simplicity by eliminating standard library prep prior to sequencing (Figure 1A)[3]. The TruPath Genome workflow is compatible with both high molecular weight (HMW) and standard molecular weight (SMW) extraction methods. The spatial proximity of neighboring nanowells allows for reconstruction of long-distance genomic connections extending to over 200 kb (Figure 1B), which can be used in a variety of ways.

Figure 1. Overview of proximity mapped read technology.
A. Intact double-stranded DNA flows over the flow cell surface where it is tagmented, resulting in the binding of the DNA to the nanowells on the flow cell. Attached DNA fragments undergo standard cluster generation. The on-flow cell tagmentation results in clusters originating from the same DNA template molecule occurring near one another on the flow cell surface. B. Distribution of the 95th percentile of template lengths across samples profiled with TruPath Genome. Template lengths are dependent on quality of input DNA.

Overview of TruPath Genome data analysis workflow in DRAGEN Germline

DRAGEN Germline integrates proximity information at every step—from mapping to SV calling—to deliver a more complete, phased, and structurally aware view of the genome.

The proximity-based long-range information available in TruPath Genome datasets can be leveraged in many different components of the analysis workflow to yield a more accurate and complete view of an individual’s genome. Many such improvements have now been implemented as part of DRAGEN beginning in version 4.5.2 (Figure 2), making them accessible to the user with a single command either locally, on the cloud, or through a fully automated sequencing to results workflow. A complete analysis (all callers active) of a TruPath Genome (typically 60-70X) can be performed in under three hours per sample on a local DRAGEN v4 server. These previously ‘dark’ regions often hide variants linked to rare disease—so illuminating them isn’t just a technical win, it’s also a win for clinical researchers.

Figure 2. Schematic of the DRAGEN Germline secondary analysis workflow for analyzing TruPath Genome data.
Several improvements to the DRAGEN Germline 4.5.3 secondary analysis workflow were made to make use of proximity information available on the TruPath Genome: (a) proximity information is leveraged on the mapper to improve read placement based on long range connections; (b) proximity information along with a pangenome reference are leveraged to phase reads into haplotypes during mapping and are output to a haplotagged BAM file; (c) counts of proximal reads mapping to all pairwise 2 kb genomic bins are assessed to generate a 2-dimensional colocation map providing information about the genome structure in the input sample; (d) phased reads are leveraged in the small variant caller to generate phased small variant calls with long phasing blocks; (e) phased reads are leveraged in the structural variant caller allowing for haplotype-specific local assemblies and structural variant detection; (f) linked reads from a targeted set of clinically relevant paralogous genes are jointly analyzed to perform copy number aware and fully phased genotyping of all copies of such genes in the input sample; (g) proximity information between in-repeat read pairs and the flank of specific STR sites is leveraged to recover and unambiguously assign such in-repeat reads to specific STRs allowing for significantly more accurate STR size estimation.

Mapping and read phasing

TruPath dramatically improves read placement in regions that were previously difficult to map, unlocking >20 Mb of new genomic territory.

DRAGEN Germline read mapping in the context of TruPath Genome datasets leverages proximity information from neighboring clusters to confidently assign a higher proportion of reads to the correct genomic location in regions of high sequence homology. This approach significantly reduces the fraction of the genome with low mapping quality, making >20 megabases of previously challenging genomic regions accessible to variant detection (Figure 3A). Figure 3B shows an example of a clinically relevant gene, RHCE, where read mapping is significantly improved relative to standard short-read WGS.

Further, to efficiently make use of phasing information in all downstream variant calling components, DRAGEN Germline leverages a novel phasing approach that phases reads to inferred ancestral haplotypes during the mapping stage. This approach selects the most closely related haplotype segment pairs from a haplotype database, while considering recombination rates and long-range proximity information, to provide probabilistic assignments of reads to haplotypes and to define phase blocks within which haplotypes remain consistent with high confidence. Such phased reads are output to a haplotagged BAM file and are leveraged within downstream variant calling steps to generate accurate and long-range phasing information between variants (Figure 3C).

Figure 3. Improved mapping and read phasing in TruPath Genome datasets.
A.
Through improved mapping of reads in challenging regions of the genome, TruPath Genome reduces the fraction of the genome defined as the “dark-by-MAPQ”1, in which 90% of the reads covering the region have a mapping quality (MAPQ) less than 10. B. Example of improved coverage in the RHCE gene in TruPath Genome compared to standard short-read WGS (Illumina DNA PCR-Free [IDPF]). The RHCE gene has a high homology paralog (RHD) with common gene conversion events between the two genes. Both paralogs are relevant for molecular blood typing (Rh blood group). C. Representation of read phasing in the CFTR gene region in TruPath Genome dataset generated from a cell line derived from an individual with compound heterozygosity affected by Cystic Fibrosis (NA13591).

Phased small variant calling

By combining improved mapping with haplotype tagged reads, TruPath delivers the most accurate and complete small variant calls to date—now fully phased across long genomic distances.

TruPath Genome small variant calling benefits from both improved mapping as well as read phasing information. Higher mapping quality and accurate read placement in difficult-to-map regions of the genome enable confident calls in a broader set of genomic regions. Phased reads and their associated phasing quality are directly incorporated into the variant calling model and used for phased variant calling (i.e. treating 0|1 and 1|0 as distinct genotype hypotheses), outputting calls that are phased with respect to each other if they are within the same phase block. Such innovations lead to the most accurate and complete small variant call set to date (Figure 4A) along with long-range phasing between variants (Figure 4B).

Figure 4. Accurate and complete small variant calls and with long-range phasing between variants.
A. Greater accuracy for small variant calling with TruPath Genome. Small variant calling performance benchmarked against HG002 T2T Q100 truth set. UG100 source: https://cdn.sanity.io/files/l7780ks7/production-2024/0a1b6a62a6da3e3fcafb81cad4c8ff2ffe85dd41.pdf. Pacbio SPRQ: downloaded from https://downloads.pacbcloud.com/public/revio/2024Q4/WGS/GIAB_trio/HG002_rep1/
https://downloads.pacbcloud.com/public/revio/2024Q4/WGS/GIAB_trio/HG002_rep1/analysis/v3.0.2/. Element 1kb: https://www.biorxiv.org/content/10.1101/2025.06.05.657102v1, supplementary table 5. Illumina DNA PCR-Free: IDPF libraries sequenced with 10B flow cells on the NovaSeq X and analyzed with DRAGEN 4.5.1 Germline (median across six technical replicates). TruPath Genome Standard: TruPath Genome datasets generated from DNA extracted with standard kits (not high molecular weight) analyzed with DRAGEN Germline 4.5.2 (median across 63 technical replicates). TruPath Genome HMW: HMW extraction sequenced with TruPath Genome, analyzed with DRAGEN Germline 4.5.2, (median across 64 technical replicates). B. Phased sequencing achieved with TruPath Genome benchmarked with HG002 truth set data. TruPath Genome results represent the median values across 63 (standard molecular weight extraction) or 64 (high molecular weight extraction) datasets. Phase block NG50 is the length of the phase block after 50% of the target region (chromosomes 1-22) has been phased. Percent genes fully phased is the percentage of genic regions from a specified gene list (Gencode v44 genes.gtf) that are completely contained within a single phasing block. Phasing hamming error rate is benchmarked against T2T Q100 truth set. HiFi data (PB) Phased VCF was obtained from https://downloads.pacbcloud.com/public/revio/2024Q4/WGS/GIAB_trio/HG002_rep1/analysis/v3.0.2/.

Structural variant calling improvements

Haplotype specific assemblies and colocation maps give TruPath a powerful new lens for detecting large and complex structural variants with higher confidence.

Structural variant (SV) detection in TruPath Genome data also benefits both from improved read mapping and phased reads. Because TruPath phases reads upfront, DRAGEN can assemble each haplotype separately—leading to cleaner assemblies and more accurate SV calls. This cleaner assembly process is responsible for most of the performance improvement in SV detection with TruPath Genome (Figure 5).

Figure 5. Structural variant calling improvements in TruPath Genome.
The analysis uses the Genome in a Bottle NIST T2T-Q100 HG002 SV v1.1 truth set with the SV confident BED file. Benchmarking was performed in accordance with Genome in a Bottle guidance for structural variant benchmarking (https://ftp-trace.ncbi.nlm.nih.gov/ReferenceSamples/giab/data/AshkenazimTrio/analysis/NIST_HG002_DraftBenchmark_defrabbV0.020-20250117/NIST_HG002_DraftBenchmark_defrabbV0.020-20250117_README.md), using “bench” and “refine” commands from Truvari v4.2.2.

Colocation maps for structural variant interpretation

Colocation maps reveal genome structure in two dimensions, exposing inversions, translocations, and other large SVs through intuitive off-diagonal signals.

In addition to improved performance on the NIST structural variant truth set of insertions and deletions in HG002, TruPath Genome also generates a new kind of output referred to as colocation maps. Think of a colocation map like a heatmap of the genome’s 3D structure—where unexpected off-diagonal signals reveal structural variants hiding in plain sight. Regions that have the same structure as the reference genome primarily display proximity signal along the colocation map diagonal, while regions where the individual’s genome structure differs from the reference display strong off-diagonal signals and signal depletion along the diagonal. Different types of structural variants show different patterns in the colocation maps (Figure 6A) and large clinically relevant structural variants can be clearly observed in such representation (Figure 6B). Given the independent nature of the colocation map signal compared to signals currently used for structural variant detection in standard short-read sequencing (e.g. split-reads and improperly paired reads), this signal can be used to filter large SVs detected by DRAGEN Germline SV leveraging standard short-read sequencing signals. DRAGEN Germline applies this filter to break-end calls that are either inter-chromosomal or intra-chromosomal but larger than 200 kb. This drastically reduces the total number of such calls in a sample genome, most of which are putative false positives, allowing for much smaller call set for these types of large events. Further integration of colocation map signals into SV detection will be incorporated in future DRAGEN Germline versions to improve sensitivity and specificity.

Figure 6 of the colocation maps in identification of structural variants.
A. Schematic representation of the signals observed in colocation maps when different types of structural variants are present. B. Example of colocation map observed in an individual with an inversion with boundary in the first intron of the F8 gene. Note the expected hourglass shaped signal in the off-diagonal region and depletion of diagonal signal in the event boundaries. Below the colocation map is a schematic representation of the locus in the reference genome and in the genome of the individual with the inversion. C. Example of colocation map off-diagonal signals in regions where DRAGEN Germline SV detected a true break-end and in a region where DRAGEN Germline SV detected a false positive break-end signal. Colocation map signals can be used to supplement read evidence and filter false positive break-end calls. D. Effect of the TruPath Genome colocation-based break-end filter implemented in DRAGEN Germline 4.5.2. The filter has no effect on the sensitivity of detection of true manually curated break-end signals. The filter significantly reduces the total number of break-end events detected. This filter is restricted to inter-chromosomal and intra-chromosomal BNDs larger than 200 kb.

Small variant calling in paralogous regions

TruPath enables copy-number–aware, haplotype resolved variant calling in highly homologous gene families—finally resolving paralogs long considered inaccessible to short reads.

DRAGEN Germline Multi‑Region Joint Detection (MRJD), paired with TruPath Genome, enables haplotype‑resolved, copy‑number–aware de novo germline small variant calling in segmental duplications. These regions are challenging for standard short‑read sequencing analysis, because high homology and structural complexity can cause ambiguous or incorrect read mapping, which leads to unreliable variant detection. TruPath Genome enables MRJD to retain reads from the paralogous loci regardless of mapping quality, estimate total copy number using read‑depth evidence, and then reconstruct the underlying copies for each paralog set by integrating copy number, read sequences, and long‑range proximity linkage information. MRJD then calls small variants on the reconstructed copies and reports phased variant calls along with their assigned genomic locations or haplotypes. This variant calling process does not rely on known population haplotypes. For clinical researchers investigating Lynch Syndrome, distinguishing PMS2 from PMS2CL has long been a diagnostic challenge. TruPath finally resolves these regions with haplotype level clarity. Figure 7 shows this for the highly homologous PMS2-PMS2CL pair (~21 kb each; ~99% identity), where standard short reads generate ambiguous, unphased variant calls across both loci, while TruPath Genome data with MRJD enables phased haplotypes that are concordant with on-market long read results.

Figure 7. Example of resolution of the high‑homology PMS2-PMS2CL locus with TruPath Genome‑enabled MRJD.
PMS2 and PMS2CL share ~99% sequence identity across ~21 kb, which leads to ambiguous mapping and unphased signal with standard short reads (top). Using TruPath Genome proximity information, MRJD generates haplotype-resolved variant calls in both PMS2 and PMS2CL (shown as copy 1 and copy 2 for each locus), with on‑market long‑read data shown for comparison (bottom).

MRJD with TruPath Genome currently supports 15 clinically relevant genes; Table 1 summarizes the supported loci and concordance versus orthogonal long‑read data.

Table 1. Median SNV concordance of phased germline small variant calls against orthogonal long‑read data for medically relevant paralogous genes supported by MRJD with TruPath Genome at launch. The concordance was measured on 14 diverse cell line samples with both HMW and standard DNA extraction. For CFHR1-CFHR2-CFHR3-CFHR4 and USP18, no orthogonal comparator call set was available, thus, concordance is reported as N/A.

Paralogous gene Disease relevance HMW DNA median concordance Standard DNA median concordance
PMS2 Lynch Syndrome 0.991 0.951
SMN1-SMN2 Spinal Muscular Atrophy 0.941 0.929
NCF1 Chronic Granulomatous Disease 0.992 0.991
CYP21A2 Congenital Adrenal Hyperplasia 1.000 1.000
TNXB Ehlers-Danlos Syndrome 1.000 1.000
STRC Recessive Nonsyndromic Hearing Loss 0.983 0.980
CYP2D6 Pharmacogenetics 0.973 0.976
CYP11B1-CYP11B2 Glucocorticoid-remediable Aldosteronism 0.997 0.997
CFHR1-CFHR2-CFHR3-CFHR4 Atypical Hemolytic Uremic Syndrome N/A N/A
SP18 Type I Interferonopathy N/A N/A

Improved short tandem repeat size estimation accuracy

Proximity signals allow TruPath to recover and assign in-repeat reads, enabling accurate STR sizing across the full expansion range—without the plateau seen in standard WGS.

Short tandem repeat expansions (STR) are associated with a wide range of neurological and neuro-developmental disorders. While the presence of an STR expansion of size beyond the small ranges observed in healthy population may be a strong indication of pathogenicity, the expansion size is known to modulate the presentation of many associated conditions. While traditional whole-genome sequencing (WGS) is an effective method to distinguish non-expanded from expanded STRs, its ability to accurately estimate the size of large STR expansions is limited due to inability to unambiguously assign fully repetitive read pairs to a specific STR, thus hampering a more nuanced classification of STR expansion status.

Proximity information available in TruPath Genome can help to resolve ambiguous mappings of reads fully contained within expanded STRs by assessing their proximity to the unique flanks of specific STR loci. The more complete recovery and assignment of in-repeat reads also allows for locus-specific adjustments to in-repeat read counts, which addresses biases associated with decreased sequencing efficiency of certain STR motifs. Furthermore, phasing information available in TruPath Genome allows for haplotype-specific STR size estimation even in cases where STR expansions occur in both parental haplotypes. This combined set of improvements leads to better STR size estimation and more nuanced and accurate STR expansion status classification (Figure 8).

Figure 8. More accurate measurements of STR expansions can be achieved using TruPath Genome.
A. Comparison of expected versus estimated STR length for standard WGS datasets using DRAGEN Germline. Size estimates plateau at approximately the fragment length of the standard short-read sequencing libraries (~450 bp) due to lack of unambiguous recovery of in-repeat read pairs. B. Comparison of expected versus estimated STR length for TruPath Genome datasets using DRAGEN Germline. Size estimates are correlated with expected size throughout the full range of STR lengths and plateau no longer occurs. C. STR length classification for Coriell samples with previously characterized expansion classification using standard WGS and DRAGEN. Dot colors reflect the true sample classification. Datapoints that are placed in the region of the swimlane of the same color are correctly classified. D. STR length classification on Coriell cell lines with known classification using TruPath Genome and DRAGEN Germline. Dot colors reflect the true sample classification. Datapoints that are placed in the region of the swimlane of the same color are correctly classified. TruPath Genome-based classifications are significantly more consistent with the true classifications and span a broader range of STR lengths.

Conclusion and next steps

The launch of TruPath Genome marks a critical turning point for the traditional trade-offs between workflow complexity, accuracy, and comprehensive genomic insights. By integrating proximity-based long-distance insights into core components of the DRAGEN germline data analysis workflow, Illumina has enabled standard short-reads to resolve regions of the genome and types of variants long thought inaccessible or incompatible.

For labs and researchers, the implications are profound[4]:

  • Identification of variants associated with genetic and rare disease: By resolving paralogous genes, accurately sizing STR expansions, improving SV detection capabilities, and delivering phased variants, TruPath Genome offers a clear path to solving rare disease cases that were previously inaccessible.

  • Operational Efficiency: The ability to achieve best-in-class small variant accuracy and phase up to 98% of genes in under three hours of analysis time means labs can consolidate multiple assays into a single, streamlined workflow.

  • Accessible Long-distance insights: This technology brings high-resolution, haplotype-resolved WGS to existing NovaSeq X systems, making comprehensive human genomics accessible and scalable with up to 16 genomes per run.

TruPath doesn’t just extend the reach of short reads—it redefines what they’re capable of. TruPath Genome is now available commercially; however, this initial product is just the beginning of what can be achieved with this new data modality. Future DRAGEN releases will continue to build on these analysis capabilities. Sign up below to stay updated on the future developments of DRAGEN Germline analysis of proximity mapped read technology, and future TruPath Genome solutions.

Learn more about ThruPath Genome

Reference

1. Ebbert MTW, Jensen TD, Jansen-West K, et al. Systematic analysis of dark and camouflaged genes reveals disease-relevant genes hiding in plain sight. Genome Biol. May 20 2019;20(1):97. doi:10.1186/s13059-019-1707-2

2. Ryan NM, Corvin A. Investigating the dark-side of the genome: a barrier to human disease variant discovery? Biol Res. Jul 20 2023;56(1):42. doi:10.1186/s40659-023-00455-0

3. Illumina. Introducing constellation mapped read technology. Genomics Research Hub blog. 2024. https://www.illumina.com/science/genomics-research/articles/constellation-mapped-read-technology.html

4. Cheng S, Zhang Q, Zheng X, et al. Constellation illuminates rare disease genetics. medRxiv. Nov 10 2025:2025.10.15.25337675. doi:10.1101/2025.10.15.25337675