Genomic Data Compression

Lossless genomic data compression

Technology from Enancio reduces genomic data storage and transfer costs 

DNA Helix

Benefits of Genomic Data Compression

Illumina is committed to delivering innovative sequencing technologies, and to helping customers manage growing volumes of next-generation sequencing (NGS) data output. Lossless genomic data compression technology from Enancio, formerly known as Lena and now known as original read archive (ORA) compression, offers optimal levels of speed and efficiency.

Genomic data compression allows for:

  • Lower data storage costs
  • High-speed data file transfers
  • Reduced internal network traffic

Lossless Genomic Data Compression Technology

Lossless genomic data compression technology reduces the data storage footprint by as much as five times by compressing the output from Illumina sequencing systems. ORA compression technology uses a reference-based compression method. The idea is to use an ultra-fast mapping scheme to map reads onto a reference genome, and then store only the data needed to regenerate those reads: a position and a list of differences.

Other data compression technologies usually suffer from low speed. ORA compression technology is optimized for high compression ratios, as well as fast compression and decompression rates, while preserving data integrity. Quality scores are encoded in a lossless way using a range encoder and context models adapted to the different types of quality schemes.

Access DRAGEN ORA Decompression Software

All files compressed with ORA compression technology can easily be decompressed using our decompression software. The decompression software is free to download and use.

Download decompression software

Once the decompression software is installed, a simple command can be used to directly pipe the output of decompression on the fly into a wide range of popular mapping tools such as BWA, STAR, and Bowtie. The compression and decompression technology is also integrated within DRAGEN secondary analysis software, which provides accurate, ultra-rapid analysis of sequencing data.

View DRAGEN secondary analysis

DRAGEN ORA and NextSeq 1000-2000
Lossless genomic compression available on-instrument

DRAGEN ORA lossless genomic data compression is now available on-instrument with the NextSeq 1000 and NextSeq 2000 Systems and NovaSeq X Series as well as on the DRAGEN secondary analysis server starting with v3.8. Learn more about:

NextSeq 1000/2000 Systems
NovaSeq X Series
DRAGEN secondary analysis

Enancio is a company recently acquired by Illumina with proprietary lossless data compression technology specifically designed for genomics data. The software company, based in Cesson-Sévigné, France, joins a suite of exceptional bioinformatics offerings, with the goal of making genomics data processing, storage, and transfer more efficient and user-friendly.

Read article: Enancio joins Illumina family
visual of how DRAGEN ORA works

DRAGEN ORA lossless compression is specifically designed for genomics data. The DNA sequence is compressed using a reference-based method: reads are mapped on a reference genome using an ultra-fast mapping scheme devised for compression. A compact binary format is used to encode reads as positions and a list of differences, followed by an entropy coder. Quality scores are encoded in a lossless way using a range encoder and context models adapted to the different types of quality schemes.

learn the benefits of compression technology

DRAGEN ORA compression technology reduces the data footprint of FASTQ files by a factor of 51 compared to gzip. This translates into direct storage cost savings and more rapid file transfer speeds.

ORA compression technology is being integrated across the Illumina portfolio in stages and will give users the option to produce compressed FASTQ files that are up to 5x smaller than fastq.gz1. Compression is already available on the NextSeq 1000 and NextSeq 2000 Systems and NovaSeq X Series. Starting with the v3.8 release, compression is also available on DRAGEN servers with native ingestion of compressed FASTQ files into the DRAGEN mapper.

During the NGS workflow, you can optionally enable compression to generate compressed fastq.ora files. With the DRAGEN v3.8 release, fastq.ora files can be directly ingested by the DRAGEN mapper for a seamless integration. In addition, fastq.ora files can be decompressed on-the-fly for other mapping and downstream analyses. The integration of compression within DRAGEN BCL conversion streamlines the workflow, as shown in the figure below:

ORA compression technology within DRAGEN
ORA compression technology used within DRAGEN secondary analysis
legacy process, compression was an extra step
Before acquisition of Enancio: compression as a standalone software. Compression is an extra step.

The output of ORA compression technology is a compressed FASTQ binary file format: fastq.ora. This file format can be stored and shared to enable significant storage cost savings and reduced file transfer times. All compressed files can be decompressed with the freely available decompression software.

Fastq.ora files can be decompressed on the fly for mapping and downstream analysis or directly ingested by DRAGEN.

A 235 GB raw FASTQ file can be compressed to 55 GB via gzip. The data footprint is further reduced to 11 GB with DRAGEN ORA compression technology2.

FASTQ files and BAM or CRAM files are typically stored for different purposes. However, fastq.ora files enable you to store a compressed copy of your raw data with a preserved MD5 sum and smaller footprint than the corresponding CRAM file.

DRAGEN can now enable compression of two different formats: FASTQs and BAMs to fastq.ora and CRAM, respectively.

Utilization of the compression is completely optional. DRAGEN users remain free to adopt the storage strategy they want: activate the conversion to Illumina FASTQ compressed file format and store these files, disable the conversion to DRAGEN ORA compressed file format fastq.ora and store fastq.gz, or store BAM or CRAM files.

With the DRAGEN 3.8 release, data compression is seamless and compressed fastq.ora files are directly ingested into the DRAGEN mapper.

Additionally, once the free decompression software is installed, a simple command can be used to directly pipe the output of decompression on the fly into a wide range of popular mapping tools such as BWA3, STAR4, and Bowtie5.

DRAGEN ORA FASTQ compressed files can be shared. The decompression software is freely available. Once the free decompression software is installed, a simple command can be used to directly pipe the output of decompression on the fly into a wide range of popular mapping tools such as BWA3, STAR4, and Bowtie5.

Related Solutions

Genomic Data Storage & Security

Securely store, process, and share large genomic and NGS datasets in the cloud with built-in speed and scalability.

Sequencing Data Analysis

Our sequencing data analysis software helps you spend more time doing research, and less time configuring and running analysis workflows.

Illumina Informatics Product Portfolio

Explore a broad range of informatics products designed to simplify genomic data analysis and management.

Have questions about compression technology?

Contact us to learn more.

References
  1. On files generated by the NextSeq 1000 and NextSeq 2000 Systems and NovaSeq 6000 System.
  2. This result was obtained from the DNA sample NA12878 sequenced on the NovaSeq 6000 System with 30x coverage. Data is accessible in this BaseSpace project: basespace.illumina.com/s/3ExEZMlH8Lkq.
  3. Li H. and Durbin R. Fast and accurate short read alignment with Burrows–Wheeler transform. Bioinformatics. 2009 Jul 15; 25(14): 1754–1760.
  4. Dobin A. et al. STAR: ultrafast universal RNA-seq aligner. Bioinformatics. 2013 Jan; 29(1): 15–21.
  5. Langmead B. et al. Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biology 2009 10:R25