Zstandard's long range mode works wonders for genome sequences without newlines
First released with Zstandard 1.3.2 in 2017, the --long
range match finder increases the compressor’s search window to at least 128MiB, improving deduplication inside large files. This optional feature had substantial performance overheads at launch, but various optimisations have since brought its performance within shooting distance of Zstandard’s fast defaults. As a fan of Zstandard’s speed and efficiency, I hoped that --long
might improve genome compression and bridge the chasm between fast general-purpose compressors with low compression ratios (CRs), and much slower specialist DNA sequence compressors capable of far higher CRs.
Grace Blackwell’s 2.6Tbp 661k dataset is a classic choice for benchmarking methods in microbial genomics. Comprising many similar DNA sequences, its 661,405 bacterial genome assemblies in FASTA text format are very compressible. Karel Břinda’s specialist MiniPhy approach takes this dataset from 2.46TiB to just 27GiB (CR: 91) by clustering and compressing similar genomes together. By comparison, naive Zstandard with default parameters compresses an order of magnitude faster, but achieves a CR of just 3.
I was initially underwhelmed by --long
’s modest reduction of the 661k dataset from 777GiB (Zstandard default) to 641GiB (CR: 4). I speculated that this poor performance might be caused by the newline bytes (0x0A
) punctuating every 60 characters of sequence, breaking the hashes used for long range pattern matching. Indeed, removing within-record newlines using seqtk seq -l 0
tripled zstd --long
’s CR to 11, yielding a 232GiB file while increasing compression time by only ~20% over Zstandard defaults. Increasing the window size to the 2GiB maximum on 64bit systems using --long=31
tripled CR again to 31, yielding an 80GiB file, increasing compression time by ~80% over Zstandard defaults. Using larger-than-default window sizes has the drawback of requiring that the same --long=xx
argument be passed during decompression, so reduces compatibility somewhat. But otherwise, zstd --long
seems like a fast and easy way to achieve compression ratios within an order of magnitude of state-of-the-art methods like MiniPhy. Just remember to first remove within-record newlines from your fasta files.
661k, single FASTA file
Compression | Line length | Size (GiB) | Ratio |
---|---|---|---|
Uncompressed | 60 | 2460 | 1 |
Gzip (pigz) | 60 | 751 | 3.3 |
Zstandard | 60 | 777 | 3.2 |
Zstandard --long | 60 | 641 | 3.8 |
Zstandard --long | 0 (infinite) | 232 | 11 |
Zstandard --long=31 | 0 (infinite) | 80 | 31 |
Zstandard's --long range mode works wonders for assemblies, but needs uninterrupted single line sequences. *AllTheBacteria 661k, multiline fasta* gzip (pigz): 751GB zstandard --long: 641GB (30% original size) *Single line fasta* gzip (pigz): 700GB zstandard --long: 232GB (10% original size)
— Bede Constantinides (@bedec.bsky.social) Sep 9, 2025 at 11:27