Zstandard's long range mode works wonders for genome sequences without newlines

September 12, 2025

First released with Zstandard 1.3.2 in 2017, the --long range match finder increases the compressor’s search window to at least 128MiB, improving deduplication inside large files. This optional feature had substantial performance overheads at launch, but various optimisations have since brought its performance within shooting distance of Zstandard’s fast defaults. As a fan of Zstandard’s speed and efficiency, I hoped that --long might improve genome compression and bridge the chasm between fast general-purpose compressors with low compression ratios (CRs), and much slower specialist DNA sequence compressors capable of far higher CRs.

Grace Blackwell’s 2.6Tbp 661k dataset is a classic choice for benchmarking methods in microbial genomics. Comprising many similar DNA sequences, its 661,405 bacterial genome assemblies in FASTA text format are very compressible. Karel Břinda’s specialist MiniPhy approach takes this dataset from 2.46TiB to just 27GiB (CR: 91) by clustering and compressing similar genomes together. By comparison, naive Zstandard with default parameters compresses an order of magnitude faster, but achieves a CR of just 3.

I was initially underwhelmed by --long’s modest reduction of the 661k dataset from 777GiB (Zstandard default) to 641GiB (CR: 4). I speculated that this poor performance might be caused by the cosmetic newlines (0x0A) punctuating every 60 characters of sequence changing the hashes of identical subsequences, breaking long range pattern matching. Indeed, removing these non-semantic newlines using seqtk seq -l 0 tripled zstd --long’s CR to 11, yielding a 232GiB file while increasing compression time by only ~20% over Zstandard defaults. Increasing the window size to the 2GiB maximum on 64bit systems using --long=31 tripled CR again to 31, yielding an 80GiB file, increasing compression time by ~80% over Zstandard defaults. Using larger-than-default window sizes has the drawback of requiring that the same --long=xx argument be passed during decompression, reducing compatibility somewhat. Results naturally vary between datasets, but given its low overheads, using --long seems often worthwhile. In this case zstd --long=31 achieved a compression ratio within an order of magnitude of slower state-of-the-art methods, representing a useful compromise. Just remember to remove any cosmetic whitespace from your files.

Edit: emphasised the non-semantic nature of these removed newlines highlighted by HN user jefftk. See the Hacker News discussion.

661k, single FASTA file

Compression	Line length	Size (GiB)	Ratio
Uncompressed	60	2460	1
Gzip (pigz)	60	751	3.3
Zstandard	60	777	3.2
Zstandard `--long`	60	641	3.8
Zstandard `--long`	0 (infinite)	232	11
Zstandard `--long=31`	0 (infinite)	80	31

Zstandard's --long range mode works wonders for assemblies, but needs uninterrupted single line sequences. *AllTheBacteria 661k, multiline fasta* gzip (pigz): 751GB zstandard --long: 641GB (30% original size) *Single line fasta* gzip (pigz): 700GB zstandard --long: 232GB (10% original size)
— Bede Constantinides (@bedec.bsky.social) Sep 9, 2025 at 11:27