Left: False positive rate (FPR) was evaluated using 985 complete FDA-ARGOS bacterial genomes simulated at 10x depth using dwgsim. Hostile 0.1.0 (default unmasked index) had the lowest median false positive rate, retaining the most bacterial reads, followed by Kraken2 (standard, version 2.1.3, index version 2023-06-05), Kraken2 (standard 8), and finally NCBI Scrubber (2.2.1). Among the worst affected taxa were important human pathogens Clostridioides difficile, Neisseria gonorrhoeae and Haemophilus influenzae.
Right: True positive rate (TPR) was evaluated using 27 real Illumina samples from the 1000 Genomes Project. A consequence of using real data from lymphoblastoid cell lines is that many of these samples are heavily contaminated with Epstein-Barr Virus, which has not been adjusted for here. Kraken2 (standard) had the highest median TPR, removing the most human reads, followed by Hostile, NCBI Scrubber, and finally Kraken2 (standard 8).
Removing host reads is a compromise. If you wish to remove as many human reads as possible and don’t mind losing a few microbial reads along the way, Kraken2 with the 67GB standard index is a great choice. If you either value precision or lack the RAM needed to run Kraken2, consider an alignment-based approach like Hostile. Note that prebuilt masked databases are available for Hostile which increase precision even further, and custom masked databases can be easily created. Please read the preprint and please cite it if you find it useful. Thanks for reading!