The input string may comprise only unambiguous IUPAC amino acid letters (ARNDCQEGHILKMFPSTWYV
). Input sequences are stripped of whitepace and asterisks and converted to upper case. The sequence is then SHA1 hashed and its digest encoded in base32. The phoneme is derived from the first 8 characters of the same hash encoded in base10, directly mapping 10 consonants and 5 vowels into decimal numbers. These are available as a command line tool and Python library in addition to the web service. Both implementations can also translate genomic sequences containing an in-frame & unambiguous SARS-CoV-2 spike sequence, making the scheme easy to adopt in practice for SARS-CoV-2. Naturally this or a similar scheme could be used for any sequence from any organism, at the scale of individual or many genes.
A next step will be to generate and make available the identifiers for all known spike sequences, and allow retrieval of spike sequences by ID.
2021-01-20: Added phoneme description