Pan-viral LLM research
Technology

ViraLingo: Predicting Viral Variants with Pan-Viral LLMs

What if a language model trained on viral genomes could predict dangerous mutations before they become dominant in the population? That is the question behind ViraLingo — a pan-viral large language model we developed in collaboration with the University of Florida.

The Variant Prediction Challenge

The COVID-19 pandemic made the world acutely aware of what virologists had long known: variants matter. SARS-CoV-2's Alpha, Delta, and Omicron variants each required revised vaccines, revised clinical protocols, and revised public health responses. Being even a few weeks ahead of a variant's emergence would have been enormously valuable. The same challenge exists for HIV, Hepatitis B and C, and influenza — viruses that mutate rapidly and where variant surveillance directly affects treatment decisions.

Traditional phylogenetic approaches to variant surveillance are powerful but slow and computationally expensive. They require domain expert annotation, do not generalise across virus families, and struggle with the speed at which novel sequences are deposited in databases like NCBI GenBank.

The Pan-Viral Language Modelling Approach

ViraLingo treats viral genome sequences as language. Just as a transformer language model learns the statistical structure of English text, ViraLingo learns the statistical structure of viral nucleotide and amino acid sequences — across multiple virus families simultaneously. The pan-viral training objective is key: by training on HIV, Hepatitis B, Hepatitis C, and Coronavirus sequences together, the model learns representations that generalise across viral biology rather than overfitting to a single family.

This cross-family generalisation is what makes ViraLingo unusual. Most existing viral language models (such as those built on ESM-2 for proteins) are trained on single families. A model that has seen HIV's mutation landscape may transfer useful representations when reasoning about a novel Coronavirus variant — because the underlying evolutionary pressures on the viral proteome share structural similarities.

"Language models have already transformed how we understand protein structure. ViraLingo extends this approach to viral evolution — treating mutations as a language with grammar, syntax, and semantics."

— Tommaso Buonocore, CTO Bilobe

Architecture and Training

ViraLingo is built on a transformer encoder architecture adapted for genomic sequences. Key design choices include:

  • Tokenisation: We use a k-mer tokenisation strategy (k=6) rather than individual nucleotides, which captures local codon context and reduces sequence length while preserving biological signal.
  • Multi-task pre-training: The model is pre-trained on masked sequence modelling across all four virus families simultaneously, with family-specific prompts to allow the model to condition on taxonomic context.
  • Fitness prediction head: A fine-tuned regression head predicts viral fitness from sequence — the probability that a given variant will expand in the population.

Training was performed on the HiPerGator supercomputer at the University of Florida, one of the top-ranked academic supercomputers in the United States. Access to HiPerGator was critical — the pre-training corpus contains over 8 million viral sequences spanning four decades of genomic surveillance data.

Genomic sequence analysis
Schematic of ViraLingo's training pipeline. Sequences from NCBI GenBank are tokenised, embedded, and used in a masked modelling objective across multiple virus families.

Results on HIV and Coronavirus

Early results are promising. On a held-out test set of HIV envelope protein variants, ViraLingo's fitness predictions correlate strongly with experimentally measured replication capacity (Spearman ρ = 0.71), outperforming single-family baselines. On SARS-CoV-2 spike protein sequences, the model correctly ranks Omicron sub-variants by their eventual epidemiological dominance in 78% of pairwise comparisons — predicting, in essence, which variants would win.

For Hepatitis C, where experimental fitness data is scarcer, the cross-family transfer from HIV and Coronavirus representations provides a substantial lift over models trained from scratch on HCV alone — a promising signal that pan-viral pre-training captures transferable biological knowledge.

Key Takeaways

  • ViraLingo is a pan-viral LLM trained on HIV, HBV, HCV, and Coronavirus genomes simultaneously
  • Cross-family pre-training enables generalisation to novel variants and underrepresented virus families
  • Fitness prediction head achieves Spearman ρ = 0.71 on HIV envelope protein variants
  • Training required HiPerGator supercomputer access via University of Florida collaboration
  • Future work targets real-time integration with GenBank deposition pipelines for prospective surveillance

Future Directions

The next phase of ViraLingo development focuses on three areas. First, real-time inference: integrating ViraLingo into an automated pipeline that scores new GenBank depositions within 24 hours of submission, enabling near-real-time variant surveillance. Second, expanding the virus family coverage to include influenza and dengue, which together account for hundreds of millions of infections annually. Third, developing interpretability tools specific to genomic sequences — understanding which amino acid positions drive the model's fitness predictions, and whether those positions correspond to known functional sites in viral biology.

We are actively seeking clinical and public health partners who want to pilot ViraLingo in a surveillance context. If your organisation monitors viral evolution and wants to explore what a machine learning layer could add to your existing workflow, we would love to hear from you.