Last November, more than 70 researchers, developers and data scientists gathered at Argonne National Laboratory (US) for the first AI Codeathon organised by the NIAID Bioinformatics Resource Centers, which include the Pathogen Data Network (PDN). Over three days, 12 teams developed AI prototypes to address bottlenecks in infectious disease research. Members of PDN contributed to several projects, including two that produced working prototypes closely aligned with the initiative’s aims for resource interoperability and innovative surveillance methods.

Making pathogen data easier to query: towards a ‘PDN GPT’

One of the teams explored how large language models could help researchers query pathogen data more easily and from a single entry-point. The prototype tested the use of AI to translate natural-language questions into structured queries, which can then search a knowledge graph linking multiple biological data resources, including a key source of pathogen datasets for PDN (ENA) and complementary resources from the SIB Swiss Institute of Bioinformatics, such as those available through Expasy

“This codeathon showed how quickly domain experts and AI specialists can move from concept to a working prototype,” said PDN’s Imane Lboukili, Data Scientist (SIB). “By experimenting with RDF structuring and automated SPARQL query generation, we created a tangible proof of feasibility for AI-assisted interoperability within PDN. ”

Future work involves incorporating core PDN resources, and turn the pipeline into a service that AI tools can query directly.

“This codeathon showed how quickly domain experts and AI specialists can move from concept to a working prototype, by experimenting with RDF structuring and automated SPARQL query generation, we created a tangible proof of feasibility for AI-assisted interoperability within PDN.”

Imane Lboukili
Data Scientist (SIB)

Watch the team’s presentation (PDN members involved: Imane Lboukili and Panayiotis Smeros (SIB))

New AI approaches for studying novel viruses

A second prototype explored how protein language models could support the analysis of viral evolution, including newly discovered or emerging ones. Because few viral protein structures have been experimentally determined, analysing viral evolution at the structural level remains challenging. The team therefore tested whether recently expanded datasets of AI-predicted viral protein structures (e.g. through AlphaFold) could improve AI models to predict coarse representation of viral structures to enable large-scale comparisons of viral proteins.

“Access to large numbers of AI-predicted viral protein structures opens new ways to study viral evolution,” said David Moi, post-doctoral scientist (SIB and University of Lausanne). “It could help link newly discovered viruses to knowledge bases such as PDN’s ViralZone, making it easier to understand how newly sequenced viruses relate to the viral taxonomy.”

The approach could eventually help researchers compare large collections of viral sequences through a structural perspective and better understand long term viral evolution as well as the functions of the proteins within novel viruses.

Future works involves:
Refining and fine-tuning existing protein language models to output structural tokens for viral sequences or focusing on specific viral clades as well as creating models to segment viral polyproteins. All of these outputs will be coupled to structure focused analysis of the proteins across viruses to derive evolutionary relationships between viruses and compare their protein contents.

Watch the team’s presentation (PDN members involved: Dave Moi and Dongwook Kim (SIB)).

Other projects to accelerate pathogen research and disease monitoring

PDN researchers also contributed to two additional transversal codeathon efforts to: 

  • speed up the interpretation of genetic sequences by automatically producing concise research summaries from multiple databases – watch the team’s presentation (PDN member involved: Jason Williams (CSHL)).
  • exploring automated collection and summarisation of outbreak information from online sources to support infectious disease monitoring – watch the team’s presentation (PDN member involved: Alexander Taepper (SIB)).

While the prototypes developed during the codeathon were experimental, they demonstrate practical approaches that support PDN’s broader goal of making pathogen data more connected, accessible and reusable for the research community.

The event was also a great platform to foster collaboration across all three NIAID-BRCs.

See full list of teams and event summary on the NIAID-BRC AI Codeathon website.