CROssBARv2: A Unified Biomedical Knowledge Graph for Heterogeneous Data Representation and LLM-Driven Exploration

1Biological Data Science Lab, Dept. of Computer Engineering & AI Engineering, Hacettepe University, Ankara, Turkey

2Dept. of Bioinformatics, Graduate School of Health Sciences, Hacettepe University, Ankara, Turkey

The Challenge

Effective therapeutics for prevalent diseases require deep insight into molecular, genetic, and cellular factors, yet this knowledge is scattered across diverse sources, posing major challenges for data integration and analysis.

What's inside:

Here, we present CROssBARv2, a heterogeneous knowledge graph (KG) based system to facilitate systems biology and drug discovery/repurposing. CROssBARv2 collects large-scale biological data from 32 data sources and stores them in a Neo4j-based graph database.

Our Solution: CROssBARv2

CROssBARv2 consists of 2,709,502 nodes and 12,688,124 relationships between 14 node types (i.e., protein, gene, organism, domain, biological process, molecular function, cellular component, drug, compound, disease, pathway, phenotype, EC number, and side effect).

Smart Access

We developed a large language model interface to convert natural language queries into Neo4j's Cypher query language back and forth to access information within the KG and answer user questions without LLM hallucinations. Other means of interacting with CROssBSRv2 are our GraphQL API and the Neo4j browser.

CROssBARv2 is expected to contribute to life sciences research considering (i) the discovery of biological mechanisms at the molecular level and (ii) the development of effective therapeutic strategies.

Unique Databases
34
Node Types
14
e.g., genes, proteins, diseases
Relationship Types
13
e.g., associations, interactions
Nodes
2,709,502
Relationships
12,688,124