ChemWorldModel

ChemWorldModel

An open knowledge graph for chemistry. Query reactions, explore molecules, and plan synthesis routes using natural language — backed by 1.8 million reactions from the Open Reaction Database and enriched with data from PubChem, ChEBI, ChEMBL, and GHS hazard classifications.

What this project does

Ask questions in plain language — “What catalysts are used in Suzuki coupling reactions with yields above 80%?” — and get answers grounded in real experimental data, with full provenance back to the source.

Search molecules by structure, substructure, or fingerprint similarity. Every molecule is identified by InChIKey and enriched with properties, biological roles, bioactivity data, and safety classifications from multiple public databases.

Plan synthesis routes by traversing the reaction graph. Given a target molecule, find multi-step paths scored by yield, step count, and commercial availability of starting materials.

Data Sources

Every record traces back to a public, open-access data source with full provenance tracking.

A structured, open-access repository of chemical reaction data including experimental conditions, yields, and selectivity. The primary dataset powering ChemWorldModel's reaction network.
~1.8M reactionsCC BY-SA 4.0
Provides:Reaction SMILESExperimental conditionsYields & selectivityReactant/product/catalyst roles|Format: Protobuf (.pb.gz)
The world's largest open chemistry database, maintained by NCBI at the National Institutes of Health. Used to enrich molecules with computed and curated properties.
110M+ compoundsPublic domain
Provides:IUPAC namesExact massXLogPMolecular complexityPubChem CID|Format: PUG REST API
Chemical Entities of Biological Interest — a freely available ontology of molecular entities focused on small chemical compounds. Provides the role and classification hierarchy for molecules.
~130K termsCC BY 4.0
Provides:Molecular rolesis_a hierarchyBiological classificationsCross-references|Format: OBO ontology
A manually curated database of bioactive molecules with drug-like properties, maintained by EMBL-EBI. Provides compound–target interaction data from medicinal chemistry literature.
2.4M+ compoundsCC BY-SA 3.0
Provides:Bioactivity data (IC₅₀, Kᵢ)Target proteinsAssay informationDrug mechanism of action|Format: SQLite dump
Globally Harmonized System of Classification and Labelling of Chemicals. Hazard data sourced via PubChem to provide safety information for molecules in the knowledge graph.
Per-molecule lookupPublic domain
Provides:H-codes (hazard statements)Signal wordsGHS pictogramsPrecautionary statements|Format: PUG VIEW API

Contributing

ChemWorldModel is open source. The codebase, schema migrations, and loader implementations are all available on GitHub. Contributions welcome — whether that’s adding a new data source, improving the query pipeline, or fixing a bug in the frontend.