Description
Dataset Description
This dataset is a Knowledge Graph (KG) projection is a graph of a subset of RB-Tnseq experimental data and associated biological and experimental metadata from the Fitness Browser. The KG is hosted as a Neo4j graph database, and contains three trained neural network models:
- A multi-layer perceptron (MLP) with learned gene embeddings which predicts fitness using a gene-experiment-media triplet
- A MLP + Graph Attention Transformer (GAT) which injects graph structure into the MLP by operating over gene–protein–function and
media–chemical subgraphs, integrated via a gated residual mechanism. The objective function is link regression, which predicts the fitness effect of a gene-experiment-media relationship. - A MLP + GraphSAGE encoder model using a similar architecture as the MLP + GAT, where the objective function is also link regression.
The resulting graph supports two complementary inference modes: (1) symbolic graph traversal to surface candidate gene–environment and gene–chemical associations, and (2) learned inference using heterogeneous graph neural networks that propagate information across biological and environmental neighborhoods.
Data Download Reference Citation:
Winston, Anthony; Donald, Sam; Purohit, Sumit; Patel, Kaizad; Egbert, Robert; & Waters, Katrina M (2026). Knowledge Graph of RB-Tnseq Data from Fitness Browser (KP-DP1).
Accessible Digital Data Downloads
The repository contains the following files:
- neo4j.dump: subset of the entire knowledge graph (built from the 10 target organisms)
- Readme: walking through the install guide
Total Download Size: <to do>
Linked Primary Data
The Fitness Browser dataset can be found here: https://fit.genomics.lbl.gov/cgi-bin/myFrontPage.cgi
Funding Acknowledgments
The research data described here was funded in whole or in part by the Predictive Phenomics Initiative (PPI) at Pacific Northwest National Laboratory (PNNL). This work was conducted under the Laboratory Directed Research and Development Program at PNNL. PNNL is a multiprogram national laboratory operated by Battelle for the DOE under Contract No. DE-AC05-76RL01830.
Citation Policy
In efforts to enable discovery, reproducibility, and reuse of PPI-funded project dataset citations in accordance with best practices (as outlined by the FORCE11 Data Citation Principles), we ask that all reuse of project data and metadata download materials acknowledge all primary and secondary dataset citations and corresponding journal articles where applicable.
Data Licensing