ChemicalX Documentation¶
ChemicalX is a deep learning library for drug-drug interaction, polypharmacy side effect, and synergy prediction. The library consists of data loaders and integrated benchmark datasets. It also includes state-of-the-art deep neural network architectures that solve the drug pair scoring task. Implemented methods cover traditional SMILES string based techniques and neural message passing based models.
>@article{chemicalx,
arxivId = {2202.05240},
author = {Rozemberczki, Benedek and Hoyt, Charles Tapley and Gogleva, Anna and Grabowski, Piotr and Karis, Klas and Lamov, Andrej and Nikolov, Andriy and Nilsson, Sebastian and Ughetto, Michael and Wang, Yu and Derr, Tyler and Gyori, Benjamin M},
month = {feb},
title = {{ChemicalX: A Deep Learning Library for Drug Pair Scoring}},
url = {http://arxiv.org/abs/2202.05240},
year = {2022}
}
ChemicalX¶
Contents
Data Structures and Loaders¶
Datasets and utilities.
Classes¶
|
Generator to create batches of drug pairs efficiently. |
|
Context feature set for biological/chemical context feature vectors. |
|
Drug feature set for compounds. |
|
A data class to store a labeled drug pair batch. |
|
Labeled triples for drug pair scoring. |
A generic dataset. |
|
|
A dataset loader for remote data. |
|
A dataset loader that processes and caches data locally. |
A dataset loader for Drugbank DDI. |
|
|
A dataset loader for a sample of TWOSIDES. |
|
A dataset loader for DrugComb. |
A dataset loader for DrugCombDB. |
|
|
A large-scale oncology screen of drug-drug synergy from [oneil2016]. |
Pair Scoring Models¶
Models for ChemicalX.
Classes¶
|
The base class for ChemicalX models. |
The base class for unimplemented ChemicalX models. |
|
|
An implementation of the CASTER model from [huang2020]. |
|
An implementation of the DeepDDI model from [ryu2018]. |
|
An implementation of the DeepDDS model from [wang2021]. |
|
An implementation of the DeepDrug model from [cao2020]. |
|
An implementation of the DeepSynergy model from [preuer2018]. |
|
An implementation of the EPGCN-DS model from [sun2020]. |
|
An implementation of the GCN-BMP model from [chen2020]. |
|
An implementation of the MatchMaker model from [kuru2021]. |
|
An implementation of the MHCADDI model from [deac2019]. |
|
An implementation of the MR-GNN model from [xu2019]. |
|
An implementation of the SSI-DDI model from [nyamabo2021]. |
Pipeline¶
A collection of full training and evaluation pipelines.
- class Result(model, predictions, losses, train_time, evaluation_time, metrics)[source]¶
A result package.
- pipeline(*, dataset, model, model_kwargs=None, optimizer_cls=<class 'torch.optim.adam.Adam'>, optimizer_kwargs=None, loss_cls=<class 'torch.nn.modules.loss.BCELoss'>, loss_kwargs=None, batch_size=512, epochs, context_features, drug_features, drug_molecules, train_size=None, random_state=None, metrics=None, device=None)[source]¶
Run the training and evaluation pipeline.
- Parameters
dataset (
Union
[str
,DatasetLoader
,Type
[DatasetLoader
],None
]) –The dataset can be specified in one of three ways:
The name of the dataset
A subclass of
chemicalx.DatasetLoader
An instance of a
chemicalx.DatasetLoader
model (
Union
[str
,Model
,Type
[Model
],None
]) –The model can be specified in one of three ways:
The name of the model
A subclass of
chemicalx.Model
An instance of a
chemicalx.Model
model_kwargs (
Optional
[Mapping
[str
,Any
]]) – Keyword arguments to pass through to the model constructor. Relevant if passing model by string or class.optimizer_cls (
Type
[Optimizer
]) – The class for the optimizer to use. Currently defaults totorch.optim.Adam
.optimizer_kwargs (
Optional
[Mapping
[str
,Any
]]) – Keyword arguments to pass through to the optimizer construction.loss_cls (
Type
[_Loss
]) – The loss to use. If none given, usestorch.nn.BCELoss
.loss_kwargs (
Optional
[Mapping
[str
,Any
]]) – Keyword arguments to pass through to the loss construction.batch_size (
int
) – The batch sizeepochs (
int
) – The number of epochs to traincontext_features (
bool
) – Indicator whether the batch should include biological context features.drug_features (
bool
) – Indicator whether the batch should include drug features.drug_molecules (
bool
) – Indicator whether the batch should include drug moleculestrain_size (
Optional
[float
]) – The ratio of training triples. Default is 0.8 if None is passed.random_state (
Optional
[int
]) – The random seed for splitting the triples. Default is 42. Set to none for no fixed seed.metrics (
Optional
[Sequence
[str
]]) – The list of metrics to use.
- Return type
- Returns
A result object with the trained model and evaluation results
Installation¶
The installation of ChemicalX requires the presence of certain prerequisites. These are described in great detail in the installation description of PyTorch Geometric. Please follow the instructions laid out here. You might also take a look at the readme file of the ChemicalX repository. The torch-scatter binaries are provided for Python version <= 3.9.
PyTorch 1.10.0
To install the binaries for PyTorch 1.10.0, simply run
$ pip install torch-scatter -f https://pytorch-geometric.com/whl/torch-1.10.0+${CUDA}.html
$ pip install torchdrug
$ pip install chemicalx
where ${CUDA} should be replaced by either cpu, cu102, or cu111 depending on your PyTorch installation.
Updating the Library
The package itself can be installed via pip:
$ pip install chemicalx
Upgrade your outdated ChemicalX version by using:
$ pip install chemicalx --upgrade
To check your current package version just simply run:
$ pip freeze | grep chemicalx
Introduction¶
ChemicalX is a deep learning library for drug-drug interaction, polypharmacy side effect, and synergy prediction. The library consists of data loaders and integrated benchmark datasets. It also includes state-of-the-art deep neural network architectures that solve the drug pair scoring task. Implemented methods cover traditional SMILES string-based techniques and neural message-passing based models.
>@article{chemicalx,
arxivId = {2202.05240},
author = {Rozemberczki, Benedek and Hoyt, Charles Tapley and Gogleva, Anna and Grabowski, Piotr and Karis, Klas and Lamov, Andrej and Nikolov, Andriy and Nilsson, Sebastian and Ughetto, Michael and Wang, Yu and Derr, Tyler and Gyori, Benjamin M},
month = {feb},
title = {{ChemicalX: A Deep Learning Library for Drug Pair Scoring}},
url = {http://arxiv.org/abs/2202.05240},
year = {2022}
}
Overview¶
We shortly overview the fundamental concepts and features of ChemicalX through simple examples. These are the following:
Design Philosophy¶
When ChemicalX
was created we wanted to reuse the high-level
architectural elements of torch
and torchdrug
. We also wanted to
conceptualize the ideas outlined in A Unified View of Relational Deep
Learning for Drug Pair Scoring.
Drug Feature Set¶
Drug feature sets are custom UserDict
objects that allow the fast
retrieval of the molecular graph and the drug level features such as
the Morgan fingerprint of the drug. The get_feature_matrix
and
get_molecules
class methods allow the batching of drugs and
molecular graphs using the drug identifiers. Molecule level features
are returned as a torch.FloatTensor
matrix while the molecular graphs
are PackedGraph
objects generated by torchdrug
.
Context Feature Set¶
Similarly to the DrugFeatureSet
the ContextFeatureSet
are custom
UserDict
objects that allow the storage of biological or chemical
context-specific feature vectors. These features are stored as
torch.FloatTensor
instances for each context identifier key.
Labeled Triples¶
Labeled triples contain labeled drug pairs where the label is
specific to a context. The LabeledTriples
class is a wrapper around
pandas
dataframes that allow shuffling the triples and the generation
of training and test splits by using the train_test_split
class method.
This class also provides basic descriptive statistics about the number of
negatively labeled instances and the number of labeled triples.
Dataset Loaders¶
Dataset loaders allow the prompt retrieval of integrated datasets. After
a loader is initialized the class methods allow getting the respective
DrugFeatureSet
, ContextFeatureSet
and LabeledTriples
.
from chemicalx.data import DrugCombDB
loader = DrugCombDB()
context_set = loader.get_context_features()
drug_set = loader.get_drug_features()
triples = loader.get_labeled_triples()
Batch Generators and Drug Pair Batches¶
Using instances of the DrugFeatureSet
, ContextFeatureSet
,
and LabeledTriples
classes one can initialize a BatchGenerator`
instance. This class allows the generation of drug ``DrugPairBatch
instances which contain the drug and context features for the drugs in
the batch. In the training and evaluation of deep drug pair scoring models
the DrugPairBatch
acts as a custom data class.
Models and Pipelines¶
Model Layers¶
Drug pair scoring models in ChemicalX
inherit from torch
neural network modules. Each of the models provides an unpack
and forward
method; the first helps with unpacking the
drug pair batch while the second makes a forward pass to make
predictions and return propensities for the drug pairs in the
batch. Models have sensible default parameters for the
non-dataset-dependent hyperparameters.
Pipelines¶
Pipelines provide high-level abstractions for the end-to-end training and evaluation of ChemicalX models. Given a dataset and model a pipeline can easily train the model on the dataset, generate scores and evaluation metrics.
from chemicalx import pipeline
from chemicalx.models import DeepSynergy
from chemicalx.data import DrugCombDB
model = DeepSynergy(context_channels=112,
drug_channels=256)
dataset = DrugCombDB()
results = pipeline(dataset=dataset,
model=model,
batch_size=1024,
context_features=True,
drug_features=True,
drug_molecules=False,
labels=True,
epochs=100)
results.summarize()
results.save("~/test_results/")
Tutorial¶
In this lightweight tutorial, we will overview an oncology use case. The same use case is discussed in detail in the ChemicalX design paper that accompanies the library. We recommend reading the design paper with the tutorial – it can deepen the understanding of the users and connect the concepts discussed in this tutorial and the introduction.
Data Loading and Generator Definition¶
We import the DatasetLoader
and BatchGenerator
from the data
namespace of ChemicalX. In the first step a DrugComDB DatasetLoader is
instantiated – it is an oncology-related synergy scoring dataset.
The task related to the dataset is to predict the synergistic nature of
drug pair combinations. Using the get_context_features()
,
get_drug_features()
and get_labeled_triples()
class methods we
load the context features, drug features, and the triples used for training.
Using the triples we generate training and test sets by using 50% of
the triples for training. Finally, we create a BatchGenerator
instance
this will generate drug pair batches of size 1024, while it will return
the drug and context features for each labeled triple.
from chemicalx.data import DrugCombDB, BatchGenerator
loader = DrugCombDB()
context_set = loader.get_context_features()
drug_set = loader.get_drug_features()
triples = loader.get_labeled_triples()
train, test = triples.train_test_split(train_size=0.5)
generator = BatchGenerator(batch_size=1024,
context_features=True,
drug_features=True,
drug_molecules=False,
context_feature_set=context_set,
drug_feature_set=drug_set,
labeled_triples=train)
Model Training¶
We already have a generator to create batches of data. Now we
will need a model, optimizer, and appropriate loss function.
We import the torch
library and DeepSynergy
from the
models
namespace of the library. We create a DeepSynergy
model instance and set the number of input channels to be compatible
with the DrugCombdDB
dataset. We define an Adam
optimizer
and a binary cross-entropy instance to accumulate the loss values.
Using these the model is trained for a single epoch. In each step,
we reset the gradients to be zero, make predictions, calculate
the loss value, backpropagate and make a step with the optimizer.
import torch
from chemicalx.models import DeepSynergy
model = DeepSynergy(context_channels=112,
drug_channels=256)
optimizer = torch.optim.Adam(model.parameters())
model.train()
loss = torch.nn.BCELoss()
for batch in generator:
optimizer.zero_grad()
prediction = model(batch.context_features,
batch.drug_features_left,
batch.drug_features_right)
loss_value = loss(prediction, batch.labels)
loss_value.backward()
optimizer.step()
Model Scoring¶
We will store the predictions in the pandas
data frames and because
of this, we import the pandas
library. We set the model to be in
evaluation mode and we assign the test set triples to the generator.
We accumulate the predictions in a list
and iterate over the
batches in the generator. In each step we make predictions for the
drug pairs in the batch, these are detached from the computation
graph and added the batch identifiers DataFrame
. This is
appended to the predictions in each step. Finally, the predictions
are turned into a DataFrame
.
import pandas as pd
model.eval()
generator.labeled_triples = test
predictions = []
for batch in generator:
prediction = model(batch.context_features,
batch.drug_features_left,
batch.drug_features_right)
prediction = prediction.detach().cpu().numpy()
identifiers = batch.identifiers
identifiers["prediction"] = prediction
predictions.append(identifiers)
predictions = pd.concat(predictions)
Data Cleaning¶
ChemicalX comes with benchmark datasets that we pre-processed. In this section of the documentation we discuss how we obtained the raw data. We also discuss what pre-processing steps have been taken. We do this for each of the datasets in the framework.
Drugbank DDI¶
We used the cleaned dataset from the Therapeutic Data Commons.
Drug identifiers are represented by the DrugBank identifier.
Contexts are represented by drug-drug interaction identifiers from DrugBank.
Using RDKit 2021.09.03. we generated 256-dimensional Morgan fingerprints.
Labels represent the presence of a specific drug-drug interaction.
Context features are one-hot encoded binary vectors.
We generated an equal number of negative samples as positives.
Negative samples do not contain collisions.
TwoSides¶
This datasets is a subsample of TwoSides.
We only included the 100 most common side effects.
We used the cleaned dataset from the Therapeutic Data Commons.
Drug identifiers are represented by the DrugBank identifier.
Contexts are represented by the top 10 most common side effects in TwoSides.
Using RDKit 2021.09.03. we generated 256-dimensional Morgan fingerprints.
Labels represent the presence of a specific drug-drug interaction.
Context features are one-hot encoded binary vectors.
We generated an equal number of negative samples as positives.
Negative samples do not contain collisions.
External resources¶
Contents
Model Architectures¶
Sunyoung Kwon, Sungroh Yoon: DeepCCI: End-to-end Deep Learning for Chemical-Chemical Interaction Prediction
Jae Yong Ryu, Hyun Uk Kim, and Yup Lee: Deep Learning Improves Prediction of Drug–Drug and Drug–Food Interactions, Code (kaistsystemsbiology/deepddi)
Kristina Preuer, Richard Lewis, Sepp Hochreiter, Andreas Bender, Krishna C Bulusu, Günter Klambauer: DeepSynergy: Predicting Anti-Cancer Drug Synergy with Deep Learning, Code (KristinaPreuer/DeepSynergy)
Nuo Xu, Pinghui Wang, Long Chen, Jing Tao, Junzhou Zhao: MR-GNN: Multi-Resolution and Dual Graph Neural Network for Predicting Structured Entity Interactions, Code
Andreea Deac, Yu-Hsiang Huang, Petar Veličković, Pietro Liò, Jian Tang: Drug-Drug Adverse Effect Prediction with Graph Co-Attention
Kexin Huang, Cao Xiao, Trong Nghia Hoang, Lucas M. Glass, Jimeng Sun: CASTER: Predicting Drug Interactions with Chemical Substructure Representation, Code (kexinhuang12345/CASTER)
Arnold K Nyamabo, Hui Yu, Jian-Yu Shi: SSI–DDI: Substructure–Substructure Interactions for Drug–Drug Interaction Prediction, Code (kanz76/SSI-DD)
Mengying Sun, Fei Wang, Olivier Elemento, Jiayu Zhou: Structure-Based Drug-Drug Interaction Detection via Expressive Graph Convolutional Networks and Deep Sets
Tianyu Zhang, Liwei Zhang, Philip Payne, Fuhai Li: Synergistic Drug Combination Prediction by Integrating Multiomics Data in Deep Learning Models
Xusheng Cao, Rui Fan, Wanwen Zeng: DeepDrug: A General Graph-Based Deep Learning Framework for Drug Relation Prediction, Code (wanwenzeng/deepdrug)
Xin Chen, Xien Liu, Ji Wuab: GCN-BMP: Investigating Graph Representation Learning for DDI Prediction Task,
Yue-Hua Feng, Shao-Wu Zhang, Jian-Yu Shi: DPDDI: a Deep Predictor for Drug-Drug Interactions, Code (NWPU-903PR/DPDDI)
Jinxian Wang, Xuejun Liu, Siyuan Shen, Lei Deng, Hui Liu: DeepDDS: Deep Graph Neural Network with Attention Mechanism to Predict Synergistic Drug Combinations, Code (Sinwang404/DeepDDS)
Halil Ibrahim Kuru, Oznur Tastan, Ercument Cicek: MatchMaker: A Deep Learning Framework for Drug Synergy Prediction, Code (tastanlab/matchmaker)
Hui Yu, ShiYu Zhao, JianYu Shi: STNN-DDI: A Substructure-aware Tensor Neural Network to Predict Drug-Drug Interactions, Code (zsy-9/STNN-DDI)