Introduction

ChemicalX is a deep learning library for drug-drug interaction, polypharmacy side effect, and synergy prediction. The library consists of data loaders and integrated benchmark datasets. It also includes state-of-the-art deep neural network architectures that solve the drug pair scoring task. Implemented methods cover traditional SMILES string-based techniques and neural message-passing based models.

>@article{chemicalx,
          arxivId = {2202.05240},
          author = {Rozemberczki, Benedek and Hoyt, Charles Tapley and Gogleva, Anna and Grabowski, Piotr and Karis, Klas and Lamov, Andrej and Nikolov, Andriy and Nilsson, Sebastian and Ughetto, Michael and Wang, Yu and Derr, Tyler and Gyori, Benjamin M},
          month = {feb},
          title = {{ChemicalX: A Deep Learning Library for Drug Pair Scoring}},
          url = {http://arxiv.org/abs/2202.05240},
          year = {2022}
}

Overview

We shortly overview the fundamental concepts and features of ChemicalX through simple examples. These are the following:

Design Philosophy

When ChemicalX was created we wanted to reuse the high-level architectural elements of torch and torchdrug. We also wanted to conceptualize the ideas outlined in A Unified View of Relational Deep Learning for Drug Pair Scoring.

Drug Feature Set

Drug feature sets are custom UserDict objects that allow the fast retrieval of the molecular graph and the drug level features such as the Morgan fingerprint of the drug. The get_feature_matrix and get_molecules class methods allow the batching of drugs and molecular graphs using the drug identifiers. Molecule level features are returned as a torch.FloatTensor matrix while the molecular graphs are PackedGraph objects generated by torchdrug.

Context Feature Set

Similarly to the DrugFeatureSet the ContextFeatureSet are custom UserDict objects that allow the storage of biological or chemical context-specific feature vectors. These features are stored as torch.FloatTensor instances for each context identifier key.

Labeled Triples

Labeled triples contain labeled drug pairs where the label is specific to a context. The LabeledTriples class is a wrapper around pandas dataframes that allow shuffling the triples and the generation of training and test splits by using the train_test_split class method. This class also provides basic descriptive statistics about the number of negatively labeled instances and the number of labeled triples.

Dataset Loaders

Dataset loaders allow the prompt retrieval of integrated datasets. After a loader is initialized the class methods allow getting the respective DrugFeatureSet, ContextFeatureSet and LabeledTriples.

from chemicalx.data import DrugCombDB

loader = DrugCombDB()

context_set = loader.get_context_features()
drug_set = loader.get_drug_features()
triples = loader.get_labeled_triples()

Batch Generators and Drug Pair Batches

Using instances of the DrugFeatureSet, ContextFeatureSet, and LabeledTriples classes one can initialize a BatchGenerator` instance. This class allows the generation of drug ``DrugPairBatch instances which contain the drug and context features for the drugs in the batch. In the training and evaluation of deep drug pair scoring models the DrugPairBatch acts as a custom data class.

Models and Pipelines

Model Layers

Drug pair scoring models in ChemicalX inherit from torch neural network modules. Each of the models provides an unpack and forward method; the first helps with unpacking the drug pair batch while the second makes a forward pass to make predictions and return propensities for the drug pairs in the batch. Models have sensible default parameters for the non-dataset-dependent hyperparameters.

Pipelines

Pipelines provide high-level abstractions for the end-to-end training and evaluation of ChemicalX models. Given a dataset and model a pipeline can easily train the model on the dataset, generate scores and evaluation metrics.

from chemicalx import pipeline
from chemicalx.models import DeepSynergy
from chemicalx.data import DrugCombDB

model = DeepSynergy(context_channels=112,
                    drug_channels=256)

dataset = DrugCombDB()

results = pipeline(dataset=dataset,
                   model=model,
                   batch_size=1024,
                   context_features=True,
                   drug_features=True,
                   drug_molecules=False,
                   labels=True,
                   epochs=100)

results.summarize()

results.save("~/test_results/")