Tutorial

In this lightweight tutorial, we will overview an oncology use case. The same use case is discussed in detail in the ChemicalX design paper that accompanies the library. We recommend reading the design paper with the tutorial – it can deepen the understanding of the users and connect the concepts discussed in this tutorial and the introduction.

Data Loading and Generator Definition

We import the DatasetLoader and BatchGenerator from the data namespace of ChemicalX. In the first step a DrugComDB DatasetLoader is instantiated – it is an oncology-related synergy scoring dataset. The task related to the dataset is to predict the synergistic nature of drug pair combinations. Using the get_context_features(), get_drug_features() and get_labeled_triples() class methods we load the context features, drug features, and the triples used for training. Using the triples we generate training and test sets by using 50% of the triples for training. Finally, we create a BatchGenerator instance this will generate drug pair batches of size 1024, while it will return the drug and context features for each labeled triple.

from chemicalx.data import DrugCombDB, BatchGenerator

loader = DrugCombDB()

context_set = loader.get_context_features()
drug_set = loader.get_drug_features()
triples = loader.get_labeled_triples()

train, test = triples.train_test_split(train_size=0.5)

generator = BatchGenerator(batch_size=1024,
                           context_features=True,
                           drug_features=True,
                           drug_molecules=False,
                           context_feature_set=context_set,
                           drug_feature_set=drug_set,
                           labeled_triples=train)

Model Training

We already have a generator to create batches of data. Now we will need a model, optimizer, and appropriate loss function. We import the torch library and DeepSynergy from the models namespace of the library. We create a DeepSynergy model instance and set the number of input channels to be compatible with the DrugCombdDB dataset. We define an Adam optimizer and a binary cross-entropy instance to accumulate the loss values. Using these the model is trained for a single epoch. In each step, we reset the gradients to be zero, make predictions, calculate the loss value, backpropagate and make a step with the optimizer.

import torch
from chemicalx.models import DeepSynergy

model = DeepSynergy(context_channels=112,
                    drug_channels=256)

optimizer = torch.optim.Adam(model.parameters())
model.train()
loss = torch.nn.BCELoss()

for batch in generator:
    optimizer.zero_grad()
    prediction = model(batch.context_features,
                       batch.drug_features_left,
                       batch.drug_features_right)
    loss_value = loss(prediction, batch.labels)
    loss_value.backward()
    optimizer.step()

Model Scoring

We will store the predictions in the pandas data frames and because of this, we import the pandas library. We set the model to be in evaluation mode and we assign the test set triples to the generator. We accumulate the predictions in a list and iterate over the batches in the generator. In each step we make predictions for the drug pairs in the batch, these are detached from the computation graph and added the batch identifiers DataFrame. This is appended to the predictions in each step. Finally, the predictions are turned into a DataFrame.

import pandas as pd

model.eval()
generator.labeled_triples = test

predictions = []
for batch in generator:
    prediction = model(batch.context_features,
                       batch.drug_features_left,
                       batch.drug_features_right)
    prediction = prediction.detach().cpu().numpy()
    identifiers = batch.identifiers
    identifiers["prediction"] = prediction
    predictions.append(identifiers)

predictions = pd.concat(predictions)