Data Cleaning¶
ChemicalX comes with benchmark datasets that we pre-processed. In this section of the documentation we discuss how we obtained the raw data. We also discuss hat pre-processing steps have been taken. We do this for each of the datasets in the framework.
Drugbank DDI¶
We used the cleaned dataset from the Therapeutic Data Commons.
Drug identifiers are represented by the DrugBank identifier.
Contexts are represented by drug-drug interaction identifiers from DrugBank.
Using RDKit 2021.09.03. we generated 256-dimensional Morgan fingerprints.
Labels represent the presence of a specific drug-drug interaction.
Context features are one-hot encoded binary vectors.
We generated an equal number of negative samples as positives.
Negative samples do not contain collisions.
TwoSides¶
This datasets is a subsample of TwoSides.
We only included the 100 most common side effects.
We used the cleaned dataset from the Therapeutic Data Commons.
Drug identifiers are represented by the DrugBank identifier.
Contexts are represented by the top 10 most common side effects in TwoSides.
Using RDKit 2021.09.03. we generated 256-dimensional Morgan fingerprints.
Labels represent the presence of a specific drug-drug interaction.
Context features are one-hot encoded binary vectors.
We generated an equal number of negative samples as positives.
Negative samples do not contain collisions.