Bringing data science and AI/ML tools to infectious disease research

H3D Foundation and Ersilia present

Event Sponsors

Session 2: Building a machine learning model

Skills Development workshop

Tools

Google Colaboratory

Python Packages

Therapeutics Data Commons

Getting Started

https://github.com/ersilia-os/event-fund-ai-drug-discovery

Classification tasks

A classifier identifies the category of a new data point. In the case of bioactivity data, we usually find two categories (binary classifier)

  • Active (1)
  • Inactive (0)

Wang, Shuangquan et al, Molecular Pharmaceutics 2016

To learn more about classification, we will use the Therapeutics Data Commons prepared dataset for blockade of hERG activity (Tox)

  • Actives (1):  block hERG
  • Inactives (0): do not block hERG

Load the data from TDC

# import the hERG dataset, part of the Toxicity data available at TDC
!pip install PyTDC

from tdc.single_pred import Tox
data = Tox(name = "hERG")

split = data.get_split()

#we can now separate the compressed dataset in the three sections
train = split["train"]
validation = split["valid"]
test = split["test"]

train.to_csv("drive/MyDrive/h3d_ersilia_ai_workshop/data/session2/tdc_herg_train.csv", index=False)
validation.to_csv("drive/MyDrive/h3d_ersilia_ai_workshop/data/session2/tdc_herg_validation.csv", index=False)
test.to_csv("drive/MyDrive/h3d_ersilia_ai_workshop/data/session2/tdc_herg_test.csv", index=False)

Load the data from TDC

458, 66, 131
#we can check how many molecules we have in each set
print(len(train))
print(len(valid))
print(len(test))

the train set is used to train the ML model

the validation set is used to improve the model

the test set is used to evaluate the final model performance

In this example we will not focus on optimizing the model parameters, therefore we will only use the train and test splits

Load the data from TDC

458, 66, 131
#we can check how many molecules we have in each set
print(len(train))
print(len(valid))
print(len(test))

the train set is used to train the ML model

the validation set is used to improve the model

the test set is used to evaluate the final model performance

train.head()
print(len(train[train["Y"]==0]))
print(len(train[train["Y"]==1]))

Molecular Representation

!pip install rdkit  #chemioinformatics package
from rdkit import Chem
from rdkit.Chem import Draw

smiles = train["Drug"][449:] # select a few smiles for representation
#convert the smiles to RdKit Molecules (not readable by humans)
mols = [Chem.MolFromSmiles(smi) for smi in smiles]
<rdkit.Chem.rdchem.Mol at 0x7fa16c9435d0>
from rdkit.Chem import Draw
Draw.MolsToGridImage(mols)

Molecular Featurization

Featurization is the conversion of SMILES into vectors or images that can be passed to the ML model

Duran-Frigola et al, Nat Biotech 2020

!pip install signaturizer
from signaturizer import Signaturizer
sign = Signaturizer("GLOBAL")
X_train = sign.predict(train["Drug"]).signature
X_test = sign.predict(test["Drug"]).signature
array([-0.09308645,  0.09305048, -0.09294817, ..., -0.05613168,
       -0.1379203 ,  0.05799835], dtype=float32)
X_train[0]

Data we have so far:

Y_datasets: list of 0 and 1 results that correspond to the list of molecules in the train and test sets

  • Y_train, Y_test

X datasets: vector-like representation of the molecules calculated using the Chemical Checker --> model input

  • X_train, X_test

Original datasets: table with SMILES and its associated hERG blocking activity as 0 or 1

  • train, test

Supervised Machine Learning

In summary, we now have two datasets (train, test):

  • X (input): molecular signature
  • Y (output): activity against hERG, 1 or 0

We will use the Random Forest Classifier algorithm implemented in scikit-learn.

!pip install sklearn
from sklearn.ensemble import RandomForestClassifier
clf = RandomForestClassifier()
clf.fit(X_train, Y_train)

Classification outputs

Once the model is fitted, we can predict the category of each molecule in the validation and test sets. A classifier outputs two numbers per each prediction:

  • Probability of 0
  • Probability of 1
y_pred = clf.predict_proba(X_test)
y_pred[:10]
y_pred[:,1]
array([[0.31, 0.69],
       [0.37, 0.63],
       [0.69, 0.31],
       [0.14, 0.86],
       [0.58, 0.42],
       [0.84, 0.16],
       [0.163,0.837],
       [0.257, 0.743],
       [0.111, 0.889],
       [0.303, 0.697]])

Classification outputs

In a classification, the output is a probability. We need to define a cut-off or threshold to transform the results into a binary output (0 or 1) again. The threshold is typically set at 0.5 by default.

Proba 0 Proba 1 Cut-off 0.5 Cut-off 0.7 Cut-off 0.3
0.39 0.61
0.3 0.7
0.69 0.31
0.21 0.79

Classification outputs

In a classification, the output is a probability. We need to define a cut-off or threshold to transform the results into a binary output (0 or 1) again. The threshold is typically set at 0.5 by default.

Proba 0 Proba 1 Cut-off 0.5 Cut-off 0.7 Cut-off 0.3
0.39 0.61 1 0 1
0.3 0.7
0.69 0.31
0.21 0.79

Classification outputs

In a classification, the output is a probability. We need to define a cut-off or threshold to transform the results into a binary output (0 or 1) again. The threshold is typically set at 0.5 by default.

Proba 0 Proba 1 Cut-off 0.5 Cut-off 0.7 Cut-off 0.3
0.39 0.61 1 0 1
0.3 0.7 1 1 1
0.69 0.31
0.21 0.79

Classification outputs

In a classification, the output is a probability. We need to define a cut-off or threshold to transform the results into a binary output (0 or 1) again. The threshold is typically set at 0.5 by default.

Proba 0 Proba 1 Cut-off 0.5 Cut-off 0.7 Cut-off 0.3
0.39 0.61 1 0 1
0.3 0.7 1 1 1
0.69 0.31 0 0 1
0.21 0.79

Classification outputs

In a classification, the output is a probability. We need to define a cut-off or threshold to transform the results into a binary output (0 or 1) again. The threshold is typically set at 0.5 by default.

Proba 0 Proba 1 Cut-off 0.5 Cut-off 0.7 Cut-off 0.3
0.39 0.61 1 0 1
0.3 0.7 1 1 1
0.69 0.31 0 0 1
0.21 0.79 1 1 1

Model Evaluation

Confusion Matrix

Inactives

Actives

Inactives

Actives

True Negative

False Negative

False Positive

True Positive

True

Predicted

Precision

Recall

Precision: how many positives are actually positive

TP

TP+FP

Recall: how many positives are we able to identify

TP

TP+FN

Model Evaluation

Confusion Matrix

Inactives

Actives

Inactives

Actives

True Negative

False Negative

False Positive

True Positive

True

Predicted

Higher, more restrictive threshold in proba1

Lower threshold in proba1

Precision

Recall

Precision

Recall

Model Evaluation

Confusion Matrix

Inactives

Actives

Inactives

Actives

True Negative

False Negative

False Positive

True Positive

True

Predicted

In which scenarios we might want a high precision?

In which scenarios we might want a high recall?

Model Evaluation

from sklearn.metrics import ConfusionMatrixDisplay as cdm
cdm.from_estimator(clf, X_test, Y_test) 

Model Evaluation

Precision: how many positives are actually positive

Recall: how many positives are we able to identify

TP

TP+FP

TP

TP+FN

93

93+17

93)

93)+5

Model Evaluation

from sklearn.metrics import precision_score, recall_score
#precision and recall scores need a specific threshold

y_pred_bin = []
for x in y_pred[:,1]:
    if x > 0.5:
        y_pred_bin += [1]
    if x <= 0.5:
        y_pred_bin += [0]
        
precision = precision_score(Y_test, y_pred_bin)
recall = recall_score(Y_test, y_pred_bin)
print(precision, recall)
0.8454545454545455 0.9489795918367347

Data we have so far:

Original datasets: train, test

X datasets: X_train, X_test

Y_datasets: Y_train, Y_test

Predictions:

  • y_pred: output of the classifier, expressing the probability that a molecule is inactive (0) or active (1).
  • y_pred_bin: binarized activity based on the predicted probability, the cut-off is set by the researcher.

Model Evaluation

True Positive Rate (sensitivity or recall): proportion of correctly predicted positives of all positive observations (TP/(TP+FN))

False Positive Rate (100-sensibility): proportion of incorrectly predicted positives of all negative observations (FP/(TN+FP))

ROC Curve: performance of the model at all classification thresholds (0-1)

Inactives

Actives

Inactives

Actives

True Negative

False Negative

False Positive

True Positive

True

Predicted

Model Evaluation

ROC Curve: performance of the model at all classification thresholds (0-1)

Area Under the Curve (AUC): agregate measure of model performance. It does not depend on the threshold 

0

0

1

1

TPR

FPR

Perfect model

No value

Real curve

Model Evaluation

from sklearn.metrics import RocCurveDisplay as rdc
rdc.from_estimator(clf, X_test, Y_test)

Model Evaluation

from sklearn.metrics import ConfusionMatrixDisplay as cdm
cdm.from_estimator(clf, X_train, Y_train)

from sklearn.metrics import RocCurveDisplay as rdc
rdc.from_estimator(clf, X_train, Y_train)

Classification tasks: summary

  • We have used a binary classification for hERG activity and trained a random forest classifier
  • The molecules were featurized using Chemical Checker signatures, which combine structural and bioactivity data of the molecules
  • The model performance was good (AUC above 0.8) on both the validation and test sets
  • How could we try to improve the model?

Regression tasks

A regressor predicts continuous numerical values, for example:

  • IC50
  • % of inhibition

To learn more about regression, we will use the Therapeutics Data Commons Acute Toxicity:

  • Lethal Dose 50 (LD50): amount of drug (mg/kg) that kills 50% of the tested population

Zhu, Hao et al, Chemical Research in toxicology 2009

Load the data from TDC


#we can check how many molecules we have in each set
print(len(train))
print(len(valid))
print(len(test))

Train set: 5170 molecules

Validation set: 738 molecules

Test set: 1477 molecules

Molecular Representation

train.head()
import matplotlib.pyplot as plt

plt.hist(train["Y"], bins=50, color = "#50285a")
plt.xlabel("Lethal Dose 50")
plt.ylabel("Number of molecules")

Molecular Featurization

Chemical Checker Signatures

Duran-Frigola et al, Nat Biotech 2020

from signaturizer import Signaturizer
sign = Signaturizer("GLOBAL")
X_train = sign.predict(train["Drug"]).signature
X_test = sign.predict(test["Drug"]).signature
array([-0.09308645,  0.09305048, -0.09294817, ..., -0.05613168,
       -0.1379203 ,  0.05799835], dtype=float32)
Y_train = list(train["Y"])
Y_test = list(test["Y"])
2.343

Supervised Machine Learning

In summary, we now have three datasets (train, validation, test):

  • X (input): molecular signature
  • Y (output): LD50 values

We will use a simple linear regression from sci-kit learn to train a model

from sklearn.linear_model import LinearRegression
reg = LinearRegression()
reg.fit(X_train, Y_train)

Regression outputs

The regression gives back the predicted value, in our case, LD50.

y_pred = reg.predict(X_test)

#if we check, the prediction is simply a list of LD50
y_pred[:4]
array([3.3221316, 2.0938659, 2.874437 , 4.17527  ], dtype=float32)

Regression model evaluation

The easiest is to plot a scatter distribution of the real versus the predicted values.

We use the Matplotlib package for plotting functions

fig, ax = plt.subplots(1,1, figsize=(5,5))
ax.scatter(x=y_pred, y=Y_test, color="#dca0dc")
ax.set_xlabel("Predicted LD50")
ax.set_ylabel("Real LD50")

Regression Model Evaluation

  • Mean Absolute Error (MAE): average of the absolute difference between actual and predicted values
  • Mean Squared Error (MSE): average of the squared difference
  • R-square: proportion of the variance explained by the regression model. It is scale free

MAE

Predicted values

Real values

Regression model evaluation

from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score

mae = mean_absolute_error(Y_valid, valid_pred)
mse = mean_squared_error(Y_valid, valid_pred)
r2 = r2_score(Y_valid, valid_pred)

print("MAE: {}".format(mae))
print("MSE: {}".format(mse))
print("R2: {}".format(r2))
MAE: 0.8202228248986185
MSE: 1.1181765214543737
R2: -0.25158525929946496

What do we think of these results?

Improving regression models

We can test other models for regression to improve the performance on our data

from sklearn.ensemble import RandomForestRegressor

reg = RandomForestRegressor(n_estimators=10) #we use only 10 trees for speed time
reg.fit(X_train, Y_train)

test_pred = reg.predict(X_test)

fig, ax = plt.subplots(1,1, figsize=(5,5))
ax.scatter(x=y_pred, y=Y_test, color="#dca0dc")
ax.set_xlabel("Predicted LD50")
ax.set_ylabel("Real LD50")

Improving regression models

We can test other models for regression to improve the performance on our data

mae = mean_absolute_error(Y_valid, valid_pred)
mse = mean_squared_error(Y_valid, valid_pred)
r2 = r2_score(Y_valid, valid_pred)

print("MAE: {}".format(mae))
print("MSE: {}".format(mse))
print("R2: {}".format(r2))
MAE: 0.5450362715297566
MSE: 0.5291154421053693
R2: 0.40775622175874715

Conclusions

  • We have used the Therapeutics Data Commons datasets to build simple classification and regression QSAR models:
    • Random Forest Classifier
    • Linear Regression
    • Random Forest Regressor
  • We have evaluated model performance using the validation and test sets, and the basic metrics for:
    • Classification: precision, recall and AUC
    • Regression: MAE, MSE, R2

EventFund_Session2_Skills

By Gemma Turon

EventFund_Session2_Skills

  • 110