# Bringing data science and AI/ML tools to infectious disease research

# H3D Foundation and Ersilia present

# Event Sponsors

# Session 2: Building a machine learning model

Skills Development workshop

## Tools

Google Colaboratory

Python Packages

Therapeutics Data Commons

## Getting Started

https://github.com/ersilia-os/event-fund-ai-drug-discovery

## Classification tasks

A classifier identifies the category of a new data point. In the case of bioactivity data, we usually find two categories (binary classifier)

- Active (1)
- Inactive (0)

Wang, Shuangquan et al, Molecular Pharmaceutics 2016

To learn more about classification, we will use the Therapeutics Data Commons prepared dataset for blockade of hERG activity (Tox)

- Actives (1): block hERG
- Inactives (0): do not block hERG

## Load the data from TDC

```
# import the hERG dataset, part of the Toxicity data available at TDC
!pip install PyTDC
from tdc.single_pred import Tox
data = Tox(name = "hERG")
split = data.get_split()
#we can now separate the compressed dataset in the three sections
train = split["train"]
validation = split["valid"]
test = split["test"]
train.to_csv("drive/MyDrive/h3d_ersilia_ai_workshop/data/session2/tdc_herg_train.csv", index=False)
validation.to_csv("drive/MyDrive/h3d_ersilia_ai_workshop/data/session2/tdc_herg_validation.csv", index=False)
test.to_csv("drive/MyDrive/h3d_ersilia_ai_workshop/data/session2/tdc_herg_test.csv", index=False)
```

## Load the data from TDC

458, 66, 131

```
#we can check how many molecules we have in each set
print(len(train))
print(len(valid))
print(len(test))
```

the train set is used to train the ML model

the validation set is used to improve the model

the test set is used to evaluate the final model performance

In this example we will not focus on optimizing the model parameters, therefore we will only use the train and test splits

## Load the data from TDC

458, 66, 131

```
#we can check how many molecules we have in each set
print(len(train))
print(len(valid))
print(len(test))
```

the train set is used to train the ML model

the validation set is used to improve the model

the test set is used to evaluate the final model performance

```
train.head()
print(len(train[train["Y"]==0]))
print(len(train[train["Y"]==1]))
```

## Molecular Representation

```
!pip install rdkit #chemioinformatics package
from rdkit import Chem
from rdkit.Chem import Draw
smiles = train["Drug"][449:] # select a few smiles for representation
#convert the smiles to RdKit Molecules (not readable by humans)
mols = [Chem.MolFromSmiles(smi) for smi in smiles]
```

<rdkit.Chem.rdchem.Mol at 0x7fa16c9435d0>

```
from rdkit.Chem import Draw
Draw.MolsToGridImage(mols)
```

## Molecular Featurization

Featurization is the conversion of SMILES into vectors or images that can be passed to the ML model

Duran-Frigola et al, Nat Biotech 2020

```
!pip install signaturizer
from signaturizer import Signaturizer
sign = Signaturizer("GLOBAL")
X_train = sign.predict(train["Drug"]).signature
X_test = sign.predict(test["Drug"]).signature
```

array([-0.09308645, 0.09305048, -0.09294817, ..., -0.05613168, -0.1379203 , 0.05799835], dtype=float32)

`X_train[0]`

## Data we have so far:

Y_datasets: list of 0 and 1 results that correspond to the list of molecules in the train and test sets

- Y_train, Y_test

X datasets: vector-like representation of the molecules calculated using the Chemical Checker --> model input

- X_train, X_test

Original datasets: table with SMILES and its associated hERG blocking activity as 0 or 1

- train, test

## Supervised Machine Learning

In summary, we now have two datasets (train, test):

- X (input): molecular signature
- Y (output): activity against hERG, 1 or 0

We will use the Random Forest Classifier algorithm implemented in scikit-learn.

```
!pip install sklearn
from sklearn.ensemble import RandomForestClassifier
clf = RandomForestClassifier()
clf.fit(X_train, Y_train)
```

## Classification outputs

Once the model is fitted, we can predict the category of each molecule in the validation and test sets. A classifier outputs two numbers per each prediction:

- Probability of 0
- Probability of 1

```
y_pred = clf.predict_proba(X_test)
y_pred[:10]
```

`y_pred[:,1]`

array([[0.31, 0.69], [0.37, 0.63], [0.69, 0.31], [0.14, 0.86], [0.58, 0.42], [0.84, 0.16], [0.163,0.837], [0.257, 0.743], [0.111, 0.889], [0.303, 0.697]])

## Classification outputs

In a classification, the output is a probability. We need to define a cut-off or threshold to transform the results into a binary output (0 or 1) again. The threshold is typically set at 0.5 by default.

Proba 0 | Proba 1 | Cut-off 0.5 | Cut-off 0.7 | Cut-off 0.3 |
---|---|---|---|---|

0.39 | 0.61 | |||

0.3 | 0.7 | |||

0.69 | 0.31 | |||

0.21 | 0.79 |

## Classification outputs

In a classification, the output is a probability. We need to define a cut-off or threshold to transform the results into a binary output (0 or 1) again. The threshold is typically set at 0.5 by default.

Proba 0 | Proba 1 | Cut-off 0.5 | Cut-off 0.7 | Cut-off 0.3 |
---|---|---|---|---|

0.39 | 0.61 | 1 | 0 | 1 |

0.3 | 0.7 | |||

0.69 | 0.31 | |||

0.21 | 0.79 |

## Classification outputs

In a classification, the output is a probability. We need to define a cut-off or threshold to transform the results into a binary output (0 or 1) again. The threshold is typically set at 0.5 by default.

Proba 0 | Proba 1 | Cut-off 0.5 | Cut-off 0.7 | Cut-off 0.3 |
---|---|---|---|---|

0.39 | 0.61 | 1 | 0 | 1 |

0.3 | 0.7 | 1 | 1 | 1 |

0.69 | 0.31 | |||

0.21 | 0.79 |

## Classification outputs

Proba 0 | Proba 1 | Cut-off 0.5 | Cut-off 0.7 | Cut-off 0.3 |
---|---|---|---|---|

0.39 | 0.61 | 1 | 0 | 1 |

0.3 | 0.7 | 1 | 1 | 1 |

0.69 | 0.31 | 0 | 0 | 1 |

0.21 | 0.79 |

## Classification outputs

Proba 0 | Proba 1 | Cut-off 0.5 | Cut-off 0.7 | Cut-off 0.3 |
---|---|---|---|---|

0.39 | 0.61 | 1 | 0 | 1 |

0.3 | 0.7 | 1 | 1 | 1 |

0.69 | 0.31 | 0 | 0 | 1 |

0.21 | 0.79 | 1 | 1 | 1 |

## Model Evaluation

Confusion Matrix

Inactives

Actives

Inactives

Actives

True Negative

False Negative

False Positive

True Positive

True

Predicted

Precision

Recall

Precision: how many positives are actually positive

TP

TP+FP

Recall: how many positives are we able to identify

TP

TP+FN

## Model Evaluation

Confusion Matrix

Inactives

Actives

Inactives

Actives

True Negative

False Negative

False Positive

True Positive

True

Predicted

Higher, more restrictive threshold in proba1

Lower threshold in proba1

Precision

Recall

Precision

Recall

## Model Evaluation

Confusion Matrix

Inactives

Actives

Inactives

Actives

True Negative

False Negative

False Positive

True Positive

True

Predicted

In which scenarios we might want a high precision?

In which scenarios we might want a high recall?

## Model Evaluation

```
from sklearn.metrics import ConfusionMatrixDisplay as cdm
cdm.from_estimator(clf, X_test, Y_test)
```

## Model Evaluation

Precision: how many positives are actually positive

Recall: how many positives are we able to identify

TP

TP+FP

TP

TP+FN

93

93+17

93)

93)+5

## Model Evaluation

```
from sklearn.metrics import precision_score, recall_score
#precision and recall scores need a specific threshold
y_pred_bin = []
for x in y_pred[:,1]:
if x > 0.5:
y_pred_bin += [1]
if x <= 0.5:
y_pred_bin += [0]
precision = precision_score(Y_test, y_pred_bin)
recall = recall_score(Y_test, y_pred_bin)
print(precision, recall)
```

0.8454545454545455 0.9489795918367347

## Data we have so far:

Original datasets: train, test

X datasets: X_train, X_test

Y_datasets: Y_train, Y_test

Predictions:

- y_pred: output of the classifier, expressing the probability that a molecule is inactive (0) or active (1).
- y_pred_bin: binarized activity based on the predicted probability, the cut-off is set by the researcher.

## Model Evaluation

True Positive Rate (sensitivity or recall): proportion of correctly predicted positives of all positive observations (TP/(TP+FN))

False Positive Rate (100-sensibility): proportion of incorrectly predicted positives of all negative observations (FP/(TN+FP))

ROC Curve: performance of the model at all classification thresholds (0-1)

Inactives

Actives

Inactives

Actives

True Negative

False Negative

False Positive

True Positive

True

Predicted

## Model Evaluation

ROC Curve: performance of the model at all classification thresholds (0-1)

Area Under the Curve (AUC): agregate measure of model performance. It does not depend on the threshold

0

0

1

1

TPR

FPR

Perfect model

No value

Real curve

## Model Evaluation

```
from sklearn.metrics import RocCurveDisplay as rdc
rdc.from_estimator(clf, X_test, Y_test)
```

## Model Evaluation

```
from sklearn.metrics import ConfusionMatrixDisplay as cdm
cdm.from_estimator(clf, X_train, Y_train)
from sklearn.metrics import RocCurveDisplay as rdc
rdc.from_estimator(clf, X_train, Y_train)
```

## Classification tasks: summary

- We have used a binary classification for hERG activity and trained a random forest classifier
- The molecules were featurized using Chemical Checker signatures, which combine structural and bioactivity data of the molecules
- The model performance was good (AUC above 0.8) on both the validation and test sets
- How could we try to improve the model?

## Regression tasks

A regressor predicts continuous numerical values, for example:

- IC50
- % of inhibition

To learn more about regression, we will use the Therapeutics Data Commons Acute Toxicity:

- Lethal Dose 50 (LD50): amount of drug (mg/kg) that kills 50% of the tested population

Zhu, Hao et al, Chemical Research in toxicology 2009

## Load the data from TDC

```
#we can check how many molecules we have in each set
print(len(train))
print(len(valid))
print(len(test))
```

Train set: 5170 molecules

Validation set: 738 molecules

Test set: 1477 molecules

## Molecular Representation

```
train.head()
```

```
import matplotlib.pyplot as plt
plt.hist(train["Y"], bins=50, color = "#50285a")
plt.xlabel("Lethal Dose 50")
plt.ylabel("Number of molecules")
```

## Molecular Featurization

Chemical Checker Signatures

Duran-Frigola et al, Nat Biotech 2020

```
from signaturizer import Signaturizer
sign = Signaturizer("GLOBAL")
X_train = sign.predict(train["Drug"]).signature
X_test = sign.predict(test["Drug"]).signature
```

array([-0.09308645, 0.09305048, -0.09294817, ..., -0.05613168, -0.1379203 , 0.05799835], dtype=float32)

```
Y_train = list(train["Y"])
Y_test = list(test["Y"])
```

2.343

## Supervised Machine Learning

In summary, we now have three datasets (train, validation, test):

- X (input): molecular signature
- Y (output): LD50 values

We will use a simple linear regression from sci-kit learn to train a model

```
from sklearn.linear_model import LinearRegression
reg = LinearRegression()
reg.fit(X_train, Y_train)
```

## Regression outputs

The regression gives back the predicted value, in our case, LD50.

```
y_pred = reg.predict(X_test)
#if we check, the prediction is simply a list of LD50
y_pred[:4]
```

```
array([3.3221316, 2.0938659, 2.874437 , 4.17527 ], dtype=float32)
```

## Regression model evaluation

The easiest is to plot a scatter distribution of the real versus the predicted values.

We use the Matplotlib package for plotting functions

```
fig, ax = plt.subplots(1,1, figsize=(5,5))
ax.scatter(x=y_pred, y=Y_test, color="#dca0dc")
ax.set_xlabel("Predicted LD50")
ax.set_ylabel("Real LD50")
```

## Regression Model Evaluation

- Mean Absolute Error (MAE): average of the absolute difference between actual and predicted values
- Mean Squared Error (MSE): average of the squared difference
- R-square: proportion of the variance explained by the regression model. It is scale free

MAE

Predicted values

Real values

## Regression model evaluation

```
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
mae = mean_absolute_error(Y_valid, valid_pred)
mse = mean_squared_error(Y_valid, valid_pred)
r2 = r2_score(Y_valid, valid_pred)
print("MAE: {}".format(mae))
print("MSE: {}".format(mse))
print("R2: {}".format(r2))
```

```
MAE: 0.8202228248986185
MSE: 1.1181765214543737
R2: -0.25158525929946496
```

What do we think of these results?

## Improving regression models

We can test other models for regression to improve the performance on our data

```
from sklearn.ensemble import RandomForestRegressor
reg = RandomForestRegressor(n_estimators=10) #we use only 10 trees for speed time
reg.fit(X_train, Y_train)
test_pred = reg.predict(X_test)
fig, ax = plt.subplots(1,1, figsize=(5,5))
ax.scatter(x=y_pred, y=Y_test, color="#dca0dc")
ax.set_xlabel("Predicted LD50")
ax.set_ylabel("Real LD50")
```

## Improving regression models

We can test other models for regression to improve the performance on our data

```
mae = mean_absolute_error(Y_valid, valid_pred)
mse = mean_squared_error(Y_valid, valid_pred)
r2 = r2_score(Y_valid, valid_pred)
print("MAE: {}".format(mae))
print("MSE: {}".format(mse))
print("R2: {}".format(r2))
```

```
MAE: 0.5450362715297566
MSE: 0.5291154421053693
R2: 0.40775622175874715
```

## Conclusions

- We have used the Therapeutics Data Commons datasets to build simple classification and regression QSAR models:
- Random Forest Classifier
- Linear Regression
- Random Forest Regressor

- We have evaluated model performance using the validation and test sets, and the basic metrics for:
- Classification: precision, recall and AUC
- Regression: MAE, MSE, R2

#### EventFund_Session2_Skills

By Gemma Turon

# EventFund_Session2_Skills

- 110