Validation of read across predictions for nanoparticle toxicities

Christoph Helma, Micha Rautenberg, Denis Gebele

in silico toxicology gmbh, Basel, Switzerland

Objectives

Validate lazar read across models for nanoparticles
Compare regression algorithms
- Local weighted average
- Local weighted partial least squares
- Local weighted random forests
Compare nanoparticle descriptors
- Nanoparticle properties (physchem, size, shape, ...)
- Interaction with human serum proteins
Provide an example for reproducible research

Relevant features

Features that correlate significantly with toxicity (Pearson correlation p-value < 0.05)

Weighted cosine similarity

Partial least squares and random forest models use the caret R package with default settings

Prediction intervals: 1.96*RMSE of carets bootstrapped model predictions

If PLS/RF modelling or prediction fails, lazar resorts to using the weighted average method.

3 repeated 10-fold crossvalidations with independent training/test set splits
No fixed random seed for training/test set splits, to avoid overfitting and to demonstrate the variability of validation results due to random training/test splits.
Separate feature selection for each training dataset to avoid overfitting

At least 100 examples per toxicity endpoint for statistically meaningful validation results
At least non-empty intersection of descriptors for calculation of similarities

Net cell association endpoint of the Protein corona dataset (121 gold and silver particles)

Descriptors	Algorithm	r²	RMSE
Physchem	WA	`0.42, 0.46, 0.48`	`2.02, 1.94, 1.92`
Physchem	PLS	`0.53, 0.54, 0.49`	`1.83, 1.8, 1.9`
Physchem	RF	`0.53, 0.52, 0.54`	`1.82, 1.84, 1.79`
Proteomics	WA	`0.66, 0.63, 0.63 *`	`1.58, 1.62, 1.66 *`
Proteomics	PLS	`0.59, 0.66, 0.63 *`	`1.74, 1.56, 1.65 *`
Proteomics	RF	`0.66, 0.65, 0.63 *`	`1.56, 1.59, 1.64 *`
All	WA	`0.73, 0.66, 0.66 *`	`1.41, 1.57, 1.58 *`
All	PLS	`0.67, 0.64, 0.69 *`	`1.53, 1.63, 1.5 *`
All	RF	`0.69, 0.69, 0.7 **`	`1.51, 1.5, 1.46 **`

Gold and silver particles included!

Correlation of log2 transformed net cell association measurements with random forest predictions using physchem properties and protein corona data.

Manuscript (and presentation) including figures and tables are built directly from experimental results
Custom pandoc filter (similar to knitr for R)
Simple Makefile (make clean; make re-runs all experiments and creates an updated manuscript)

More aggressive parameter optimization and feature selection (danger of overfitting a relatively large dataset)
Mechanistic interpretation of relevant features (nanoparticle properties and proteins)