Honglei Xie

DREAM Big Data Challenge

December 04, 2015 | 3 Minute Read

Last summer I and my colleagues in Ontario Institute for Cancer Research teamed up together as Chipmunks, participated in one of DREAM big data challenges: Acute Myeloid Leukemia Outcome Prediction Challenge (AML). Similar to Kaggle, we were asked to develop predictive models based on the AML dataset that can be used by physicians to better assess a patient’s potential response to treatment. The model performance were accessed on a weekly basis and it was such a good hands-on experience digging into real big data.

Problem

Determine the best model to predict which AML patients will have Complete Remission (CR) or Primary Resistant (PR). The resp.simple target variable in the training data contains this information. Instead of predicting a binary value, our model report a decimal value between 0 and 1 expressing the confidence that a patient will have a complete response and achieve complete remission.

Data Description

The training data provided consists of measurements from 191 patients diagnosed with AML, who were treated at M.D. Anderson Cancer Center and receive chemotherapy. These measurements include 40 covariates describing patient demographics, cytogenics and mutation status, and the results of several standard blood tests.

Model Summary

Balanced random forest algorithm with categorized clinical features.

Methods

Data Pre-processing

Considering there was only a small proportion of missing (maximum missing proportion was incurred in FIBRINOGEN 7.33% in training data; 6% in test data), we replaced missing features by median values.

Cytogenetics was usually regarded as an important prognostic factor in acute myeloid leukemia and it should be processed carefully. We re-categorized cyto.cat into three new categories with reference to Grimwade, Walker, Oliver, Wheatley, Harrison, Harrison, Rees, Hann, Stevens, Burnett, and others (1998).

Training, Training and Training

According to Breiman (2001), mtry, number of variables randomly sampled at each split, is probably the only tuning parameter in random forest. We wish to find out a optimal mtry that balances properly between correlation between trees and strength of a single tree. As for the choice of number of trees, growing more trees is not bad, however, taking into accounts of the sample size, we choose 100,000.

We used 10-fold cross validation and randomly spiting approaches to tune mtry. We surprisedly found out that mtry = 2 yielded the best model. Another important technique has been used was down-sampling that addressed imbalanced data problem. Specifically, with every individual tree growing, draw bootstrap samples from majority class as the same number as the minority class. By doing so, we adjusted tendency towards to majority in classification results. Fortunately, this can be very easily be accomplished in R randomForest:::randomForest function.

Model Validation

Model performance is evaluated by simple average of Balanced Accuracy Rate (BAC) and Area Under Curve (AUC). By 10-fold CV, we obtained the following internal validation results (looks really promising!)

CV Performance BAC AUC
Median 0.8310 0.7864
Mean 0.8106 0.7785

A Little Trick

Note that the best threshold to maximizing BAC is not necessarily 0.5. Actually it varies from different cuts of training data. What if people just use 0.5 to judge your model? Here comes to our little trick to get the maximum BAC when threshold value is fixed at 0.5. We used a simple linear piece-wise transformation \(f(x)\) on prediction probabilities such that:

  • \(f(x)\) is monotonically increasing;
  • \(f(0)=0\);
  • \(f(c)=0.5\) where \(c\) is the threshold that produces the maximum BAC.

Please check out the source R codes in my Github.