In many real world applications, a common problem is the occurrence of imbalanced data where one class outcome (majority class, 0) is observed far more frequently than the other (minority class, 1)
Business analytics: customer churn data
Software and text recognition: spam filtering, detecting coding errors
Fault detection and quality control in industrial systems
Biomedical data when prevelance of target group is relatively small or event measured over short interval
Example from causal analysis
An investigator studying the value of adjuvant treatment versus surgery in esophageal patients needed to apply causal techniques for their analysis. This required predicting the treatment group using patient features (treatment overlap analysis). There are \(N_0=6649\) surgery patients and \(N_1=988\) adjuvant treatment patients, giving an imbalanced ratio of \(6649/988 \approx 6.8\).
Running a RF classifier on the data, they obtained the following confusion matrix (from the OOB ensemble):
predicted
surgery
adjuvant
error
surgery
6549
100
.015
adjuvant
694
294
.702
Overall error rate: 10.4%
Example from causal analysis
Running a RF classifier on the data, they obtained the following confusion matrix (from the OOB ensemble):
predicted
surgery
adjuvant
error
surgery
6549
100
.015
adjuvant
694
294
.702
Overall error rate: 10.4%
The TNR (true postive rate) and TPR (false positive rate) are: \[
\text{TNR} = \frac{6549}{6549+100}=98.5\%,\hskip10pt
\text{TPR} = \frac{294}{694+294}=29.8\%,\hskip10pt
\] The geometric mean (gmean) is \[
\text{gmean}=\sqrt{.985\times .298} = 54.2\%
\] This is very poor, random guessing gives 0.5
Metrics for evaluating class imbalanced performance
Avoid using AUC, misclassification error and other standard metrics and instead use:
gmean
PR-AUC (precision recall, AUC)
ROC curve and PR-AUC curve for RF classifier from esophageal study of adjuvant treatment versus surgery
Why does the classifier have such poor performance?
The problem here (and for other ML methods) is the equal misclassification costs used by the Bayes decision rule \[
\delta(\mathbf{x})=1 \hskip5pt\text{if}\hskip5pt P\{Y = 1 \mid \mathbf{X} = \mathbf{x}\} \ge \frac{1}{2}
\]
In imbalanced data, the probability of an event is very small!! Therefore, the classifier tends to classify all cases as class labels 0
RFQ (random forest quantile) classifier
Use a different threshold. In place of 0.5 (the median) we use \(\pi=P\{Y=1\}\)\[
\delta(\mathbf{x})=1
\hskip5pt\text{if}\hskip5pt
P\{Y = 1 \mid \mathbf{X} = \mathbf{x}\}
\ge
\pi
\approx \frac{N_1}{N_0+N_1}
\]
This has a dual optimality property (cost-weighted Bayes optimal and maximizes TNR+TPR)
General call to imbalanced
Classification example: Glioma
Imbalanced class
Tip
imbalanced() is useful if you have two imbalanced class labels
data(glioma, package ="varPro")table(glioma$y)> Classic-like Codel G-CIMP-high G-CIMP-low LGm6-GBM Mesenchymal-like PA-like >148174249254121526## make a super majority classclass.combine <-c("Classic-like", "Codel", "G-CIMP-high", "Mesenchymal-like")ynew <-factor(1*!is.element(glioma$y, class.combine))## replace y with the new super classglioma2 <- gliomaglioma2$y <- ynewtable(glioma2$y)>01>78694
Imbalanced classification: Glioma
standard RF classifier
## standard RF classifiero1 <-rfsrc(y~.,glioma2)print(o1) Sample size:880 Frequency of class labels:786, 94 Number of trees:500 Forest terminal node size:1 Average no. of terminal nodes:29.726No. of variables tried at each split:36 Total no. of variables:1241 Resampling used to grow trees: swor Resample size used to grow trees:556 Analysis: RF-C Family: class Splitting rule: gini *random* Number of random split points:10 Imbalanced ratio:8.3617 (OOB) Brier score:0.03639723 (OOB) Normalized Brier score:0.1455889 (OOB) AUC:0.98621488 (OOB) Log-loss:0.1298745 (OOB) PR-AUC:0.9170852 (OOB) G-mean:0.72932496 (OOB) Requested performance error:0.05, 0, 0.46808511Confusion matrix: predicted observed 01 class.error078600.0000144500.4681 (OOB) Misclassification rate:0.05
Imbalanced classification: Glioma
RFQ classifier
## RFQ classifiero2 <-imbalanced(y~.,glioma2)print(o2) Sample size:880 Frequency of class labels:786, 94 Number of trees:3000 Forest terminal node size:1 Average no. of terminal nodes:26.557No. of variables tried at each split:36 Total no. of variables:1241 Resampling used to grow trees: swor Resample size used to grow trees:556 Analysis: RFQ Family: class Splitting rule: auc *random* Number of random split points:10 Imbalanced ratio:8.3617 (OOB) Brier score:0.03362459 (OOB) Normalized Brier score:0.13449834 (OOB) AUC:0.99057983 (OOB) Log-loss:0.12037099 (OOB) PR-AUC:0.94065642 (OOB) G-mean:0.93830013 (OOB) Requested performance error:0.06169987Confusion matrix: predicted observed 01 class.error0692940.119610940.0000 (OOB) Misclassification rate:0.1068182
Imbalanced classification: Glioma
Balanced RF (BRF)
BRF under-samples the majority class to balance the frequencies
This is an example of a data-level method
Surprisingly, BRF is actually an example of a quantile classifier and is theoretically equivalent to RFQ
However, in finite samples it is less efficient
Imbalanced classification: Glioma
## BRF classifiero3 <-imbalanced(y~.,glioma2, method ="brf")print(o3) Sample size:880 Frequency of class labels:786, 94 Number of trees:3000 Forest terminal node size:1 Average no. of terminal nodes:15.1263No. of variables tried at each split:36 Total no. of variables:1241 Resampling used to grow trees: swr Resample size used to grow trees:188 Analysis: RF-C Family: class Splitting rule: auc *random* Number of random split points:10 Imbalanced ratio:8.3617 (OOB) Brier score:0.04844594 (OOB) Normalized Brier score:0.19378376 (OOB) AUC:0.98658708 (OOB) Log-loss:0.19773628 (OOB) PR-AUC:0.92307335 (OOB) G-mean:0.91219607 (OOB) Requested performance error:0.08780393Confusion matrix: predicted observed 01 class.error0759270.0344113810.1383 (OOB) Misclassification rate:0.04545455
Imbalanced classification: Glioma
## BRF - brute forceo4 <-rfsrc(y~.,glioma2,case.wt = randomForestSRC:::make.wt(glioma2$y),sampsize = randomForestSRC:::make.size(glioma2$y))print(o4) Sample size:880 Frequency of class labels:786, 94 Number of trees:500 Forest terminal node size:1 Average no. of terminal nodes:19.234No. of variables tried at each split:36 Total no. of variables:1241 Resampling used to grow trees: swor Resample size used to grow trees:188 Analysis: RF-C Family: class Splitting rule: gini *random* Number of random split points:10 Imbalanced ratio:8.3617 (OOB) Brier score:0.05301977 (OOB) Normalized Brier score:0.21207909 (OOB) AUC:0.98412376 (OOB) Log-loss:0.20939848 (OOB) PR-AUC:0.90805735 (OOB) G-mean:0.90192259 (OOB) Requested performance error:0.06477273, 0.05597964, 0.13829787Confusion matrix: predicted observed 01 class.error0742440.0560113810.1383 (OOB) Misclassification rate:0.06477273
Imbalanced classification: Glioma
## gmean variable importance with cio2 <-imbalanced(y~.,glioma2, importance="permute", block.size=20)oo2 <-subsample(o2)plot.subsample(oo2)
Competing risk
Family
Example Grow Call with Formula Specification
Regression Quantile Regression
rfsrc(Ozone~., data = airquality) quantreg(mpg~., data = mtcars)
Classification Imbalanced Two-Class
rfsrc(Species~., data = iris) imbalanced(status~., data = breast)
This is a specialized survival (time to event) analysis setting:
In competing risk the individual is subject to more than one event.
For example, a patient listed to receive a heart transplant can die while on the waitlist, or they can receive the heart. The patient is subject to two competing risks “death” and “transplant”
Right-censoring also occurs just like in standard survival analysis
Competing risk quantities of interest
There are two primary quantities of interest:
1. The cumulative incidence function (CIF) equal to the probability of observing an event of interest by time \(t\)
Use the splitrule logrankCR (the default)
Plot the values for obj$cif
Use minimal depth for importance
2. The cause-specific hazard function equal to the instantaneous risk of event of interest \(j\) occuring at time \(t\)
Use the splitrule logrank and option cause=j
Use option importance='permute' to obtain importance and extract column \(j\) of obj$importance
Competing risk example: PBC Mayo Clinic
time
status
trt
age
sex
ascites
hepato
spiders
edema
bili
chol
albumin
copper
alk.phos
ast
trig
platelet
protime
400
2
1
58.76523
f
1
1
1
1.0
14.5
261
2.60
156
1718.0
137.95
172
190
12.2
4500
0
1
56.44627
f
0
1
1
0.0
1.1
302
4.14
54
7394.8
113.52
88
221
10.6
1012
2
1
70.07255
m
0
0
0
0.5
1.4
176
3.48
210
516.0
96.10
55
151
12.0
1925
2
1
54.74059
f
0
1
1
0.5
1.8
244
2.54
64
6121.8
60.63
92
183
10.3
1504
1
2
38.10541
f
0
1
1
0.0
3.4
279
3.53
143
671.0
113.15
72
136
10.9
2503
2
2
66.25873
f
0
1
0
0.0
0.8
248
3.98
50
944.0
93.00
63
NA
11.0
1832
0
2
55.53457
f
0
1
0
0.0
1.0
322
4.09
52
824.0
60.45
213
204
9.7
2466
2
2
53.05681
f
0
0
0
0.0
0.3
280
4.00
52
4651.2
28.38
189
373
11.0
2400
2
1
42.50787
f
0
0
1
0.0
3.2
562
3.08
79
2276.0
144.15
88
251
11.0
51
2
2
70.55989
f
1
0
1
1.0
12.6
200
2.74
140
918.0
147.25
143
302
11.5
data(pbc, package ="survival") pbc$id <-NULL## remove the ID## status at endpoint, 0/1/2 for censored, transplant, deadpbc.cr <- pbc
Competing risk example: PBC Mayo Clinic
## canonical example: default Gray's splitting-ruleo <-rfsrc(Surv(time, status) ~ ., pbc)o Sample size:276 Number of events:18, 111 Number of trees:500 Forest terminal node size:15 Average no. of terminal nodes:13.118No. of variables tried at each split:5 Total no. of variables:17 Resampling used to grow trees: swor Resample size used to grow trees:174 Analysis: RSF Family: surv-CR Splitting rule: logrankCR *random* Number of random split points:10 (OOB) Requested performance error:0.18408881, 0.1666207
Tip
rfsrc recognize competing risk problem automatically when using the survival formula
event specific and non-event specific variable selection
## canonical example: default Gray's splitting-rule## status: 0/1/2 for censored, transplant, dead o <-rfsrc(Surv(time, status) ~ ., pbc)## log-rank splitting where each event type is treated as the event of interest ## log-rank cause-1 specific splitting and targeted VIMP for cause 1="transplant"o.log1 <-rfsrc(Surv(time, status) ~ ., pbc, splitrule ="logrank", cause =1, importance ="permute")## log-rank cause-2 specific splitting and targeted VIMP for cause 2="death"o.log2 <-rfsrc(Surv(time, status) ~ ., pbc, splitrule ="logrank", cause =2, importance ="permute")## extract minimal depth from the Gray split forest: non-event specific## extract VIMP from the log-rank forests: event-specificvar.perf <-data.frame(md =max.subtree(o)$order[, 1],vimp1 =100* o.log1$importance[ ,1],vimp2 =100* o.log2$importance[ ,2])print(var.perf[order(var.perf$md), ], digits =2)
event specific and non-event specific variable selection
## canonical example: default Gray's splitting-rule## status: 0/1/2 for censored, transplant, dead o <-rfsrc(Surv(time, status) ~ ., pbc)## log-rank splitting where each event type is treated as the event of interest ## log-rank cause-1 specific splitting and targeted VIMP for cause 1="transplant"o.log1 <-rfsrc(Surv(time, status) ~ ., pbc, splitrule ="logrank", cause =1, importance ="permute")## log-rank cause-2 specific splitting and targeted VIMP for cause 2="death"o.log2 <-rfsrc(Surv(time, status) ~ ., pbc, splitrule ="logrank", cause =2, importance ="permute")## extract minimal depth from the Gray split forest: non-event specific## extract VIMP from the log-rank forests: event-specificvar.perf <-data.frame(md =max.subtree(o)$order[, 1],vimp1 =100* o.log1$importance[ ,1],vimp2 =100* o.log2$importance[ ,2])print(var.perf[order(var.perf$md), ], digits =2)
## cumulative incidence function (CIF) for transplant and dead stratified by edemacif <- o$cif.oob; Time <- o$time.interestedema <- o$xvar$edemacif.transplant <-cbind(apply(cif[,,1][edema ==0,], 2, mean),apply(cif[,,1][edema >0,], 2, mean))cif.dead <-cbind(apply(cif[,,2][edema ==0,], 2, mean),apply(cif[,,2][edema >0,], 2, mean))matplot(Time, cbind(cif.transplant, cif.dead), type ="l",lty =c(1,2,1,2), col =c(4, 4, 2, 2), lwd =3, ylab ="CIF")legend("topleft", legend =c("Transplant (No Edema)", "Transplant (Edema)", "Dead (No Edema)", "Dead (Edema)"),lty =c(1,2,1,2), col =c(4, 4, 2, 2), lwd =3, cex =1.1)
The run.rfsrc function for an overview
## tip!!! perform the same analysis as above with one linerun.rfsrc(Surv(time, status) ~ ., pbc, ntree=1000, alpha=.10)
The run.rfsrc function for an overview
Partial plot
using “target” to specify the event of interest
pbc.cr$time <- pbc.cr$time/365; o <-rfsrc(Surv(time, status) ~ ., pbc.cr)## target = an integer value between 1 and J indicating the event of interestplot.variable(o, target =1, xvar.names ="age", partial =TRUE, smooth.lines =TRUE)
Partial plot
using “target” to specify the event of interest
## target = an integer value between 1 and J indicating the event of interest## where J is the number of event typesplot.variable(o, target =1, xvar.names ="age", partial =TRUE, smooth.lines =TRUE)
## target = an integer value between 1 and J indicating the event of interest## where J is the number of event typesplot.variable(o, target =2, xvar.names ="age", partial =TRUE, smooth.lines =TRUE)
Multivariate analysis
Family
Example Grow Call with Formula Specification
Regression Quantile Regression
rfsrc(Ozone~., data = airquality) quantreg(mpg~., data = mtcars)
Classification Imbalanced Two-Class
rfsrc(Species~., data = iris) imbalanced(status~., data = breast)
Study the effects of five diet treatments on 21 liver lipids and 120 hepatic gene expression in wild-type and PPAR-alpha deficient mice. We regress gene expression, diet and genotype (x-variables) on lipid expressions (multivariate y-responses) [4]
## diet and genotype are factorshead(nutrigenomic$diet)> [1] lin sun sun fish ref coc> Levels: coc fish lin ref sunhead(nutrigenomic$genotype)> [1] wt wt wt wt wt wt> Levels: ppar wt
Multivariate example: Nutrigenomic Study
## parse into y and x dataydta <- nutrigenomic$lipidsxdta <-data.frame(nutrigenomic$genes,diet = nutrigenomic$diet,genotype = nutrigenomic$genotype)## multivariate regression forest callo <-rfsrc(get.mv.formula(colnames(ydta)),data.frame(ydta, xdta),importance=TRUE, nsplit =10,splitrule ="mahalanobis")
>print(o) Sample size:40 Number of trees:500 Forest terminal node size:5 Average no. of terminal nodes:4.776No. of variables tried at each split:41 Total no. of variables:122 Total no. of responses:21 User has requested response: C14.0 Resampling used to grow trees: swor Resample size used to grow trees:25 Analysis: mRF-R Family: regr+ Splitting rule: mahalanobis *random* Number of random split points:10 (OOB) R squared:0.13227364 (OOB) Requested performance error:0.55613339
Multivariate example: Nutrigenomic Study
## acquire the error rate for each of the 21-coordinates ## standardize to allow for comparison across coordinatesserr <-get.mv.error(o, standardize =TRUE)
o <-rfsrc(get.mv.formula(colnames(ydta)),data.frame(ydta, xdta),importance=TRUE, nsplit =10,splitrule ="mahalanobis")>print(o) Sample size:40 Number of trees:500 Forest terminal node size:5 Average no. of terminal nodes:4.776No. of variables tried at each split:41 Total no. of variables:122 Total no. of responses:21 User has requested response: C14.0 Resampling used to grow trees: swor Resample size used to grow trees:25 Analysis: mRF-R Family: regr+ Splitting rule: mahalanobis *random* Number of random split points:10 (OOB) R squared:0.13227364 (OOB) Requested performance error:0.55613339
Split rules
default composite (independence) splitting
o2 <-rfsrc(get.mv.formula(colnames(ydta)),data.frame(ydta, xdta),importance=TRUE, nsplit =10)>print(o2) Sample size:40 Number of trees:500 Forest terminal node size:5 Average no. of terminal nodes:4.45No. of variables tried at each split:41 Total no. of variables:122 Total no. of responses:21 User has requested response: C14.0 Resampling used to grow trees: swor Resample size used to grow trees:25 Analysis: mRF-R Family: regr+ Splitting rule: mv *random* Number of random split points:10 (OOB) R squared:0.38963538 (OOB) Requested performance error:0.39118801
Split rules
compare VIMP under the two split rules
## compare standardized VIMP for top 25 variablesimp <-data.frame(mahalanobis =rowMeans(get.mv.vimp(o, standardize =TRUE)),default =rowMeans(get.mv.vimp(o2, standardize =TRUE)))
Missing data can be imputed at various stages of the analysis
During training using na.action="na.impute" in rfsrc
When predicting data using na.action="na.impute" in predict
Prior to any analysis using impute
There are two methods used for imputing missing data
OTFI (On the fly imputation): simultaneously impute data while growing a tree
missForest: grow a forest using each variable as the \(y\) outcome and predicting the missing values
General call to impute
OTFI
Only non-missing data are used for split-statistics. Daughter node membership assignment for missing data is made by sampling from the non-missing inbag data
For datasets with a large number of variables, instead of running \(p\) regressions for each cycle, we can group variables into multivariate regressions
4. Martin PG, Guillou H, Lasserre F, Déjean S, Lan A, Pascussi J-M, et al. Novel aspects of PPAR\(\alpha\)-mediated regulation of lipid and xenobiotic metabolism revealed through a nutrigenomic study. Hepatology. 2007;45:767–77.