ePCR
is an R-package intended for the survival analysis
of advanced prostate cancer. This document is a basic introduction to
the functionality of ePCR
and a general overview to the
possible analysis workflows for clinical trial or hospital registry
cohorts. The approach leverages ensemble-driven usage of single Cox
regression based regression models named ePCR, which was the
top performing approach in the DREAM 9.5 Prostate Cancer Challenge
(Guinney et al, 2017).
The latest version of ePCR
is available in the
Comprehensive R Archive Network CRAN. CRAN mirrors
are by default available in the installation of R, and the
ePCR
package is installable using the R terminal command:
install.packages("ePCR")
. This should prompt the user to
select a nearby CRAN mirror, after which the installation of
ePCR
and its dependencies are automatically performed.
After the install.packages
-call, the ePCR
package can be loaded with either command
library("ePCR")
.
The following notation is used in the document: R commands, package
names and function names are written in typewriter font
.
The notation of format pckgName::funcName
indicates that
the function funcName
is called from the package
pckgName
, which is prominently used in the underlying R
code due to package namespaces. This document as well as other useful
PDFs can be inspected using the browseVignettes
function
for any package in R.
The ePCR
-package is provided with two example hospital
registry datasets. These datasets represent confidential hospital
registry cohorts, to which kernel density estimation was fitted.
Illustrative virtual patients were then generated from the kernel
estimates and are provided here in the example datasets. Please see the
accompanying ePCR
publication for further details on the
two Turku University Hospital cohorts (Laajala et al., 2018), and the
Synapse site for DREAM 9.5 PCC for accessing the original DREAM data
(Guinney, Wang, Laajala et al. 2017). The exemplifying datasets can be
loaded into an R session using:
##
## Attaching package: 'ePCR'
## The following object is masked from 'package:graphics':
##
## plot
## The following object is masked from 'package:base':
##
## plot
# Kernel density simulated patients from Turku University Hospital (TYKS)
# Data consists of TEXT cohort (text-search found patients)
# and MEDI (patients identified using medication and few keywords)
data(TYKSSIMU)
# The following data matrices x and survival responses y become available
head(xTEXTSIMU); head(yTEXTSIMU)
## BMI HEIGHTBL WEIGHTBL ALP ALT AST CA CREAT
## TEXTSIMU1 27.16556 172 83.0 4.852030 3.044522 3.401197 2.305 3.951244
## TEXTSIMU2 27.16556 176 83.0 4.442651 3.258097 3.401197 2.310 4.644391
## TEXTSIMU3 29.35235 168 91.2 4.304065 2.708050 3.401197 2.305 4.394449
## TEXTSIMU4 24.80000 176 83.0 4.442651 2.944439 3.218876 2.330 4.465908
## TEXTSIMU5 27.20000 176 83.0 5.129899 2.944439 3.401197 2.310 3.891820
## TEXTSIMU6 27.16556 176 83.0 4.564348 1.609438 3.401197 2.305 4.204693
## HB LDH NEU PLT PSA TBILI TESTO WBC
## TEXTSIMU1 11.3 5.265247 1.128171 323 3.4657359 2.197225 -0.1743534 2.001480
## TEXTSIMU2 12.6 5.265247 1.329710 216 4.6051702 2.197225 -0.1743534 2.332144
## TEXTSIMU3 13.5 5.265247 2.187174 83 3.8712010 2.197225 -0.1743534 1.856298
## TEXTSIMU4 12.7 5.273000 2.551006 189 0.3364722 3.135494 0.3364722 2.186051
## TEXTSIMU5 12.3 5.265247 1.329710 298 6.6720329 2.197225 -0.1743534 2.041220
## TEXTSIMU6 15.4 5.265247 1.329710 237 3.6505739 2.197225 -0.1743534 1.435085
## CREACL NA. MG PHOS ALB TPRO RBC LYM BUN
## TEXTSIMU1 3.549617 137 -0.210721 0.1397619 34.8 67 4.830 0.3364722 2.475973
## TEXTSIMU2 3.549617 141 -0.210721 0.1397619 34.8 67 4.830 0.3364722 2.475973
## TEXTSIMU3 3.549617 135 -0.210721 0.1397619 29.5 67 4.185 0.3364722 2.397895
## TEXTSIMU4 3.549617 140 -0.210721 0.1397619 34.8 67 3.620 0.3364722 2.475973
## TEXTSIMU5 3.549617 140 -0.210721 0.1397619 34.8 67 4.120 0.3364722 2.475973
## TEXTSIMU6 3.549617 142 -0.210721 0.1397619 34.8 67 3.780 0.3364722 2.475973
## CCRC GLU SYSTOLICBP DIASTOLICBP PULSE HEMAT SPEGRA LYMperLEU
## TEXTSIMU1 3.703478 1.824549 136 76 72 0.43 0 24
## TEXTSIMU2 3.703478 1.840550 142 64 72 0.45 0 22
## TEXTSIMU3 3.703478 1.856298 111 76 72 0.38 0 22
## TEXTSIMU4 3.703478 1.856298 128 76 72 0.38 0 22
## TEXTSIMU5 3.703478 1.757858 142 76 69 0.38 0 22
## TEXTSIMU6 3.703478 1.856298 151 76 72 0.38 0 22
## MONO MONOperLEU NEUperLEU POT BASOperLEU EOS EOSperLEU TARGET
## TEXTSIMU1 0.62 9 63 4.1 1 0.17 0 0
## TEXTSIMU2 0.62 9 63 4.1 0 0.17 1 0
## TEXTSIMU3 0.62 9 63 4.1 0 0.19 2 0
## TEXTSIMU4 0.62 9 63 4.9 0 0.17 2 0
## TEXTSIMU5 0.62 9 63 3.7 0 0.17 2 0
## TEXTSIMU6 0.62 9 63 3.7 0 0.17 2 0
## LYMPH_NODES KIDNEYS LUNGS LIVER PLEURA OTHER PROSTATE ORCHIDECTOMY
## TEXTSIMU1 0 0 0 0 0 0 0 1
## TEXTSIMU2 0 0 0 0 0 0 0 0
## TEXTSIMU3 0 0 0 0 0 0 0 0
## TEXTSIMU4 0 0 0 0 0 1 0 0
## TEXTSIMU5 0 0 0 1 0 0 0 0
## TEXTSIMU6 1 0 0 0 0 1 0 0
## PROSTATECTOMY LYMPHADENECTOMY BILATERAL_ORCHIDECTOMY
## TEXTSIMU1 1 0 1
## TEXTSIMU2 0 0 0
## TEXTSIMU3 0 0 0
## TEXTSIMU4 0 0 0
## TEXTSIMU5 0 0 0
## TEXTSIMU6 0 0 0
## PRIOR_RADIOTHERAPY ANALGESICS ANTI_ANDROGENS GLUCOCORTICOID
## TEXTSIMU1 1 0 0 0
## TEXTSIMU2 1 1 0 1
## TEXTSIMU3 1 0 0 0
## TEXTSIMU4 0 0 0 0
## TEXTSIMU5 0 0 0 1
## TEXTSIMU6 1 0 0 0
## GONADOTROPIN BISPHOSPHONATE CORTICOSTEROID IMIDAZOLE ACE_INHIBITORS
## TEXTSIMU1 0 0 0 0 0
## TEXTSIMU2 0 0 0 0 0
## TEXTSIMU3 0 0 0 0 0
## TEXTSIMU4 0 0 0 0 0
## TEXTSIMU5 0 0 0 0 0
## TEXTSIMU6 0 0 0 0 0
## BETA_BLOCKING HMG_COA_REDUCT ESTROGENS ANTI_ESTROGENS CEREBACC CHF
## TEXTSIMU1 0 0 0 0 0 0
## TEXTSIMU2 0 0 0 0 0 0
## TEXTSIMU3 0 0 0 0 0 0
## TEXTSIMU4 0 0 0 0 0 1
## TEXTSIMU5 0 0 0 0 0 0
## TEXTSIMU6 0 0 0 0 0 0
## DVT DIAB MI PULMEMB SPINCOMP COPD MHBLOOD MHCARD MHCONGEN MHEAR
## TEXTSIMU1 0 0 0 0 0 0 0 1 0 0
## TEXTSIMU2 0 0 0 0 0 0 0 0 0 0
## TEXTSIMU3 0 0 0 0 0 0 0 0 0 0
## TEXTSIMU4 0 1 0 0 0 0 0 0 0 0
## TEXTSIMU5 0 0 0 0 0 0 0 0 0 0
## TEXTSIMU6 0 0 0 0 0 0 0 0 0 1
## MHENDO MHGASTRO MHHEPATO MHIMMUNE MHINFECT MHINJURY MHINVEST MHMETAB
## TEXTSIMU1 0 0 0 0 0 1 0 0
## TEXTSIMU2 0 1 0 0 0 0 0 0
## TEXTSIMU3 1 1 0 0 1 0 0 0
## TEXTSIMU4 0 1 0 0 0 0 0 0
## TEXTSIMU5 0 0 0 0 0 0 0 0
## TEXTSIMU6 0 0 0 0 0 0 0 0
## MHPSYCH MHRENAL MHRESP MHSKIN MHVASC ECOG_C AGEGRP2 RaceAsian
## TEXTSIMU1 0 0 0 0 0 0 2 0
## TEXTSIMU2 0 0 0 0 1 0 0 0
## TEXTSIMU3 0 0 0 0 0 0 1 0
## TEXTSIMU4 0 0 0 0 0 0 1 0
## TEXTSIMU5 0 0 0 0 0 0 1 0
## TEXTSIMU6 0 0 0 0 0 0 2 0
## RaceBlack RaceOther RaceWhite RegionAsia RegionEastEuro
## TEXTSIMU1 0 0 0 0 0
## TEXTSIMU2 0 0 0 0 0
## TEXTSIMU3 0 0 0 0 0
## TEXTSIMU4 0 0 0 0 0
## TEXTSIMU5 0 0 0 0 0
## TEXTSIMU6 0 0 0 0 0
## RegionNorthAmer RegionSouthAmer RegionWestEuro
## TEXTSIMU1 0 0 0
## TEXTSIMU2 0 0 0
## TEXTSIMU3 0 0 0
## TEXTSIMU4 0 0 0
## TEXTSIMU5 0 0 0
## TEXTSIMU6 0 0 0
## DEATH LKADT_P surv
## TEXTSIMU1 1 342 342
## TEXTSIMU2 0 360 360+
## TEXTSIMU3 1 682 682
## TEXTSIMU4 0 1067 1067+
## TEXTSIMU5 1 113 113
## TEXTSIMU6 0 1246 1246+
## BMI HEIGHTBL WEIGHTBL ALP ALT AST CA CREAT
## MEDISIMU1 28.04282 175 90 5.093750 2.708050 3.349750 1.99 4.488636
## MEDISIMU2 26.57313 176 60 5.017280 3.091042 3.258097 2.41 4.174387
## MEDISIMU3 28.39506 165 65 4.418841 3.332205 3.349750 2.41 4.077537
## MEDISIMU4 24.57787 176 107 5.003946 3.295837 3.349750 2.33 4.634729
## MEDISIMU5 30.58581 188 73 4.158883 2.484907 3.367296 2.34 4.234107
## MEDISIMU6 25.18079 174 86 4.564348 4.882802 3.349750 2.33 4.499810
## HB LDH NEU PLT PSA TBILI TESTO WBC
## MEDISIMU1 10.9 5.327876 1.2149127 186 6.194405 1.386294 -0.08338161 1.609438
## MEDISIMU2 13.3 5.327876 0.7030975 156 2.163323 1.609438 -0.08338161 2.041220
## MEDISIMU3 11.8 5.327876 1.0952734 126 3.713572 1.609438 -0.99425227 1.871802
## MEDISIMU4 13.1 5.327876 0.4946962 217 3.555348 1.791759 -0.08338161 1.568616
## MEDISIMU5 15.3 5.327876 1.1939225 221 3.367296 2.079442 0.78845736 1.704748
## MEDISIMU6 12.8 5.327876 1.9892433 386 3.610918 1.791759 -1.56064775 1.824549
## CREACL NA. MG PHOS ALB TPRO RBC LYM BUN
## MEDISIMU1 0 140 -0.1923903 0.09531018 36.65 68.5 3.91 0.1823216 1.722767
## MEDISIMU2 0 142 -0.1923903 -0.02020271 33.60 68.5 4.28 0.8878913 1.722767
## MEDISIMU3 0 144 -0.1923903 -0.02020271 36.65 68.5 4.62 0.4946962 1.722767
## MEDISIMU4 0 142 -0.1923903 -0.02020271 36.65 69.0 4.35 -0.2744368 1.722767
## MEDISIMU5 0 143 -0.1923903 -0.06187540 36.65 68.5 4.05 0.4946962 1.722767
## MEDISIMU6 0 137 -0.1923903 -0.02020271 36.65 68.5 4.78 0.5128236 1.722767
## CCRC GLU SYSTOLICBP DIASTOLICBP PULSE HEMAT SPEGRA LYMperLEU
## MEDISIMU1 3.800105 1.435085 141.5 77 68 0.37 0 29
## MEDISIMU2 3.746038 1.871802 107.0 77 58 0.34 0 29
## MEDISIMU3 3.800105 1.916923 141.5 77 71 0.43 0 29
## MEDISIMU4 3.800105 1.871802 126.0 90 71 0.40 0 29
## MEDISIMU5 3.800105 1.589235 141.5 77 71 0.35 0 28
## MEDISIMU6 3.800105 1.791759 188.0 77 88 0.38 0 29
## MONO MONOperLEU NEUperLEU POT BASOperLEU EOS EOSperLEU TARGET
## MEDISIMU1 0.60 11 56.5 4.4 0 0.17 3 0
## MEDISIMU2 0.60 11 56.5 4.5 1 0.17 3 0
## MEDISIMU3 0.60 11 56.5 3.7 0 0.17 7 0
## MEDISIMU4 0.88 11 56.5 4.1 0 0.17 3 0
## MEDISIMU5 0.60 11 56.5 4.6 0 0.17 3 0
## MEDISIMU6 0.60 11 56.5 4.0 0 0.17 3 0
## LYMPH_NODES KIDNEYS LUNGS LIVER PLEURA OTHER PROSTATE ORCHIDECTOMY
## MEDISIMU1 0 0 0 0 0 1 0 0
## MEDISIMU2 0 0 0 0 0 0 0 0
## MEDISIMU3 0 0 0 0 0 0 0 0
## MEDISIMU4 1 0 0 0 0 0 0 1
## MEDISIMU5 0 0 0 0 0 1 0 0
## MEDISIMU6 0 0 0 0 0 0 0 0
## PROSTATECTOMY LYMPHADENECTOMY BILATERAL_ORCHIDECTOMY
## MEDISIMU1 0 0 0
## MEDISIMU2 0 0 0
## MEDISIMU3 1 0 0
## MEDISIMU4 0 0 0
## MEDISIMU5 0 0 0
## MEDISIMU6 0 0 0
## PRIOR_RADIOTHERAPY ANALGESICS ANTI_ANDROGENS GLUCOCORTICOID
## MEDISIMU1 1 0 0 0
## MEDISIMU2 1 1 0 0
## MEDISIMU3 1 0 1 1
## MEDISIMU4 0 1 1 1
## MEDISIMU5 1 0 1 1
## MEDISIMU6 1 0 1 1
## GONADOTROPIN BISPHOSPHONATE CORTICOSTEROID IMIDAZOLE ACE_INHIBITORS
## MEDISIMU1 0 1 1 0 0
## MEDISIMU2 0 0 1 0 0
## MEDISIMU3 0 0 1 0 0
## MEDISIMU4 0 0 1 0 0
## MEDISIMU5 0 0 0 0 0
## MEDISIMU6 0 0 1 0 0
## BETA_BLOCKING HMG_COA_REDUCT ESTROGENS ANTI_ESTROGENS CEREBACC CHF
## MEDISIMU1 0 1 0 0 0 0
## MEDISIMU2 0 0 0 0 0 0
## MEDISIMU3 1 0 0 0 0 0
## MEDISIMU4 0 0 0 0 0 0
## MEDISIMU5 0 0 0 0 0 0
## MEDISIMU6 1 0 0 0 0 0
## DVT DIAB MI PULMEMB SPINCOMP COPD MHBLOOD MHCARD MHCONGEN MHEAR
## MEDISIMU1 0 0 0 0 0 0 0 1 0 0
## MEDISIMU2 0 0 0 0 0 0 0 1 0 1
## MEDISIMU3 1 0 0 0 0 0 0 1 0 1
## MEDISIMU4 0 1 0 0 0 0 0 0 0 0
## MEDISIMU5 0 0 0 0 0 0 0 0 0 0
## MEDISIMU6 0 1 0 0 0 0 0 0 0 0
## MHENDO MHGASTRO MHHEPATO MHIMMUNE MHINFECT MHINJURY MHINVEST MHMETAB
## MEDISIMU1 0 1 0 0 1 0 0 0
## MEDISIMU2 0 0 0 0 0 1 0 1
## MEDISIMU3 0 0 0 0 0 0 0 0
## MEDISIMU4 0 0 0 0 0 0 0 1
## MEDISIMU5 0 0 0 0 0 0 0 0
## MEDISIMU6 0 0 0 0 0 0 0 0
## MHPSYCH MHRENAL MHRESP MHSKIN MHVASC ECOG_C AGEGRP2 RaceAsian
## MEDISIMU1 0 0 0 0 0 0 2 0
## MEDISIMU2 0 0 0 0 0 0 1 0
## MEDISIMU3 0 0 0 0 0 0 2 0
## MEDISIMU4 0 0 0 1 0 0 2 0
## MEDISIMU5 0 1 0 0 0 0 0 0
## MEDISIMU6 0 0 0 0 0 0 2 0
## RaceBlack RaceOther RaceWhite RegionAsia RegionEastEuro
## MEDISIMU1 0 0 0 0 0
## MEDISIMU2 0 0 0 0 0
## MEDISIMU3 0 0 0 0 0
## MEDISIMU4 0 0 0 0 0
## MEDISIMU5 0 0 0 0 0
## MEDISIMU6 0 0 0 0 0
## RegionNorthAmer RegionSouthAmer RegionWestEuro
## MEDISIMU1 0 0 0
## MEDISIMU2 0 0 0
## MEDISIMU3 0 0 0
## MEDISIMU4 0 0 0
## MEDISIMU5 0 0 0
## MEDISIMU6 0 0 0
## DEATH LKADT_P surv
## MEDISIMU1 0 89 89+
## MEDISIMU2 1 754 754
## MEDISIMU3 1 783 783
## MEDISIMU4 0 159 159+
## MEDISIMU5 0 1322 1322+
## MEDISIMU6 1 200 200
It is important to disginguish between the PSP
and
PEP
objects, which represent a single penalized Cox
regression model and an ensemble of Cox regression models, respectively.
PSP
objects are penalized/regularized Cox regression models
fitted to a particular dataset by exploring its {λ, α} parameter space.
Notice that the sequence of λ
is dependent on the α ∈ [0, 1]. The
regularized/penalized fitting procedure in ePCR
is provided
by the glmnet
-package (Simon et al., 2011), although custom
cross-validation and other supporting functionality is provided
independently.
After fitting suitable candidate PSP
-objects (Penalized
Single Predictors), these will be aggregated to the ensemble structure
PEP
(Penalized Ensemble Predictor). The key input to
PEP
-constructor are the PSP
intended for the
use of the ensemble. We will start off by introducing the fine-tuning
and fitting of PSP
s. For this purpose the generic S4-class
contructor new
will be called with the main parameter
indicating that we wish to construct a PSP
-object.
The key attributes provided for the PSP-constructor are the following
parameters (see ?'PSP-class'
in R for further
documentation):
x
: The input data matrix where rows corresponding to
patients and columns to potential predictors.y
: The Surv
-class response vector as
required by Cox regression and glmnet
in survival
prediction.seeds
: An integer vector or a single value for setting
the random seed for cross-validation. Notice that this is highly
suggested for reproducibility. If a multiple seed integers are provided,
the cross-validation will be conducted separately for each. This will
smoothen the cross-validation surface, but will take multiply the
computational time required to fit a model.score
: The scoring function utilized in evaluating the
generalization ability of the fitted model in cross-validation; readily
implemented scoring functions include score.iAUC
and
score.cindex
, but custom scoring functions are also
allowed.alphaseq
: Sequence of alpha values. The extreme ends
α = 1 is LASSO regression and
α = 0 is Ridge Regression.
α ∈ ]0, 1[ is generally
referred to as Elastic Net. Notice that LASSO and Ridge Regression have
noticeably different characteristics as they utilizeo only the L1 and L2 norms, respectively;
for example, a Ridge Regression model will never have its coefficients
exactly zero. Furthermore, for co-linear predictors LASSO tends to pick
a single one, while Ridge Regression picks multiple ones and spreads the
overall effect over these predictors. Depending on the ultimate
prediction purpose, one may prefer one or the other and can tailor
alphaseq
to suit their needs. By default we suggest
utilizing an evenly spaced alphaseq
over [0, 1] at least for preliminary search.nlambda
: Number of λ tested as a function of the
corresponding α. By default
glmnet
suggests 100 values which are picked from a feasible
range between model including all coefficients and converged model where
no further penalization is possible.folds
: Number of folds in the cross-validation (minimum
3, maximum n obs = LOO-CV).For the sake of the example, we will construct an ePCR
model ensemble that consists of two PSP
-objects; one from
the medication curated cohort and other from the text search cohort. We
will leave out a small portion of medication and text search patients
for a small test set, to later evaluate the generalization ability of
the ensemble. Notice however that this is not a proper evaluation as the
patients are not from an independent source, and therefore give an
optimistic view to the generalization capability of the model(s).
testset <- 1:30
# Medication cohort fit
# Leaving out patients into a separate test set using negative indices
psp_medi <- new("PSP",
# Input data matrix x (example data loaded previously)
x = xMEDISIMU[-testset,],
# Response vector, 'surv'-object
y = yMEDISIMU[-testset,"surv"],
# Seeds for reproducibility
seeds = c(1,2),
# If user wishes to run the CV binning multiple times,
# this is possible by averaging over them for smoother CV heatmap.
cvrepeat = 2,
# Using the concordance-index as prediction accuracy in CV
score = score.cindex,
# Alpha sequence
alphaseq = seq(from=0, to=1, length.out=6),
# Using glmnet's default nlambda of 100
nlambda = 100,
# Running the nominal 10-fold cross-validation
folds = 10,
# x.expand slot is a function that would allow interaction terms
# For the sake of the simplicity we will consider identity function
x.expand = function(x) { as.matrix(x) }
)
## --- Initializing new PSP object ---
##
## --- Cross-validation ( 10 -folds) repeat run 1 of 2 ---
##
## [1] "alpha 0"
## [1] "alpha 0.2"
## [1] "alpha 0.4"
## [1] "alpha 0.6"
## [1] "alpha 0.8"
## [1] "alpha 1"
## --- Cross-validation ( 10 -folds) repeat run 2 of 2 ---
##
## [1] "alpha 0"
## [1] "alpha 0.2"
## [1] "alpha 0.4"
## [1] "alpha 0.6"
## [1] "alpha 0.8"
## [1] "alpha 1"
## --- Computing AUCs for regularization curves for coefficients ---
##
## --- Generating feature list and dictionary ---
##
## --- New PSP object successfully created ---
The parameters for the second PSP
are similar to the one
above. Notice that with the PSP
-members, user can tailor
multiple parameters to best suit the data.
# Text run similar to above
# Leaving out patients into a separate test set using negative indices
psp_text <- new("PSP",
x = xTEXTSIMU[-testset,],
y = yTEXTSIMU[-testset,"surv"],
seeds = c(3,4),
cvrepeat = 2,
score = score.cindex,
alphaseq = seq(from=0, to=1, length.out=6),
nlambda = 100,
folds = 10,
x.expand = function(x) { as.matrix(x) }
)
## --- Initializing new PSP object ---
##
## --- Cross-validation ( 10 -folds) repeat run 1 of 2 ---
##
## [1] "alpha 0"
## [1] "alpha 0.2"
## [1] "alpha 0.4"
## [1] "alpha 0.6"
## [1] "alpha 0.8"
## [1] "alpha 1"
## --- Cross-validation ( 10 -folds) repeat run 2 of 2 ---
##
## [1] "alpha 0"
## [1] "alpha 0.2"
## [1] "alpha 0.4"
## [1] "alpha 0.6"
## [1] "alpha 0.8"
## [1] "alpha 1"
## --- Computing AUCs for regularization curves for coefficients ---
##
## --- Generating feature list and dictionary ---
##
## --- New PSP object successfully created ---
## PSP ePCR object
## N observations: 120
## Optimal alpha: 1
## Optimal lambda: 0.2578574
## Optimal lambda index: 1
# Plot the CV-surface of the fitted PSP:
plot(psp_medi,
# Showing only every 10th row and column name (propagated to heatcv-function)
by.rownames=10, by.colnames=10,
# Adjust main title and tilt the bias of the color key legend (see ?heatcv)
main="C-index CV for psp_medi", bias=0.2)
Noticeably, the cross-validation surface suggests different optimized
penalization parameters for the two ensemble members. This most likely
stems from systematic differences in the two cohorts, to which end the
ePCR
methodology offers an ensemble-driven alternative to
account for differences between patient substrata.
plot(psp_text,
# Showing only every 10th row and column name (propagated to heatcv-function)
by.rownames=10, by.colnames=10,
# Adjust main title and tilt the bias of the color key legend (see ?heatcv)
main="C-index CV for psp_text", bias=0.2)
In addition to providing the CV-grid, the identified optimal parameters are available for downstream analyses:
## Alpha AlphaIndex Lambda LambdaIndex
## 1.0000000 6.0000000 0.2578574 1.0000000
## Alpha AlphaIndex Lambda LambdaIndex
## 1.0000000 6.0000000 0.4396716 1.0000000
## [1] "description" "features" "strata" "alphaseq" "cvfolds"
## [6] "nlambda" "cvmean" "cvmedian" "cvstdev" "cvmin"
## [11] "cvmax" "score" "cvrepeat" "impute" "optimum"
## [16] "seed" "x" "x.expand" "y" "fit"
## [21] "criterion" "dictionary" "regAUC"
Once the PSP
-objects have been constructed, they are
aggregated to the corresponding Penalized Ensemble Predictor (PEP). The
PEP
objects aggregate PSP
objects from various
data slices or optimization criteria, and create an ensemble predictor
that averages over the provided single predictors. As such, its most
important input is the list of desired PSP
-objects:
pep_tyks <- new("PEP",
# The main input is the list of PSP objects
PSPs = list(psp_medi, psp_text)
)
# These PSPs were constructed using the example code above.
pep_tyks
## Penalized Ensemble Predictor
## Count of PSPs: 2
# Conduct naive test set evaluation
xtest <- rbind(xMEDISIMU[testset,], xTEXTSIMU[testset,])
ytest <- rbind(yMEDISIMU[testset,], yTEXTSIMU[testset,])
# Perform survival prediction based on the PEP-ensemble we've created
xpred <- predict(pep_tyks, newx=as.matrix(xtest), type="ensemble")
# Construct a survival object using the Surv-class
ytrue <- Surv(time = ytest[,"surv"][,"time"], event = ytest[,"surv"][,"status"])
# Test c-index between our constructed ensemble prediction and true response
tyksscore <- score.cindex(pred = xpred, real = ytrue)
print(paste("TYKS example c-index:", round(tyksscore, 4)))
## [1] "TYKS example c-index: 0.5"
The ePCR
R-package comes with readily fitted
ePCR
-ensembles from the work by (Guinney, Wang, Laajala et
al. 2017) as well as from hospital registry cohorts. Due to data
confidentiality issues, the original data matrices or responses are not
provided in the S4-objects (although normally they would be in the slots
@x
and @y
, respectively).
In order to gain access to the original data by Guinney et al., the
processed data can be accessed as raw .csv
files or R
workspaces at the corresponding Synapse
workspace.
Accessing the Turku University Hospital registry cohort requires a research permit and users are encouraged to contact the Center for Clinical Informatics ([email protected]) for further information.
Despite not providing the original data matrices, the ensemble model
fits and their coefficients as a function of {λ, α} are fully
functional. They are therefore suitable for conducting predictions for
future patients or for studying effect within the estimated
models/ensembles. These model objects can be loaded in ePCR
using:
## [1] "PEP"
## attr(,"package")
## [1] ".GlobalEnv"
## [1] "PEP"
## attr(,"package")
## [1] "ePCR"
The DREAM
S4-object is the top-performing mCRPC
OS-predicting ensemble from Guinney et al., while the TYKS models are
fitted to the original Turku University Hospital cohorts. These model
objects can be used for prediction similarly to the novel S4
PEP
-object created in above sections. As an example, if we
utilize the DREAM model trained on controlled clinical trials on the
TYKS hospital registry patients, the OS prediction can be conducted
using:
# Create a DREAM-matching data input matrix from our xtest and the full data matrix
xtemp <- conforminput(DREAM, xtest)
# Predict survival for our hospital registry example dataset
dreampred <- predict(DREAM,
# Providing full new data and average prediction over the ensemble members
newx=xtemp, type="ensemble",
# Defining that we don't want any further data matrix feature extraction
# The call to conforminput above already formatted the input data
x.expand = as.matrix
)
Notice that we utilize the helper function conforminput
for feature extraction/creation, as multiple interaction variables were
introduced in the original DREAM data matrix and the dimensions would
not match in the regression task otherwise.
The following error message is quite commonly encountered when first using pre-built models to new data:
Error in newx %*% nbeta : Cholmod error ‘X and/or Y have wrong dimensions’ at file ../MatrixOps/cholmod_sdmult.c, line 90
It is prompted by the glmnet
-package’s C/Fortran
implementation, if the β
coefficients do not conform to the provided dimensions of the new data
matrix X. For this purpose,
the new data should have equal number of columns (variables) using data
processing (functions such as conforminput
or the S4-slot
in a PEP
-object called x.expand
).
# Test c-index between the DREAM ensemble prediction and TYKS true response
dreamscore <- score.cindex(pred = dreampred, real = ytrue)
print(paste("DREAM example c-index:", round(dreamscore, 4)))
## [1] "DREAM example c-index: 0.389"
## R version 4.4.2 (2024-10-31)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.2 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: Etc/UTC
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] survival_3.8-3 ePCR_0.11.0 rmarkdown_2.29
##
## loaded via a namespace (and not attached):
## [1] Matrix_1.7-2 glmnet_4.1-8 future.apply_1.11.3
## [4] jsonlite_1.8.9 compiler_4.4.2 Rcpp_1.0.14
## [7] parallel_4.4.2 jquerylib_0.1.4 globals_0.16.3
## [10] splines_4.4.2 yaml_2.3.10 fastmap_1.2.0
## [13] lattice_0.22-6 prodlim_2024.06.25 impute_1.81.0
## [16] Bolstad2_1.0-29 R6_2.6.0 shape_1.4.6.1
## [19] knitr_1.49 iterators_1.0.14 pec_2023.04.12
## [22] future_1.34.0 maketools_1.3.2 bslib_0.9.0
## [25] rlang_1.1.5 cachem_1.1.0 xfun_0.50
## [28] sass_0.4.9 sys_3.4.3 cli_3.6.3
## [31] hamlet_0.9.7 digest_0.6.37 foreach_1.5.2
## [34] grid_4.4.2 mvtnorm_1.3-3 lifecycle_1.0.4
## [37] lava_1.8.1 timereg_2.0.6 timeROC_0.4
## [40] evaluate_1.0.3 pracma_2.4.4 data.table_1.16.4
## [43] numDeriv_2016.8-1.1 listenv_0.9.1 codetools_0.2-20
## [46] buildtools_1.0.0 parallelly_1.42.0 tools_4.4.2
## [49] htmltools_0.5.8.1