br A standard way to turn a continuous variable
A standard way to turn a continuous variable into a binary or categorical one is simply to dichotomize the variable by find-ing an appropriate breakpoint or threshold. Since the range and distribution of gene Filipin III profiles vary, especially for methods applied transcriptome-wide, quantile-based di-chotomization schemes are commonly used. A natural choice for selecting a breakpoint for data-derived quantiles is to use the median. This breakpoint is often used in situations that involve identifying diagnostic markers for clinical tests. An al-ternative is to use quartiles, such as the 25th percentile and 75th percentile. An advantage of this approach is that patient groups are identified by more extreme changes in gene ex-pression and therefore it may be easier to detect changes in survival time. However, 50% of patient samples are not included in the derivation of the predictive biomarker so for datasets with small sample sizes, this form of dichotomiza-tion may be more unstable. In either case, when relying on the median and quantiles as a threshold for dichotomization, one of the key concerns is the relative stability of these statis-tics, especially when highly-variable data such as RNA-seq data are used.
To investigate this issue in the context of this study, a gene was selected at random from TCGA as a representa-tive example, and 500 bootstrap sets of 50 samples (gene expression measures) were generated for each of the four cancer types. The median, first quartile (Q1) and third quar-tile (Q3) were calculated from these bootstrapped datasets as a surrogate for having multiple datasets from which to investi-gate the stability of these threshold-based statistics (Supple-mental Fig. 2). The distribution of these threshold statistics showed considerable variability that was statistically signifi-cant across the four cancer types (Levene’s test, P-value < 10−25). Given the significant degree of variability, these results demonstrated that the thresholds based on these statistics were not always robust between different types of cancers. Certainly, it was evident that the variance of the median, Q1 and Q3 were larger in the head and neck cancer dataset and ovarian cancer dataset, compared with the prostate and kid-ney cancer datasets. These results provide further support for the poor performance observed with the threshold-based di-chotomization methods and suggest that it is potentially erro-neous to apply them without thoroughly investigating the data a priori.
This study set out to determine which method had the most optimal performance for identifying gene expression-based 11
prognostic biomarkers from cancer RNA-sequencing data. Out of eight methods, and using three different sets of as-sessments for accuracy, reliability, and robustness, the Cox regression had the best overall performance. For accuracy, the Cox regression, k-means, C-index and D-index had the strongest performance. The Cox regression had the most re-liable performance, followed in second place by the k-means method. For robustness, the D-index had the strongest per-formance and the Cox regression method was the second most robust. A conclusion of this study is the recommenda-tion against the use of methods that involve dichotomizing the gene expression data based on quantiles or the Kaplan-Scan method as these both performed poorly on our tests. It should be highlighted that the testing framework that was designed for our study was motivated by the task of per-forming an unbiased discovery or identification of candidate biomarkers from large-scale datasets. If a marker is identi-fied from non-bioinformatics evidence, then it certainly may be reasonable to then determine an appropriate cutoff to guide binary treatment decisions in a clinical setting. We have also shown that the number of highly differentially expressed genes in a cancer greatly influences the ability to predict mark-ers, independent of the method employed. In the future, we hope to investigate robust methods and more sensitive ap-proaches to handle the problem of identifying markers of sur-vival in the presence of heterogeneous cancer data. Improv-ing these kinds of techniques will hopefully pave the way to develop personalized and more accurate cancer diagnostic tests that have widespread generalizability to other patient populations.