TY - JOUR
T1 - A new feature selection algorithm for two-class classification problems and application to endometrial cancer
AU - Ahsen, M. Eren
AU - Singh, Nitin K.
AU - Boren, Todd
AU - Vidyasagar, M.
AU - White, Michael A.
PY - 2012
Y1 - 2012
N2 - In this paper, we introduce a new algorithm for feature selection for two-class classification problems, called ℓ1-StaR. The algorithm consists of first extracting the statistically relevant features using the Student t-test, and then passing the reduced feature set to an ℓ1-norm support vector machine (SVM) with recursive feature elimination (RFE). The final number of features chosen by the ℓ1-StaR algorithm can be smaller than the number of samples, unlike with ℓ1-norm regression where the final number of features is bounded below by the number of samples. The algorithm is illustrated by applying it to the problem of determining which endometrial cancer patients are at risk of having the cancer spreading to their lymph nodes. The data consisted of 1,428 micro-RNAs measured on a data set of 94 patient samples (divided evenly between those with lymph node metastasis and those without). Using the algorithm, we identified a subset of just 15 micro-RNAs and a linear classifier based on these, that achieved two-fold cross validation accuracies in excess of 80%, and combined accuracy, sensitivity and specificity in excess of 93%.
AB - In this paper, we introduce a new algorithm for feature selection for two-class classification problems, called ℓ1-StaR. The algorithm consists of first extracting the statistically relevant features using the Student t-test, and then passing the reduced feature set to an ℓ1-norm support vector machine (SVM) with recursive feature elimination (RFE). The final number of features chosen by the ℓ1-StaR algorithm can be smaller than the number of samples, unlike with ℓ1-norm regression where the final number of features is bounded below by the number of samples. The algorithm is illustrated by applying it to the problem of determining which endometrial cancer patients are at risk of having the cancer spreading to their lymph nodes. The data consisted of 1,428 micro-RNAs measured on a data set of 94 patient samples (divided evenly between those with lymph node metastasis and those without). Using the algorithm, we identified a subset of just 15 micro-RNAs and a linear classifier based on these, that achieved two-fold cross validation accuracies in excess of 80%, and combined accuracy, sensitivity and specificity in excess of 93%.
UR - http://www.scopus.com/inward/record.url?scp=84874285158&partnerID=8YFLogxK
U2 - 10.1109/CDC.2012.6426819
DO - 10.1109/CDC.2012.6426819
M3 - Conference article
AN - SCOPUS:84874285158
SN - 0743-1546
SP - 2976
EP - 2982
JO - Proceedings of the IEEE Conference on Decision and Control
JF - Proceedings of the IEEE Conference on Decision and Control
M1 - 6426819
T2 - 51st IEEE Conference on Decision and Control, CDC 2012
Y2 - 10 December 2012 through 13 December 2012
ER -