TY - GEN
T1 - Enabling fast prediction for ensemble models on data streams
AU - Zhang, Peng
AU - Li, Jun
AU - Wang, Peng
AU - Gao, Byron J.
AU - Zhu, Xingquan
AU - Guo, Li
PY - 2011
Y1 - 2011
N2 - Ensemble learning has become a common tool for data stream classification, being able to handle large volumes of stream data and concept drifting. Previous studies focus on building accurate prediction models from stream data. However, a linear scan of a large number of base classifiers in the ensemble during prediction incurs significant costs in response time, preventing ensemble learning from being practical for many real world time-critical data stream applications, such as Web traffic stream monitoring, spam detection, and intrusion detection. In these applications, data streams usually arrive at a speed of GB/second, and it is necessary to classify each stream record in a timely manner. To address this problem, we propose a novel Ensemble-tree (E-tree for short) indexing structure to organize all base classifiers in an ensemble for fast prediction. On one hand, E-trees treat ensembles as spatial databases and employ an R-tree like height-balanced structure to reduce the expected prediction time from linear to sub-linear complexity. On the other hand, E-trees can automatically update themselves by continuously integrating new classifiers and discarding outdated ones, well adapting to new trends and patterns underneath data streams. Experiments on both synthetic and real-world data streams demonstrate the performance of our approach.
AB - Ensemble learning has become a common tool for data stream classification, being able to handle large volumes of stream data and concept drifting. Previous studies focus on building accurate prediction models from stream data. However, a linear scan of a large number of base classifiers in the ensemble during prediction incurs significant costs in response time, preventing ensemble learning from being practical for many real world time-critical data stream applications, such as Web traffic stream monitoring, spam detection, and intrusion detection. In these applications, data streams usually arrive at a speed of GB/second, and it is necessary to classify each stream record in a timely manner. To address this problem, we propose a novel Ensemble-tree (E-tree for short) indexing structure to organize all base classifiers in an ensemble for fast prediction. On one hand, E-trees treat ensembles as spatial databases and employ an R-tree like height-balanced structure to reduce the expected prediction time from linear to sub-linear complexity. On the other hand, E-trees can automatically update themselves by continuously integrating new classifiers and discarding outdated ones, well adapting to new trends and patterns underneath data streams. Experiments on both synthetic and real-world data streams demonstrate the performance of our approach.
KW - Concept drifting
KW - Ensemble learning
KW - Spatial indexing
KW - Stream classification
KW - Stream data mining
UR - http://www.scopus.com/inward/record.url?scp=80052653210&partnerID=8YFLogxK
U2 - 10.1145/2020408.2020442
DO - 10.1145/2020408.2020442
M3 - Conference contribution
AN - SCOPUS:80052653210
SN - 9781450308137
T3 - Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining
SP - 177
EP - 185
BT - Proceedings of the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD'11
PB - Association for Computing Machinery
T2 - 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD 2011
Y2 - 21 August 2011 through 24 August 2011
ER -