Categorical Data Sets for Outlier Detection
Lack of benchmark data sets is a major bottleneck for outlier detection. Some efforts have been made to provide widely-used outlier detection data sets to promote the development of outlier detection, e.g., the data sets at http://www.dbs.ifi.lmu.de/research/outlier-evaluation/DAMI/ and http://odds.cs.stonybrook.edu/ . However, no or very limited categorical data can be found therein. We provide the categorical data sets that are used in our previous papers [1,2] to complement them.
A Summary of 15 Data Sets and Their Complexity Evaluation Results is represented as follows. BM, APAS, AD, CMC, SF, R10, CT and LINK are acronyms for Bank Marketing, aPascal, Internet Advertisements, Contraceptive Method Choice, Solar Flare, Reuters10, CoverType and Linkage, respectively. The data sets are ordered by the average rank in the last column.
Data sources: CelebA, aPascal and Reuters10 are available at http://vision.cs.uiuc.edu/attributes/, http://mmlab.ie.cuhk.edu.hk/projects/CelebA.html and http://sci2s.ugr.es/keel/, respectively. w7a is available at https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/. All other 11 data sets are availalbe at http://archive.ics.uci.edu/ml/.
These data sets are also used in the journal extension of our IJCA16 paper [1], which will be made available soon. The four data complexity indicators are also defined there.
All these 15 data sets can be downloaded by clicking here. Please cite the following two papers when you use these data sets in your work.
References:
[1] Pang, G., Cao, L., & Chen, L. (2016). Outlier detection in complex categorical data by modelling the feature value couplings. In Proceedings of the 25th International Joint Conference on Artificial Intelligence (Vol. 2016, pp. 9-15).
[2] Pang, G., Cao, L., Chen, L., & Liu, H. (2016, December). Unsupervised Feature Selection for Outlier Detection by Modelling Hierarchical Value-Feature Couplings. In Data Mining (ICDM), 2016 IEEE 16th International Conference on (pp. 410-419). IEEE.