ZERO++: Harnessing the Power of Zero Appearances to Detect Anomalies in Large-scale Data Sets

Abstract: ZERO++ is a new unsupervised anomaly detector which employs the number of zero appearances in subspaces to detect anomalies in categorical data. It is unique in that it works in regions of subspaces that are not occupied by data; whereas existing methods work in regions occupied by data. ZERO++ examines only a small number of low dimensional subspaces to successfully identify anomalies. Unlike existing frequency-based algorithms, ZERO++ does not involve subspace pattern searching. We show that ZERO++ is better than or comparable with the state-of-the-art anomaly detection methods over a wide range of real-world categorical and numeric data sets; and it is efficient with linear time complexity and constant space complexity which make it a suitable candidate for large-scale data sets.

Intuition: Here we provide two intuitive examples that zero appearances can be used to detect anomalies. The first example shows the fact that anomalies are more likely to have zero appearances in a subsample than normal instances. The second example intuits that, using zero appearances, only subspaces need to be examined to detect anomalies. 

Example 1: Given an univariate categorical data set with 1,000 instances, and there exists anomalies with at most 10 appearances and normal instances with at least 100 appearances. In a subset of 8 instances, randomly sampled without replacement from the data set, the expected probability of any anomaly having zero appearances in this subset is at least {1000-10 \choose 8}/{1000 \choose 8} \approx 0.9225; whereas that of any normal instance is at most {1000-100 \choose 8}/{1000 \choose 8} \approx 0.4291. As a result, having sampled multiple such subsets, the anomaly is likely to have zero appearances in significantly more subsets than normal instances.

Example 2: In a multivariate data set, the rarity characteristic of anomalies is usually manifested as zero appearances in subspaces. In other words, test instances, which appear in regions of subspaces that are not occupied by training instances, are likely to be anomalies.

Figure: A two-dimensional data set with 30 instances, where the size of the circle indicates the number of instances in a region; and the number in [ ] indicates the number of instances in one-dimensional subspace. The categorical values in A_{1} are {i,j,k,l} and the values in A_{2} are {q,r,s,t}.

An example of the zero appearances in subspaces for a simple two-dimensional subspace is shown in the above figure. Given the data distribution shown in the figure, regions of zero appearances in one-dimensional subspaces are S_{A_{1}l} and S_{A_{2}s} only, where S_{A_{x}y} denotes region A_{x}=y in the one-dimensional subspace of attribute A_{x}. Any test instance having either A_{1}=l or A_{2}=s is likely to be an anomaly. Since instances must have zero appearances in higher-dimensional subspaces of either A_{1}=l or A_{2}=s, we only need to examine regions in these one-dimensional subspaces.

ZERO++ harnesses the power of zero appearances in both subspaces and subsamples, described above, to detect anomalies. 

Source Code:

ZERO++ is implemented in WEKA (a widely used JAVA-based open-source data mining tool). Please download the code by clicking here.