Modern statistical data analysis is predominantly model-driven, seeking to decompose an observed data distribution in terms of major underlying descriptive features modified by some stochastic variation. A large part of data mining is also concerned with this exercise. However, another
fundamental part of data mining is concerned with detecting anomalies amongst the vast mass of the data: the small deviations, unusual observations, unexpected clusters of observations, or surprising blips in the data, which the model does not explain. We call such anomalies patterns. For
sound reasons, which are outlined in the paper, the data mining community has tended to focus on the algorithmic aspects of pattern discovery, and has not developed any general underlying theoretical base. However, such a base is important for any technology: it helps to steer the direction
in which the technology develops, as well as serving to provide a basis from which algorithms can be compared, and to indicate which problems are the important ones waiting to be solved. This paper attempts to provide such a theoretical base, linking the ideas to statistical work in spatial
epidemiology, scan statistics, outlier detection, and other areas. One of the striking characteristics of work on pattern discovery is that the ideas have been developed in several theoretical arenas, and also in several application domains, with little apparent awareness of the fundamentally
common nature of the problem. Like model building, pattern discovery is fundamentally an inferential activity, and is an area in which statisticians can make very significant contributions.
No Reference information available - sign in for access.
No Citation information available - sign in for access.
No Supplementary Data.
No Article Media