Detecting changes in high frequency data streams, with applications
In recent years, problems relating to the analysis of data streams have become widespread. A data stream is a collection of time ordered observations x1, x2, ... generated from the random variables X1, X2, .... It is assumed that the observations are univariate and independent, and that they arrive in discrete time. Unlike traditional sequential analysis problems considered by statisticians, the size of a data stream is not assumed to be fixed, and new observations may be received over time.
... eived over time. The rate at which these observations are received can be very high, perhaps several thousand every second. Therefore computational efficiency is very important, and methods used for analysis must be able to cope with potentially huge data sets. This paper is concerned with the task of detecting whether a data stream contains a change point, and extends traditional methods for sequential change detection to the streaming context. We focus on two different settings of the change point problem. The first is nonparametric change detection where, in contrast to most of the existing literature, we assume that nothing is known about either the pre- or post-change stream distribution. The task is then to detect a change from an unknown base distribution F0 to an unknown distribution F1. Further, we impose the constraint that change detection methods must have a bounded rate of false positives, which is important when it comes to assessing the significance of discovered change points. It is this constraint which makes the nonparametric problem difficult. We present several novel methods for this problem, and compare their performance via extensive experimental analysis. The second strand of our research is Bernoulli change detection, with application to streaming classification. In this setting, we assume a parametric form for the stream distribution, but one where both the pre- and post-change parameters are unknown. The task is again to detect changes, while having a control on the rate of false positives. After developing two diffe [...]