Data Mining 1: New Year Analyses


Devvar−Netvar Statistic

In 2005, a data mining assessment of seven years of data around the New Year transition was undertaken to develop a new hypothesis to test on future data. Peter Bancel explored variations on previous analyses with the goal of optimizing two recipes with a parallel structure to the original analyses done in previous years. This page shows some of the work forming the background for the eventual choices for the Devvar−Netvar analysis.

The following two plots show several measures for 37 timeszones and the *complement* of the 14 timezones. These plots and all the preceeding ones use scaled chisqrs for the cumdevs of the netvar. the cumdevs do NOT use converted zscores. There is no problem for comparisons, but you can't read off zscores by dividing the deviations by Sqrt(Nsecs).

37 timezones. First row, Moving Averages (4-min smooth). Second row, Cumulative Deviations.

New Year Covar
Analysis

23 timezones (Complement to 14 heavily populated zones). First row, Moving Averages (4-min smooth). Second row, Cumulative Deviations.

New Year Covar
Analysis

The next two plots show the deviations of netvar and devvar, averaged over 7 years, to the 14 tzone set and 13 timezones,
respectively. The difference of either of these would make a good test statistic on an interval of -7 to +2 minutes around midnight.
The statistic is the difference of zscores for the netvar and devvar, for each second.

New Year Covar
Analysis

tITLE_of_graph

These may be compared with a similar plot using 37 zones.

tITLE_of_graph

Smoothed Covar Statistic

The second of our two newly defined New Year analyses looks at the epoch average of the smoothed Covar statistic. This is almost exactly the same as the smoothed variance analysis used in previous years. We did look at the "data-mining" assessment of the previous seven years of data and determined that a good choice would be the same 13 timezones as used in the Devvar−Netvar analysis, with the epochs defined as in previous years, namely the 10 minutes centered on midnight, and a smoothing window of 4 minutes, as before. The test statistic is also the same as previously used. A permutation analysis (10,000 permutations) provides a distribution for the result of multiplying the magnitude of the deviation at smoothed curve minimum times its proximity to midnight. The min*prox measure for this year is compared to the distribution. The following figure shows the all eight years, including 2006, with the 10-minute epoch averaged over 13 zones as well as the original 37 zones. A complement to the 13 zones is also shown in the middle figure.

New Year Covar
Analysis

We expect to do further assessments, for example, looking at the impact of different smoothing windows and different smoothing algorithms. The next figure shows the result of smoothing with a Gaussian convolution. It has the advantage of weighting low frequencies more than high frequencies, effectively reducing the high frequency noise. Comparisons of variations on this theme using the previous data should give us a good general prediction for subsequent years.

New Year Covar
Analysis


GCP Home