Data Mining to Develop New Year Analyses


Devvar−Netvar Statistic

In 2005, a data mining assessment of seven years of data around the New Year transition was undertaken to develop a new hypothesis to test on future data. Two recipes were devised, which differ in detail but are parallel to the original analyses done in previous years. A more detailed look at the process is available.

First we address a rough equivalent to the previous "standard analysis" which was an epoch average of the cumulative deviation of the Chisquare for 10 minutes centered on midnight in each of 37 timezones. The new statistic is a difference between two measures. One of these is the the Devvar (device variance), which our retrospective analysis shows usually has a positive deviation. The second measure is the Netvar cumulative (the Stouffer Z²), which tends to have a negative deviation. (See the examples shown in the second figure below.)

We determined that an optimal recipe would be an epoch average over 13 timezones of the difference of two cumulative deviation curves (some other choices are almost as good, and differences are in the noise). The optimal time period appears to be from 7 minutes before midnight to 1 minute after. The following figure shows the Devvar−Netvar statistic for different timezone groups and for several durations, as indicated in the legend. Our "optimized" choice is simply the highest point on the graph. It is easy to see, however, that many of the options would have been equally good, given the statistical noise inherent in these data.

New Year Covar
Analysis

The next figure shows the two measures and their difference as cumulative deviations over a half hour period surrounding midnight, for several timezone groupings. (There are two labeled "14 timezones". The first has a few incorrect zones.) In the right column, the complementary set of timezones is shown. A smooth probability envelope gives an idea of the significance of the cumulative deviations. It starts at the beginning of the optimal period at −7 minutes (480 on the x-axis), and the end is at +1 minute (900 on the x-axis). Our choice for subsequent formal analyses is the difference curve, shown in black, for the period −7 to +1 minute relative to midnight in the thirteen specified zones.

tITLE_of_graph

Smoothed Covar Statistic

The second of our two newly defined New Year analyses looks at the epoch average of the smoothed Covar statistic. This is almost exactly the same as the smoothed variance analysis used in previous years. We did look at the "data-mining" assessment of the previous seven years of data and determined that a good choice would be the same 13 timezones as used in the Devvar−Netvar analysis, with the epochs defined as in previous years, namely the 10 minutes centered on midnight, and a smoothing window of 4 minutes, as before. The test statistic is also the same as previously used. A permutation analysis (10,000 permutations) provides a distribution for the result of multiplying the magnitude of the deviation at smoothed curve minimum times its proximity to midnight. The min*prox measure for this year is compared to the distribution. The following figure shows the all eight years, including 2006, with the 10-minute epoch averaged over 13 zones as well as the original 37 zones. A complement to the 13 zones is also shown in the middle figure.

New Year Covar
Analysis


GCP Home