In 2005, a data mining assessment
of seven years of data around the New Year transition
was undertaken to develop a new hypothesis to test on future data.
Peter Bancel explored variations on previous analyses with the goal of
optimizing two recipes with a parallel structure to the
original analyses done in previous years. This page shows some of the
work forming the background for the eventual choices for the
Devvar−Netvar analysis.
The following two plots show several measures for 37
timeszones and the *complement* of the 14 timezones.
These plots and all the preceeding ones use scaled chisqrs
for the cumdevs of the netvar. the cumdevs do NOT use converted
zscores. There is no problem for comparisons, but you can't read
off zscores by dividing the deviations by Sqrt(Nsecs).
37 timezones. First row, Moving Averages (4-min smooth). Second row,
Cumulative Deviations.
23 timezones (Complement to 14 heavily populated zones).
First row, Moving Averages (4-min smooth). Second row, Cumulative Deviations.
The next two plots show
the deviations of netvar and devvar, averaged over 7 years, to the 14
tzone set and 13 timezones,
respectively.
The difference of either of these would make a good test
statistic on an interval of -7 to +2
minutes around midnight.
The statistic is the difference of zscores for the
netvar and devvar, for each second.
These may be compared with a similar plot using 37 zones.
Smoothed Covar Statistic
The second of our two newly defined New Year analyses
looks at the epoch
average of the smoothed Covar statistic. This is almost exactly the same
as the smoothed variance analysis used in previous years. We did look at
the "data-mining" assessment of the previous seven years of data
and determined that a good choice would be the same 13 timezones as used in the
Devvar−Netvar analysis, with the epochs defined as in previous years,
namely the 10 minutes centered on midnight, and a smoothing window of 4
minutes, as before. The test statistic is also the same as previously used. A
permutation analysis (10,000 permutations) provides a distribution for
the result of multiplying the magnitude of the deviation at smoothed
curve minimum times its proximity to midnight. The min*prox measure for
this year is compared to the distribution.
The following figure shows the all eight years, including 2006, with the
10-minute epoch averaged over 13 zones as
well as the original 37 zones. A complement to the 13 zones is also
shown in the middle figure.
We expect to do further assessments, for example, looking at the impact
of different smoothing windows and different smoothing algorithms. The
next figure shows the result of smoothing with a Gaussian convolution.
It has the advantage of weighting low frequencies more than high
frequencies, effectively reducing the high frequency noise.
Comparisons of variations on this theme using the previous data should
give us a good general prediction for subsequent years.