LOOS  v2.3.2
block_average.cpp
1 /*
2  Compute block-averaged standard error for a time series
3  using the Flyvbjerg & Petersen approach. Outputs the block averaged
4  standard error as a function of block size -- you'll need to plot it
5  and estimate the plateau value.
6 
7  References:
8 
9  Flyvbjerg, H. & Petersen, H. G. Error estimates on averages
10  of correlated data J. Chem. Phys., 1989, 91, 461-466
11 
12  Grossfield, A., and Zuckerman, D. M. Quantifying uncertainty and sampling
13  quality in biomolecular simulations, Ann. Reports in Comp. Chem., 2009,
14  5, 23-48
15 
16 
17  Alan Grossfield
18  Grossfield Lab
19  Department of Biochemistry and Biophysics
20  University of Rochester Medical School
21 
22  This file is part of LOOS.
23 
24  LOOS (Lightweight Object-Oriented Structure library)
25  Copyright (c) 2008, Alan Grossfield
26  Department of Biochemistry and Biophysics
27  School of Medicine & Dentistry, University of Rochester
28 
29  This package (LOOS) is free software: you can redistribute it and/or modify
30  it under the terms of the GNU General Public License as published by
31  the Free Software Foundation under version 3 of the License.
32 
33  This package is distributed in the hope that it will be useful,
34  but WITHOUT ANY WARRANTY; without even the implied warranty of
35  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
36  GNU General Public License for more details.
37 
38  You should have received a copy of the GNU General Public License
39  along with this program. If not, see <http://www.gnu.org/licenses/>.
40 
41 */
42 
43 #include <loos.hpp>
44 
45 using namespace std;
46 using namespace loos;
47 
48 string fullHelpMessage(void)
49  {
50  string s =
51 "\n"
52 "SYNOPSIS\n"
53 "\n"
54 "Apply block averaging to estimate standard error of timeseries data\n"
55 "\n"
56 "DESCRIPTION\n"
57 "\n"
58 "This tool performs a block averaging analysis in order to estimate the \n"
59 "standard error of a set of time series data. It takes as input a text\n"
60 "file with white-space delimited data in columns (each time point is a \n"
61 "row), and returns the estimated standard error as a function of block \n"
62 "size. \n"
63 "\n"
64 "The command line arguments are as follows:\n"
65 "\n"
66 "block_average TimeSeriesFile column max_blocks skip\n"
67 "\n"
68 "TimeSeriesFile columnated text file (blank lines and lines starting \n"
69 " with \"#\" are ignored) containing the time series data\n"
70 "column which column to use for analysis (1-based)\n"
71 "max_blocks maximum number of blocks to use in the analysis\n"
72 "skip number of frames to skip from the beginning of the \n"
73 " trajectory\n"
74 " \n"
75 "The algorithm used is in essence that of Flyvbjerg and Petersen [Ref 1],\n"
76 "and is intended to estimate the standard error for a correlated time\n"
77 "series. For uncorrelated data, the standard error can be estimated as\n"
78 "\n"
79 "SE = sqrt(var(a) / N) = stdev(a) / sqrt(N)\n"
80 "\n"
81 "where \"a\" is the quantity of interest and \"N\" is the number of points.\n"
82 "When the data has correlations, as is the case for nearly all molecular\n"
83 "dynamics or Monte Carlo simulations, this formula significantly \n"
84 "underestimates the statistical uncertainty. \n"
85 "\n"
86 "The block averaging algorithm works by breaking the \"N\" data points\n"
87 "into \"M\" equal-sized contiguous blocks , computing the average within \n"
88 "each block, and then combining them to get the standard deviation in the \n"
89 "averages. By tracking how that standard dev changes as a function of block \n"
90 "size, we can estimate the standard error in the limit of inifinite\n"
91 "block size, which is an estimate of the true standard error. \n"
92 "\n"
93 "As the blocks get longer, there are fewer of them, and their variance\n"
94 "can get very noisy. If you've got really good data, the at long block \n"
95 "time will be pronounced before the curve gets noisy. If not, you can\n"
96 "estimate the plateau value by averaging the values at the last few block\n"
97 "sizes (in a plot of std err vs. block size). If there is no plateau (e.g.\n"
98 "the curve is still systematically rising), your data is sufficiently \n"
99 "unconverged that the statistical error cannot be estimated.\n"
100 "\n"
101 "It is important to note that block averaging can significantly\n"
102 "underestimate the standard error for extremely undersampled systems, \n"
103 "because it is entirely based on what has been seen in the trajectory. For\n"
104 "example, in the case of a 2-state system with different positions along\n"
105 "a reaction coordinate x, a very short simulation might only have population\n"
106 "in 1 state; block averaging this data would produce a small estimated \n"
107 "uncertainty, because the data looks very homogeneous. Basically, the\n"
108 "analysis can't know what it hasn't seen.\n"
109 "\n"
110 "Note: If the number of blocks doesn't evenly divide the number of points,\n"
111 "then the remainder will be discarded from the end of the trajectory.\n"
112 "\n"
113 "\n"
114 "See references 1 and 2 for more discussion of the block averaging algorithm.\n"
115 "\n"
116 "1. Flyvbjerg, H. & Petersen, H. G. Error estimates on averages \n"
117 " of correlated data J. Chem. Phys., 1989, 91, 461-466\n"
118 "\n"
119 "2. Grossfield, A., and Zuckerman, D. M. Quantifying uncertainty and \n"
120 " sampling quality in biomolecular simulations, Ann. Reports in Comp. \n"
121 " Chem., 2009, 5, 23-48\n"
122 "\n"
123 "\n"
124 "EXAMPLE\n"
125 "\n"
126 "block_average trj_1.dat 2 20 100\n"
127 "\n"
128 "In this case, trj_1.dat is the data file (I used NAMD's output of the box\n"
129 "dimensions), 2 means analyze the 2nd column (the dimension of the x \n"
130 "coordinate), 20 means use from 2--20 blocks, and 100 means skip the \n"
131 "first 100 time points in the file.\n"
132 "\n"
133 "The output will look like:\n"
134 "\n"
135 "# block_average 'trj_1.dat' '2' '20' '100' - alan (Fri Apr 6 14:22:45 2012) {/home/alan/projects/IBM/lipopeptides/analysis/box_area/pope_popg} [2.0.0 120406]\n"
136 "# Num_Blocks BlockLen StdErr\n"
137 "20 111 0.190357\n"
138 "19 117 0.194111\n"
139 "18 123 0.193382\n"
140 "17 131 0.19941\n"
141 "(more lines like this)\n"
142 "3 743 0.239527\n"
143 "2 1114 0.354243\n"
144 "\n"
145 "The first column is the number of blocks the data is broken into, the \n"
146 "second is the number of time points in each block, and the last column \n"
147 "is the standard error of the averages for each block. As a rule, you'll\n"
148 "want to plot this data using column 2 as the x-axis, and column 3 as the \n"
149 "y-axis.\n"
150 "\n"
151  ;
152  return(s);
153  }
154 
155 
156 void Usage()
157  {
158  cerr << "Usage: block_average TimeSeriesFile column max_blocks skip"
159  << endl;
160  cerr << endl;
161  cerr << "TimeSeriesFile is a columnated text file. Blank lines and "
162  << "lines starting with \"#\" are ignored"
163  << endl;
164  }
165 
166 int main(int argc, char*argv[])
167 {
168 
169 if ( (argc >=2) && (strncmp(argv[1], "--fullhelp", 10) == 0) )
170  {
171  cout << fullHelpMessage() << endl;
172  exit(-1);
173  }
174 
175 if ( (argc <= 1) ||
176  ( (argc >= 2) && (strncmp(argv[1], "-h", 2) == 0) ) ||
177  (argc < 5)
178  )
179  {
180  Usage();
181  exit(-1);
182  }
183 
184 // Print the command line arguments
185 cout << "# " << invocationHeader(argc, argv) << endl;
186 
187 char *datafile = argv[1];
188 int column = atoi(argv[2]);
189 unsigned int max_blocks = atoi(argv[3]);
190 unsigned int skip = atoi(argv[4]);
191 
192 // Read the TimeSeries file
193 TimeSeries<float> data = TimeSeries<float>(datafile, column);
194 unsigned int num_points = data.size();
195 
196 // validate the arguments
197 if (skip > num_points)
198  {
199  cerr << "You set skip ( " << skip << " ) greater than the number "
200  << "of points in the trajectory ( " << num_points << " )."
201  << endl
202  << "This doesn't work."
203  << endl;
204  exit(-1);
205  }
206 
207 // Remove the equilibration time
208 data.set_skip(skip);
209 num_points -= skip;
210 
211 if (max_blocks > num_points)
212  {
213  cerr << "You set max_blocks ( " << max_blocks << " ) greater than "
214  << "the number of points in the trajectory minus the number skipped "
215  << "( " << num_points << " )."
216  << endl
217  << "This doesn't work."
218  << endl;
219  exit(-1);
220  }
221 
222 // Loop over the number of blocks, computing the variance of the averages for
223 // each number of blocks
224 
225 cout << "# Num_Blocks\tBlockLen\tStdErr" << endl;
226 
227 for (int i=max_blocks; i>=2; i--)
228  {
229  int time = num_points / i;
230  float variance = data.block_var(i);
231  float std_err = sqrt(variance/i);
232  cout << i << "\t\t"
233  << time << "\t\t"
234  << std_err
235  << endl;
236  }
237 
238 
239 }
T block_var(const int num_blocks) const
Definition: TimeSeries.hpp:444
STL namespace.
void set_skip(unsigned int num_points)
Definition: TimeSeries.hpp:353
std::string invocationHeader(int argc, char *argv[])
Create an invocation header.
Definition: utils.cpp:124
Time Series Class.
Definition: TimeSeries.hpp:51
Namespace for most things not already encapsulated within a class.