Python statistical analysis
From PIMC++
Contents |
[edit] Getting Started
Here are instructions for installing on your local laptop/machine. If you are working on a lab computer, continue below.
In order to use the software for the lab, several environment variables need to be set up. This can be done by typing
source /afs/ictp/public/shared/smr1929/login.cshrcNote: you must run the above "source" line for each terminal you open. Verify that everything is working by typing
TestPlotThis should produce a simple plot of
.
Next, let's get the sample data and skeleton code into your home directory:
setup_statsThis will create a Statistics subdirectory in your current directory and copy some files into it. You can now
cd StatsLab code> and begin the tutorial.
[edit] Doing Statistics
To begin with, we will write some code that will do statistical analysis. For this section, we will work with the data set test.dat.
Once you have working code, we will use this code later to analyze results from your own variational monte carlo code.
In this section, we will use python. Start with the template in CalcStatistics.py code>.
There are a number of blank functions in this file which you will fill in.
Let's start by figuring out how to take the average of our data. (Throughout this tutorial we will use the library numpy for arrays) It turns out that numpy has an average function that we can call. (i.e. data_avg=numpy.average(myArray) code>). If you haven't used python before, though, it's a good idea to get started by writing one yourself). Let's walk through what you need to know in order to do this.
First in order to define a function we specify:
def Mean(myArray): for i in myArray: print i # this prints everything in the array. # let's change this to average things in a return 0.0
You should notice a couple of things. To begin with, a function is defined by using the "def" keyword. The paramaters to the function don't have types (in python you almost never need to specify the types of data). Also, white-space is important in python. You must indent things (typically 4 spaces) underneath a function and underneath a "for loop". The importance of white space is somewhat unnatural to begin with, but makes the code much more readable.
There are a variety of ways to "loop" in python. The one that is most like C++ and fortran is to do
for i in range(0,len(myArray)): print myArray[i]
Equivalently one can do
for i in myArray: print i
Other useful points:
- Notice, that you can access the i'th element of the array by doing
myArray[i] code>
- If it's a 2d array, you can access the (i,j)'th element by doing
myArray[i,j] code> - If you divide and one of your numbers is an integer, then you will get an integer result (this unfortunate behavior will go away in the next version of python). To avoid this behavior, add 0.0 onto an integer before dividing.
Using this information, write a function that calculates and returns the average of your array.
The file TestMean.py code> reads test.dat code>
and passes the data to your Mean code> function. It also calls Numpy's built-in
average code> function to compare. After you have filled in your Mean code> fuction, try running ./TestMean.pyto see if your function is correct.
[edit] Equilibration (transient)
Of course, in calculating the average, we have ignored the fact that some of the data exists before the system has equilibrated. Let's start by looking at a plot of our data. The script PlotData.py code> plots the first 500 points in our data set:
#!/bin/env python import numpy import pylab d = pylab.load('test.dat') pylab.plot(d[:500]) pylab.show()
pylab (aka matplotlib) is a MATLAB-like python library that is very well suited for plotting publication quality 2d plots. You should now see a plot of the first 500 points of the data (we don't plot it all because otherwise the transient will get washed out). If the data you were averaging was sufficiently short, you would notice a difference between were the average appears on the plot and the average you calculated earlier. This is because, the first part of the data is clearly not yet equilibrated. We want to throw out that part of the data. To get only a piece of an array in python, we use slice notation calling "myArray[start:end]" (this includes the point "start" but not the point "end"). If you don't include one of these (i.e. calling myArray[start:]) it will substitute the first (respectively the last+1) point for the missing parameter). Use this information, to calculate the actual average without the equilibrated data. If you're feeling especially talented, come up with a heuristic way to take a dataset and return the value it has equilibrated at.
[edit] Standard Error
Now, we want to calculate the standard error of our dataset. In order to do this, we need to figure out the variance of our dataset as well as the number of effective points. Let us start by calculating the standard deviation of our dataset. The variance is defined as

where N is the number of points. Using the variance we just calculated, the standard error can be expressed as

Write functions
def var(myArray): # do stuff return variance
and
def NaiveStandardError(myArray): # calculate the standard error by calling the variance function return standard_err
that returns the variance and "naive" standard error of your dataset. By this, I mean the standard error assuming that all points in your system were independent (i.e.
). As a check, the variance for this dataset should be 0.1401 if you calculate it on the entire dataset (which is the incorrect thing to do since you should really leave out the non-equilibrated data). (Helpful note: math.sqrt or numpy.sqrt will give you the sqrt of a number).
Recall that the standard deviation
gives, roughly speaking, the width of a normally distributed dataset. The command
./PlotHist.pywill plot a histogram of the dataset. Compare its width with your calculated value.
We would expect the standard error to be independent of the way in which samples are binned. However, we will see that this is not always the case with real data. If there is some correlation between different data points then the standard error turns out to be a function of the bin size. This is because
and
are different. Let us define kappa (
) to be
Here we will see two standard methods that we can use correct for this autocorrelation and get the correct standard error.
[edit] Autocorrelation Time
Another way to get the correct standard error is to estimate
directly. This can be done by looking at the autocorrelation function

and integrating below it to get

where
is the point where C(i) goes negative.
Write a function
def C(i,myArray,mean,var):
that returns C(i). Then using pylab, write some code
def GraphC(myArray): # just graphs C(i) # remember to use pylab.plot(myList) return None
that graphs C(i) as a function of i. You should note that when this function goes to 0, the points are totally decorrelated (the value of
will be smaller then this because we can use partially correlated points to get a full "effective point".)
Now write a function
def CalcKappa(myArray):
that returns the autocorrelation time.
Now that you can calculate the standard deviation of your dataset and the autocorrelation time,
(and hence the number of effective points), you should be able to put these two things together to calculate the actual standard error in a function
def StdError(myArray):
For the dataset we have supplied, you can call TestKappa.py code> you should get a standard error of 0.0233 and a kappa of 19.49.
[edit] Binning (optional)
The binning method is a technique that let's you calculate the actual standard error by iteratively binning together sets of data. It works in the following way. We calculate the naive standard error of our dataset. Then, we "bin" the dataset by averaging together groups of k points producing a new dataset (size n/k). We then calculate the naive standard error for this new dataset. If we then plot the standard error as a function of k, we should find that the standard error eventually becomes independent of k. This asymptotic value is the real standard error. (From here you could obviously back out the value for
as well) There are two things to note about this procedure. To begin with, establishing the point at which the standard error "levels out" is typically done by looking at the plot (as opposed to automatically) although we could come up with an automatic procedure. The second thing to note is that for a large enough k, the average is over sufficiently few points that you are going to get nonsensical results (don't go above k=N/20). Obviously you should ignore the data at such large k.
To implement this procedure, start by writing a function
def BinData(myArray,i): #bin myArray into groups of size i and put the results in binnedArray return binnedArray
Note a couple useful things. To produce a new 1d python array (of floats) do
binnedArray = numpy.zeros((numPoints),float)
Now write a function
def BinningMethod(myArray):
that returns a list where the i'th element of the list is the naive standard error where myArray is binned into groups of i.
Then by calling TestBinning.py code> (which uses these commands)
myBinnedList = BinningMethod(myArray) pylab.plot(myBinnedList)
you can see a graph of the binning size versus the naive standard error. Look at this plot and estimate the "true error" that this asymptotes to (also back out
)
Compare the results you get using this method with the results from the autocorrelation formula.
Again if you feel especially talented, write a function that uses this method to automatically find this asymptote (without actually having to graph and look at the plot
[edit] Putting it together
Now that we have python code to calculate the respective statistics, we will quickly package it together so that we can easily call it for the rest of our tutorial.
Write a function
def stats(myArray): return (mean,stdError,kappa)
that takes an array and returns a tuple of the mean and standard error. (In python a tuple is a set of values that are packaged together in parenthesis).
From another file (for example, the VMC one you are about to write), you would be able to call your function in the following way:
import CalcStatistics #because you saved it in CalcStatistics.py (should be at top of file) (myMean,myError,myAutoCorr)=CalcStatistics.stats(myArray) #wherever you want to calculate statistics of your array
