Pages

Statistics for Physicists

This is a formula cheat sheet with notes and definitions of common statistical methods used for data analysis in the physical sciences.

Definitions

Mean or Expected Value: $$\mu_x = \langle x \rangle = \frac{1}{N} \sum_{i=1}^N x_i$$

Sample Variance: $$S_x^2=\frac{1}{N-1} \sum_{i=1}^N (x_i-\langle x \rangle)^2 $$

Correlated Sample Variance: $$S_{uv}^2 = \frac{1}{N-1}\sum_{i=1}^N (u_i-\langle u \rangle)(v_i - \langle v \rangle) $$

Standard Deviation: $$\sigma_x = S_x(N \rightarrow \infty) $$

Degrees of Freedom: $$\text{DOF} = N(\text{Measurements}) - N(\text{Distribution Parameters})$$

"Uncertainty of Mean": $$ \frac{\sigma}{\sqrt{N}} $$

"Variance of Mean": $$ \frac{\sigma^2}{N} $$

\(\sigma\) is usually a measure of some limiting property of the measurement system, though it is often referred to as the "uncertainty". However, it is possible to have a physical process with some large \(\sigma\) but a small measurement uncertainty. This means that a measurement of a physical quantity with a high probability of taking on a value far from the expected value is very robust (we will measure the same distribution every time), but that the quantity cannot be measured with greater precision unless the measurement is improved (up to the uncertainty limit).

Sample Distributions

Sample distributions of measurement values are used to infer the parent distribution of a physical quantity.

Binomial: Used for a small, discrete number of possible independent outcomes or states. Typically, we are measuring whether an event is observed or not (binary outcome) in some time interval and we would like to know how likely it is that \(x\) events are observed. The probability of \(x\) out \(n\) observations with rate \(p\) is $$ P(x;n,p) = \frac{n!}{x!(n-x)!}p^x (1-p)^x $$ $$ \mu = np $$ $$ \sigma^2 = np(1-p) $$

Poisson: Similar to binomial, except that the average number of outcomes is less than the possible number of outcomes (\(p \rightarrow 0)\) (see Poisson Limit Theorem) and the possible number of outcomes is large. Discrete distribution used to model random and rare events such as photon counts in a PMT. Probability of \(x\) observations out of a large number with an expected value of \(\lambda\) is $$ P(x;\lambda) = \frac{\lambda^x}{x!} e^{-\lambda} $$ $$\mu = \lambda$$ $$\sigma^2 = \lambda$$ If the distribution is presented as a histogram (un-normalized), the uncertainty in each bin is 1 (\(x\) events are observed or they are not). This implies that the standard deviation for each bin is \(\sigma_{\text{bin}} = \sqrt{n_{\text{bin}}} \). 

Gaussian: Continuous distribution of measurement values, commonly used to describe physical quantities that are subject to broadening due to thermal (versus collisional) processes like Doppler shifts. The probability of observing the measurement value (not the number of observations) \(x\) of a quantity with a presumed Gaussian parent distribution with mean \(\mu\) and standard deviation \(\sigma\) is $$P(x;\mu,\sigma) = \frac{1}{\sigma \sqrt{2\pi}} e^{-\frac{1}{2}(\frac{x-\mu}{\sigma})^2} $$ $$\text{FWHM} = 2.354 \sigma $$ If the sample size is small, the mean and standard deviation of the parent distribution are necessarily poorly-determined and a T-distribution should be used. Typically, we replace the leading \(\sigma\)-dependent coefficient with a new fitting parameter for a total of three parameters since Gaussian distributions in measurements of raw data are always scaled by some number and are not normalized.

Lorentzian: Continuous distribution of measurement values, used to describe physical quantities that are subject to broadening due to a random (Poisson) reset process such as collisions in a gas. Also known as the Cauchy distribution. The probability of observing the measurement value \(x\) of a quantity with parent distribution characterized by expected value \(\mu\) and FWHM of \(2\gamma\) is $$P(x;\mu,\gamma) = \frac{1}{\pi \gamma} \frac{\gamma^2}{(x-\mu)^2+\gamma^2}$$ Again, we typically use a three-parameter fit as for the Gaussian distribution where we are fitting a physical measurement and not a normalized probability distribution. Also, note that the standard deviation is not defined for the Lorentzian and we use the FWHM instead.

Voigt: The line shape resulting from the convolution of a Gaussian and Lorentzian is a non-analytic function called the Voigt function. There exists a simple relationship between the FWHM of each type of profile $$\frac{f_V}{f_G} = 0.5346 \left( \frac{f_L}{f_G} \right) + \sqrt{1+0.2166 \left( \frac{f_L}{f_G} \right) } $$ which can be used to estimate the ratio of the Lorentzian and Gaussian FWHM if a measurement technique provides a knob to increase and decrease the collisional-broadening process with respect to a roughly constant thermal broadening process (which may come from the measurement method itself).


Fitting Data

\(\chi^2\) Test: Non-linear least-squares regressions in SciPy or gnuplot use the Levenberg-Marquardt algorithm to minimize the \(\chi^2\) statistic. A good rule of thumb is that \(\chi^2 \approx \text{DOF}\) or the reduced-\(\chi^2 \approx 1\). The \(\chi^2\) statistic is calculated from the data via $$\chi^2 = \sum_{i=1}^N \frac{(x_i - f(x_i))^2}{\sigma_i^2} $$ where \(f(x)\) is the optimized model function. A p-value is calculated with this statistic and the \(\chi^2\)-distribution. A rough interpretation of the p-value: A small p-value indicates the uncertainties may be underestimated and a large p-value indicates the uncertainties may be overestimated.

F Test: A future post will address this test and ANOVA in great detail.

Maximum-Likelihood: A future post will address this in great detail.

Orthogonal Distance Regression: ODR is a very useful fitting method for data which has known uncertainty of the independent as well as the dependent variables. Usually, we would like for the uncertainty in independent variables to be small so we can ignore this aspect of the analysis, but it never hurts to be thorough. In theory, ODR will provide a more accurate value of the uncertainty of the model fitting parameters without needing to go through the trouble of programming a Monte Carlo simulation that varies the independent variables. There is a very nice package for ODR in SciPy. My experience has shown that SciPy ODR needs to be nearly on top of the minimized solution so that running SciPy curve_fit or similar to provide an initial fit is almost always necessary.

References

Bevington, P.; Robinson, D. K. Data Reduction and Error Analysis for the Physical Sciences, 3rd ed.; McGraw Hill: Boston, 2003.

Casella, G.; Berger, R. L. Statistical Inference, 2nd ed.; Duxbury: Pacific Grove, CA, 2001.