next up previous contents
Next: Non-parametric Up: Statistical Methods Previous: Statistical Methods

Parametric

Returning to the data set of the guys and the gals, you will, if you have had any amount of statistical education (and if you haven't, go to the Information Theory Notes to acquire some), have immediately thought that the cluster of men looked very like what would be described by a bivariate normal or gaussian distribution, and that the cluster of women looked very like another. In elementary books introducing the one dimensional normal distribution, it is quite common to picture the distribution by getting people to stand with their backs to a wall, with people of the same height standing in front of each other. Then the curve passing through the people furthest from the wall is the familiar bell shaped one of Fig.1.8., with its largest value at the average height of the sample.


 
Figure 1.8: One dimensional (univariate) normal or gaussian function
\begin{figure}
\vspace{8cm}
\special {psfile=patrecfig8.ps}\end{figure}

The function family for the one dimensional (univariate) gaussian distribution has two parameters, the centre, $\mu$ and the standard deviation, $\sigma$. Once these are assigned values, then the function is specified (so long as $\sigma$ is positive!) and of course we all know well the expression

\begin{displaymath}
g_{\mu,\sigma}(x) = \frac{1}{\sqrt{2\pi}\sigma} 
e^{-\frac{(x-\mu)^2}{2\sigma^2}} \end{displaymath}

which describes the function algebraically.

The distribution of heights of a sample of men may be modelled approximately by the gaussian function in dimension 1 for suitably chosen values of $\mu, \sigma$. The modelling process means that if you want an estimate of the proportion of the sample between, say, 170 and 190 cm. tall, it can be found by integrating the function between those values. The gaussian $g_{\mu,\sigma}$ takes only positive values, and the integral from $-\infty$ to $\infty $ is 1, so we are simply measuring the area under the curve between two vertical lines, one at 170 and the other at 190. It also follows that there is some fraction of the sample having heights between -50 and -12 cm. This should convince you of the risk of using models without due thought. In low dimensions, the thought is easy, in higher dimensions it may not be. To the philosopher, using a model known to be `wrong' is a kind of sin, but in statistics and probability modelling, we do not have the luxury of being given models which are `true', except possibly in very simple cases.

To visualise the data of men's heights and weights as modelled by a gaussian function in two dimensions, we need to imagine a `gaussian hill' sitting over the data, as sketched rather amateurishly in Fig.1 .9. Don't shoot the author, he's doing his best.


 
Figure 1.9: Two dimensional (bivariate) normal or gaussian distribution
\begin{figure}
\vspace{8cm}
\special {psfile=patrecfig9.ps}\end{figure}

This time the gaussian function is of two variables, say x and y, and its parameters now are more complicated. The centre, $\mu$ is now a point in the space ${\fam11\tenbbb R}^2$, while the $\sigma$has become changed rather more radically. Casting your mind back to your elementary linear algebra education, you will recall that quadratic functions of two variables may be conveniently represented by symmetric matrices, for example the function

\begin{displaymath}
f: {\fam11\tenbbb R}^2 \longrightarrow {\fam11\tenbbb R}\end{displaymath}

given by

\begin{displaymath}
% latex2html id marker 578
f\left(\begin{array}
{c} x \\  y \end{array} 
\right) = 3x^2 + 8xy + 7y^2 \end{displaymath}

may be represented by the matrix

\begin{displaymath}
% latex2html id marker 579
\left(\begin{array}
{cc} 3 & 4 \\  4 & 7 \end{array} 
\right) \end{displaymath}

and in general for quadratic functions of two variables we can write

\begin{displaymath}
% latex2html id marker 580
\left(\begin{array}
{c} x \\  y \...
 ...y} \right) 
\left(\begin{array}
{c} x \\  y\end{array} \right) \end{displaymath}

for the function usually written ax2 + 2bxy + cy2. Multiplying out the matrices gives the correct result.

Since in one dimension the gaussian function exponentiates a quadratic form, it is no surprise that it does the same in two or more dimensions. The n-dimensional gaussian family is parametrised by a centre, ${ {\bf m}}$ which is a point in ${\fam11\tenbbb R}^n$ and ${\bf V}$ which is an n by n invertible positive definite symmetric matrix representing the quadratic map which takes ${\bf x}$ to $ {\bf x}^T {\bf 
V}^{-1} {\bf x}$. The symbol $\ ^T$ denotes the transpose of the column matrix to a row matrix. The formula for a gaussian function is therefore

\begin{displaymath}
g_{[{\bf m},{\bf V}]} ({\bf x}) =
 \frac{1}{(\sqrt{2\pi})^n\...
 ...\frac{({\bf x}-{\bf m})^T {\bf V}^{-1} ({\bf x}-{\bf m}) }{2}} \end{displaymath}

and we shall refer to ${ {\bf m}}$ as the centre of the gaussian and ${\bf V}$ as its covariance matrix. The normal or gaussian function with centre ${ {\bf m}}$ and covariance matrix ${\bf V}$ is often written N(${{\bf m}}, {\bf V}$) for short. All this may be found explained and justified, to some extent, in the undergraduate textbooks on statistics. See Feller, An Introduction to Probability Theory and Applications volume 2, John Wiley 1971, for a rather old fashioned treatment. Go to Information Theory for a more modern explanation.

The parameters ${ {\bf m}}$ and ${\bf V}$ when given actual numerical values determine just one gaussian hill, but we have the problem of working out which of the numerical values to select. The parameter space of possible values allows an infinite family of possible gaussian hills. If we believe that there is some suitable choice of ${ {\bf m}}$ and ${\bf V}$ which will give, of all possible choices, the best fit gaussian hill to the data of Fig.1.9., then we can rely on the statisticians to have found a way of calculating it from the data. We shall go into this matter in more depth later, but indeed the statisticians have been diligent and algorithms exist for computing a suitable ${ {\bf m}}$ and ${\bf V}$. These will, in effect, give a function the graph of which is a gaussian hill sitting over the points. And the same algorithms applied to the female data points of Fig.1.2. will give a second gaussian hill sitting over the female points. The two hills will intersect in some curve, but we shall imagine each of them sitting in place over their respective data points- and also over each others. Let us call them gm and gf for the male and female gaussian functions respectively.

If a new data point ${\bf x}$ is provided, we can calculate the height of the two hills at that point, $g_m({\bf x})$ and $g_f({\bf x})$ respectively. It is intuitively appealing to argue that if the male hill is higher than the female hill at the new point, then it is more likely that the new point is male than female. Indeed, we can say how much more likely by looking at the ratio of the two numbers, the so called likelihood ratio

\begin{displaymath}
\frac{ 
g_m({\bf x})}{g_f({\bf x})}\end{displaymath}

Moreover, we can fairly easily tell if a point is a long way from any data we have seen before because both the likelihoods[*] $g_m({\bf x})$ and $g_f({\bf x})$ will be small. What `small' means is going to depend on the dimension, but not on the data.

It is somewhat easier to visualise this in the one dimensional case: Fig.1.10. shows a new point, and the two gaussian functions sitting over it; the argument that says it is more likely to belong to the function giving the greater height may be quantified and made more respectable, but is intuitively appealing. The (relatively) respectable version of this is called Bayesian Decision Theory, and will be described properly later.


 
Figure 1.10: Two gaussian distributions over a point of unknown type.
\begin{figure}
\vspace{8cm}
\special {psfile=patrecfig10.ps}\end{figure}

The advantage of the parametric statistical approach is that we have an explicit (statistical) model of a process by which the data was generated. In this case, we imagine that the data points were generated by a process which can keep producing new points. In Fig.1.10. one can imagine that two darts players of different degrees of inebriation are throwing darts at a line. One is aiming at the centre a and the other, somewhat drunker, at the centre b. The two distributions tell you something about the way the players are likely to place the darts; then we ask, for the new point, x, what is the probability that it was thrown by each of the two players? If the b curve is twice the height of the a curve over x, then if all other things were equal, we should be inclined to think it twice as likely that it was aimed by the b player than the a.

We do not usually believe in the existence of inebriated darts players as the source of the data, but we do suppose that the data is generated in much the same way; there is an ideal centre which is, so to speak, aimed at, and in various directions, different amounts of scatter can be expected. In the case of height and weight, we imagine that when mother nature, god, allah or the blind forces of evolution designed human beings, there is some height and weight and shape for each sex which is most likely to occur, and lots of factors of a genetic and environmental sort which militate in one direction or another for a particular individual. Seeing mother nature as throwing a drunken dart instead of casting some genetic dice is, after all, merely a more geometric metaphor.

Whenever we make a stab at guessing which is the more likely source of a given data point coding an object, or alternatively making a decision as to which category an object belongs, we have some kind of tacit model of the production process or at least some of its properties. In the metric method, we postulate that the metric on the space is a measure of similarity of the objects, in the neural net method we postulate that at least some sort of convexity property holds for the generating process.

Note that in the case of the statistical model, something like the relevant metric to use is generated automatically, so the problem of Fig.1.4. is solved by the calculation of the two gaussians (and the X-axis gets shrunk, in effect). The rationale is rather dependent on the choice of gaussians to model the data. In the case discussed, of heights and weights of human beings, it looks fairly plausible, up to a point, but it may be rather difficult to tell if it is reasonable in higher dimensions. Also, it is not altogether clear what to do when the data does not look as if a gaussian model is appropriate. Parametric models have been used, subject to these reservations, for some centuries, and undoubtedly have their uses. There are techniques in existence for coping with the problems of non-gaussian distributions of data, and some will be discussed later. The (Bayesian) use of the likelihood ratio to select the best bet has its own rationale, which can extend to the case where we have some prior expectations about which category is most likely. Again, we shall return to this in more detail later.


next up previous contents
Next: Non-parametric Up: Statistical Methods Previous: Statistical Methods
Mike Alder
9/19/1997