One of the simplest methods is to find the closest point of the labelled set of points to the new point P, and assign to the new point whatever category the closest point has. So if (for the data set of guys and gals) the nearest point to P is an X, then we conclude that P should be a man. If a rationale is needed, we could argue that the measurement process is intended to extract important properties of the objects, and if we come out with values for the readings which are close together, then the objects must be similar. And if they are similar in respect of the measurements we have made, they ought, in any reasonable universe, to be similar in respect of the category they belong to as well. Of course it isn't clear that the universe we actually live in is the least bit reasonable.
Such a rationale may help us devise the algorithm in the first place, but it may also allow us to persuade ourselves that the method is a good one. Such means of persuasion are unscientific and frowned upon in all the best circles. There are better ways of ensuring that it is a good method, namely testing to see how often it gives the right answer. It is noteworthy that no matter how appealing to the intuitions a method may be, there is an ultimate test which involves trying it out on real data. Of course, rationales tend to be very appealing to the intuitions of the person who thought of them, and less appealing to others. It is, however, worth reflecting on rationales, particularly after having looked at a bit more data; sometimes one can see the flaws in the rationales, and devise alternative methods.
The metric method is easy to implement in complete
generality for n measurements, we just
have to go through the whole list of points
where we know the category and compute the distance
from the given point P.
How do we do this? Well, the usual Euclidean
distance
between the vectors
and
is simply
, which is
easy to compute. Now we find that point x for which this distance from the new point P
is a minimum. All that
remains is to note its category.
If anyone wants to know where the formula for
the euclidean distance comes from in higher
dimensions, it's a definition, and it gives the
right answers in dimensions one,
two and three. You have a better idea?
Reflection suggests some drawbacks. One is that we need to compute a comparison with all the data points in the set. This could be an awful lot. Another is, what do we do in a case such as Fig.1.3., above, where the new point P doesn't look as if it belongs to either category? An algorithm which returns `Haven't the faintest idea, probably neither' when asked if the P of Fig.1.3. is a man or a woman would have some advantages, but the metric method needs some modification before it can do this. It is true that P is a long way from the closest point of either category, but how long is a long way?
Exercise: Is P in Fig.1.3 likely to be (a) a kangaroo or (b) a pole vaulter's pole?
A more subtle objection would occur only to a geometer, a species of the genus Mathematician. It is this: why should you use the euclidean distance? What is so reasonable about taking the square root of the sum of the squares of the differences of the co-ordinates? Sure, it is what you are used to in two dimensions and three, but so what? If you had the data of Fig.1.4. for example, do you believe that the point P is, on the whole, `closer to' the X's or the O's?
There is a case for saying that the X-axis in Fig.1.4. has been stretched out by something like three times the Y-axis, and so when measuring the distance, we should not give the X and Y coordinates the same weight. If we were to divide the X co-ordinates by 3, then P would be closer to the X's, whereas using the euclidean distance it is closer to the O's.
It can come as a nasty shock to the engineer to
realise that there are an
awful lot of different metrics (ways of
measuring distances)
on
, and the old, easy one isn't necessarily
the right one to use. But it should be obvious
that if we measure weight in kilograms and
height in centimetres, we shall get different
answers from those we would obtain if we
measured height in metres and weight in grams.
Changing the measuring units in the above
example changes the metric, a matter of very practical
importance in real life. There are
much more complicated cases than this which occur
in practice, and we shall meet
some in later sections, when we go over these
ideas in detail.
Remember that this is only the mickey-mouse, simple and easy discussion on the core ideas and that the technicalities will come a little later.