For reasons which are not altogether convincing, it is not uncommon to model speech as if it were produced by IIR filtering a glottal pulse input (for voiced sounds) or white noise (for unvoiced sounds). Then if we know what is supposed to have gone in, we know what came out, we can calculate the coefficients which give the best fit to the output over some period of time. As the vocal tract changes, these coefficients are also supposed to change in time, but relatively slowly. So we change a fast varying quasi-periodic time series into a vector valued time series, or a bit of one, which I have called the trajectory of an utterance. The argument for Autoregressive modelling suggested above hints at relationship with the Fourier Transform, which emerges with more clarity after some algebra.
This approach is called Linear Predictive Coding in the Speech Recognition literature.
ARMA modelling with its assumptions, implausible
or not, is done extensively.
A variant is to
take a time series and `difference' it by deriving
the time series of consecutive differences,
v(n) = u(n) - u(n-1). This may be repeated
several times. Having modeled the differenced
time
series, one can get back a model for the original
time series, given some data on initial
conditions. This is known as ARIMA modeling, with
the I short for Integration.
The modelling of a stationary time series by supposing it arrives by filtering a white noise input is a staple of filtering theory. The method is perhaps surprising to the innocent, who are inclined to want to know why this rather unlikely class of models is taken seriously. Would you expect, say, the stock exchange price of pork belly futures to be a linear combination of its past values added to some white noise which has been autocorrelated? The model proposes that there is a random driving process which has short term autocorrelations of a linear sort and arises from the driving process by more autocorrelations, that is dependencies on its own past. Would you believe it for pork bellies? For pork belly futures? Electroencepghalograms? As a model of what is happening to determine prices or anything much else, it seems to fall short of Newtonian dynamics, but do you have a better idea? Much modelling of a statistical sort is done the way it is simply because nobody has a better idea. This approach, because it entails linear combinations of things, can be written out concisely in matrix formulation, and matrix operations can be computed and understood, more or less, by engineers. So something can be done, if not always the right thing. Which beats scratching your head until you get splinters.
Once the reader understands that this is desperation
city, and that things are done this way
because they can be rather than because there
is a solid rationale, he or she may feel much
more
cheerful about things.
For speech, there is a theory which regards the
vocal
tract as a sequence of resonators made up out
of something deformable, and which can, in consequence,
present some sort of justification for Linear
Predictive Coding. In general, the innocent beginner
finds an extraordinary emphasis on linear models
throughout physics, engineering and statistics,
and may
innocently believe that this is because life is
generally linear. It is actually because we know
how to
do the sums in these cases. Sometimes, it more
or less works.