ine
5376/5379
Programa
Links
Bibliografia
Plano
de Ensino
|
Reconhecimento de
Padrões
5.5. Técnicas Estatísticas
- Glossário de Termos Estatísticos
Esta seção
explica alguns dos termos estatísticos mais importantes e que têm
um papel importante nos métodos que estaremos ensinando aqui. O
objetivo é rever esta matéria e refrescar a sua memória.
5.5.1. Variância
5.5.2. Desvio
Padrão
5.5.3. Erro
Padrão
5.5.4. Correlação
5.5.5. Autocorrelação
5.5.6. Covariância
5.5.7. Valor
Esperado
5.5.8. Valor
Estandardizado
5.5.9. Hipótese
Nula
5.5.10. Análise
de Variância - ANOVA
5.5.11. Teste
F de Fisher
5.5.12. Distribuição-F
de Fisher
5.5.1.
Variância
A variância
(este termo foi utilizado pela primeira vez por Fisher em 1918) de uma
população
de valores é calculada como:
s2 = S(xi-m)2/N
onde:
-
m
é a média da população
-
N é o tamanho da população.
A estimativa amostral da variância
de uma população finita de tamanho N
é dada por:

onde
-
xbar é a média
amostral
-
N é o tamanho da amostra.
5.5.2. Desvio Padrão
In probability and statistics,
the standard deviation is the most commonly used measure of statistical
dispersion e a commonly-used measure of variation. O termo foi usado pela
primeira vez por Pearson, 1894. Standard deviation is defined as the square
root of the variância. It is defined
this way in order to give us a measure of dispersion that is
-
um número positivo e
-
é expresso nas mesmas
unidades dos dados.
Importante: We distinguish
between the standard deviation s
(sigma) of a whole population or of a random variable, and the standard
deviation s of a sample.
O desvio padrão de
uma população é calculado como:
s = [S(xi-m)2/N]1/2
onde:
-
m is the population mean
-
N is the population size
O desvio padrão de uma
amostra é computado como:
s = [S(xi-xbar)2/n-1]1/2
onde:
-
xbar is the
sample mean
-
n is the sample size
5.5.3. Erro Padrão
The standard error of the mean
(first used by Yule, 1897) is the theoretical standard deviation of all
sample means of size n drawn from a population and depends on both the
population variance (s)
and the sample size (n) as indicated below:
sxbar
= (s2/n)1/2
onde:
-
s2
is the population variance and
-
n is the sample size.
5.5.4. Correlação
In probability theory and statistics,
the correlation, also called coeficiente de correlação,
between two random variables is found by dividing their covariância
by the product of their standard deviations. (It is defined only if these
standard deviations are finite.) It is a corollary of the Cauchy-Schwarz
inequality that the correlation cannot exceed 1 in absolute value.
Redizendo isto em termos
menos matemáticos, a correlation is a measure of the relation between
two or more variables. Correlation coefficients can range from -1.00 to
+1.00. The value of -1.00 represents a perfect negative correlation while
a value of +1.00 represents a perfect positive correlation. A value of
0.00 represents a lack of correlation.
-
Correlação
Positiva. The relationship between two variables is such that as one
variable's values tend to increase, the other variable's values also tend
to increase. This is represented by a positive correlation coefficient.
-
Correlação
Negativa. The relationship between two variables is such that as one
variable's values tend to increase, the other variable's values tend to
decrease. This is represented by a negative correlation coefficient.
The most widely-used type
of correlation coefficient is Pearson r (Pearson, 1896), also called linear
or product-moment correlation (the term correlation was first used by Galton,
1888). It is the basic type of correlation. Using non technical language,
one can say that the correlation coefficient determines the extent to which
values of two variables are "proportional" to each other. The value of
the correlation (i.e., correlation coefficient) does not depend on the
specific measurement units used; for example, the correlation between height
and weight will be identical regardless of whether inches and pounds, or
centimeters and kilograms are used as measurement units. Proportional means
linearly related; that is, the correlation is high if it can be approximated
by a straight line (sloped upwards or downwards). This line is called the
regression
line or least squares line, because it is determined such that
the sum of the squared distances of all the data points from the line is
the lowest possible.
Pearson correlation assumes
that the two variables are measured on at least interval scales.
Significância de uma Correlação
The significance level calculated
for each correlation is a primary source of information about the reliability
of the correlation. The test of significance is based on the assumption
that the distribution of the residual values (i.e., the deviations from
the regression line) for the dependent variable y follows the normal distribution,
and that the variability of the residual values is the same for all values
of the independent variable x. However, Monte Carlo studies suggest that
meeting those assumptions closely is not absolutely crucial if your sample
size is not very large. It is impossible to formulate precise recommendations
based on those Monte Carlo results, but many researchers follow a rule
of thumb that if your sample size is 50 or more then serious biases are
unlikely, and if your sample size is over 100 then you should not be concerned
at all with the normality assumptions. There are, however, much more common
and serious threats to the validity of information that a correlation coefficient
can provide.
5.5.5. Autocorrelação
Autocorrelation is a mathematical
tool used frequently in signal processing for analysing functions or series
of values, such as time domain signals. Autocorrelation is useful for finding
repeating patterns in a signal, such as determining the presence of a periodic
signal which has been buried under noise, or identifying the fundamental
frequency of a signal which doesn't actually contain that frequency component,
but implies it with many harmonic frequencies.
An autocorrelation is the
cross-correlation of a signal with itself. It is the correlation of a series
of data with itself, shifted by a particular lag of k observations. The
plot of autocorrelations for various lags is a crucial tool for determining
an appropriate model for ARIMA analysis. The computations of the autocorrelation
coefficients rk follow the standard formulas, as described in most time
series references (e.g., Box & Jenkins, 1976).
5.5.6. Covariância
A covariância duas variáveis
randômicas X e
Y, com expected
values E(X) = m
e
E(Y) = n é
definida como:
-

Isto é equivalente à
fórmula abaixo, que é geralmente utilizada na realização
dos cálculos:
-

For column-vector valued random
variables X and Y with respective expected values ? and ?,
and
n and m scalar components respectively, the covariance
is defined to be the n×m matrix
-

cov(X,Y)
é também notada como CXY e denominada matriz
de covariância.
If X and Y
are independent,
then their covariance is zero. This follows because under independence,
E(X·Y) = E(X)·E(Y). The converse, however, is not true: it
is possible that X and Y are not independent, yet their covariance
is zero.
The covariance matrix is
always symmetric and positive semi-definite. It is diagonal if all n variables
are independent. Its determinant is zero if linear relations exist between
variables.
If X and Y
are real-valued random variables and c is a constant ("constant",
in this context, means non-random), then the following facts are a consequence
of the definition of covariance:
-

For vector-valued random variables,
cov(X, Y) and cov(Y, X) are each other's transposes.
The covariance is sometimes
called a measure of "linear dependence" between the two random variables.
That phrase does not mean the same thing that it means in a more formal
linear algebraic setting (see linear
dependence), although that meaning is not unrelated.
The correlation
is a closely related concept used to measure the degree of linear dependence
between two variables.
The Cyclops
Project
German-Brazilian Cooperation
Programme on IT
CNPq GMD DLR
|
 |
|
|