A First Course for Students of Psychology and Education, 4th Edition. New York: West Publishing. Chapter 4: Variability So far we've discussed two of the three characteristics used to describe distributions, now we need to discuss the remaining - variability. Notice in our distributions that not every score is the same, e.g., not everybody gets the same score on the exam. So what we need to do is describe the varied results, rougly to describe the width of the distribution. Show
In other words variablility refers to the degree of "differentness" of the scores in the distribution. High variability means that the scores differ by a lot, while low variability means that the scores are all similar ("homogeneousness"). The simplest measure of variability is the range, which we've already mentioned in our earlier discussions.
So look at your frequency distribution table, find the highest and lowest scores and subtract the lowest from the highest (note, if continuous must consider the real limits).
- there are some drawbacks of using the range as the description of the variability of a distribution
So think back to percentiles. 50%tile equals the point at which exactly half the distribution exists on one side and the other half on the other side.
So for the above distribution (assume that it is a continuous variable)
25%tile = Q1 = 2.5 -> the upper real limit for the interval 2 75%tile = Q3 = 5.5 -> the upper real limit for the interval 5 Note that the interquartile range is often transformed into the semi-interquartile range which is 0.5 of the interquartile range. SIQR = (Q3 - Q1) 2So for our example the semi-interquartile range is (3.0)(0.5) = 1.5 So the interquartile range focusses on the middle half of all of the scores in the distribution. Thus it is more representative of the distribution as a whole compared to the range and extreme scores (i.e., outliers) will not influence the measure (sometimes refered to as being robust). However, this still means that 1/2 of the scores in the distribution are not represented in the measure. The standard deviation is the most popular and most important measure of variability. It takes into account all of the individuals in the distribution. In essence, the standard deviation measures how far off all of the individuals in the distribution are from a standard, where that standard is the mean of the distribution.
So to get a measure of the deviation we need to subtract the population mean from every individual in our distribution.
Example: consider the following data set: the population of heights (in inches) for the class 69, 67, 72, 74, 63, 67, 64, 61, 69, 65, 70, 60, 75, 73, 63, 63, 69, 65, 64, 69, 65 mean = m = 67 S (X - m) = (69 - 67) + (67 - 67) + .... + (65 - 67) = ? Notice that if you add up all of the deviations they should/must equal 0. Think about it at a conceptual level. What you are doing is taking one side of the distribution and making it positive, and the other side negative and adding them together. They should cancel each other out.
So what we have to do is get rid of the negative signs. We do this by squaring the deviations and then taking the square root of the sum of the squared deviations. Sum of Squares = SS = S (X - m)2 = (69 - 67) 2 + (67 - 67) 2 + .... + (65 - 67) 2 = The equation that we just used (SS = S (X - m)2) is refered to as the definitional formula for the Sum of Squares. However, there is another way to compute the SS, refered to as the computational formula. The two equations are mathematically equivalent, however sometimes one is easier to use than the other. The advantage of the computational formula is that it works with the X values directly. The computational formula for SS is: SS = SX2 - (SX)2 N So for our example: SS = [(69)2 + (67)2 + ..... + (69)2 + (65)2] - (69 + 67 + ... + 69 + 65)2 21 = 94631 - (1407)2= 94631 - 94269 = 362 21 Now we have the sum of squares (SS), but to get the Population Variance which is simply the average of the squared deviations (we want the population variance not just the SS, because the SS depends on the number of individuals in the population, so we want the mean). So to get the mean, we need to divide by the number of individuals in the population.
s = sqroot(s)
s = sqroot (17.2) = 4.15 To review:
- divide the SS by the N
use instead of m in the computaion of SS - need to adjust the computation to tak into account that a sample will typically be less variable than the corresponding population. - if you have a good, representative sample, then your sample and population means should be very similar, and the overall shape of the two distributions should be similar. However, notice that the variability of the sample is smaller than the variability of the population. - to account for this the sample variance is divided by n - 1 rather than just n sample variance = s2 = __SS _ n - 1 - and the same is true for sample standard deviation
So what we're doing when we subtract 1 from n is using degrees of freedom to adjust our sample deviations to make an unbiased estimation of the population values. What are degrees of freedom? Think of it this way. You know what the sample mean is ahead of time (you've got to to figure out the deviations). So you can vary all but one item in the distribution. But the last item is fixed. There will be only one value for that item to make the mean equal what it does. So n - 1 means all the values but one can vary. Example:
5 + 4 + 6 + 2 + X = 25 there will be only one value of X that'll make this work. X = 8 Okay, so let's do an example of computing the standard deviation of a sample
step 1: compute the SS
= (1 - 4)2 + (2 - 4)2 + (3 - 4)2 + (4 - 4)2 + (4 - 4)2 + (5 - 4)2 + (6 - 4)2 + (7 - 4)2 = 9 + 4 + 1 + 0 + 0 + 1 + 4 + 9 = 28 -- OR -- You can still use the computational formula to get SS SS = SX2 - (S X)2 N = (1+4+9+16+16+25+36+49) - (1+2+3+4+4+5+6+7) 8 = 156 - 128 = 28.0step 2: determine the variance of the sample (remember it is a sample, so we need to take this into account) sample variance = s2 = _SS_ n - 1step 3: determine the standard deviation of the sample
= sqroot 4.0 = 2.0
Comparing Measures of Variability
- Sample size: range tends to increase as n increases, IQR & s do not - The range does not have stable values when you repeatedly sample from the same population, but the IQR & S are stable and tend not to fluctuate. - With open-ended distributions, one cannot even compute the range or S, so the IQR (or SIQR) is the only option Under what circumstances is the computational formula preferred over the definitional?The computational formula is preferred when the mean is not a whole number.
Under what circumstances is the computational formula easy to use?B) The computational formula does not require the mean value, and it computes the SS by using the X values only. Hence, the computational formula would be easy to use when only the X values are provided.
What is the difference between definitional formula and computational formula?For example, the definitional formula of variance states that it is the mean squared difference between a score and the mean of all of the scores. This contrasts with the computational formula, which is the equation used to calculate values for the concept.
What is the computational formula for SS?The mean of the sum of squares (SS) is the variance of a set of scores, and the square root of the variance is its standard deviation. This simple calculator uses the computational formula SS = ΣX2 - ((ΣX)2 / N) - to calculate the sum of squares for a single set of scores.
|