In which circumstance is the computational formula preferred over the definitional formula when computing ss the sum of the squared deviations for a population?

  • Gravetter, F. J., Wallnau, L. B. (1996). Statistics for the Behavioral Sciences:
    A First Course for Students of Psychology and Education, 4th Edition. New York: West Publishing.

    Chapter 4: Variability

    So far we've discussed two of the three characteristics used to describe distributions, now we need to discuss the remaining - variability. Notice in our distributions that not every score is the same, e.g., not everybody gets the same score on the exam. So what we need to do is describe the varied results, rougly to describe the width of the distribution.

      Variability provides a quantitiative measure of the degree to which scores in a distribution are spread out or clustered together.

      In other words variablility refers to the degree of "differentness" of the scores in the distribution. High variability means that the scores differ by a lot, while low variability means that the scores are all similar ("homogeneousness").

    We'll concentrate on three measures of variability, the range, the interquartile range, and the standard deviation.

    The simplest measure of variability is the range, which we've already mentioned in our earlier discussions.

      - The range is the difference between the upper real limit of the largest (maximum) X value and the lower real limit of the smallest (minimum) X value.

      So look at your frequency distribution table, find the highest and lowest scores and subtract the lowest from the highest (note, if continuous must consider the real limits).

      __X	f	cf	c% 
        10	2	25	100
        9	8	23	92
        8	4	15	60
        7	6	11	44
        6	4	5	20
        5	1	1	4 
      
      if X is discrete then:
        the range = 10 - 5 = 5

      if X is continuous then:

        the range = 10.5- 4.5 = 6

      - there are some drawbacks of using the range as the description of the variability of a distribution

        - the statistic is based solely on the two most extream values in the distribution, thus it doesn't capture all of the members of the distribution.
    An alternative measure of variability is the interquartile range.

    So think back to percentiles. 50%tile equals the point at which exactly half the distribution exists on one side and the other half on the other side.

      - Considering the same logic, what does the 25%tile represent? - The 75%?
        - So using the 25th, 50th, & 75%tiles we can break the distribution into 4 quarters, or quartiles
        _X 	f	%	c% 
         7	4	12.5	100
         6	4	12.5	87.5
         5	4	12.5	75
         4	8	25	62.5
         3	4	12.5	37.5
         2	4	12.5	25
         1	4	12.5	12.5 
        
        In which circumstance is the computational formula preferred over the definitional formula when computing ss the sum of the squared deviations for a population?
    The interquartile range is the distance between the first quartile and the third quartile. So this corresponds to the middle 50% of the scores of our distribution.

    So for the above distribution (assume that it is a continuous variable)

      median = Q2 = 4.0 -> using interpolation (notice exactly halfway between 62.5 & 37.5)
      25%tile = Q1 = 2.5 -> the upper real limit for the interval 2
      75%tile = Q3 = 5.5 -> the upper real limit for the interval 5
    So the interquartile range (IQR) = 5.5 - 2.5 = 3.0

    Note that the interquartile range is often transformed into the semi-interquartile range which is 0.5 of the interquartile range.

    		SIQR = (Q3 - Q1)
    			   2
    
    So for our example the semi-interquartile range is (3.0)(0.5) = 1.5

    So the interquartile range focusses on the middle half of all of the scores in the distribution. Thus it is more representative of the distribution as a whole compared to the range and extreme scores (i.e., outliers) will not influence the measure (sometimes refered to as being robust). However, this still means that 1/2 of the scores in the distribution are not represented in the measure.

    The standard deviation is the most popular and most important measure of variability. It takes into account all of the individuals in the distribution.

    In essence, the standard deviation measures how far off all of the individuals in the distribution are from a standard, where that standard is the mean of the distribution.

      We will begin by discussion the standard deviation parameter, that is the standard deviation of the population. Then we will discuss the standard deviation statistic (for the sample). They are closely related descriptive statistics, but they have some important differences.

      So to get a measure of the deviation we need to subtract the population mean from every individual in our distribution.

        X - m = deviation score
        - if the score is a value above the mean the deviation score will be positive - if the score is a value below the mean the deviation score will be negative

    Example: consider the following data set: the population of heights (in inches) for the class

    69, 67, 72, 74, 63, 67, 64, 61, 69, 65, 70, 60, 75, 73, 63, 63, 69, 65, 64, 69, 65

    mean = m = 67

    S (X - m) = (69 - 67) + (67 - 67) + .... + (65 - 67) = ?
    = 2+ 0 + 5 + 7 + -4 + 0 + -3 + -6 + 2 + -2 + 3 + -7 + 8 + 6 + -4 + -4 + 2 + -2 + -3 + 2 + -2
    = 0

    Notice that if you add up all of the deviations they should/must equal 0. Think about it at a conceptual level. What you are doing is taking one side of the distribution and making it positive, and the other side negative and adding them together. They should cancel each other out.

    In which circumstance is the computational formula preferred over the definitional formula when computing ss the sum of the squared deviations for a population?

    So what we have to do is get rid of the negative signs. We do this by squaring the deviations and then taking the square root of the sum of the squared deviations.

    Sum of Squares = SS = S (X - m)2 = (69 - 67) 2 + (67 - 67) 2 + .... + (65 - 67) 2 =
    SS = 4+ 0 + 25 + 49 + 16 + 0 + 9 + 36 + 4 + 4 + 9 +49 + 64 + 36 + 16 + 16 + 4 + 4 + 9 + 4 + 4
    SS = 362

    The equation that we just used (SS = S (X - m)2) is refered to as the definitional formula for the Sum of Squares. However, there is another way to compute the SS, refered to as the computational formula. The two equations are mathematically equivalent, however sometimes one is easier to use than the other. The advantage of the computational formula is that it works with the X values directly.

    The computational formula for SS is:

    	SS = SX2 - (SX)2
    		    N

    So for our example:

    	SS = [(69)2 + (67)2 + ..... + (69)2 + (65)2] - (69 + 67 + ... + 69 + 65)2
    							      	    21
    
    	     =  94631  -  (1407)2=  94631 - 94269  = 362
    			    21

    Now we have the sum of squares (SS), but to get the Population Variance which is simply the average of the squared deviations (we want the population variance not just the SS, because the SS depends on the number of individuals in the population, so we want the mean). So to get the mean, we need to divide by the number of individuals in the population.

      Population variance = s2 = SS/N
    However the population variance isn't exactly what we want, we want the standard deviation from the mean of the population. To get this we need to take the square root of the population variance.
      standard deviation = sqroot(variance) = sqroot(SS/N)

      s = sqroot(s)

    So for our example:
      s2 = 362 / 21 = 17.24
      s = sqroot (17.2) = 4.15

    To review:

      step 1: compute the SS
        - either by using definitional formula or the computational formula
      step 2: determine the variance
        - take the average of the squared deviations
        - divide the SS by the N
      step 3: determine the standard deviation
        - take the square root of the variance
    Now let's move onto the Standard Deviation of a Sample
      - the computations are pretty much the same here
        - different notation:
          s = sample standard deviation
          use
          In which circumstance is the computational formula preferred over the definitional formula when computing ss the sum of the squared deviations for a population?
          instead of m in the computaion of SS

        - need to adjust the computation to tak into account that a sample will typically be less variable than the corresponding population.

        In which circumstance is the computational formula preferred over the definitional formula when computing ss the sum of the squared deviations for a population?

        - if you have a good, representative sample, then your sample and population means should be very similar, and the overall shape of the two distributions should be similar. However, notice that the variability of the sample is smaller than the variability of the population.

        - to account for this the sample variance is divided by n - 1 rather than just n

        	sample variance = s2 = __SS _
        			       n - 1
        

        - and the same is true for sample standard deviation

          sample standard deviation = s = sqroot(SS/(n - 1))
        What we're really doing here is trying to use a sample to make estimates about the nature of the population. But since we don't know things like what is the mean of the population, we really can't measure our deviances from the population standard. So what we use is our best estimate of what the population mean is, and that is the sample mean.

        So what we're doing when we subtract 1 from n is using degrees of freedom to adjust our sample deviations to make an unbiased estimation of the population values.

      What are degrees of freedom? Think of it this way. You know what the sample mean is ahead of time (you've got to to figure out the deviations). So you can vary all but one item in the distribution. But the last item is fixed. There will be only one value for that item to make the mean equal what it does. So n - 1 means all the values but one can vary.

    Example:

      suppose that you know that the mean of your sample = 5
        if your first 4 items are:
          5, 4, 6, 2 then what must the final number be?
          5 + 4 + 6 + 2 + X = 25
          there will be only one value of X that'll make this work. X = 8

    Okay, so let's do an example of computing the standard deviation of a sample

      data: 1, 2, 3, 4, 4, 5, 6, 7

      step 1: compute the SS

        SS = S (X -
        In which circumstance is the computational formula preferred over the definitional formula when computing ss the sum of the squared deviations for a population?
        )2
        = (1 - 4)2 + (2 - 4)2 + (3 - 4)2 + (4 - 4)2 + (4 - 4)2 + (5 - 4)2 + (6 - 4)2 + (7 - 4)2
        = 9 + 4 + 1 + 0 + 0 + 1 + 4 + 9 = 28

        -- OR --

        You can still use the computational formula to get SS

        	        SS = SX2 - (S  X)2
        			     N
        
        		  = (1+4+9+16+16+25+36+49) - (1+2+3+4+4+5+6+7)
        						      8
        		  = 156 - 128 = 28.0
        
      step 2: determine the variance of the sample (remember it is a sample, so we need to take this into account)
        		sample variance = s2 = _SS_
        				      n - 1
        

        = 28/(8-1) = 28/7 = 4.0

      step 3: determine the standard deviation of the sample
            standard deviation = sqroot(SS/(n - 1))
                = sqroot(28/(8 - 1)

                = sqroot 4.0 = 2.0

    Properties of the standard deviation (Transformations)
      1) Adding a constant to each score in the distribution will not change the standard deviation.

      In which circumstance is the computational formula preferred over the definitional formula when computing ss the sum of the squared deviations for a population?

    So if you add 2 to every score in the distribution, the mean changes (by 2), but the variance stays the same (notice that none of the deviations would change because you add 2 to each score and the mean changes by 2).
      2) Multiplying each score by a constant causes the stardard deviation to be multiplied by the same constant.
    This one is easier to think of with numbers. Suppose that your mean is 20, and that two of the individuals in your distribution are 21 and 23. If you multiply 21 and 23 by 2 you get 42 and 46, and your mean also changes by a factor of 2 and is now 40. Before your deviations were (21 - 20 = 1) & (23 - 20 = 3). But now, your deviations are (42 - 40 = 2) & (46 - 40 = 6). So your deviations are getting twice as big as well.

    Comparing Measures of Variability

      - Extreme scores: range is most affected, IQR is least affected
      - Sample size: range tends to increase as n increases, IQR & s do not
      - The range does not have stable values when you repeatedly sample from the same population, but the IQR & S are stable and tend not to fluctuate.
      - With open-ended distributions, one cannot even compute the range or S, so the IQR (or SIQR) is the only option
  • Under what circumstances is the computational formula preferred over the definitional?

    The computational formula is preferred when the mean is not a whole number.

    Under what circumstances is the computational formula easy to use?

    B) The computational formula does not require the mean value, and it computes the SS by using the X values only. Hence, the computational formula would be easy to use when only the X values are provided.

    What is the difference between definitional formula and computational formula?

    For example, the definitional formula of variance states that it is the mean squared difference between a score and the mean of all of the scores. This contrasts with the computational formula, which is the equation used to calculate values for the concept.

    What is the computational formula for SS?

    The mean of the sum of squares (SS) is the variance of a set of scores, and the square root of the variance is its standard deviation. This simple calculator uses the computational formula SS = ΣX2 - ((ΣX)2 / N) - to calculate the sum of squares for a single set of scores.