Which of the following is true of the generalizability of several personnel selection methods?

Generalizability Theory

R.L. Brennan, in International Encyclopedia of Education (Third Edition), 2010

Generalizability theory offers an extensive conceptual framework and a powerful set of statistical procedures for addressing numerous measurement issues. In a sense, generalizability theory liberalizes classical theory by employing analysis of variance methods that allow an investigator to disentangle the multiple sources of error that contribute to the undifferentiated error in classical theory. In this article, principal consideration is given to the conceptual and computational details of univariate generalizability theory. In addition, an introduction to multivariate generalizability theory is provided. Throughout the article, a real-data example illustrates results. Finally, other issues are briefly considered, including computer programs for generalizability theory.

Read full chapter

URL: https://www.sciencedirect.com/science/article/pii/B9780080448947002463

Generalizability Theory

Richard J. Shavelson, Noreen M. Webb, in Encyclopedia of Social Measurement, 2005

Random and Fixed Facets

Generalizability theory is essentially a random effects theory. Typically a random facet is created by randomly sampling conditions of a measurement procedure (e.g., tasks from a job in observations of job performance). When the conditions of a facet have not been sampled randomly from the universe of admissible observations but the intended universe of generalization is infinitely large, the concept of exchangeability may be invoked to consider the facet as random.

A fixed facet (cf. fixed factor in analysis of variance) arises when (1) the decision maker purposely selects certain conditions and is not interested in generalizing beyond them, (2) it is unreasonable to generalize beyond conditions, or (3) the entire universe of conditions is small and all conditions are included in the measurement design. G theory typically treats fixed facets by averaging over the conditions of the fixed facet and examining the generalizability of the average over the random facets. When it does not make conceptual sense to average over the conditions of a fixed facet, a separate G study may be conducted within each condition of the fixed facet or a full multivariate analysis may be performed.

G theory recognizes that the universe of admissible observations encompassed by a G study may be broader than the universe to which a decision maker wishes to generalize in a D study, the universe of generalization. The decision maker may reduce the levels of a facet (creating a fixed facet), select (and thereby control) one level of a facet, or ignore a facet. A facet is fixed in a D study when n′ = N′, where n′ is the number of conditions for a facet in the D study and N′ is the total number of conditions for a facet in the universe of generalization. From a random-effects G study with design p × i × o in which the universe of admissible observations is defined by facets i and o of infinite size, fixing facet i in the D study and averaging over the ni conditions of facet i in the G study (ni = n′i) yields the following estimated universe-score variance:

(13)στ2=σp2+ σpI2=σp2+σpi2 ni′

where στ2 denotes estimated universe-score variance in generic terms. στ2 in Eq. (13) is an unbiased estimator of universe-score variance for the mixed model only when the same levels of facet i are used in the G and D studies. Estimates of relative and absolute error variance, respectively, are:

(14)σδ2=σpO 2+σpIO2=σpo2no′+σpio,e2ni′no′

(15)σΔ2=σO2+σpO2+σIO2+σpIO2 =σo2no′+σpo 2no′+σio2ni′ no′+σpio,e2ni′no′

Read full chapter

URL: https://www.sciencedirect.com/science/article/pii/B0123693985001936

Split-Half Reliability

Robert L. Johnson, James Penny, in Encyclopedia of Social Measurement, 2005

G-theory and Intraclass Correlations

The G- and D-coefficients of generalizability theory and Cronbach's alpha are special cases of intraclass correlations. Hoyt applied analysis of variance techniques to the estimation of reliability as the ratio of

(9)r=MSpersons−MSpiMSpersons

where MSpersons is the mean square for persons from an analysis of variance and MSpi is the mean square for the person-by-item interaction. Hoyt's derivation, according to Cronbach, results in a formula equivalent to alpha. As such, intraclass correlations are linked to split-half reliability.

Read full chapter

URL: https://www.sciencedirect.com/science/article/pii/B0123693985000967

The International System for Teacher Observation and Feedback: A Theoretical Framework for Developing International Instruments

L. Kyriakides, ... D. Muijs, in International Encyclopedia of Education (Third Edition), 2010

Technique 4: Using Generalizability Theory to Analyze Data

Since the findings of the ISTOF project are based on the responses of the panel of experts in each country, it is important to evaluate the dependability (reliability) of the behavior of each member and/or interest group in the panel of each country. At the same time, the issue of reaching consensus among the panel of experts is very important for generating the common part of the ISTOF. Therefore, the analysis of quantitative data must identify the extent to which such consensus was established in each step of the process. Generalizability theory provides answers to these two issues (dependability and degree of consensus) and it is for this reason that it was used for developing ISTOF.

The conceptual framework underlying generalizability theory involves an investigator asking about the precision or reliability of a measure because she/he wishes to generalize from the observation in hand to some class of observations to which it belongs (Shavelson et al., 1989). In the case of ISTOF project, it was considered important to determine the extent to which the responses of each member of the panel of experts to the questionnaire items depend on his/her professional status (e.g., researcher, teacher, teacher-educator, policymaker, or evaluator) and/or on his/her origin from a particular country. Specifically, this analysis helped us to identify the extent to which experts from different countries agreed among themselves about the appropriateness of: (1) the components of effective teacher, (2) their indicators, and (3) the items designed to measure the set of indicators of teacher effectiveness. Therefore, the use of generalizability theory gave answers to questions concerned with the extent to which the conceptual map of the ISTOF is in line with the opinions of experts in different countries who belong to different groups of stakeholders.

Taking into consideration the rationale given above, the following procedure was used to develop ISTOF by first generating components of effective teaching, and then producing indicators for each component and finally creating items associated with each indicator.

Read full chapter

URL: https://www.sciencedirect.com/science/article/pii/B9780080448947016365

An Overview of Statistics in Education

S. Sinharay, in International Encyclopedia of Education (Third Edition), 2010

Analysis of Variance, Analysis of Covariance, and Multivariate Analysis of Variance

Analysis of variance (ANOVA) is the statistical procedure of comparing the means of a variable across several groups of individuals. For example, ANOVA may be used to compare the average SAT critical reading scores of several schools. The name of the technique arises from the fact that the first step in an ANOVA is to partition the variance present in the observations into several components. The ANOVA method was the second most frequently used data-analysis procedure in a survey of articles published between 1971 and 1998 in three reputed educational-research journals (Hsu, 2005). Generalizability theory (Cronbach et al., 1963), which is a competitor to the classical theory of reliability of tests, usually applies ANOVA procedures to test scores.

Analysis of covariance (ANCOVA) is used when, like in ANOVA, the interest is in comparing several means, but the investigator also has the values of an additional variable that influences the variable of interest. For example, ANCOVA may be used to compare the average SAT critical reading scores of several schools where the preliminary scholastic aptitude test/national merit scholarship qualifying test (PSAT/NMSQT) critical reading score of each examinee is available in addition to the SAT critical reading score. (The PSAT/NMSQT is supposed to provide firsthand practice for the SAT.)

Multivariate analysis of variance (MANOVA) is used to compare means of several variables simultaneously across several groups of individuals. For example, one could apply MANOVA to simultaneously compare the average scores on several subjects across several schools. Longford (1990) provides such an example.

Read full chapter

URL: https://www.sciencedirect.com/science/article/pii/B978008044894701719X

Test–Retest Reliability

Chong Ho Yu, in Encyclopedia of Social Measurement, 2005

Subjective Scoring

If the test format is subjective, such as consisting of essay-type questions, subjective scoring could be a source of measurement error. To be specific, errors from subjective scoring can come from two different sources. One type of error results from different raters (interrater) and the other is caused by the same rater conducting grading on different occasions (intrarater). This issue could be very complicated since it involves multiple sources of errors, or an interaction effect between the time factor (test–retest) and the rater factor. The generalizability theory, which will be discussed later, is proposed as an effective strategy to address the problem of multiple sources of errors. In addition, the intraclass correlation (ICC) coefficient can be computed to estimate the intrarater and interrater reliability. This estimate is based on mean squares obtained by applying ANOVA models. To be specific, in an ANOVA model the rater effect is indicated by the mean square (MS) of the between- subject factor while the multiple measures on different occasions are shown in the mean square of the between-measure factor. The reliability estimate is calculated by

r=MSbetween−measure−MSresidual /MSbetween−measure+dfbetween−subject×MS residual

In the context of ANOVA, there are separate coefficients for three models:

1.

One-way random effects model: Raters are perceived as a random selection from possible raters, who grade all subjects, which are a random sample.

2.

Two-way random effects model: Raters, also conceived as random, rate a subset of subjects chosen at random from a sample pool.

3.

Two-way mixed model: All raters rate all subjects, which are a random sample. This is a mixed model because the raters are a fixed effect and the subjects are a random effect.

Read full chapter

URL: https://www.sciencedirect.com/science/article/pii/B0123693985000943

Diagnosis and Classification

M. Sapp, in International Encyclopedia of Education (Third Edition), 2010

Theories of Testing

The purpose of this article is to discuss the positive and negative roles of diagnosis and classification in special education. Before students with disabilities can be diagnosed and assessed, reliable and valid assessment measures must be employed. To illustrate, African American and Latino students are not adequately represented in most standardized assessments, and this is why great care must be used when attempting to diagnose and assess all students. In contrast to the general perception within education, reliability and validity are based on educational test score theories – not facts. Classical true-score theory, generalizability theory, and item response theory are the three dominant theories within educational assessment (Allen and Yen, 1979). Classical true-score theory is the dominant theory within education, and it states that a student’s observed score (X) is an addition of the student’s true score (T) and some error (E). The E is often misunderstood by educators, and it is not a mistake but a theoretical construct that takes into account the inability to measure concepts perfectly. Generalization theory extends classical true-score theory measurement by showing that it does not have to be restricted to the two-component linear model of a true score and error score – the true-score theory. Essentially, generalizability theory is concerned with the reliability of generalizing from a student’s observed score on a test to his/her average measure that would occur under all possible conditions that are acceptable. Implicit in this assumption is that the student’s measured attributes are in a steady state, and changes in the student’s scores are not the results of maturation, learning, or development. Changes in the student’s attributes are the result of multiple sources of error such as occasions, different test forms, different test administrators, and so on (Sapp, 2006). Item response theory is a form of item analysis, and unlike the group assessment model of the true-score theory, it is an individualistic model. For example, true-score theory is based on the number of students who pass a test item (item difficulty) and the extent to which test items discriminate among students (discrimination index), and all these factors are based on the norms of the standardization sample. In contrast, item response theory makes no assumptions about the students involved and it does not rely on group norms; instead, a student’s test score is the arithmetic multiplication of a student’s ability and the student’s item difficulty. As opposed to comparison to group norms, the greater a student’s ability, the greater the probability that a client will get an item correct.

Test items are said to be valid when they measure what they purport to measure, and validity involves several forms such as face, content, criterion, and construct. Face validity is a nonstatistical aspect of validity that on face appearance items appear to measure what they purport to measure. Content validity is another nonstatistical aspect of validity, and it is the degree to which items assess relevant facets of a conceptual domain. For example, if an educator were interested in assessing word knowledge, he/she would want test items that sample the domain of word knowledge. Criterion validity is a statistical form of validity that has at least three forms such as predictive, concurrent, and retrospective, and criterion validity is also called empirical validity. Simply stated, criterion validity assesses the degree in which items from two tests correlate. Construct validity helps determine if a given educational measurement instrument actually measures the underlying conceptual construct that it was designed to measure. Construct validity must be constantly updated; and with minority students, often construct validity has not been assessed.

Read full chapter

URL: https://www.sciencedirect.com/science/article/pii/B9780080448947011362

Classical Test Theory Reliability

M.D. Miller, in International Encyclopedia of Education (Third Edition), 2010

Classical Test Theory

Classical test theory is defined such that any observed test score, X, is the sum of a true score, T, and a random error, E. That is,

X=T+E

This model assumes that the expected value of the random errors are zero and that the random errors are uncorrelated with the true score, with errors on alternate forms of the test, and with other measures such as another test or grades in school. Validity focuses on the definition (i.e., use and interpretation) of the true score, while reliability focuses on the random error. The error is the sum of all random effects, while the true score is the sum of all consistent effects. Thus, the error is undifferentiated with respect to different sources of randomness unlike in generalizability theory. Broadly, two indices of reliability are commonly reported: the reliability coefficient and the standard error of measurement (SEM).

The reliability coefficient (ρXX′) is defined as the ratio of the true score variance (σT2) to the observed score variance (σX2), or the ratio of the true score variance to the sum of the true score variance and the error variance (σE2). That is,

ρXX′=σT2/σX2=σT2/(σT2+σE2)

Hence, the value of the reliability coefficient is the proportion of variation in test scores that can be attributed to consistent measurement (i.e., the true score). The reliability coefficient ranges from 0.0 to 1.0 with higher values being preferred. At ρXX′ = 0.0, there is no consistency in the measurement procedure and the observed score is equal to the random error (X = E). At ρXX′ = 1.0, the observed score has no error and the observed score is equal to the true score (X = T). In practice, the reliability coefficient will be somewhere between the two extremes.

The SEM is defined as the standard deviation of the errors of measurement, σE. The SEM ranges from 0.0 to the standard deviation of the observed scores, σx. When the SEM = σx, there is no consistency in the measurement procedures, the reliability coefficient is equal to 0.0, and the observed score is equal to the random error. When the SEM = 0.0, there is perfect consistency in the test scores, the observed score is equal to the true score, and the reliability coefficient is equal to 1.0. In practice, the SEM will fall somewhere between the two extremes.

The reliability coefficient is an easily interpreted index of the consistency of the test scores since it is in a standard range for all tests. While the SEM is more difficult to interpret initially, it is in the metric of the test scores which allows for the interpretation of the individual test scores via confidence intervals. Another advantage of the SEM is that it is not based on the true scores and, consequently, its estimation is not influenced by sampling errors. The reliability coefficient will be underestimated when the sample range of scores is restricted, whereas the SEM will be largely uninfluenced by sampling fluctuations.

Read full chapter

URL: https://www.sciencedirect.com/science/article/pii/B9780080448947002359

Psychometrics of Intelligence

Peter H. Schonemann, in Encyclopedia of Social Measurement, 2005

1950s–1960s: Apogee—Louis Guttman and Lee Cronbach

Psychometrics reached its apogee in the 1950s under the leadership of Guttman and Cronbach. Both had strong commitments to both branches of psychometrics and were heavily engaged in empirical research.

Louis Guttman

Guttman was without question technically the most accomplished and creative psychometrician in the history of psychometrics. Early in his career he addressed fundamental problems in (unidimensional) scaling. He is most widely known for the joint scale that bears his name. In test theory, he subjected the seemingly innocuous notion of reliability inherited from CTT to searing logical criticism, anticipating later developments by Cronbach and others.

The Importance of Falsifiability

In marked contrast to Thurstone, Guttman never tired of stressing the need to test strong assumptions (e.g., simple structure). Turning his attention to factor analysis, he emphasized that neither parsimony pillar supporting Thurstone's edifice, small rank and simple structure, can be taken for granted. These hypotheses are not just strong but most likely false. A recurrent theme in Guttman's papers is the need to replicate empirical findings instead of simply relying on facile significance tests (“star gazers”).

With his radex theory, he proposed a family of alternative structural models unconstrained by these assumptions. In 1955, he also revived interest in the dormant factor indeterminacy issue, recognizing it as a fundamental problem that undermines the very purpose of factor analysis. Thurstone, in contrast, had never faced up to it.

Facet Theory

Later in his career, Guttman returned to Spearman's question, What is intelligence?, which he sought to answer with his facet theory. This research strategy starts with a definition of the domain of the intended variable. Only after this has been done does it become feasible to determine empirically, with a minimum of untested auxiliary assumptions (such as linearity, normality, and the like), whether the staked out domain is indeed unidimensional as Spearman had claimed. If it is not, then one has to lower one's aim and concede that intelligence, whatever else it may be, cannot be measured.

Lee Cronbach

Cronbach also went his own way. As author of a popular text on testing, he was familiar with the problems practicing psychometricians had to face and not easily dazzled by mathematical pyrotechnics devoid of empirical substance. Just as Guttman before him, he also questioned the simplistic assumptions of CTT. Searching for alternatives, he developed a reliability theory that recognized several different types of measurement error—thus yielding several different types of reliability—to be analyzed within the framework of analysis of variance (generalizability theory).

Mental Tests and Personnel Decisions

Most important, in a short book he wrote with Goldine Gleser in 1957, the authors radically departed from the traditional correlational approach for gauging the merits of a test. Instead of asking the conventional questions in correlational terms, they asked a different question in probabilistic terms: How well does the test perform in terms of misclassification rates?

It is surprising that this elementary question had not received more attention earlier. In hindsight, it seems rather obvious that, since use of a test always entails a decision problem, its merit cannot be judged solely in terms of its validity. How useful a test will be in practice also depends on prior knowledge, including the percentage of qualified applicants in the total pool of testees (the base rate) and the stringency of the admission criterion used (the admission quota).

Base Rate Problem

That knowledge of the correlation between the test and the criterion alone cannot possibly suffice to judge the worth of a test is most easily seen in the context of clinical decisions, which often involve very skewed base rates, with the preponderance of subjects being assessed as “normal.”

For example, assume the actual incidence (base rate) of normal is 0.90, and that for “pathological” it is 0.10. Suppose further that the test cutoff is adjusted so that 90% of the testees are classified as normal on the basis of their test scores and 10% as pathological. If the joint probability of actually being normal and of being correctly classified as such is 0.85, then one finds that the so-called phi coefficient (as an index of validity) is 0.44. This is quite respectable for a mental test. However, on using it, one finds the following for the probability of total correct classifications (i.e., the joint probability of actually being normal and also being so diagnosed plus the joint probability of actually being pathological and being so diagnosed): 0.85 + 0.05 = 0.90. This value exactly equals the assumed base rate. Thus, the proportion of total correct classifications achieved on using the test could also have been achieved without it by simply classifying all testees as normal.

The moral of his tale is that the practical utility of a test is a joint function of, among other things, validity, base rate, and admission quota. Validity by itself tells us nothing about the worth of a test. Meehl and Rosen had made much the same point a few years earlier. Cronbach and Gleser expanded on it systematically, tracing out the consequences such a decision-theoretic perspective implies.

To my knowledge, the only currently available complete tables for hit rates (the conditional probability that a qualified testee passes the test) and total percentage correct classifications, as joint functions of validity, base rate, and admission quota, are those published in Schonemann (1997b), in which it is also shown that no test with realistic validity (<0.5) improves over random admission in terms of total percentage correct if either base rate exceeds 0.7.

Notwithstanding the elementary nature of this basic problem and its transparent social relevance, the Social Sciences Citation Index records few, if any, references to it in Psychometrika. This is surprising considering that much of modern test theory, with its narrow focus on item analysis, may well become obsolete once one adopts Cronbach's broader perspective. In contrast, some more applied journals did pay attention to the work of Meehl, Rosen, Cronbach, and Gleser.

Read full chapter

URL: https://www.sciencedirect.com/science/article/pii/B0123693985005247

Which of the following personnel selection methods has a high generalizability?

Which of the following is true of the generalizability of several personal selection methods? Cognitive ability tests have high generalizability.

Is the degree to which the information provided by selection methods enhances the bottom line effectiveness of the organization?

utility. Utility is the degree to which the information provided by selection methods enhances the bottom-line effectiveness of the organization.

Is the knowledge of one's strengths and weaknesses quizlet?

______________ is the knowledge of one's strengths and weaknesses. -Self-awareness.

Which method for assessing validity involves giving a measure to applicants then correlating it with some criterion at a later time?

A concurrent validation assesses the validity of a test by administering it to people already on the job and then correlating test scores with existing measures of each person's performance.