Reliability, Validity, and other Measurement Issues


Reliability and Validity

Reliability refers to the consistency and stability of gscores.h These scores are assumed to result from some gmeasurement process.h Consistency is most often thought to mean repeatability. Stability is most often thought to mean the obtained scoresf consistency over time. In a more technical sense, reliability refers to the strength of the standard systematic variance, usually conceptualized as some statistical association, between theorized entity and an overt indicator of that theorized entity. Psychometrics often describes this shared variance through the theories of true and error scores, domain sampling, or gtrue scores and parallel tests.h Coefficient alpha usually assesses the internal consistency of scores, and test-retest coefficients usually assess the scoresf stability over time.

Coefficient alpha is a function of internal consistency (Cortina, 1993). Internal consistency refers to the degree of interrelatedness among the items. Whereas homogeneity refes to unidimensionality. That is, internal consistency is a necessary but not sufficient condition for homogeneity. A set of items can be relatively interelated and multidimensional. Therefore, alpha is a function of the extent to which items in a test have high communalities and thus low uniquenesses. It is also a function of interrelatedness, although one must remember that this does not imply unidimentionality or homogeneity. It has been said that alpha is appropriately computed only when there is a single common factor (Cotton, Campbell, & Malone, 1957) because alpha is affected by the number of items, item intercorrelations, and dimensionality.

With regard to dimensionality, Child (1969) suggest how the number of factors to be extracted. Kaiser's criterion suggests only the factors having latent roots greater than one are considered as common factors. It is probably most reliable when the number of variables is between 20 and 50. According to Cattell's scree test, a graph is plotted of latent roots against the factor number and the shape of the resulting curve employed to judge the cut-off point. Interpretability should be added to the judgement.

Validity is often defined as the shared gtrueh variance between an underlying concept and its empirical scores. In particular, true variance is defined to exclude systematic bias and to include only the gtheoretically meaningfulh systematic variance between an underlying idea and its overt representations."Construct Validity" is a continuous, ongoing process of accumulating evidence that suggests scores from a measurement procedure reflect its intended construct. As such, construct validity subsumes all the form of validity described below. Evidence of "criterion-related validity" may be established if a statistical association is demonstrated between a predictor variable and some meaningful criterion. The statistical association is indexed by a Pearson product-moment correlation coefficient, but other statistics could be applied just as well. It is assumed that criterion itself is already reliably measured and meaningful and that predictor variable can be measured either before or simultaneous with the measurement of the criterion. The former refers to predictive validity and latter refers to concurrent validity. Evidence of "content validity" or "content-oriented test construction" may be established if the procedures followed in constructing a measure are judged to derive gclearly and in a compelling fashionh from a meaningful conceptual domain. The inference of content validity is based on the qualitative judgement that (a) the conceptual domain, (b) the test plan designed to map that conceptual domain, and (c) the resulting measurement instrument overlap substantially. Evidence for "convergent validity" may be established if scores from several measurement procedures that purport to measure the same (or very similar) concept are ghighlyh correlated. That is, scores from multiple measures of the same (or very similar) concept should converge. Evidence for "discriminant" validity may be established if scores from several measurement procedures that purport to measure different concepts are uncorrected. Scores from different measurement procedures for different concepts should diverge even more. "Internal validity" refers to the judgement that an experiment's procedures are sufficient to justify rejection or provisional acceptance of its hypothesis. "External validity" refers to the judgement that an experiment's results can be generalized to a larger population or to an alternative population. "Ecological validity" refers to the judgement that a study's or an experiment's features include or reflect the major features of the context in which the phenomenon of interest if found.

In the modern world or natural science model, there is an objective world and there are objective truths, and traditional positivist notions of science should lead to the discovery and understanding of those truths. In this modern world, scientific validity can be best understood through the assessment of the truthfulness of its statements using the correspondence criterion (accuracy or strength of relationship between the scientific statements and the objective world), coherence criterion (internal logic or consistency among scientific statements), and pragmatic criterion (practical outcomes of the scientific statements.) In the postmodern world as the rejection of the natural science model, empirical reality is seen as a social entity, and truth is therefore socially constructed (e.g., though dialogue). In these socially constructed worlds, in which there are many possible worldviews, truths, and criteria for truth, traditional criteria for truth may be more or less correct, depending upon one's point of view. Kvale (1996) identifies three forms of validity. Validity as craftmanship involves the trustworthiness of study's results, and it is most directly conceptualized by the strength of a study's falsification attempts. To enhance judgements about validity of craftmanship, (1) the researcher must adopt a critical outlook during the analysis, (2) the purpose and content of the study must precede its method, and (3) the research must be tightly connected to theory creation or testing. Validity as communication demonstrates hermeneutic traditions. To determine the judgement of validity of communication, (1) communication and truth testing involve persuasion through rational discourse, (2) the criteria or purpose of the discourse should be clear, and (3) the interests of the debating parties should be made clear as well. Pragmatic validity involves the real-world changes that occur as a result of a researcher's theory, propositions, or actions. A theory, proposition, or action can induce verbal changes, and induce behavioral changes as well.

Random and Nonrandom Errors

For variables in the social sciences, there almost always are discrepancies between the conceptual variables and the measures that operationalize them. These discrepancies make it critical to accurately partition reliable into true score common variance and true score unique variance. True score common variance is related to the theoretical constructs of interest to researchers and is the part that researchers want to isolate and keep. This component is part of the reliable variance of the measure. True score unique variance is the difference between the reliability of the measure and its relation to the constructs of interest. This component is not error and will consistently appear each time the measure is used. Traditional error variance is the unreliable variance that is part of the measure. A certain measure could contain method variance, have measure specific variance, or even assess a second theoretical variable, as well as contain measurement error. For path models, it would be ideal to partition variance in a way that leaves only the true score common variance as reliable and lumps together the true score unique variance with measurement error. Such partitioning cannot be done until multiple measures of constructs are introduced and the logic of factor analysis is used.

Random measurement error is error that actually meets the desired properties of error variance, namely, that is unrelated to the predictor variables, criterion variables, and errors of other measures. It exists independently of other measures. In multivariate instances, random error produces neither an ideal case nor a predicable one. For dependent variables, error gets absorbed into the residual. On the other hand, slopes, the unstandardized coefficients, remain unaffected, and this is a persuasive reason for working with covariances if there is concern about error in one's dependent variables. For independent variables, the new error term biases the regression weight and therefore cannot readily be dismissed. Nonrandom error is error variance that is related in some systematic way to a variable or other error term. The most common types of nonrandom error are those that are result from two measures having more than one underlying dimension (construct) in common. Models containing nonrandom error cannot be solved by using traditional regression techniques.

Describing Individual Differences

Qualitative description is termed classification and quantitative description is termed measurement. Measurement involves the use of numbers. All living creatures possess the property of life. A property in which individuals differ among themselves is termed a variable. They may differ qualitatively or quantitatively. In some cases quantitative variables are discontinuous and show breaks or steps. In other cases quantitative variables are continuous. All variables can be classified into one or the other of qualitative variables and quantitative variables. Quantitative variables can be subdivided into ranked variables and scalar variables. Ranked variables provide a series of categories that are discrete and separate and those variables are ordered. Scalar variables can be further subdivided into discontinuous scale and continuous scale. Also they can be classified another way into ratio scales and interval scales.

The process of defining our variables and devising operations that we can use in the description of individual differences is a never-ending one. The clearer and more specific the definition of a variable, the more useful it is. The way in which we define a variable is a function of the theories and knowledge we have about the property and the individuals for whom it is a variable. After defining our variable, we are in a position to develop operations that permit us to observe individual differences in the property. The quantitative description of an individual given by the operation is termed the raw score. Scores are transformed to make the numerical values more convenient for recording or for arithmetic manipulation, or to provide more meaningful quantitative descriptions of individuals. These new variables are termed transformed scores. In developing a specific set of operations, we are in a vary real sense redefining our variable, or at least further specifying its definition.

Numbers provide a means by which individuals can be classified or arranged in a systematic way and they can be manipulated and combined by arithmetic processes to give more precise descriptions or other meanings. When we are dealing with ranked variables, the scale will be ordinal scale. When scale is continuous, the quantitative description assigned to an individual never is precise but rather is an approximation. If the intervals between the numbers on the scale are equal, our operations are said to have provided an interval scale. Ratio scales have a natural and meaningful zero point, and this makes it possible to perform arithmetic operations directly on the numbers themselves. If we separate individuals into two categories, it is a example of dichotomous variables. There are truly dichotomous variables and truly continuous or multistep discontinuous quantitative variables (discontinuous dichotomous scales and dichotomous scales based on continuous or multstep variables).

Basic Aspects of Psychological Measurement and Transforming Scores

The successive categories of a discontinuous quantitative scale and the successive markers on continuous quantitative scale must represent equal increments in the amount, frequency, or degree of the property being measured. Scores that are given in terms that refer to the scale itself convey more information than those that are not. Because markers on a scale are established arbitrary, one system of units can be readily changed into other system. It is possible to transform scores that are expected in unfamiliar terms to values that are familiar and therefore meaningful. If we know the individual stands in the property relative to other individuals of his or her kin, we have a quantitative description that is quite meaningful.

It is only when scores o the different scales are expressed precisely in the same terms that we can say they are comparable. Two scores, each assessed independently of each other and on a different variable, are comparable if they represent the same standing in the same population on those scales. Comparability is attained not through rendering the units on the various scales equal, but rather through comparing the individual's scores on different variables with the distributions of scores earned by a population of individuals in his or her same general class.

We must be sure that the individuals used in the sample are fully representative of the total group the norms are to represent. The frequency distribution is a listing of the obtained scores and the number of people obtaining each score. We can compare distributions by central tendency (arithmetic mean or the point of least squares, mode, and median), an index of the degree to which scores exhibit variation (standard deviation and variance), an index of the degree to which scores exhibit symmetrical (skewness), and laptokurtic (sharply peaked) or platykurtic (flat) or kutosis. Normal frequency distribution is symmetrical and bell-shaped.

The most common decision it to admit that the distribution of scores given by our operations fo measurement does not correctly reflect the distribution of the trait. Thus we have to make up our mind about the model of the distribution of the trait, that is, the shape of the frequency distribution of scores. Then we can attempt to design operations of measurement yielding scores that distribute in the same manner as our model. Next, we can transmute our raw scores to other values of such a nature that the new transmuted scores do distribute in accordance with the model. Finally, we can say that the distribution of scores of the particular sample of individuals we have does possess the characteristics of the model even though it does not.

When raw scores are transformed into percentile ranks, the units within a test and between tests are made comparable. The units are expressed in terms of numbers of people. The frequency distribution of percentile ranks earned by the group on which they are determined necessarily is rectangular in shape. The transformation of raw scores into standard scores does not change the shape of the distribution of scores. It necessarily assumes that the units of measurement as given by the raw scores are all equal. If we know an individual's percentile score, we know the percentage of the group that falls below his or her raw score. It should be a simple matter to then use the normal curve to find the normalized standard score below which the same number of observations fall.

When all groups are measured under the same conditions with exactly the same measuring device, or when different groups are measured under different conditions and/or with different forms of the measuring device, it is desired to combine the scores of all cases into a single distribution. The differences in means and standard deviations of the scores of the various groups are eliminated through the use of standard or standardized scores, and differences in the shapes of the distributions of scores are eliminated through normalizing procedures.


Cortina (1993). What Is Coefficient Alpha? An Examination of Theory and Applications. Journal of Applied Psychology, 78 (1), 98-104.

Cotton, J. W., Campbell, D. T., & Malone, R. D. (1957). The relationship between factorial composition of test items and measures of test reliability. Psychometorika, 22, 347-358.

Child, D. (1969). The Essentials of Factor Analysis. Holt Rinehart and Winston, London.

Ghaselli, E. E., Campbell, J. P., Zedeck, S. (1981). Measurement Theory for the Behavioral Sciences. W. H. Freeman, San Fransisco.

Kvale (1996). InterViews: An introduction to qualitative research interviewing. Thousand Oaks, CA: Sage.

Lee, Thomas W. (1998). Using Qualitative Methods in Organizational Research. CA. Sage.

Maruyama, G. M. (1997). Basics of Structural Equation Modeling. Thoudand Oaks, Sage.