THE HISTORY OF PSYCHOLOGICAL TESTING
The history of psychological
testing is a fascinating story and has abundant relevance to present-day
practices. After all, contemporary tests did not spring from a vacuum; they
evolved slowly from a host of precursors introduced over the last one hundred
years. Accordingly, Chapter 1 features a review of the historical roots of present-day psychological tests. In
Topic 1A, The Origins of Psychological Testing, we focus largely on the efforts
of European psychologists to measure intelligence during the late nineteenth
century and pre–World War I era. These early intelligence tests and their successors
often exerted powerful effects on the examinees who took them, so the first
topic also incorporates a brief digression documenting the pervasive importance
of psychological test results.
Topic
1B, Early Testing in the United States, catalogues the profusion of tests
developed by American psychologists in the first half of the twentieth century.
Psychological testing in its modern form originated little more than one
hundred years ago in laboratory studies of sensory discrimination, motor skills,
and reaction time. The British genius Francis Galton (1822–1911) invented the
first battery of tests, a peculiar assortment of sensory and motor measures,
which we review in the following. The American psychologist James McKeen
Cattell (1860–1944) studied with Galton and then, in 1890, proclaimed the
modern testing agenda in his classic paper entitled “Mental Tests and
Measurements.”He was tentative and modest when describing the purposes and
applications of his instruments: Psychology cannot attain the certainty and exactnessof
the physical sciences, unless it rests on a foundation of experiment and
measurement. A step in this direction could be made by applying a series of
mental tests and measurements to a large number of individuals. The results
would be of considerable scientific value in discovering the constancy of
mental processes, their interdependence, and their variation under different
circumstances. Individuals, besides, would find their tests interesting,
and,
perhaps, useful in regard to training, mode of life or indication of disease.
The scientific and practical value of such tests would be much increased should
a uniform system be adopted, so that determinations made at different times and
places
could be compared and combined. (Cattell,1890) Cattell’s conjecture that
“perhaps” tests would be useful in “training, mode of life or indication of disease”
must certainly rank as one of the prophetic understatements of all time. Anyone
reared in the Western world knows that psychological testing has emerged from
its timid beginnings to become a big business and a cultural institution that
permeates
modern
society. To cite just one example, consider the number of standardized
achievement and ability tests administered in the school systems of the United
States. Although it is difficult to obtain exact data on the extent of such
testing, an estimate of 200 million per year is probably not extreme (Medina
& Neill, 1990). Of course, the total number of tests administered yearly
also includes
millions
of personality tests and untold numbers of the thousands of other kinds of
tests now in existence (Conoley & Kramer, 1989, 1992; Mitchell, 1985;
Sweetland & Keyser, 1987). There is no doubt that testing is pervasive. But
does it
make
a difference?
THE
IMPORTANCE OF TESTING
Tests
are used in almost every nation on earth for counseling, selection, and
placement. Testing occurs in settings as diverse as schools, civil service, industry,
medical clinics, and counseling centers. Most persons have taken dozens of
tests and thought nothing of it. Yet, by the time the typical individual reaches
retirement age, it is likely that psychological test results will help shape
his or her destiny. The deflection of the life course by psychological test
results might be subtle, such as when a prospective mathematician qualifies for
an accelerated calculus course based on tenth-grade achievement scores. More
commonly, psychological test results alter individual destiny in profound ways.
Whether a person is admitted to one college and not another, offered one job
but refused a second, diagnosed as depressed or not—all such determinations rest,
at least in part, on the meaning of
test
results as interpreted by persons in authority. Put simply, psychological test
results change lives. For this reason it is prudent—indeed, almost mandatory—that
students of psychology learn about the contemporary uses and occasional abuses
of
testing. In Case Exhibit 1.1, the life-altering aftermath of psychological
testing is illustrated by means of several true case history examples.
The
importance of testing is also evident from historical review. Students of
psychology generally regard historical issues as dull, dry, and pedantic, and
sometimes these prejudices are well deserved. After all, many textbooks fail to
explain the relevance of historical matters and provide only vague sketches of
early developments in mental testing. As a result, students of psychology often
conclude incorrectly that historical issues are boring and irrelevant. In
reality, the history of psychological testing is a captivating story that has
substantial relevance to present-day practices. Historical developments are pertinent
to contemporary testing for the following reasons:
1.
A review of the origins of psychological testing helps explain current
practices that might other- wise
seem arbitrary or even peculiar. For example, why
do many current intelligence tests incorporate a
seemingly nonintellective capacity, namely, short-term
memory for digits? The answer is, in part,
historical inertia—intelligence tests have
always included a measure of digit span.
2. The
strengths and limitations of testing also stand out better when tests are
viewed in historical context. The reader will discover, for example, that modern
intelligence tests are exceptionally good at predicting school
failure—precisely because this was the original and sole purpose of the first such
instrument developed in Paris, France, at the turn of the twentieth century.
3. Finally,
the history of psychological testing contains some sad and regrettable episodes
that help remind us not to be overly zealous in our modern-day applications of
testing. For example, based on the misguided and prejudicial application of
intelligence test results, several prominent psychologists helped ensure
passage of the Immigration Restriction Act of 1924. In later chapters, we
examine the principles of psychological testing, investigate applications in specific
fields (e.g., personality, intelligence, neuropsychology), and reflect on the
social and legal consequences of testing. However, the reader will find these
topics more comprehensible when viewed in historical context. So, for now, we
begin at the beginning by reviewing rudimentary forms of testing that existed
over four thousand years ago in imperial China.
THE
CONSEQUENCES OF TEST RESULTS
The
importance of psychological testing is best illustrated by example. Consider
these
brief vignettes:
•
A shy, withdrawn 7-year-old girl is administered an IQ test by a school
psychologist.
Her
score is phenomenally higher than the teacher expected. The
student
is admitted to a gifted and talented program where she blossoms into
a
self-confident and gregarious scholar.
•
Three children in a family living near a lead smelter are exposed to the toxic
effects
of lead dust and suffer neurological damage. Based in part on psychological
test
results that demonstrate impaired intelligence and shortened
attention
span in the children, the family receives an $8 million settlement
from
the company that owns the smelter.
•
A candidate for a position as police officer is administered a personality
inventory
as
part of the selection process. The test indicates that the candidate
tends
to act before thinking and resists supervision from authority figures.
Even
though he has excellent training and impresses the interviewers, the
candidate
does not receive a job offer.
•
A student, unsure of what career to pursue, takes a vocational interest
inventory.
The
test indicates that she would like the work of a pharmacist. She
signs
up for a prepharmacy curriculum but finds the classes to be both difficult
and
boring. After three years, she abandons pharmacy for a major
in
dance, frustrated that she still faces three more years of college to earn a
degree.
•
An applicant to graduate school in clinical psychology takes the Minnesota
Multiphasic
Personality Inventory (MMPI). His recommendations and grade
point
average are superlative, yet he must clear the final hurdle posed by the
MMPI.
His results are reasonably normal but slightly defensive; by a narrow
vote,
the admissions committee extends him an invitation. Ironically, this is
the
only graduate school to admit him—nineteen others turn him down. He
accepts
the invitation and becomes enchanted with the study of psychological
assessment. Many years later, he
writes this book
RELIABILITY
Reliability refers to the
consistency of a measure. A test is considered reliable if we get the same
result repeatedly. For example, if a test is designed to measure a trait (such
as introversion), then each time the test is
administered to a subject, the results should be approximately the same.
Unfortunately, it is impossible to calculate reliability exactly, but it can be
estimated in a number of different ways. Test-Retest Reliability
To gauge test-retest reliability,
the test is administered twice at two different points in time. This kind of
reliability is used to assess the consistency of a test across time. This type
of reliability assumes that there will be no change in the quality or construct
being measured. Test-retest reliability is best used for things that are stable
over time, such as intelligence. Generally, reliability will be higher when
little time has passed between tests.
Inter-rater Reliability
This type of reliability is assessed
by having two or more independent judges score the test. The scores are then
compared to determine the consistency of the raters estimates. One way to test
inter-rater reliability is to have each rater assign each test item a score.
For example, each rater might score items on a scale from 1 to 10. Next, you
would calculate the correlation between the two rating to determine the level
of inter-rater reliability. Another means of testing inter-rater reliability is
to have raters determine which category each observations falls into and then
calculate the percentage of agreement between the raters. So, if the raters
agree 8 out of 10 times, the test has an 80% inter-rater reliability rate.
Parallel-Forms Reliability
Parellel-forms reliability is gauged
by comparing two different tests that were created using the same content. This
is accomplished by creating a large pool of test items that measure the same
quality and then randomly dividing the items into two separate tests. The two
tests should then be administered to the same subjects at the same time.
Internal Consistency Reliability
This form of reliability is used to
judge the consistency of results across items on the same test. Essentially,
you are comparing test items that measure the same construct to determine the
tests internal consistency. When you see a question that seems very similar to
another test question, it may indicate that the two questions are being used to
gauge reliability. Because the two questions are similar and designed to
measure the same thing, the test taker should answer both questions the same,
which would indicate that the test has internal consistency.
VALIDITY
Validity is the extent to which a
test measures what it claims to measure. It is vital for a test to be valid in
order for the results to be accurately applied and interpreted.
Validity isn’t determined by a
single statistic, but by a body of research that demonstrates the relationship
between the test and the behavior it is intended to measure. There are three
types of validity:
Content validity:
When a test has content validity,
the items on the test represent the entire range of possible items the test
should cover. Individual test questions may be drawn from a large pool of items
that cover a broad range of topics.
In some instances where a test
measures a trait that is difficult to define, an expert judge may rate each
item’s relevance. Because each judge is basing their rating on opinion, two
independent judges rate the test separately. Items that are rated as strongly
relevant by both judges will be included in the final test.
Criterion-related Validity:
A test is said to have
criterion-related validity when the test has demonstrated its effectiveness in
predicting criterion or indicators of a construct. There are two different
types of criterion validity:
- Concurrent
Validity
occurs when the criterion measures are obtained at the same time as the
test scores. This indicates the extent to which the test scores accurately
estimate an individual’s current state with regards to the criterion. For
example, on a test that measures levels of depression, the test would be
said to have concurrent validity if it measured the current levels of
depression experienced by the test taker.
- Predictive
Validity
occurs when the criterion measures are obtained at a time after the test.
Examples of test with predictive validity are career or aptitude tests,
which are helpful in determining who is likely to succeed or fail in
certain subjects or occupations.
Construct Validity: A test has construct validity if it
demonstrates an association between the test scores and the prediction of a
theoretical trait. Intelligence tests are one example of measurement
instruments that should have construct validity.