High Stakes: Testing for Tracking, Promotion, and Graduation
High Stakes
TESTING FOR TRACKING, PROMOTION, AND GRADUATION
Jay P. Heubert and Robert M. Hauser, Editors
Committee on Appropriate Test Use
Board on Testing and Assessment
Commission on Behavioral and Social Sciences and Education
National Research Council
NATIONAL ACADEMY PRESS
Washington, D.C. 1999
NATIONAL ACADEMY PRESS 2101 Constitution Avenue, N.W. Washington, D.C. 20418
NOTICE: The project that is the subject of this report was approved by the
Governing Board of the National Research Council, whose members are drawn from the councils
of the National Academy of Sciences, the National Academy of Engineering, and the
Institute of Medicine. The members of the committee responsible for the report were
chosen for their special competences and with regard for appropriate balance.
The study was supported by Contract/Grant No. ED-98-CO-0005 between the
National Academy of Sciences and the U.S. Department of Education. Any
opinions, findings, conclusions, or recommendations expressed in this publication are those of
the author(s) and do not necessarily reflect the view of the organizations or agencies
that provided support for this project.
Library of Congress Cataloging-in-Publication Data
High stakes : testing for tracking, promotion, and graduation / Jay
P. Heubert and Robert M. Hauser, editors ; Committee on Appropriate
Test Use.
p. cm.
Includes bibliographical references and index.
ISBN 0-309-06280-2 (pbk.)
1. Educational tests and measurementsUnited States. 2.
Educational accountabilityUnited States. 3. Education and
stateUnited States. I. Heubert, Jay Philip. II. Hauser, Robert
Mason. III. National Research Council (U.S.). Committee on
Appropriate Test Use.
LB3051 .H475 1999
371.26'0973dc21 98-40215
Additional copies of this report are available from National Academy Press, 2101
Constitution Avenue, N.W., Washington, D.C. 20418
Call (800) 624-6242 or (202) 334-3313 (in the Washington metropolitan area)
This report is also available on line at http://www.nap.edu
Printed in the United States of America
Copyright 1999 by the National Academy of Sciences. All rights reserved.
COMMITTEE ON APPROPRIATE TEST USE
ROBERT M. HAUSER (Chair), Department of Sociology, University
of Wisconsin, Madison
LIZANNE DeSTEFANO, Department of Education, University of
Illinois, Urbana-Champaign
PASQUALE J. DeVITO, Office of Assessment and Information
Services, Rhode Island Department of Education, Providence
RICHARD P. DURÁN, Graduate School of Education, University
of California, Santa Barbara
JENNIFER L. HOCHSCHILD, Woodrow Wilson School of Public
and International Affairs, Princeton University
STEPHEN P. KLEIN, RAND Corporation, Santa Monica
SHARON LEWIS, Council of the Great City Schools, Washington, D.C.
LORRAINE M. McDONNELL, Department of Political Science,
University of California, Santa Barbara
SAMUEL MESSICK, Educational Testing Service, Princeton, New Jersey
ULRIC NEISSER, Department of Psychology, Cornell University
ANDREW C. PORTER, Wisconsin Center for Educational
Research, University of Wisconsin, Madison
AUDREY L. QUALLS, Iowa Testing Program, University of Iowa, Iowa City
PAUL R. SACKETT, Department of Psychology, University of
Minnesota, Minneapolis
CATHERINE E. SNOW, Graduate School of Education, Harvard University
WILLIAM T. TRENT, Department of Educational Policy
Studies, University of Illinois, Urbana-Champaign
ROBERT L. LINN, ex officio, Board on Testing and Assessment; School
of Education, University of Colorado, Boulder
JAY P. HEUBERT, Study Director
MICHAEL J. FEUER, Director, Board on Testing and Assessment
PATRICIA MORISON, Senior Program Officer
NAOMI CHUDOWSKY, Senior Program Officer
ALLISON M. BLACK, Research Associate
MARGUERITE CLARKE, Technical Consultant
EDWARD MILLER, Editorial Consultant
VIOLA C. HOREK, Administrative Associate
KIMBERLY D. SALDIN, Senior Project Assistant
BOARD ON TESTING AND ASSESSMENT
ROBERT L. LINN (Chair), School of Education, University of
Colorado, Boulder
CARL F. KAESTLE (Vice Chair), Department of Education,
Brown University
RICHARD C. ATKINSON, President, University of California
IRALINE BARNES, The Superior Court of the District of Columbia
PAUL J. BLACK, School of Education, King's College, London
RICHARD P. DURÁN, Graduate School of Education, University
of California, Santa Barbara
CHRISTOPHER F. EDLEY, JR., Harvard Law School, Harvard University
PAUL W. HOLLAND, Graduate School of Education, University
of California, Berkeley
MICHAEL W. KIRST, School of Education, Stanford University
ALAN M. LESGOLD, Learning Research and Development
Center, University of Pittsburgh
LORRAINE MCDONNELL, Department of Political Science, University
of California, Santa Barbara
KENNETH PEARLMAN, Lucent Technologies, Inc., Warren, New Jersey
PAUL R. SACKETT, Department of Psychology, University of
Minnesota, Minneapolis
RICHARD J. SHAVELSON, School of Education, Stanford University
CATHERINE E. SNOW, Graduate School of Education, Harvard University
WILLIAM L. TAYLOR, Attorney at Law, Washington, D.C.
WILLIAM T. TRENT, Associate Chancellor, University of Illinois,
Urbana-Champaign
JACK WHALEN, Xerox Palo Alto Research Center, Palo Alto, California
KENNETH I. WOLPIN, Department of Economics, University
of Pennsylvania
MICHAEL J. FEUER, Director
VIOLA C. HOREK, Administrative Associate
Foreword
President Clinton's 1997 proposal to create voluntary national
tests in reading and mathematics catapulted testing to the top of the
national education agenda. The proposal turned up the volume on what
had already been a contentious debate and drew intense scrutiny from a
wide range of educators, parents, policy makers, and social scientists.
Recognizing the important role science could play in sorting through the
passionate and often heated exchanges in the testing debate, Congress
and the Clinton administration asked the National Research
Council, through its Board on Testing and Assessment (BOTA), to conduct
three fast-track studies over a 10-month period.
This report and its companionsUncommon Measures:
Equivalence and Linkage Among Educational
Tests and Evaluation of the Voluntary National Tests: Phase
1are the result of truly heroic efforts on the part
of the BOTA members, the study committee chairs and members, two
co-principal investigators, consultants, and staff, who all understood
the urgency of the mission and rose to the challenge of a unique and
daunting timeline. Michael Feuer, BOTA director, deserves the special thanks
of the board for keeping the effort on track and shepherding the
report through the review process. His dedicated effort, long hours, sage
advice, and good humor were essential to the success of this effort.
Robert Hauser deserves our deepest appreciation for his superb leadership of
the committee that produced this report.
These reports are exemplars of the Research Council's
commitment to scientific rigor in the public interest: they provide clear and
compelling statements of the underlying issues, cogent answers to nettling
questions, and highly readable findings and recommendations. These
reports will help illuminate the toughest issues in the ongoing debate over
the proposed voluntary national tests. But they will do much more as well.
The issues addressed in this and the other two reports go well beyond
the immediate national testing proposal: they have much to contribute
to knowledge about the way testsall testsare planned, designed,
implemented, reported, and used for a variety of education policy goals.
I know the whole board joins me in expressing our deepest
gratitude to the many people who worked so hard on this project. These
reports will advance the debate over the role of testing in American
education, and I am honored to have participated in this effort.
Robert L. Linn, Chair
Board on Testing and Assessment |
Dedication
In early October 1998, after the public release of this report
but before its formal publication, we were saddened to learn of the death
of our fellow committee member, Samuel Messick. Sam spent almost all
of his career at the Educational Testing Service, and he made
legendary contributions to the science and profession of educational measurement.
Even had he not been a member of the committee, Sam would
have guided the committee's deliberations through his earlier National
Research Council work on the use of tests to make decisions about
students with mental retardationwhich provided the overarching framework
of our reportand his creative reconstruction of the concept of test
validity. As it was, Sam made even greater contributions to the project
through his drafts of major sections of the text as well as his cordial, but ever
crisp, incisive, and often wryly humorous contributions to our discussions.
Sam was a wonderful scholar, intellect, and friend, and we dedicate this
book to him.
Acknowledgments
The Committee on Appropriate Test Use wishes to thank the
many people who helped make possible the preparation of this report on
an accelerated schedule.
An important part of the committee's work was to gather data
about testing research, policy, and practice in states and school districts.
Many people gave generously of their time, at meetings and workshops of
the committee, in interviews with committee staff, and by drafting
short papers to assist the committee's thinking.
Lorrie A. Shepard, University of Colorado, Boulder, provided
an excellent overview of educational issues in high-stakes testing of
individual students. Floraline Stevens, of Los Angeles, provided insights
into state and local high-stakes test policies. At a workshop on testing
of English-language learners, Jamal Abedi, University of California,
Los Angeles, shared his experimental findings on effects of question
wording and format among English-language learners. Toni Marsnik,
Language Acquisition and Bilingual Development Branch, Los Angeles
Unified School District, and Lynn Winters, assistant superintendent for
research, planning, and evaluation, Long Beach Unified School District,
offered perspectives on practices for testing English-language learners in
their districts and in California more generally.
At a committee workshop in Washington, D.C., six leading
educational policymakers offered local, state, and national perspectives on
the use of high-stakes tests for promotion or retention; the presenters
included Arlene Ackerman, superintendent of schools, Washington,
D.C.; Philip Hansen, chief accountability officer, Chicago Public
Schools; Nancy Grasmick, superintendent of schools, State of Maryland;
Jim Watts, vice president for state services, Southern Regional
Education Board; Michael Cohen, special assistant to the president for
educational policy; and Bella Rosenberg, assistant to the president, American
Federation of Teachers.
The committee also commissioned short papers to assist in
deliberations about alternate strategies for promoting appropriate test use.
Those who prepared such papers include: Tyler Cowan, George Mason
University; Ernest House, University of Colorado, Boulder; Don Kettl,
University of Wisconsin, Madison; Henry Levin, Stanford University;
Theodore Marmor, Yale University; and Anne Schneider, Arizona State
University. We are grateful to David Klahr, Carnegie Mellon University, for
his insights.
Jennifer C. Day, Population Division, U.S. Bureau of the
Census, provided access to unpublished tabulations of school enrollment
data from the October Current Population Survey. In addition, staff of
several state education agencies provided valuable information about state
retention rates: Alabama, Arizona, California, Delaware, District of
Columbia, Florida, Georgia, Indiana, Kentucky, Louisiana, Maryland,
Massachusetts, Michigan, Mississippi, New Mexico, New York, North
Carolina, Ohio, South Carolina, Tennessee, Texas, Vermont, Virginia, West
Virginia, and Wisconsin.
We are also grateful to those who served as consultants to the
committee. Marguerite Clarke, research associate at Boston College,
provided invaluable contributions during all phases of the study,
especially on psychometric issues. Edward Miller joined the project midway
as editor, and he skillfully, tirelessly pulled our bits, scraps,
andsometimesavalanches of text into clear, concise prose. Diane August
provided important advice and assistance on the testing of
English-language learners and prepared early drafts of Chapter 9 of the report. Susan
E. Phillips, Michigan State University, and William L. Taylor, a member
of the Board on Testing and Assessment, provided valuable advice on
legal issues in testing. Taissa S. Hauser volunteered to collect and
assemble statistical data on school retention and age-grade retardation, and
her good company and quiet advice were a source of support to all on
the project staff.
We owe an important debt of gratitude to the scientific and
professional staff of the Commission on Behavioral and Social Sciences
and Education (CBASSE), without whose guidance, support, and hard
work we could not conceivably have completed this report. Barbara B.
Torrey, executive director of the commission, and Sandy Wigdor, director of
the Division on Education, Labor, and Human Performance, have been
enthusiastic supporters of the project and a timely source of gracious
reminders that we keep our priorities in line. Michael J. Feuer, director
of the Board on Testing and Assessment (BOTA), brought our
research team together, created staff support and resources whenever we
needed them, and was our most valuable guide, sounding board, and humorist
as we pondered the complexities of educational policy analysis.
Patricia Morison made major contributions to our work on students with
disabilities and English-language learners and was a constant source of
support and thoughtful ideas. Allison Black contributed to many phases of
the project; she developed many of the background materials for the
committee, and her structured interviews with school administrators were a
key source of information about local testing policies and practices.
Naomi Chudowsky took major responsibility for the investigation of high
school graduation and also contributed to the presentation of psychometric
concepts, and Robert Rothman made important contributions to the
analysis of policy alternatives. During her summer internship, Yale
University doctoral student Marilyn Dabady was a careful and critical in-house
reader of our drafts. National Research Council (NRC) staff were always
available to pitch in when expertise or energy were called for. They were
key members of the study team, and it is hard to see how the study could
have been completed without their expert help.
Kimberly Saldin served unflappably and flawlessly as the
committee's senior project assistant. She dealt smoothly with the logistics of our
four committee meetings in five months, with our voluminous collections
and distributions of published and unpublished research materials, and with
a seemingly endless stream of text files, e-mail file attachments, and
file revisions in seemingly incompatible word-processing formats.
Other BOTA staffSteve Baldwin, Alix Beatty, Meryl
Bertenthal, Cadelle Hemphill, Lee Jones, Karen Mitchelloffered advice, help,
and support at key stages of the process. Kimberly Saldin received
support when she needed it from other wonderful project assistants to the board:
Lisa Alston, Dorothy Majewski, Jane Phillips, and Holly Wells.
Viola Horek, administrative associate to BOTA, was always there,
instrumental in seeing that the entire project ran smoothly.
We are deeply grateful to Eugenia Grohman, associate director
for reports of CBASSE. Genie has and shares enormous knowledge
and experience in keeping a committee on track and putting a report
together from beginning to end. We also appreciate the superb work of
Christine McShane, to whom fell the responsibility for final editing of the
full report. We are indebted, also, to the whole CBASSE staff for
indulging our scheduling exigencies. Thanks also to Sally Stanfield and the
whole Audubon team at the National Academy Press for their creative
and speedy support.
Several members of the Board on Testing and Assessment were
not members of the committee but attended our meetings ex officio and
were constant sources of wisdom and encouragement: Robert L. Linn,
University of Colorado at Boulder, chair of the Board on Testing and
Assessment, and committee member ex officio; William L. Taylor, Attorney
at Law; and Carl F. Kaestle, Brown University.
Individual committee members have made outstanding
contributions to the study. Several of them drafted sections on particular topics,
prepared background materials, or helped to organize workshops and
committee discussions. Everyone contributed constructive, critical
thinking, serious concern about the difficult and complex issues that we faced,
and an open-mindedness that was essential to the success of the project.
A word of acknowledgment to the sponsors of this study. We
have benefited from supportive and collegial relations with members of
the various House and Senate committee staffson both sides of the
aislefor whom the results of our work have such important implications.
We thank them all for understanding and respecting the process of the NRC.
Our contracting officer's technical representative, Holly Spurlock, of
the U.S. Department of Education, has been a most effective project
officer; we thank her for her patience and guidance throughout. Many
other officials in the department, the National Assessment Governing
Board, and in numerous private and public organizations involved in testing
also deserve our thanks and recognition for their cooperation in
providing information.
This report has been reviewed by individuals chosen for their
diverse perspectives and technical expertise, in accordance with procedures
approved by the NRC's Report Review Committee. The purpose of
this independent review is to provide candid and critical comments that
will assist the authors and the NRC in making the published report as
sound as possible and to ensure that the report meets institutional standards
for objectivity, evidence, and responsiveness to the study charge. The
content of the review comments and draft manuscript remain confidential
to protect the integrity of the deliberative process.
We wish to thank the following individuals, who are neither
officials nor employees of the NRC, for their participation in the review of
this report: Lloyd Bond, School of Education, University of North
Carolina, Greensboro; Wayne J. Camara, The College Board, New York, New
York; John Fremer, Educational Testing Service, Princeton, New Jersey;
Adam Gamoran, Wisconsin Center for Education Research, University of
Wisconsin; Arthur S. Goldberger, Department of Economics, University
of Wisconsin; Lyle V. Jones, L.L. Thurstone Psychometric Laboratory,
University of North Carolina, Chapel Hill; Jeannie Oakes, Graduate
School of Education and Information Studies, University of California, Los
Angeles; Diana Pullin, School of Education, Boston College; Henry
W. Riecken, Professor of Behavioral Sciences (emeritus), University of
Pennsylvania School of Medicine.
Although the individuals listed above have provided many
constructive comments and suggestions, responsibility for the final content of
this report rests solely with the authoring committee and the NRC.
The two of us were unacquainted when we began the project,
andone a legal scholar and the other a demographerwe had little in
common beyond our shared belief in the importance of our mandate. Each
of us has benefited from the other's strengths, and working together
has been an unalloyed pleasure.
Jay Heubert, Study Director
Robert M. Hauser, Chair
Committee on Appropriate Test Use |
The National Academy of Sciences is a private, nonprofit, self-perpetuating
society of distinguished scholars engaged in scientific and engineering research, dedicated to
the furtherance of science and technology and to their use for the general welfare. Upon
the authority of the charter granted to it by the Congress in 1863, the Academy has
a mandate that requires it to advise the federal government on scientific and
technical matters. Dr. Bruce M. Alberts is president of the National Academy of Sciences.
The National Academy of Engineering was established in 1964, under the charter
of the National Academy of Sciences, as a parallel organization of outstanding engineers.
It is autonomous in its administration and in the selection of its members, sharing with
the National Academy of Sciences the responsibility for advising the federal government.
The National Academy of Engineering also sponsors engineering programs aimed
at meeting national needs, encourages education and research, and recognizes the
superior achievements of engineers. Dr. William A. Wulf is president of the National Academy
of Engineering.
The Institute of Medicine was established in 1970 by the National Academy
of Sciences to secure the services of eminent members of appropriate professions in
the examination of policy matters pertaining to the health of the public. The Institute
acts under the responsibility given to the National Academy of Sciences by its
congressional charter to be an adviser to the federal government and, upon its own initiative, to
identify issues of medical care, research, and education. Dr. Kenneth I. Shine is president of
the Institute of Medicine.
The National Research Council was organized by the National Academy of
Sciences in 1916 to associate the broad community of science and technology with the
Academy's purposes of furthering knowledge and advising the federal government. Functioning
in accordance with general policies determined by the Academy, the Council has
become the principal operating agency of both the National Academy of Sciences and the
National Academy of Engineering in providing services to the government, the public,
and the scientific and engineering communities. The Council is administered jointly by
both Academies and the Institute of Medicine. Dr. Bruce M. Alberts and Dr. William A.
Wulf are chairman and vice chairman, respectively, of the National Research Council.
Contents
Executive Summary
PART I
BACKGROUND AND CONTEXT
1 Introduction
2 Assessment Policy and Politics
3 Legal Frameworks
4 Tests as Measurements
PART II
USES OF TESTS TO MAKE HIGH-STAKES DECISIONS
ABOUT INDIVIDUALS
5 Tracking
6 Promotion and Retention
7 Awarding or Withholding High School Diplomas
8 Students with Disabilities
9 English-Language Learners
10 Use of Voluntary National Test Scores for Tracking, Promotion, or Graduation Decisions
PART III
ENSURING APPROPRIATE USES OF TESTS
11 Potential Strategies for Promoting Appropriate Test Use
12 Findings and Recommendations
Biographical Sketches
Index
Public Law 105-78, enacted November 13, 1997
SEC. 309. (a) STUDYThe National Academy of Sciences
shall conduct a study and make written recommendations on
appropriate methods, practices, and safeguards to ensure that
(1) existing and new tests that are used to assess student
performance are not used in a discriminatory manner or inappropriately for
student promotion, tracking or graduation; and
(2) existing and new tests adequately assess student reading
and mathematics comprehension in the form most likely to yield
accurate information regarding student achievement of reading and
mathematics skills.
(b) REPORT TO CONGRESSThe National Academy of
Sciences shall submit a written report to the White House, the
National Assessment Governing Board, the Committee on Education and
the Workforce of the House of Representatives, the Committee on
Labor and Human Resources of the Senate, and the Committees on
Appropriations of the House and Senate not later than September 1, 1998.
|
Executive Summary
The use of large-scale achievement tests as instruments of
educational policy is growing. In particular, states and school districts are
using such tests in making high-stakes decisions with important
consequences for individual students. Three such high-stakes decisions involve
tracking (assigning students to schools, programs, or classes based on
their achievement levels), whether a student will be promoted to the
next grade, and whether a student will receive a high school diploma.
These policies enjoy widespread public support and are increasingly seen as
a means of raising academic standards, holding educators and students
accountable for meeting those standards, and boosting public confidence
in the schools.
Because the stakes are high, the Congress wants to ensure that
tests are used properly and fairly, and it asked the National Academy of
Sciences, through its National Research Council, to "conduct a study
and make written recommendations on appropriate methods, practices
and safeguards to ensure that
A.existing and new tests that are used to assess student
performance are not used in a discriminatory manner or inappropriately for
student promotion, tracking or graduation; and
B.existing and new tests adequately assess student reading and
mathematics comprehension in the form most likely to yield accurate
information regarding student achievement of reading and mathematics skills."
This study focuses on tests with high stakes for individual students.
The committee recognizes that accountability for students is related
in important ways to accountability for educators, schools, and school
districts. Indeed, the use of tests for accountability of educators,
schools, and school districts has significant consequences for individual
students, for example, by changing the quality of instruction or affecting
school management and budgets. Such indirect effects of large-scale
assessment are worth studying in their own right. By focusing on the
congressional interest in high-stakes decisions about individual students, this
report does not address accountability at those other levels, apart from the
issue of participation of all students in large-scale assessments.
BASIC PRINCIPLES OF TEST USE
The use of tests in decisions about student tracking, promotion,
and graduation is intended to serve educational policy goals, such as
setting high standards for student learning, raising student
achievement-levels, ensuring equal educational opportunity, fostering parental
involvement in student learning, and increasing public support for the schools.
The committee recognizes that test use may have negative consequences
for individual students even while serving important social or
educational policy purposes. The development of a comprehensive testing
policy should therefore be sensitive to the balance among the individual
and collective benefits and costs of various uses of tests.
Determining whether high-stakes testing of students produces
better overall educational outcomes requires that its potential benefits
be weighed against its potential unintended negative consequences.
Thus, the value of tests should also be weighed against the use of other
information in making high-stakes decisions about students. Tracking,
promotion, and graduation decisions will be made with or without tests.
The committee adopted three principal criteria, developed from
earlier work by the National Research Council, for determining whether
a test use is appropriate:
(1) measurement validitywhether a test is valid for a
particular purpose, and whether it accurately measures the test taker's knowledge
in the content area being tested;
(2) attribution of causewhether a student's performance on a
test reflects knowledge and skill based on appropriate instruction or is
attributable to poor instruction or to such factors as language barriers or
disabilities unrelated to the skills being tested; and
(3) effectiveness of treatmentwhether test scores lead to
placements and other consequences that are educationally beneficial.
These criteria, based on established professional standards, lead
to the following basic principles of appropriate test use for educational
decisions:
The important thing about a test is not its validity in general,
but its validity when used for a specific purpose. Thus, tests that are valid
for influencing classroom practice, "leading" the curriculum, or
holding schools accountable are not appropriate for making high-stakes
decisions about individual student mastery unless the curriculum, the
teaching, and the test(s) are aligned.
Tests are not perfect. Test questions are a sample of
possible questions that could be asked in a given area. Moreover, a test score
is not an exact measure of a student's knowledge or skills. A student's
score can be expected to vary across different versions of a testwithin
a margin of error determined by the reliability of the testas a function
of the particular sample of questions asked and/or transitory factors, such
as the student's health on the day of the test. Thus, no single test score
can be considered a definitive measure of a student's knowledge.
An educational decision that will have a major impact on a
test taker should not be made solely or automatically on the basis of a
single test score. Other relevant information about the student's
knowledge and skills should also be taken into account.
Neither a test score nor any other kind of information can justify
a bad decision. Research shows that students are typically hurt by
simple retention and repetition of a grade in school without remedial and
other instructional support services. In the absence of effective services
for low-performing students, better tests will not lead to better
educational outcomes.
The committee has considered how these principles apply to
the appropriate use of tests in decisions about tracking, promotion, and
graduation, to increasing the participation of students with disabilities
and English-language learners in large-scale assessments, and to possible
uses of the proposed voluntary national tests in making high-stakes
decisions about individual students. The committee has also examined
existing and potential strategies for promoting appropriate test use.
USES AND MISUSES OF TESTS
Blanket criticisms of tests are not justified. When tests are used
in ways that meet relevant psychometric, legal, and educational
standards, students' scores provide important information that, combined with
information from other sources, can lead to decisions that promote
student learning and equality of opportunity. For example, tests can
identify learning differences among students that the education system needs
to address. Because decisions about tracking, promotion, and
graduation will be made with or without testing, proposed alternatives to the use
of test scores should be at least equally accurate, efficient, and fair.
It is also a mistake to accept observed test scores as either infallible
or immutable. When test use is inappropriate, especially in making
high-stakes decisions about individuals, it can undermine the quality of
education and equality of opportunity. For example, the lower
achievement test scores of racial and ethnic minorities and students from
low-income families reflect persistent inequalities in American society and its
schools, not inalterable realities about those groups of students. The improper
use of test scores can reinforce these inequalities. This lends special
urgency to the requirement that test use with high-stakes consequences for
individual students be appropriate and fair.
Decisions about tracking, promotion, and graduation differ from
one another in important ways. They differ most importantly in the role
that mastery of past material and readiness for new material play. Thus,
the committee has considered the role of large-scale high-stakes testing
in relation to each type of decision separately in this report. But
tracking, promotion, and graduation decisions also share common features
that pertain both to appropriate test use and to their educational and
social consequences.
Members of some minority groups, English-language learners,
and students from low socioeconomic backgrounds are overrepresented
in lower-track classes and among those denied promotion or graduation
on the basis of test scores. Moreover, these same groups of students
are underrepresented in high-track classes, "exam" schools, and "gifted
and talented" programs. In some cases, such as courses for
English-language learners, such disproportions are logical: one would not expect to
find native English speakers in classes designed to teach English to
English-language learners. In other circumstances, such disproportions raise
serious questions. For example, grade retardation among children
cumulates rapidly after age 6, and it occurs disproportionately among males
and minority group members. These disproportions are especially
disturbing in view of other evidence that, as typically practiced, grade retention
and assignment to low tracks have little educational value. For
example, assignment to low tracks is typically associated with an
impoverished curriculum, poor teaching, and low expectations. It is also important
to note that group differences in test performance do not necessarily
indicate problems in a test, because test scores may reflect real differences
in achievement. These, in turn, may be due to a lack of access to a
high-quality curriculum and instruction. Thus, a finding of group
differences calls for a careful effort to determine their cause.
RECOMMENDATIONS
The committee offers more detailed recommendations in Chapter
12 about the appropriate uses of tests. Those recommendations cover
cross-cutting issues that affect testing generally; specific issues and
problems pertaining to the uses of tests in tracking, promotion, and graduation;
and the inclusion of students with disabilities and students who are
English-language learners. The organization of the recommendations in
Chapter 12 follows the logic of the chapters in this report. In this
executive summary, we present overarching recommendations and discuss the
possible use of the proposed voluntary national tests for high-stakes
decisions about individual students.
Accountability for educational outcomes should be a shared
responsibility of states, school districts, public officials, educators,
parents, and students. High standards cannot be established and
maintained merely by imposing them on students. Moreover, if parents,
educators, public officials, and others who share responsibility for educational
outcomes are to discharge their responsibility effectively, they should
have access to information about the nature and interpretation of tests and
test scores. Such information should be freely available to the public
and should be incorporated into teacher education and into educational
programs for principals, administrators, public officials, and others.
Tests should be used for high-stakes decisions about
individual mastery only after implementing changes in teaching and
curriculum that ensure that students have been taught the knowledge and skills
on which they will be tested. Some school systems are already doing this
by planning a gap of several years between the introduction of new tests
and the attachment of high stakes to individual student performance,
during which schools may achieve the necessary alignment among tests,
curriculum, and instruction. But others may see attaching high stakes to
individual student test scores as a way of leading curricular reform, not
recognizing the danger that such uses of tests may lack the
"instructional validity" required by lawthat is, a close correspondence between
test content and instructional content.
The consequences of high-stakes testing for individual
students are often posed as either-or propositions, but this need not be the case.
For example, "social promotion" and repetition of a grade are really
only two of many educational strategies available to educators when test
scores and other information indicate that students are experiencing
serious academic difficulty. But neither social promotion nor retention alone
is an effective treatment for low achievement, and schools can use a
number of other possible strategies to reduce the need for these
either-or choices, for example, by coupling early identification of such
students with effective remedial education.
Some large-scale assessments are used to make high-stakes
decisions about individual students, but most often in combination with
other information, as recommended by the major professional and
scientific organizations concerned with testing. For example, most school
districts say they base promotion decisions on a combination of grades,
achievement test scores, developmental factors, attendance, and teacher
recommendations. As our study has shown, however, a number of
jurisdictions have adopted policies that rely exclusively on achievement test scores
to make high-stakes decisions. A test score, like other sources of
information, is not exact. It is an estimate of the student's understanding
or mastery at a particular time. Therefore, high-stakes educational
decisions should not be made solely or automatically on the basis of a
single test score but should also take other relevant information into account.
The preparation of students plays a key role in appropriate test use.
It is not proper to expose students ahead of time to items that will
actually be used on their test or to give students the answers to those
questions. Test results may also be invalidated by teaching so narrowly to
the objectives of a particular test that scores are raised without actually
improving the broader set of academic skills that the test is intended
to measure. The desirability of "teaching to the test" is affected by
test design. For example, it is entirely appropriate to prepare students
by covering all the objectives of a test that represents the full range of
the intended curriculum. We therefore recommend that test users
respect the distinction between genuine remedial education and teaching
narrowly to the specific content of a test. At the same time, all
students should receive sufficient preparation for the specific test so their
performance will not be adversely affected by unfamiliarity with its format or
by ignorance of appropriate test-taking strategies.
Accurate assessment of students with disabilities and
English-language learners presents complex technical and policy challenges, in
part because these students are particularly vulnerable to potential
negative consequences when high-stakes decisions are based on tests. We
recommend that policymakers pursue two key policy objectives in
modifying tests and testing procedures in these special populations:
(1) to increase such students' participation in large-scale
assessments, in part so that school systems can be held accountable for
their educational progress; and
(2) to test each such student in a manner that provides
appropriate accommodation for the effect of a disability or of limited English
proficiency on the subject matter being tested, while maintaining the
validity and comparability of test results among all students.
These objectives are sometimes in tension, and the goals of full
participation and valid measurement thus present serious technical and
operational challenges to test developers and users.
The purpose of the proposed voluntary national tests (VNT) is
to inform students (and their parents and teachers) about their
performance in 4th grade reading and 8th grade mathematics relative to the
standards of the National Assessment of Educational Progress and to
performance in the Third International Mathematics and Science Study. The
proposal does not suggest any direct use of VNT scores to make
decisions about the tracking, promotion, or graduation of individual students,
and thus it is not being developed to support those uses. However, states
and school districts would be free to use scores on the voluntary national
tests for these purposes. Given their design, the proposed voluntary
national tests should not be used for decisions about the tracking, promotion,
or graduation of individual students. The committee takes no position
on whether the voluntary national tests are practical or appropriate for
their primary stated purposes.
The committee sees a strong need for better evidence on the
intended benefits and unintended negative consequences of using
high-stakes tests to make decisions about individuals. A key question
is whether the consequences of a particular test use are educationally
beneficial for studentsfor example, by increasing academic achievement
or reducing dropout rates. It is also important to develop statistical
reporting systems of key indicators that will track both intended effects (such
as higher test scores) and other effects (such as changes in dropout or
special education referral rates). Indicator systems could include measures
such as retention rates, special education identification rates, rates of
exclusion from assessment programs, number and type of
accommodations, high school completion credentials, dropout rates, and indicators of
access to high-quality curriculum and instruction.
PROMOTING APPROPRIATE TEST USE
At present, professional norms and legal action (through
administrative enforcement or litigation) are the principal mechanisms available
to enforce appropriate test use. These mechanisms are inadequate.
Compliance with provisions of the Joint Standards for Educational and
Psychological Testing and the Code of Fair Testing Practices in Education
is largely voluntary, and enforcement is often weak. Legal action is
typically adversarial, time-consuming, and expensive, and applicable law
can vary by jurisdiction, making enforcement uneven.
New methods, practices, and safeguards could take any of
several forms, but in general they would appear at various points on a
continuum between professional norms and legal enforcement, some less
coercive, some more so. Deliberative forums, an independent oversight body,
labeling, and federal regulation represent a range of possible options
that could supplement professional standards and litigation as means of
promoting and enforcing appropriate test use.
The committee is not recommending adoption of any particular
strategy or combination of strategies, nor does it suggest that these four
approaches are the only possibilities. We do think, however, that
ensuring proper test use will require multiple strategies. Given the inadequacy
of current methods, practices, and safeguards, there should be further
research on these and other policy options to illuminate their
possible effects on test use. In particular, we would suggest empirical research
on the effects of these strategies, individually and in combination, on
testing products and practice, and an examination of the associated
potential benefits and risks.
Large-scale assessments, used properly, can improve teaching,
learning, and equality of educational opportunity. That tests are
sometimes used improperly should not discourage policymakers, teachers, and
parents. Rather, it should motivate action to ensure that educational
tests are used fairly and effectively. This report is a contribution to
that essential work.
|