Interest in including evidence of student learning in evaluations of teachers has been
growing. After all, if student learning is the primary goal of teaching, it appears straightforward
that it ought to be taken into account in determining a teachers’ competence. However, the
research literature includes many cautions about the problems of basing teacher evaluations
substantially on student test scores.
These include concerns about overemphasis on “teaching to the test” at the expense of
other kinds of learning (especially given the narrowness of most tests currently used in the United
States); problems of attributing student gains to specific teachers; and disincentives for teachers
to serve high-need students, for example, those who do not yet speak English and those who have
special education needs (and whose test scores therefore may not accurately reflect their
learning). This could inadvertently reinforce current practices in which inexperienced teachers are
disproportionately assigned to the neediest students or schools and may discourage high-need
students from entering or staying.
Researchers have been developing value-added methods (VAM) for looking at gains in
student test scores, and these have proved valuable for research on groups of teachers. However,
most researchers agree that value-added modeling (VAM) is not appropriate as a primary measure for
evaluating individual teachers. Reviews of research on value-added methodologies for estimating
teacher “effects” based on student test scores have concluded that these measures are too unstable
and too vulnerable to many sources of error to be used for teacher evaluation. A major report by the
RAND Corporation concluded that:
The research base is currently insufficient to support the use of VAM for high-stakes decisions
about individual teachers or schools.1
Similarly, Henry Braun of the Educational Testing Service concluded in his review of research:
VAM results should not serve as the sole or principal basis for making consequential
decisions about teachers. There are many pitfalls to making causal attributions of teacher
effectiveness on the basis of the kinds of data available from typical school districts. We still
lack sufficient understanding of how seriously the different technical problems threaten the
validity of such interpretations.2
According to these studies, the problems with using value-added testing models to determine teacher
effectiveness include:
Teachers’ ratings are affected by differences in the students who are assigned to them.
Students are not randomly assigned to teachers – and statistical models cannot fully adjust for the fact that some teachers will have a disproportionate number of students who may be exceptionally difficult to teach students with poor attendance, who are homeless, who have severe problems at home, etc.) and
whose scores on traditional tests are frequently not valid (e.g. those who have special education
needs or who are English language learners). In addition, student attendance can have as large an
effect on student learning growth as teachers’ competence. All of these factors can create both
misestimates of teachers’ effectiveness and disincentives for teachers to want to teach the students
who have the greatest needs.
Value-added models of teacher effectiveness do not produce stable ratings of teachers.
Teachers look very different in their measured effectiveness when different statistical
methods are used.3 In addition, a given teacher may appear to have differential effectiveness
from class to class, from year to year, and even from test to test. Researchers have found
that teachers’ effectiveness ratings differ significantly when their students are evaluated on
different tests, even when these are within the same content area.4 Braun notes that ratings
are most unstable at the upper and lower ends of the scale, where many would like to use
them to determine high or low levels of effectiveness.
It is impossible to fully separate out the influences of students’ other teachers, as well as school conditions, on their apparent learning.
Many Prior teachers have lasting effects, for good or ill, on students’ later learning, and current teachers also interact to produce students’ knowledge and skills. For example, the essay writing a student learns through his history
teacher may be credited to his English teacher, even if she assigns no writing; the math he
learns in his physics class may be credited to his math teacher. Specific skills and topics
taught in one year may not be tested until later years. A teacher who works in a well-resourced
school with specialist supports may appear to be more effective than one whose
students don’t receive these supports.
As Braun notes,
It is always possible to produce estimates of what the model designates as teacher effects.
These estimates, however, capture the contributions of a number of factors, those due to
teachers being only one of them. So treating estimated teacher effects as accurate indicators
of teacher effectiveness is problematic.
Economist Jesse Rothstein makes a similar point:
[Value-added methods] rely on strong, unverified assumptions about the teacher assignment
process. The VAMs in common use depend on incorrect assumptions, and richer models that
are not falsified by the data yield notably different estimates of teachers effects. Causal
inference from observational data on student tests and teacher assignments calls for a great
deal of caution… Value added estimates should be validated before being pressed into
service in accountability and compensation policy.5
Finally, many analysts raise concerns about a system that would tie judgments about teaching
to the kinds of tests that are most commonly used in the United States – and to the kinds of tests that are required to draw inferences about student growth, neither of which allow for the rich assessment
of wide-ranging skills commonly used in high-achieving nations and needed to evaluate a full range of
educational goals. Value-added models require vertically scaled tests, which most states (including
large states like New York and California) do not use. In order to be scaled, tests must evaluate
content that is measured along a continuum from year to year. This reduces their ability to
measure the breadth of curriculum content in a particular course or grade level. Curriculum-based
tests focused on specific grade-level standards have advantages, but they do not allow a measure
of student growth, and are likely to miss gains for students who are lower-achieving or higherachieving.
Most tests in the U.S. are also primarily multiple-choice, thus they do not measure or
encourage a focus on writing, research, scientific investigation, technology applications, or a host
of other critically important skills. Harvard University professor Dan Koretz points out that:
[V]alue-added models, taken by themselves, are not an adequate measure of overall
educational quality. Like any other measure based on standardized tests, VAMs provide a
valuable but incomplete view of students’ knowledge, skills, and dispositions. Because of the
need for vertically scaled tests, value-added systems may be even more incomplete than
some status or cohort-to-cohort systems. Value added-based rankings of teachers are highly
error-prone. And value-added modeling does nothing to address the … problems of an
excessive focus on standardized test scores in an accountability system: undue narrowing of
instruction, inappropriate test preparation, and the resulting inflation of test scores. Finally,
we have to accept that even within the range of outcomes assessed by the tests used in
VAMs, they cannot be counted on to give us true estimates of teachers’ value added…6
To understand the influences on student learning, more data about student learning, teachers’
practices, and context are needed.
1 Daniel F. McCaffrey, Daniel Koretz, J. R. Lockwood, Laura S. Hamilton (2005). Evaluating Value-
Added Models for Teacher Accountability. Santa Monica: RAND Corporation.
2 Henry Braun, Using Student Progress to Evaluate Teachers: A Primer on Value-Added Models
(Princeton, NJ: ETS, 2005), p. 17.
3 Rothstein, J. (2007). Do Value-Added Models Add Value? Tracking, Fixed Effects, and Causal
Inference.
National Bureau for Economic Research.
4 Lockwood, J. R., McCaffrey, D. F., Hamilton, L.S., Stetcher, B., Le, V. N., & Martinez, J. F. (2007). The
sensitivity of value-added teacher effect estimates to different mathematics achievement measures.
Journal of Educational Measurement, 44 (1), 47 – 67.
5 Rothstein, p. 32.
6 Koretz, D. (2008, Fall). A Measured Approach, p. 39, American Educator, pp. 18-39.