The introduction explains the history of how today’s high stakes testing evolved. In the 1950’s and early 60’s people still had faith in the educational system in the U.S. This faith in the system was followed by doubts around the late 60’s – early 70’s when articles were published about students receiving high school diplomas but could not read.
Because of widespread distress about public education, it engendered a legislative response. New regulations in some states and districts required students to pass minimum competency tests.
Another factor was the Elementary and Secondary Education Act (ESEA) of 1965. This federal legislation would give federal funds to schools if they could evaluate and report their effectiveness. This led to many schools adopting standardized testing because: 1) they had been developed by respected measurement companies, 2) were widely available, and 3) were regarded as technically first-rate.
By the late 1980’s, most states mandated some sort of statewide testing program. Some tests are off-the-shelf standardized achievement tests, others are customized and built for the state. All are standardized in the sense that they are administered and scored in a uniform, predetermined manner.
The 1990’s brought a tremendous increase in the reliance on students’ standardized achievement test scores as indicators of instructional quality. Newspapers began to publish the results of schools and districts. According to Dr. Popham, U.S. educators have been thrown into a score-boosting game they cannot win – at least not without harming children.
Chapter 1: Classroom Consequences of Unsound Testing It is asserted that the high-stakes testing puts misdirected pressure on educators, allows the misidentification of inferior and superior schools, results in curricular reductionism, which there is a tendency for drill and kill, and test-pressured cheating. It is believed that some terrible things are happening in U.S. schools – and they’re happening as a direct consequence of ill-conceived high-stakes testing programs.
Chapter 2: Why We Test This chapter deals with assessment’s ideal role within the instructional process. It follows something like this: Instruction – Content – Assessment – Inferences – Decisions – and back to Instruction. Classroom tests, the ones teachers construct, have primarily three purposes: 1) to give students grades, 2) to motivate students, and 3) to help make instructional decisions. Large-scale assessment programs, the bulk of which are of the high-stakes variety, are in place chiefly because someone believes that the annual collection of students’ achievement scores will allow the public and educational policymakers to see if educators are performing satisfactorily.
Chapter 3: The Mystique of Standardized Measuring Instruments A state-customized version of a standardized achievement test is called a criterion-referenced test, while the nationally distributed tests are described as norm-referenced tests. Most of the state tests are built by the same companies that market the national standardized achievement tests.
These are some major evaluative shortcomings of these standardized tests. 1) Teaching/testing mismatches (you’ll often find half or more of what’s tested wasn’t even supposed to be taught in a particular district or state), and 2) there is a tendency to jettison items covering important content.
The tests are constructed to show a substantial score-spread as a determination of a test’s reliability, the better the score-spread, the higher the tests reliability. High reliability sells tests. Test companies use the term “p-value” to indicate the percentage of students who answer a question correctly. An item with a p-value of .85 means 85% answered it correctly. Items that make the best contribution to a test’s score-spread are those with p-values in the .40 - .60 range. Items with high p-values are almost always removed from the tests when the tests are revised.
Accordingly, standardized achievement tests should not be used to evaluate the quality of students’ schooling because the quest for wide score-spread tends to eliminate items covering important content that teachers have emphasized and students have mastered. Furthermore, students’ scores on standardized tests should not be used as indicators of instructional quality.
Chapter 4: Confounded Casualty When policymakers create accountability systems centered on students’ tests scores, they assume that higher test scores reflect better instruction. This just isn’t the case. It is like comparing apples to oranges. As any teacher knows, students’ cognitive quality at any given grade level can vary substantially from one year to the next. Also, it is impossible to devise truly equally difficult test forms for each grade level (and what about the small school with multi-age and multi-graded classrooms?).
Three distinct factors determine whether a student will correctly answer the items on a standardized achievement test: 1) what the student learned in school, 2) the student’s socioeconomic status (SES), and 3) the student’s inherited academic aptitudes.
What the student learned in school is what the test is supposed to measure: the kinds of skills and knowledge we hope the children will be learning.
The presence of SES-linked items means students from middle- and upper-SES backgrounds will perform better than those from lower-SES backgrounds. These items do a fantastic job of spreading out examinees’ scores, but they do a miserable job of helping evaluate a school staff’s real effectiveness. Of the five major national achievement tests, Dr. Popham found 65% SES linked items in the Language Arts category.
The student’s inherited academic aptitude, or genetic intellectual potential. These test items are really camouflaged intelligence test items. They should not be used to evaluate the quality of instruction. Of the five major national achievement tests, Dr. Popham found 55% Inherited Academic Aptitude items in the Science category.
Standardized achievement tests should not be used to judge the quality of students’ schooling because factors other than instruction also influence students’ performance on these tests.
Chapter 5 gives ideas on creating large-scale tests that illuminate instructional decisions.
Chapter 6 has ideas to get maximum mileage out of classroom assessments.
Chapter 7 was illuminating as it explains how we can collect credible evidence of instructional effectiveness. I found the idea of a split-and-switch version of a pretest/post-test of particular interest. Basically, you have to forms (A, B). One half of the class takes the first form and the other half the other form for the pretest. You file the pretest. Later, for when the post-test is given, you give the other form the student did not take for the pretest. It is after the post-test you grade all papers (with names on the back) to judge your instructional effectiveness.
Chapter 8, the final chapter, shares 8 ways we can take action. They are: 1) Offer assessment literacy programs for all teachers and administrators, 2) provide assessment literacy programs for parents, 3) establish autonomous parent-action groups, 4) offer assessment-related briefing sessions for educational policymakers, 5) initiate a systematic public information campaign regarding local high-stakes tests, 6) conduct rigorous, security-monitored reviews of the items found in a high-stakes test, 7) implement defensible evaluative schemes for school and district level accountability, and 8) demand the installation of a more educationally appropriate high-stakes statewide test.
Personally, I found the book very informative and filled with a lot of useful information. I would strongly recommend this book as a “must read” in our age of increasing accountability.