Volume 31, Issue 2 p. 27-32
Full Access

Informing in the Information Age: How to Communicate Measurement Concepts to Education Policy Makers

Stephen G. Sireci

Stephen G. Sireci

University of Massachusetts Amherst

Search for more papers by this author
Ellen Forte

Ellen Forte

edCount, LLC

Search for more papers by this author
Stephen G. Sireci, Center for Educational Assessment Amherst, 156 Hills South, University of Massachusetts, Amherst, MA 01003; [email protected]. Ellen Forte, edCount, LLC, 5335 Wisconsin Avenue NW, Suite 440, Washington, DC 20015.

Abstract

Current educational policies rely on educational assessments. However, the technical aspects of assessments are often unknown to policy makers, which is dangerous because sound assessment policy requires knowledge of the strengths and limitations of educational tests. In this article, we discuss the importance of informing policy makers of important psychometric issues that should be considered whenever tests are proposed for specific purposes. We discuss the types of information that are important to communicate to policy makers, how to best convey this information in a manner in which it can be understood, and how to be seen as a valuable source of information to education policy makers. We end with some specific steps organizations such as NCME can take to inform policy makers and advocate for valid educational assessment policies.

The 21st century marks an unprecedented role of educational assessment in educational systems throughout the world. In the United States, the assessment mandates of No Child Left Behind (NCLB) are well known, and NCLB assessments affect millions of students, teachers, and school administrators throughout the United States. The fields of licensure assessment, certification, adult education, higher education, and employment are also experiencing increased activity in testing, as tests are used at virtually all levels of public accountability. International assessments are also rapidly increasing, with testing programs such as Trends in International Mathematics and Science Study (TIMSS; Mullis, Martin, & Foy, 2005), Programme for the International Assessment of Adult Competencies (Organization for Economic Co-operation and Development, 2004), the Program for International Student Assessment (Organization for Economic Co-operation and Development, 2006), and the Literacy Assessment Monitoring Program (Guadalupe, Tay-Lim, Cardoso, & Girardi, 2009) providing highly sought-after comparisons of how students and adults from different countries are doing with respect to math, science, reading, and various aspects of literacy. For example, the most recent TIMSS assessment involved 82 participating countries, which is just one indication that educational testing is not solely a phenomenon in the United States.

It is important to realize that the use of tests is initiated and managed by policy makers. Why are educational tests becoming increasingly common, and why are they relied upon so much by education policy makers? We believe there are three main reasons. First, they are seen as objective and quantifiable indicators of student achievement (McDonnell, 2004). Second, they are inexpensive, relative to other means for obtaining information about characteristics such as achievement, competence, and literacy. Third, in many cases, there are few, if any, alternatives.

As those who develop and advocate fair and appropriate testing practices, we can take pride in the first reason, but we find the other two discomforting. Thanks to over a century of developments in test construction; statistical models for scoring, equating, and gathering validity evidence for tests; and advances in reliability and validity theory, we can have confidence in the claim that educational tests can provide useful, reliable, and valid information. Tests can fulfill many of the purposes for which they are relied on. However, we also know that even in this amazing, technology-rich 21st century, educational tests still have serious limitations, and they are sometimes called upon to fulfill purposes beyond their reach.

Presently, the uses to which test scores are put often place an undue and untenable burden on the tests that yield these scores. There is great debate, and even ignorance, regarding the types of information tests can provide, and how they can best be used to provide information to students, parents, organizations, and education policy makers. Given this state of affairs, we think it is an ethical imperative for the measurement community to do all we can to inform policy makers of the strengths, benefits, and limitations of educational tests. We realize that such information can be extremely complex, which makes communication of key concepts difficult. However, the risks associated with overinterpretation and inappropriate test use, such as the denial of services to students or of jobs to teachers, obligate us to act. In this article, we attempt to help measurement practitioners, testing agencies, personnel at state departments of education, and others, relate technical information about the psychometric properties of tests and about appropriate test use, to policy makers. Such information is particularly critical now, because educational tests are key tools in education reform movements at the local, state, national, and international levels.

In the remainder of this article we (a) identify the most important measurement concepts that should be communicated to educational policy makers, (b) provide suggestions for how to effectively communicate these concepts to policy makers, (c) provide suggestions for being “at the table” when educational assessment policies are being formed, and (d) provide suggestions for NCME and other organizations that support fair and appropriate testing practices to become more involved in the formation of sensible educational assessment policy.

Identifying the Psychometric Concepts to Communicate to Policy Makers

In an ideal world, all policy makers who make decisions related to assessment systems would complete several graduate courses in educational measurement. This being unlikely, we propose that policy makers have access to basic information about the two most fundamental concepts in educational measurement: validity and reliability. Thus, we target our instruction on these critical concepts.

One of the greatest potential dangers in educational testing is using a test for purposes for which it was not designed. A related danger is thinking we can design a test to fulfill very different purposes. For this reason, the Standards for Educational and Psychological Testing (American Educational Research Association [AERA], American Psychological Association, & National Council on Measurement in Education [NCME], 1999) define validity as “the degree to which evidence and theory support the interpretations of test scores entailed by proposed uses of tests” (p. 9). From this definition it is clear that it is not a test that is validated per se, but rather the use of a test for a particular purpose and the interpretations that result from that use. The measurement community applauds this consensus definition, but the potential for education policy makers to believe a test itself can carry the stamp of “valid,” and is hence appropriate for multiple purposes, still exists. Thus, the first message to communicate to policy makers and the public in general is that if a test is considered valid for one purpose, it does not justify its use for other purposes.

For example, many policy makers may not understand that a test developed to assess how well 6th grade students achieved objectives in the 6th grade math curriculum may not be appropriate as the sole determinant of whether their teachers were effective. Likewise, it would not be appropriate to use this same 6th grade math test in lieu of, for example, the grade-level test for 8th grade English learners or 8th grade students with disabilities in an attempt to “simplify” the language or the content of the test for these students.

The next critical lesson to communicate to policy makers is that every score from every educational assessment contains error, and the amount of error associated with a test score must be (a) estimated and (b) reported and (c) considered when forming educational policies. Estimates of the amount of error contained in a test score are gauged by statistical approaches such as reliability estimates, standard errors of measurement, generalizability theory, and test information based on item response theory (IRT). In NCLB and many other contexts, estimates of decision consistency (the degree to which the same test would classify students into the same achievement level if they were retested) and decision accuracy (degree to which a test classifies students into their “true” achievement level) are also relevant. In the next section, we provide suggestions for how to communicate such information. With respect to convincing policy makers that such information should be reported, we only need to point to the aforementioned Standards (AERA et al., 1999), which have been seen as authoritative in legal challenges to tests (Sireci & Parker, 2006). For example, the Standards dictate, “For each total score, subscore, or combination of scores that is to be interpreted, estimates of relevant reliabilities and standard errors of measurement or test information functions should be reported” (p. 31).

In addition to informing policy makers about the need to report estimates of measurement error, it is important to illustrate how such error affects the educational policy decisions they may make. To illustrate the importance of these effects, it might be helpful to take a specific policy they are considering and show them what the consequences might be, given considerations of measurement error. For example, the Texas legislature recently proposed legislation that would make the statewide assessments more rigorous, and would require high school students to take and pass at least six end-of-course exams to graduate with a recommended high school program or distinguished achievement program (Texas Education Agency, 2010). This well-intentioned legislation is designed to increase student achievement, but requiring passing scores on each of six tests for any certification decision may be problematic due to the reliability limitations of each test. Conjunctive decision models like this one (i.e., models that require each of two or more cases to be true) can limit the overall passing rate to an extent not necessarily intended by policy makers due to inflated “false positive” rates stemming from error compounded across the separate assessments.

In the next section, we use this example to illustrate how to communicate the potential dangers in such policies from a measurement perspective. Before concluding this section on which concepts to communicate to policy makers, we present a brief list of the key measurement concepts they should know before authorizing education policy.

  • Validity: Emphasize validity is not a property of a test, and that each use of a test must be defended using empirical evidence. They need to know there are five sources of validity evidence recommended by the Standards that should be used to defend the use of a test for a particular purpose.1 The nature and relative emphasis among these evidence sources depends upon the claims and assumptions one is making about the test scores and on the stakes associated with the use of these test scores (Kane, 2006). For example, if test scores are used to allow or deny a student access to educational services or a high school diploma, the validity burden is very high. A lower validity burden would be associated with scores that are used solely for formative program evaluation. Policy makers need to know that those to whom they assign the duty of implementing testing systems will be obligated to establish validity evidence for each testing purpose they legislate, and that it is not always easy to gather such evidence.

  • Measurement error: Convey that each test score or subscore contains error, and these errors (a) aggregate as students take more tests and (b) tend to be greater when the number of items or students on which a score is based is small. Errors in assigning students to achievement levels based on test scores should be discussed, as should the error inherent in group statistics such as the percentage of proficient students.

  • Equating: Policy makers need to know that tests scores from different test forms need to be placed on a common scale for comparisons to be made over time, across grades, or across different forms of a test. For example, two 6th-grade math tests must be equated before one should make judgments about whether the scores from these tests are comparable.

  • Standard setting: Note that standard setting is a judgmental process, but there are well-researched procedures for setting standards on educational tests. Explain the difference between norm-referenced and criterion-referenced standards and that criterion-referenced standards are typically what are needed in achievement testing.

  • Test accommodations and score comparability: Test forms or test administrations are designed to promote access and fairness in testing, but may provide scores that cannot be interpreted in the same way for all students. Explain the need for test accommodations, including test translations where appropriate (e.g., when a student is proficient in reading in his non-English native language and has been instructed in the content in his non-English native language) or inappropriate (e.g., when a student is a native speaker of a language other than English but is neither literate in that language nor instructed in that language), describe what a “standardized” test is, and describe when an accommodation might lead to scores that cannot be compared to or aggregated with those from a standard version of a test.

In the next section, we provide some advice on how to describe some of these concepts to policy makers using language that is understandable and relevant to them.

Suggestions for Communicating with Educational Policy Leaders

Psychometrics is a highly technical and scientific domain that combines philosophy, psychology, and statistics. When communicating with policy makers, time is short and they are not likely to have a background in any of these disciplines. Thus, the most critical concepts must be identified and communicated quickly. In this section, we offer suggestions for translating “psychometricianese” to standard, everyday English. To communicate complex ideas to policy makers we recommend three strategies: (a) use of “plain language,” (b) use of visuals, and (c) use of stories and examples.

Plain Language and Visuals

Plain language refers to language that a lay audience can understand the first time they read or hear it. The idea is to promote clarity in written documents and oral presentations. Although you would not know it by reading legislation such as NCLB, plain language is actually mandated by the U.S. government. For example, in 2011 President Obama issued an executive order stating that the regulatory system “must ensure that regulations are accessible, consistent, written in plain language, and easy to understand” (Executive Order 13563, 2011). There is even a Federal website to promote plain language: http://www.plainlanguage.gov/. This site has several examples of how plain language was used to improve clarity and accessibility of text written by Federal agencies. The following before/after example comes from a Department of Health and Human Services Brochure.2

Before.

“The Dietary Guidelines for Americans recommends a half hour or more of moderate physical activity on most days, preferably every day. The activity can include brisk walking, calisthenics, home care, gardening, moderate sports exercise, and dancing.

After.

After. Do at least 30 minutes of exercise, like brisk walking, most days of the week.

Obviously, the rewrite is much shorter and to the point. Let us try some examples from educational measurement. First, to avoid embarrassing anyone except perhaps the first author, we will borrow from Sireci and Talento-Miller (2006) who explained a predictive validity study in the following way,

In predictive validity studies, it is desirable to find large and statistically significant multiple correlation coefficients, which account for a substantial amount of variation in the predictor. (p. 307)

Although this language might be appropriate for a measurement journal, it would be gibberish to policy makers. Using the principle of plain language, the idea behind a predictive validity study might be better described as, “A predictive validity study tells us how well test scores predict future performance.”

The Plain Language site also advocates using pictures to convey complex meaning. For example, the information about car safety in the following text was replaced by Figure 1.

Details are in the caption following the image

Visual illustration of communicating car safety information.2

This is a multipurpose passenger vehicle which will handle and maneuver differently from an ordinary passenger car, in driving conditions which may occur on streets and highways and off road. As with other vehicles of this type, if you make sharp turns or abrupt maneuvers, the vehicle may roll over or may go out of control and crash. You should read driving guidelines and instructions in the Owner's Manual, and WEAR YOUR SEAT BELTS AT ALL TIMES.

What a difference a picture makes! This strategy can also be used to communicate measurement concepts to policy makers. Consider trying to communicate how computerized-adaptive testing works. Using a verbal description it may sound like,

Computerized adaptive tests tailor a test to an examinee by keeping track of an examinee's performance on each test question and then using this information to select the next item to be administered. The primary criterion for selecting the next item to be administered to an examinee is a desire to match the difficulty of the item to the examinee's current estimated proficiency. If the examinee gets an item correct, s/he is administered a more difficult item. If the examinee answers the item incorrectly, s/he is administered an easier item. As more items are administered the error associated with a student's score decreases.

A visual presentation of this same information is clearer, and can even be used to illustrate how error is reduced as items are administered (see Figure 2).

Details are in the caption following the image

Visual illustration of computerized adaptive testing.

Using Stories and Examples

The use of parables and stories has been an effective means for communicating and teaching complex information since the earliest days of recorded history. We do not know how the technique got lost in the teaching of psychometrics, but we recommend bringing it back. Stories of court cases in other states, or lessons learned by other departments of education will typically get the attention of policy makers. Other stories or examples designed to make a point can be planned in advance.

As an illustration, we return to the proposed policy that would require students to take and pass six tests to earn a college-ready high school diploma. A measurement specialist concerned about false negative errors in this system might tell the following story.

Requiring students to pass 6 tests may address all college readiness skills, but remember each test is not a perfectly reliable measure and a student who is competent in a subject area may fail any test just due to chance factors. Let's say the chance a competent student will pass a test is 90%. If we hold that probability across all 6 tests, the chance she or he will pass all 6 tests is about 53%. Thus, the system you are proposing may actually fail about half of the students who actually deserve to pass.

Psychometricians may immediately be upset with this example because it does not consider correlations among the six tests, but that information is too complex to introduce at the outset. The idea is to get policy makers to consider the consequences of the aggregate measurement error inherent in their policies, and to get them to pay attention to that issue from an early stage. Once those concepts are understood, visuals and other aids can be brought in to evaluate different scenarios, such as using a compensatory model (e.g., pass five of six tests or achieve a certain total score across all tests) rather than a conjunctive model (pass all six tests) and the implications of providing retests. Hambleton and Slater (1997), for example, provided an interesting graph (pp. 34–36) to illustrate that if you keep adding requirements in a conjunctive testing system, you could eventually fail all competent candidates just due to measurement error!

Getting to the Table

Policy makers can be a tricky audience for psychometricians to reach because, as noted earlier, these two groups do not often speak the same language. Further, policy makers generally don't know what they don't know about testing and don't go looking for more information. Determining which concepts to present and how to present them will only be effective if this information gets the attention of the intended audience.

There are several strategies for getting the attention of policy makers. As is the case for testing itself, the best strategy depends on the purpose of the message. Here, we describe three ways to get attention and, as a result, perhaps influence policy toward more meaningful measurement: connecting with the general media, connecting with politicians’ education staff, and strategic publication. We describe each of these strategies in relation to a single issue, later.

If one is hoping to shine light on a poor practice that is either happening or anticipated, such as the six-test conjunctive model described earlier, the most effective attention-getting strategy may be to connect with an education reporter at a major local newspaper or television station and offer to serve as a source for an article or segment on the practice. Explain the issue in layman's terms, starting with the reason the public and the policy makers should care: “If you adopt this model you will be denying diplomas to many qualified students. The graduation rate will plummet because of a statistical problem, not because students are not prepared properly, and more students may drop out as they face a seemingly unachievable obstacle. Thus, this policy may result in fewer prepared students than more prepared students.”

To avoid being seen as just another naysayer, and therefore someone to be ignored, it also helps to offer concrete alternatives or solutions such as those described earlier.

Getting to know the education staff of Chief State School Officers, Congressmen, Board of Education members, or legislators can also help get the attention of these policy makers. Recognize that the politicians and education agency leaders rely on their staff members to seek out and distill information for them. If you are truly interested in local, state, or federal education policy, find out who the relevant education staffers are and make a personal connection with them by attending a public meeting, making a phone call, or writing an email. Offer them something they probably need, such as a one-page easy-to-read statement like the one above about the conjunctive model or a short synopsis of some other related topic. Become a valued and trusted source, a voice of reason.

Finally, recognize that neither policy makers nor their staff is likely to pick up a peer-reviewed journal anytime soon. However, they and the education-focused public do read trade newspapers (e.g., Education Week, Education Daily, The Title I Monitor) and education websites and blogs. Clear, concise articles written for these types of media may have a far greater impact on policy than a dozen articles in peer-reviewed journals. If your goal is to influence policy and become part of the policy conversation, your audience is not other researchers. It's the public and those who represent them in policy-making entities.

Looking Forward: Suggestions for NCME and Other Organizations

Beyond our individual roles in communicating with policy makers, many of us are now looking to professional organizations, such as NCME, to play a strong role in advising policy makers. Many of us have found that lone voices have done little over the past decade to help policy makers and others to move beyond the rhetoric that drives much of our current, flawed education policies. Our lack of an organized voice has left a large void in critical thinking about testing and the use of test scores for making high stakes decisions about students, teachers, and systems.

NCME could step into this void and provide a powerful voice in support of measurement reform. As we noted in the NCME Newsletter in the Fall of 2008, the organization has a history of outreach, having been involved in development of the ABCs of School Testing (Joint Committee on Testing Practices, 1993), a video produced to communicate educational measurement concepts to teachers, school administrators, and parents, as well as in the development of the Standards for Teacher Competence in Educational Assessment of Students (American Federation of Teachers, NCME, & National Education Association, 1990). Although these efforts have not been replicated or revisited in the ensuing two decades, they may provide a good model for both products and collaborative efforts.

To support renewal of this kind of work and the development of new ways to reach policy makers, NCME could establish as priorities the inclusion of a policy course in graduate measurement programs and the creation of alliances among universities and state and local education agencies to support real-world connections between researchers and practitioners. Investing in this type of capacity building would enhance the body of accessible information about testing, perhaps through the sponsored publication of policy briefs on measurement topics, and enhance the ability of both in-service practitioners and the graduate students who seek to join our field in the near future.

Perhaps the best place to start this endeavor would be by commissioning a monograph on the five concepts listed earlier (validity, measurement error, equating, standard setting, and testing accommodations and score comparability) intended for distribution among education writers and political and agency staffers. This monograph could not only describe these concepts and their relevance to real-world, timely measurement issues, but also provide language that others can use when trying to explain these issues further.

The swift timing of such work, and wide dissemination, are critical. Most states are currently involved in federally-funded assessment development consortia and struggling with how to transition to these new assessments, how to interpret test data across this transition period, how to evaluate the quality of the new assessments, and how to use scores from both old and new assessments to inform decisions about students, teachers, schools, programs, and policies. In addition, the reauthorization of the Elementary and Secondary Education Act looms somewhere on the next 2-year time horizon and will certainly include new provisions for testing and the use of test scores for high stakes decisions. A policy brief, and other capacity-building efforts could be invaluable to help practitioners and policy makers understand the ramifications of their work and provide both a framework and a language for guiding testing practice in the future.

Conclusions

In this article, we stressed the need for measurement specialists to communicate better with education policy makers. Specific strategies for getting their attention and communicating complex measurement issues in a clear manner were provided. The psychometric community cannot be insular. For educational tests to do more good than harm, we must share our expertise with policy makers and the public in general so that the strengths, benefits, and limitations are understood. We know that educational testing programs always have consequences. By better informing policy makers of potential consequences, we can help policy makers better define and meet their goals, which will lead to more positive consequences for all.

Notes

  • 1 The five sources are validity evidence based on test content, response processes, internal structure, relations to other variables, and testing consequences.
  • 2 http://www.plainlanguage.gov/examples/before_after/pub_hhs_losewgt.cfm
    • The full text of this article hosted at iucr.org is unavailable due to technical difficulties.