Volume 30, Issue 4 p. 3-15
Full Access

Test Development with Performance Standards and Achievement Growth in Mind

Steve Ferrara

Steve Ferrara

CTB/McGraw-Hill

Search for more papers by this author
Dubravka Svetina

Dubravka Svetina

Indiana University

Search for more papers by this author
Sylvia Skucha

Sylvia Skucha

DePaul University

Search for more papers by this author
Anne H. Davidson

Anne H. Davidson

Alpine Testing Solutions

Search for more papers by this author
First published: 23 December 2011
Citations: 16

Steve Ferrara, CTB/McGraw-Hill, 1200 G St., NW, Washington DC 20005; [email protected]. Dubravka Svetina, Indiana University, Bloomington, IN; [email protected]. Sylvia Skucha, DePaul University, Chicago, IL; [email protected]. Anne H. Davidson, Alpine Testing Solutions, Orem. UT 84057.

Abstract

Items on test score scales located at and below the Proficient cut score define the content area knowledge and skills required to achieve proficiency. Alternately, examinees who perform at the Proficient level on a test can be expected to be able to demonstrate that they have mastered most of the knowledge and skills represented by the items at and below the Proficient cut score. It is important that these items define intended knowledge and skills, especially increasing levels of knowledge and skills, on tests that are intended to portray achievement growth across grade levels. Previous studies show that coherent definitions of growth occur often as a result of good fortune rather than by design. In this paper, we use grades 3, 4, and 5 mathematics tests from a state assessment program to examine how well (a) descriptors for Proficient performance define achievement growth across grades, and (b) the knowledge and skill demands of test items that define Proficient performance at each grade level may or may not define achievement growth coherently. Our purpose is to demonstrate (a) the results of one state assessment program's first attempt to train item writers to hit assigned proficiency level targets, and (b) how those efforts support and undermine coherent inferences about what it means to achieve Proficient performance from one grade to the next. Item writers’ accuracy in hitting proficiency level targets and resulting inferences about achievement growth are mixed but promising.

Typically, when we design and develop grade level achievement tests and set performance standards as the means for making interpretations from test scores about what examinees know and can do, we implement two parallel systems:

  • A test development system, which begins from content standards and includes a general test design, test and item specifications, the items themselves, and perhaps other specifications such as a construct definition.

  • An interpretation system, which may include generic policy definitions and grade and content specific proficiency level descriptors, cut scores that delineate the proficiency levels, items that may be mapped to each level, plus guidance on interpreting test results.

We intend these two systems to be coordinated so that, together, they enable valid inferences about what examinees know and can do. Coordinating development and interpretation systems to enable inferences about achievement growth from one grade to the next—and to predict levels of achievement in the subsequent grade based on achievement in the current grade—significantly complicates the situation. In order to enable valid inferences about grade-to-grade achievement growth, test developers must be able to train item writers to write items that (a) elicit intended cognition and content area knowledge and skills (i.e., item response demands), and (b) are located on test score scales at targeted proficiency levels. In addition, test developers must be able to assemble test forms in which (a) items are located on the within-grade test scales so that item response demands are aligned with corresponding proficiency level descriptors, and (b) increases in expectations for content area knowledge and skills that represent Proficient performance increase coherently.

Current practice in test development and building interpretation systems has evolved. For example, researchers now recommend writing proficiency level descriptors (e.g., in the form of generic policy definitions) to guide test development, development of grade and content area specific proficiency level descriptors, and standard setting (Bejar, Braun, & Tannenbaum, 2007; Egan, Schneider, & Ferrara, 2011; Perie, 2008). Similarly, methods to evaluate and improve alignment between test items and content standards are widely implemented (e.g., Porter, Polikoff, Zeidner, & Smithson, 2008; Webb, 2007) and required by No Child Left Behind (NCLB) peer review. While this explicit alignment requirement has permeated conventional requirements for educational achievement testing, efforts at ensuring the alignment between the knowledge and skill demands of items where they map onto a test score scale and the corresponding proficiency level descriptor are not widely successful. Such alignment is necessary to ensure valid inferences about achievement growth from one grade level test to another.

Innovations in test design and development are heading in that direction. For example, researchers and test designers have demonstrated how to design and build coordinated, coherent assessment systems built on rigorous design validation research (e.g., evidence-centered test design; see Mislevy, 2006; Mislevy & Haertel, 2006; assessment engineering; Luecht, Dallas, & Steed, 2010); suggested how to adapt advances in cognitively based test development and validation for large-scale, operational programs (Gorin, 2006); illustrated a model for horizontally and vertically aligning assessment systems to support inferences about achievement growth (Martineau, Paek, Keene, & Hirsch, 2007); and proposed test development and performance standard setting procedures that are prospective (i.e., using performance level descriptors to guide the entire test development process), progressive (i.e., using content and performance standards that are articulated across grade levels), and predictive (i.e., using performance level descriptors and standards based on theoretical and empirical evidence of achievement growth trajectories; Bejar et al., 2007).

And practice seems to be following in that direction. Some states and their contractors now train item writers to aim at performance level targets as well as content standards targets. So, for example, item writers may receive assignments to write items that target a content strand, indicators within that strand, and a specified performance level (e.g., Proficient or Basic) or the corresponding range of the test scale. Hitting targets defined by ranges of scale scores—and even easy, medium, and difficult levels of classical p-values—is no easy thing. Even training judges to estimate the difficulty of existing items is no easy thing. For example, Hambleton and Jirka (2006) illustrated promising results from four studies of training content experts to judge item difficulty. They trained judges using an anchor-based method (i.e., anchor items and anchor item descriptions to illustrate item features associated with p-values at .25, .50, and .75) and an item mapping method (i.e., a full set of items and their p-values), plus feedback about other judges’ item difficulty estimates, and discussions of judges’ explanations for the estimates (see Hambleton and Jirka, 2006, pp. 408–415 for details). Judges in the anchor-based method rated two thirds of Law School Admission Test (LSAT) reading comprehension items to be easier than the actual p-values, but the final correlation between the median estimated and actual item difficulty was .59. Judges in the item mapping method estimated item difficulty for LSAT logical reasoning items more accurately. The final correlation between the median estimated and actual item difficulty was .84. In their comprehensive review and evaluation of training judges to estimate item difficulty, Hambleton and Jirka concluded that anchor items and anchor item descriptions contribute to accurate item difficulty estimates and that accuracy in estimating item difficulty is likely to vary across content areas and item types and depends on the judges’ knowledge of the content area and examinee population and the content and quality of training. Further:

More needs to be learned about what factors contribute to item difficulty so that judges can be better trained. More work especially should be committed to the prediction of item discriminating power. (p. 416)

Similarly, automatically generating items (e.g., Irvine & Kyllonen, 2002) using cognitive and linguistic theory and models shows both promise and considerable challenges. For example, Bejar et al. (2003) demonstrated that they could use item models to generate, on the fly, Graduate Record Examination items and achieve item difficulty parameter correlations between .77 and .88 (see Bejar et al., 2003, Table 3).

Table 1. Expectations for Growth in Mathematics Understanding and Skill Across Grades, Using the Proficient Level Descriptor
Content Strand 3 4 5
Number and Operations Compose and decompose four-digit numbers. Compare and order four-digit numbers and justify reasoning. Estimate sums and differences of whole numbers. Model representations of fractions. Add (up to three addends) and subtract four-digit numbers. Model multiplication and division. Add and subtract up to five-digit whole numbers with regrouping. Add and subtract decimals through hundredths. Represent equivalence relationships between fractions and decimals. Divide four-digit dividends by one- and two-digit divisors. Model equivalent fractions. Compose and decompose five-digit numbers and decimal numbers. Model factors and multiples of whole numbers. Add and subtract fractions with like denominators. Use benchmark numbers. Explain two or more methods of multiplying and dividing whole numbers, and justifying the process. Estimate products and quotients of whole numbers. Compare integers, decimals, like and unlike fractions, and mixed numbers. Compose and decompose seven-digit numbers and decimals through thousandths. Model prime and composite numbers. Model equivalent fractions. Add, subtract, multiply, and divide using non-negative rational numbers. Estimate sums, differences, products, and quotients of non-negative rational numbers.
Algebra Create and describe extended growing and repeating patterns. Determine the value of missing quantities or variables within equations or number sentences. Use real number properties to develop multiple algorithms and to solve problems. Model inverse relationships of addition/subtraction. Create models for the concept of equality. Justify the process used to determine value of missing quantities or variables. Determine the value of variables in equations and justify the process used. Explain the inverse operations of addition/subtraction and multiplication/division. Explain the properties of basic operations. Analyze a given numeric pattern and generate a similar pattern. Construct input/output function tables and generalize the rule. Devise a rule for an input/output table. Determine the value of variables in equations and inequalities, justifying the process. Apply the properties of basic operations to solve problems. Apply inverse operations of addition/subtraction and multiplication/division to problem-solving situations.
Geometry Describe, compare, and analyze two-dimensional shapes by sides and angles. Explain and describe the process of decomposing, composing, and transforming polygons. Create three-dimensional shapes (prisms and pyramids) from two-dimensional nets, and create two-dimensional nets from prisms and pyramids. Analyze and describe the similarities and differences between and among two- and three-dimensional geometric shapes, figures, and models. Analyze the relationships between and among points, lines, line segments, angles, and rays. Locate ordered pairs in the first quadrant of the coordinate plane. Identify transformations, reflections, and model translations. Describe the characteristics of each type of transformation, reflection, and translation of two-dimensional figures. Analyze the characteristics of symmetry relative to polygons. Construct and analyze two- and three-dimensional shapes to solve problems. Explain the relationship between coordinates in each quadrant of the coordinate plane.
Measurement Estimate length, using fractional parts to the nearest half inch. Measure capacity, weight/mass, and length in both English and metric systems. Develop and use methods to solve problems involving perimeter. Estimate a given object to the nearest eighth of an inch. Convert capacity, weight/mass, and length within the English and metric systems. Describe relationships of rectangular area to numerical multiplication. Use appropriate tools to estimate and compare units for measurement. Estimate length to the nearest millimeter and sixteenth of an inch. Develop and compare formulas to calculate perimeter and area. Apply appropriate units for measuring length, mass, volume, and temperature. Develop formulas to estimate and calculate perimeter and area.
Data Analysis and Probability Analyze, predict, and model the number of different combinations of two or more objects and relate to multiplication. Interpret quantities represented on tables and graphs, make predictions, and solve problems based on the information. Interpret bar graphs, line graphs, and stem-and-leaf plots. Interpret the mean, median, and range of a set of data. Compare data and interpret quantities represented on tables and graphs. Use the mean, median, mode, and range to analyze a data set. Interpret quantities represented on tables and graphs to make predictions and to solve problems.
Table 2. Exact Agreement between Two Coders (in Percentages)
Response Demands Grade-Level Scale
3 4 5
Reading load 77 85 87
Depth of knowledge 74 92 91
Mathematical complexity 63 50 48
Question type 89 92 83
Ambiguous words 26 19 48
Mathematics vocabulary 65 62 43
Complex verbs 82 88 74
Pronouns 80 46 48
Prepositional phrases1 63 69 52
  • 1Final counts of prepositional phrases are based on a list of specified prepositions. See text for details.
Table 3. Descriptive Statistics for Locations of the Study Items on the Within-Grade Theta Scales, Using RP50 Theta Values, and Related Information
Item Locations Grade-Level Test
3 4 5
Study Items
 Lowest −3.72 −3.64 −3.08
 Highest −0.12 −0.18 −0.17
 Mean −1.43 −1.02 −0.84
SD 0.97 0.80 0.78
All Items in the 2008 Operational Test Forms
 Bottom of the scale −3.72 −3.64 −3.08
 Top of the test scale 2.64 3.02 2.63
 Mean −0.70 −0.02 0.16
SD 1.47 1.31 1.27
Cut Scores on the Operational Test Form Scales
 Proficient cut score 0.08 −0.06 −0.01
 Conditional standard error at the cut score (on the scale score scale) 4 3 3
 Percentage of students at/above Proficient 55 52 50
  • Note. There are 35, 26, and 23 study items at grades 3, 4, and 5 and 44, 45, and 50 total operational items at each grade. Theta scales are independently calibrated within grade.

These studies suggest that we can learn how to train item writers to develop items that hit difficulty range targets and targeted proficiency levels. We also need to be able to train item writers to hit those difficulty targets with items that elicit intended examinee cognitive processing, content and area knowledge and skills, and response strategies. Once item writers can do that, test developers can assemble test forms that enable valid inferences about what examinees at different achievement levels know and can do and about how their knowledge and skills develop from grade to grade. For now, it is often the case that alignment of item response demands and content, knowledge, and skill demands in proficiency level descriptors is a result of good fortune, not of design (Ferrara et al., 2007).

In this paper, we use grades 3, 4, and 5 mathematics tests from a state assessment program to examine (a) how a set of proficiency level descriptors for Proficient performance define achievement growth across grades, and (b) how the knowledge and skill demands of test items that define Proficient at each grade level may or may not define achievement growth coherently. Our purpose is to demonstrate (a) the results of one state assessment program's first attempt to train item writers to hit assigned proficiency level targets, and (b) how those efforts support and undermine coherent inferences about what it means to achieve Proficient performance from one grade to the next.

Background

Items that are located on test score scales within each test performance level (e.g., Basic, Proficient) define the knowledge and skills that examinees at that level are expected to be able to demonstrate. Ideally, these response demands should align closely with the knowledge and skill demands of the corresponding proficiency level descriptor. After all, proficiency level descriptors define the knowledge and skills that examinees at that level are expected to be able to demonstrate. Whether item response demands align with proficiency level descriptors is an empirical question. This is a bit of a simplification, of course. Examinees respond successfully to items at and below their estimated proficiency level, and they fail to respond successfully to items below their estimated proficiency level. However, in the context of standard setting, especially for item mapping methods like Bookmark and Item-Descriptor (ID) Matching (see Cizek & Bunch, 2007, chap. 10 and 11, respectively), this is a reasonable conception. In item mapping standard-setting methods, panelists review items in ordered item books in which items are ordered empirically from the easiest to the most difficult item. They locate cut scores on the score scale underlying the ordering of the items based on their judgments about (a) what examinees need to know and to be able to do in order to respond successfully to each item, and (b) what makes each subsequent item in the ordered item book more difficult than the previous items. Specifically, they locate cut scores in the ordered item book, and on the underlying score scale, with the idea that they can expect examinees1 above the cut score to be able to respond successfully to most of the items below that cut score (and to items above the cut score, but with lower probability of success). In essence, the items located on the scale below the cut score define the knowledge and skills most likely to be possessed by examinees in the proficiency level above the cut score.

In a previous study, Ferrara et al. (2007) used this logic to examine the degree to which items in a statewide mathematics assessment program defined achievement growth coherently at the Proficient level across grades 3, 4, and 5. They found that the seven items located around Proficient cut scores reflected increasing knowledge and skill requirements in order to reach the Proficient level in each grade. They concluded that this positive result occurred as a result of good fortune rather than by design. In the current study, we extend the purpose, logic, and methodology from the previous study to a different state's mathematics test and include all items below the Proficient cut score. All design and development activities in this testing program focused on aligning the test development and score interpretation systems.

Method

This study includes items from the 2008 grades 3, 4, and 5 mathematics assessments from a state assessment program and several item response demands coding frameworks.

Data

State Assessment Program

This state assessment program includes assessments in reading, mathematics, and science in grades 3–8 and high school, as required by NCLB, as well as other content areas. This study focuses on the operational items in the mathematics tests at grades 3, 4, and 5. We selected only these grades (as opposed to using the entire grade span) because our main purpose is to evaluate the utility of using these coding frameworks to identify and illustrate coherence problems of proficiency standards (i.e., the cut scores and proficiency level descriptors) in typical state assessment programs that are intended to support inferences about achievement growth across grades. We did not set out to evaluate the coherence and articulation of inferences of achievement for the entire testing program.

This second edition of the state assessment program became operational in 2008, when within-grade operational scales were established. The mathematics tests contain only multiple choice items. The tests were scaled using the 3-parameter logistic IRT model with number correct scoring, independently, for each grade. Calibrations were based on items assigned to both operational and embedded field test slots (i.e., 55, 55, and 60 items in grades 3, 4, and 5) and more than 35,000 valid examinee response records in each grade. Small numbers of items were flagged for model misfit and differential item functioning. After expert review of flagged items, only one item in the grade 3 test was suppressed. Theta scales were converted to scale scores with target means and SDs of 150 and 10. After truncation of extreme theta scores (i.e., values beyond −4.0 and 4.0), final scale scores ranged between 88 and 183 for grade 3, 105 and 186 for grade 4, and 98 and 188 for grade 5. Performance standards were established on the scale score scales using the ID Matching method (Cizek & Bunch, 2007, chap. 11; Ferrara & Lewis, 2012).

Test items are closely aligned to content Competencies (i.e., the state's content strands) in each grade. Competencies are explicitly articulated across grade levels. In corresponding fashion, proficiency level descriptors are explicitly articulated across grade levels. The Proficient level descriptor explicitly represents attainment of grade level knowledge and skills, using verbs to indicate the cognitive processes expected of students (e.g., describe, analyze, compare, predict), as defined by the four Depths of Knowledge levels used in the Web Alignment Tool (see Webb, 2007). Table 1 contains the Proficient descriptors for grades 3–5 and the five Competencies. This state separates the proficiency level descriptors for the Competencies to aid interpretations of test scores and planning for instruction and to highlight the articulation of expectations for achievement growth across grades. Using the Number and Operations Competency to illustrate, students at the Proficient level in grade 3 are expected to “compose and decompose four-digit numbers,” 4th graders are expected to “compose and decompose five-digit numbers and decimal numbers,” and 5th graders to “compose and decompose seven-digit numbers and decimals through thousandths.” In this illustration, growth in achievement expectations across grades is defined by increases in the complexity of numbers, from four-digit numbers to five-digit and seven-digit numbers, and from “decimal numbers” to decimal numbers through the thousandths. Growth expectations are defined similarly in this Competency, including addition and subtraction, multiplication and division, and fractions and for the other four Competencies, as well.

In this testing program, knowledge and skills in the Number and Operations strand are emphasized over the other four Competencies in all grades and knowledge and skill demands increase slightly across grades in Geometry and Measurement.

Item writers were instructed to write items to align closely with the performance level descriptors for Basic, Proficient, and Advanced. (They did not target the Minimal level because more than enough existing items were located at this level.) Item writers were trained extensively on general principles for writing items, the mathematics Competencies, and the performance level descriptors. Cut scores had not yet been set for this assessment, so the training included guidelines for targeting each proficiency level descriptor rather than scale score ranges based on cut scores and anchor items for each level. Item writers were instructed to target nouns in the Basic, Proficient, and Advanced proficiency level descriptor in each grade that represent mathematics understanding, and verbs that represent cognitive skills and depth of knowledge. For example, the Geometry competency for Proficient performance in Table 1 requires 3rd graders to “explain and describe [the target verbs] the process [the target noun] of decomposing, composing and transforming polygons”; 4th graders are required to “identify [the target verb] transformations, reflections, and model translations [the target nouns]”; and 5th graders to “describe [the target verb] the characteristics of each type of transformation, reflections, and translations of two-dimensional figures [the target nouns].” The directions about targeting specific proficiency level descriptors were explicit about mathematics knowledge and skills, but did not include model items to emulate nor specified procedures to follow while targeting items at a specified proficiency level. For example, the training did not include discussing examples of items located in each proficiency level, as recommended in Hambleton and Jirka (2006, pp. 408–415), because cut scores for each level had not yet been established.

Nine to 23% of the items in the standard setting ordered item book that item writers targeted below the Proficient level actually were located below the Proficient cut score. Over 80% of the items in the ordered item book that item writers targeted at and above the Proficient level actually were located above the Proficient cut score in grades 3 and 4; only 17% in grade 5. Overall, the item writing targeting accuracy was 40% at grade 3 (i.e., 20 of 50 items were accurately targeted), 54% at grade 4 (29 of 54 items), and 14% at grade 5 (8 of 57). So, even though the item writer training lagged behind recommendations in the measurement research literature (an all-too-common observation), the results from the level of training that was provided represent a promising first attempt.

Item Response Demands Coding Frameworks

The goal in coding item response demands is to identify the hypothesized knowledge and skill requirements of items that define Proficient performance in each grade level assessment. We define item response demands as knowledge, understanding, skills, and processes that an item requires a student to call on in order to respond fully or partially successfully to an item (Ferrara et al., 2007).

We identify hypothesized cognitive item response demands using empirically supported coding frameworks for reading load, depth of knowledge, mathematical complexity, and question type. The test blueprints in Table 4 indicate content area knowledge and skill demands that are targeted by the assessment. We have used these coding frameworks to evaluate the alignment of targeted and actual response demands in science assessment items (Ferrara et al., 2004), to compare targeted and intended science achievement constructs (Ferrara & Duncan, 2011), and to explore inferences about achievement growth across successive grade level assessments (Ferrara et al., 2007). We identify hypothesized linguistic item response demands by coding linguistic attributes of items from a recent item difficulty modeling study (Shaftel, Belton-Kocher, Glasnapp, & Poggio, 2006). We emphasize that the coding frameworks provide hypotheses, in the form of predictions, for knowledge and skill requirements of items that define Proficient performance in each grade level assessment. Inferences about item response demands based on these hypotheses have some empirical support from the studies cited here. Results from a study of coding item response demands cognitive laboratory think-aloud verbal protocols for science multiple choice items and constructed response performance tasks indicate that response demands coding can predict evidence of examinee cognitive processing in verbal protocols with as high as 90% accuracy (Ferrara & Chen, 2011).

Table 4. Numbers (and Percentages) of Items in Each Mathematics Competency that Represent the Expected Knowledge and Skills of Students at the Proficient Level
Mathematics Competency Grade Level Scale
3 4 5
Study Items
 Number and Operations 12 (34) 7 (27) 6 (26)
 Algebra  7 (20) 6 (23) 5 (22)
 Geometry  5 (14) 5 (19) 5 (22)
 Measurement  6 (17) 4 (15) 3 (13)
 Data Analysis and Probability  5 (14) 4 (15) 4 (17)
 Total 35 26 23
All Items in the 2008 Operational Test Form Blueprints
 Number and Operations 17 (39) 16 (36) 15 (30)
 Algebra  7 (16)  7 (16)  8 (16)
 Geometry  7 (16)  7 (16) 10 (20)
 Measurement  6 (14)  8 (18)  9 (18)
 Data Analysis and Probability  7 (16)  7 (16)  8 (16)
 Total items in test 44 45 50
  • Note. Total number of study items across grades is 84; total number of items in the full test forms is 139. Percentages are rounded and may not sum to 100.

We chose these frameworks for this study because each provides unique information on the demands that items appear to place on examinees in order to respond successfully. (Spearman rank correlations, ρ, among the cognitive codes range between .41 and .56, median is .51; Pearson correlations, r, among the linguistic codes range from −.10 to .47, median is .17; and correlations among the cognitive and linguistic codes range between .06 and .53, median is .27.) Together, they provide a fairly comprehensive view of mathematics content knowledge, cognitive, and linguistic demands on examinees. They also provide information on cognitive and linguistic demands that may be relevant to the target achievement construct (e.g., the content standards targeted by items, knowledge types) or possibly irrelevant (e.g., linguistic demands, reading load) and on item features relevant to identifying response demands (e.g., question type, depth of knowledge). (See Huff & Ferrara, 2010, for a discussion.) These frameworks have proved useful in previous studies of test forms and items, are used widely in alignment studies for state assessment programs, or are familiar through their use in National Assessment of Educational Progress (NAEP) assessment frameworks. Perhaps most important, the content area knowledge and skills specifications and depth of knowledge, mathematical complexity, and question type frameworks capture key aspects of the proficiency level descriptors for Proficient and for the item writer training: the cognitive processes targeted by the writers, represented by the target verbs (see earlier), and the mathematical knowledge and skills, represented by the target nouns. We summarize each coding framework below and then describe coding procedures. Complete details on the coding frameworks are available in Ferrara et al. (2007).

Content Standards

This state assessment program uses five familiar content strands to define its state content standards and guide curriculum and test development: Number and Operations, Algebra, Geometry, Measurement, and Data Analysis and Probability (see Table 1).

Cognitive Response Demands Coding Frameworks

Reading load We define reading load as the amount and complexity of the textual and visual information provided with an item that an examinee must process and understand in order to respond successfully to an item. The focus of this framework is on item stems, response options, and other textual and visual material provided with an item. It does not pertain to reading passages and visual stimuli that appear with item sets except as relevant to evaluating the reading load of an item itself. This framework includes three levels of reading load: low, moderate, and high. Items with low reading load may include a small amount of text (e.g., one sentence or introductory phrase); items with moderate reading load include lower amounts of text and visuals and less complex text (i.e., in comparison to high reading load items); and items with high reading load may include a large amount of text (e.g., several sentences), much of which is complex linguistically or complex because of the content area concepts and terminology used.

Depth of knowledge (DOK) Webb (2007) defines four levels of knowledge for aligning item response demands with content standards and proficiency level descriptors. Recall (level 1) items require recalling information, such as facts, definitions, terms, or simple procedures and implementing simple algorithms or formulas (i.e., one-step, well-defined, straight algorithmic procedures). Skill/concept (level 2) items elicit cognitive processing beyond learned, automatized skills and processes. Level 2 items require examinees to make decisions about how to approach problems, in contrast to level 1 items that require rote responses and well known algorithms. Strategic thinking (level 3) items require reasoning, planning, using evidence to respond, and using higher levels of thinking than do items at levels 1 and 2. Cognitive demands at level 3 are complex and abstract. Extended thinking (level 4) are rare in state assessment programs like this one. Details on depth of knowledge coding and the Web Alignment Tool are available in Webb (2007) and at http://www.wcer.wisc.edu/wat/index.aspx.

Mathematical complexity According to the National Assessment Governing Board (NAGB), certain demands are placed on examinees’ thinking as they solve mathematics items. Identifying these demands can be used to determine the mathematical complexity of items. In the 2009 NAEP mathematics assessment framework (National Assessment Governing Board, n.d.), an item may evoke any of three levels of mathematical complexity: low, moderate, and high. Low complexity items may require examinees to recall a property or recognize a concept. Items in this category are straightforward, single operation items. Items of moderate complexity may require examinees to make connections between multiple concepts, multiple operations, and to display flexibility in thinking as they decide how to approach a problem. Items of high complexity may require examinees to analyze assumptions made in a mathematical model or to use reasoning, planning, judgment, and creative thought. Complexity demands assume that students are familiar with the mathematics concepts and skills required by an item.

Question type Question type categories are adapted from an Item Demands Analysis framework, developed in previous work on middle school science assessments (Ferrara et al., 2004, 2007; Ferrara & Duncan, 2011). In this study we use the content area skills section of the framework to identify the mathematics skill demands that these items place on examinees. Question types are categorical. Use/apply items may require examinees to use information provided with the item in visual displays (e.g., graphs, numerical charts) or in text, information generated as part of responding to the item (e.g., completing a calculation, using a mathematical formula to compute), or prior knowledge (e.g., an arithmetic fact, a computational procedure). Analyze/categorize/hypothesize items may require examinees to consider the components of a situation in order to group things according to identifiable features, observe and describe patterns in data (e.g., complete a numerical sequence), or compare and contrast (e.g., describe similarities or differences). Answer and explain items may require examinees to provide an answer to a question and then defend the answer (e.g., by providing a rationale for its plausibility or correctness) or explain the thought process or skills used to arrive at the answer. Detailed definitions of these coding categories and subcategories and empirical support for the coding categories appear in Ferrara et al. (2004).

Linguistic Demands Frameworks

Mathematics items like those in this study place linguistic demands on examinees as they process items, formulate an understanding of item demands, and select or generate responses. While reading load might be a more general cognitive-linguistic demand, linguistic demands are specific to the language used in items. In this study we use a framework from an item difficulty modeling study of grades 4, 7, and 10 mathematics items (Shaftel et al., 2006). The authors concluded that “ambiguous wording, item length, difficult vocabulary, syntactic complexity with longer sentences, and comparison problems may contribute to item difficulty” (p. 110) and that “difficult mathematics vocabulary had a consistent effect on performance for all students at all grades” (p. 105). These language features of mathematics items can be represented as five linguistic attributes: (a) number of ambiguous, slang, multiple meaning, and idiomatic words or phrases, such as change, feet, function, set, and (it) took; (b) number of words that may be unusual or difficult and specific to mathematics (i.e., vocabulary), such as complementary, coordinate, equation, likelihood, perimeter, quotient, reflection, and symmetry; (c) number of complex verbs (i.e., verb forms of three words or more), such as had been going, would have gone; (d) number of relative pronouns, specifically, that, who, whom, whose, which (sometimes), why; and (e) number of prepositional phrases, such as phrases beginning with about, above, after, by, during, except, for, from, inside, instead of, into, like, of, over, past, since, and through.

Procedures

Item Response Demands Training and Coding Procedures

We examined the response demands of items that define Proficient performance. For each grade level assessment, we selected items with scale locations from the lowest obtainable scale score to the Proficient cut score plus the three items located just above the Proficient cut score. (We included three items above each cut score to account for the conditional standard error at the cut score.) In ordered item books used for item mapping standard setting methods, panelists make cut score decisions that indicate that students who are just barely Proficient (a) are likely to be able to respond successfully to these items (i.e., the Bookmark method), or (b) would be expected to respond successfully to these items because the item response demands match the demands described for the proficiency levels below Proficient (i.e., Item-Descriptor [ID] Matching). (See Cizek & Bunch, 2007, chaps. 10 and 11 for a description of these methods.) Examinees who just barely reach the Proficient level would be less likely to be able to respond to items located above the Proficient cut score because those items require knowledge and skills that those examinees are less likely to possess. Items in the analyses were selected based on their location on the theta scale and the cut score that defines Proficient. The numbers of items studied in grades 3, 4, and 5 are 35, 26, and 23, respectively.

After each of the frameworks was adapted for the study, we coded items one grade at a time. Three researchers coded items independently. A co-author of this paper adapted the frameworks for this study, served as a coder for all frameworks, and trained a second coder, also a coauthor, for the reading load, depth of knowledge, and question type frameworks. The trainer trained a different second coder (another co-author) for the mathematical complexity and linguistic frameworks. The trainer was a Ph.D. student in an educational psychology program, with a master's degree in educational psychology. The second coder for the reading load, depth of knowledge, and question type frameworks is a biology lab technician with a bachelor's degree in biology and who is training as a science item writer. The trainer and this second coder were interns at CTB/McGraw-Hill in summer 2008. The second coder, for mathematical complexity and the linguistic frameworks, is a psychometrician at CTB/McGraw-Hill. The trainer provided definitions and examples for each framework, explained in a training session how to code items, and practiced coding with the other coders.

The researchers coded all items independently, one framework at a time, and coded items in each grade, one grade at a time. After coding was completed, inter-rater agreement was calculated. Final codes for all items were determined in consensus meetings between the first and second coders. Table 2 displays exact agreement rates between two coders (before consensus discussions) for the cognitive demands and linguistic demands codes.

Exact rater agreement rates varied across frameworks and grades, but were generally high, with some notable exceptions. Agreement rates tend to be lower in grade 3, the first grade coded for each framework, and increase in grades 4 and 5 for reading load and depth of knowledge. The lowest cognitive demand agreement rate is for mathematical complexity for grade 5 (48%). The highest agreement rates are for depth of knowledge for grades 4 and 5 and question type for grade 4, all over 90%. Exact plus adjacent agreement rates for continuous coding frameworks (e.g., reading load) all are 90% or higher. We found higher disagreement rates in the linguistic codes. While exact agreement on complex verbs was reasonable, other rates were unacceptably low. In response, we used a consensus process to specify lists of countable ambiguous words, mathematics vocabulary, complex verbs, pronouns, and prepositional phrases to remove all ambiguity, and recoded all items. Using those lists, we counted instances of each category. (The lists are available from the authors.)

For reporting purposes, we collapsed some linguistic demands counts into ranges of frequencies, based on size and variability of the counts. Counts for ambiguous words, mathematics vocabulary, and prepositional phrases varied widely and are reported as 0, 1–2, 3–5, and greater than 5. Complex verbs are reported in their original counts, 0 or 1. Pronouns are reported in their original counts, 0, 1, or 2.

Results

Psychometric Characteristics of the Tests and the Sets of Study Items

Table 3 indicates the range of the locations of the grades 3, 4, and 5 study items on the within-grade scales and related information. We display the RP50 item locations to be consistent with the final locations of the Proficient cut scores on the within grade score reporting scales. Table 3 also contains information about the complete test forms.

The Proficient cut scores on the theta scales (i.e., .08, −.06, and −.01 for grades 3, 4, and 5) are located relatively close to the average difficulty of the operational items in each grade level test. Standard-setting panelists recommended, during a moderation process, an across grade articulation pattern of decreasing impact data (i.e., percentages of examinees at/above the Proficient level) across grades: 55, 52, and 50% of examinees for grades 3, 4, and 5 (see the last line of Table 3). They adjusted cut scores to achieve this articulation goal.

Mathematics Content Demands

Table 4 indicates the numbers of items included in the study that correspond to each content strand in each grade and the total numbers of items in the full tests. The response demands of these items represent the mathematics knowledge and skills that examinees at the Proficient level would be expected to be able to display. In addition, they should be consistent with the knowledge and skills represented in descriptors for the Minimal, Basic, Proficient, and Advanced levels.

The differences in percentages of study items and items on the 2008 operational test forms in each mathematics competency are 5% or less for the grades 3 and 5 tests. Differences also are similarly small in Geometry, Measurement, and Data Analysis and Probability for grades 4 and 5 and Number and Operations for grade 5. Differences for the grade 4 test are approximately 9% lower for Number and Operations and 7% higher for Algebra. The study items that define Proficient cover the content sub-strands quite consistently with the total test blueprint.

Cognitive Response Demands

Table 5 summarizes the counts of study items in each level or category for the four cognitive demands coding frameworks.

Table 5. Cognitive Response Demands Identified in Coding of the Study Items
Cognitive Demands Grade Level Scale
3 4 5 Total
Reading Load
 Low 27 (77) 18 (69) 12 (52) 57 (68)
 Moderate 7 (20) 4 (15) 9 (39) 20 (24)
 High 1 (3) 4 (15) 2 (9) 7 (8)
Depth of Knowledge
 Recall 15 (43) 19 (73) 14 (61) 48 (57)
 Skill/concept 17 (49) 7 (27) 7 (30) 31 (37)
 Strategic thinking 3 (9) 0 (0) 1 (4) 4 (5)
 Extended thinking 0 (0) 0 (0) 1 (4) 1 (1)
Mathematical Complexity
 Low 23 (66) 16 (62) 15 (65) 54 (64)
 Moderate 10 (29) 10 (38) 8 (35) 28 (33)
 High 2 (6) 0 (0) 0 (0) 2 (2)
Question Type
 Use/apply 30 (86) 19 (73) 19 (83) 68 (81)
 Analyze/categorize/hypothesize 5 (14) 6 (23) 2 (9) 13 (15)
 Answer and explain 0 (0) 1 (4) 2 (9) 3 (4)
  • Note. 35, 26, and 23 items (total = 84) are coded for grades 3, 4, and 5; column percentages in parentheses. Percentages are rounded and may not sum to 100.

Reading Load

The reading load for the grades 3, 4, and 5 tests is low (the reading load for 68% of all items is low), and lowest at grade 3, as expected. The reading load is high for almost no items (fewer than 10% of all items). Two thirds of these items demand that examinees read small amounts of text that is relevant only to the content area knowledge or skill that is required to respond; processing these items and selecting a response is obvious or explicitly stated; and stimulus materials (i.e., visual displays) in item stems are short, simple to process, and easy to understand. Almost no items (overall, fewer than 10%) require high reading load. Reading load increases at the moderate level at grade 5.

Depth of Knowledge

The depth of knowledge requirements that these items place on examinees are at the first two levels, recall and skill/concept. More than half of the items are recall items that require examinees to recall mathematics facts, definitions, and terms and apply simple procedures and rote algorithms. Another one-third of the items are skill/concept items that require examinees to decide how to approach problems and solve multi-step problems. Surprisingly, the percentage of recall items is lower at grade 3 than at grades 4 and 5 and the percentage of skill/concept items is higher at grade 3 than at grades 4 and 5.

Mathematical Complexity

The mathematical complexity of these items is primarily low. Over half of the items in all grades are low-complexity items, which are straightforward, single operation items that require examinees to recall a mathematical property or recognize a concept. Another third of the items are moderately complex, which requires examinees to make connections and display flexibility as they decide how to approach a problem. The complexity of these items does not increase across grades.

Question Type

The majority of the items (81%) are use/apply items, which require examinees to use information provided with the item in visual displays (e.g., graphs, numerical charts) or text, information generated as part of responding to the item (e.g., completing a calculation, using a mathematical formula to compute), or prior knowledge (e.g., an arithmetic fact, a computational procedure). Another 15% are analyze/categorize/hypothesize items, which require examinees to consider the components of a situation in order to group things according to identifiable features, observe and describe patterns in data (e.g., complete a numerical sequence), or compare and contrast (e.g., describe similarities or differences). The number of analyze/categorize/hypothesize items is lower on the grade 5 test than on the grades 3 and 4 tests. Answer and explain items, which require examinees to provide or select and explain an answer, typically are in constructed response format. Only three answer and explain type items appear among the study items.

Linguistic Response Demands

Table 6 summarizes the counts of study items in each level or category for the five linguistics demands coding frameworks.

Table 6. Linguistic Response Demands Identified in Coding of the Study Items
Linguistic Demands Grade Level Scale
3 4 5 Total
Ambiguous Words
 0 6 (17) 6 (23) 3 (13) 15 (18)
 1–2 16 (46) 13 (50) 8 (35) 37 (44)
 3–5 11 (31) 6 (23) 10 (43) 27 (32)
 More than 5 2 (6) 1 (4) 2 (9) 5 (6)
Mathematics Vocabulary
 0 12 (34) 10 (38) 5 (22) 27 (32)
 1–2 23 (66) 16 (62) 15 (65) 54 (64)
 3–5 0 (0) 0 (0) 2 (9) 2 (2)
 More than 5 0 (0) 0 (0) 1 (4) 1 (1)
Complex Verbs
 0 31 (89) 22 (85) 20 (87) 73 (87)
 1 4 (11) 4 (15) 3 (13) 11 (13)
Pronouns
 0 3 (9) 11 (42) 16 (70) 30 (36)
 1 26 (74) 11 (42) 6 (26) 43 (51)
 2 6 (17) 4 (15) 1 (4) 11 (13)
Prepositional Phrases
 0 4 (11) 2 (8) 0 (0) 6 (7)
 1–2 13 (37) 8 (31) 8 (35) 29 (35)
 3–5 12 (34) 8 (31) 10 (43) 30 (36)
 More than 5 6 (17) 8 (31) 5 (22) 19 (23)
  • Note. 35, 26, and 23 items (total = 84) are coded for grades 3, 4, and 5; column percentages in parentheses. Percentages are rounded and may not sum to 100.

Ambiguous Words

Approximately three quarters of these items contain 1–5 ambiguous or multiple meaning words or phrases (e.g., feet, [it] took…). Shaftel et al. (2006) found that ambiguous words influenced item difficulty at grades 4, 7, and 10 and especially for 4th graders. Grade 5 items contain 3–5 ambiguous words somewhat more often than do grades 3 and 4 items.

Mathematics Vocabulary

Almost two-thirds of the items in our study (64%) contain at least one or two mathematics vocabulary items that examinees are likely to find difficult (e.g., complementary, coordinate). The number of items with no mathematics vocabulary is lower at grade 5, where only one item contains more than five mathematics terms.

Complex Verbs

Overall, few of the studied items (13%) contain complex verbs (e.g., had been going) that would increase the syntactic complexity of the items.

Pronouns

Relative pronouns (e.g., that, which) can influence the syntactic complexity of mathematics items. Almost two thirds of these items (64%) contain one or two pronouns. The trend across grades actually is the opposite that might be expected. Ninety-one percent of the grade 3 items contain one or two pronouns, while only 30% of the grade 5 items contain one or two pronouns.

Prepositional Phrases

Similarly, prepositional phrases (e.g., phrases beginning with after, except, instead of) can influence the syntactic complexity of mathematics items. Almost one quarter of all items (23%) include more than five prepositional phrases; more than one third (36%) include 3–5; and another one third (35%) include 1–2 prepositional phrases. The percentage of grade 3 items with three or more prepositions (51%) is lower than at grades 4 and 5 (62 and 65%, respectively).

Discussion

Results Related to Expectations for Achievement Growth

The developers of this state assessment program implemented a design and procedures to align the test development system with the test performance interpretation system. They articulated performance level descriptors across grades as a strategy to enable monitoring achievement growth. They trained item writers to write items whose empirical difficulty would be consistent—by design, if they were successful—with the performance levels they explicitly targeted. And standard-setting panelists articulated cut scores in the standard setting workshops by examining percentages of examinees at and above the Proficient level as part of the strategy for aligning the test development and performance interpretation systems. The item writers achieved some limited success in hitting proficiency level targets (40% and 54% at grades 3 and 4; only 14% at grade 5). Independent of the program developers’ efforts, in this study we evaluated the consistency of the hypothesized cognitive and linguistic demands that the items that define Proficient performance for these test forms place on examinees. In general, we found that cognitive and linguistic demands of the items that define Proficient performance for grades 3, 4, and 5 do not increase consistently, in a manner that reflects coherently increasing expectations for mathematics achievement.

We summarize the cognitive and linguistic demands of these mathematics items later. We expect cognitive demands that define Proficient performance to increase coherently across grade levels, and any linguistic demands that may be construct-relevant or irrelevant to increase across grade levels. We discuss response demands in the study items that increase consistently across grade levels and response demands that are disarticulated and undermine coherent expectations for demonstrating Proficient performance across grade levels. Table 7 summarizes the discussion and portrays patterns of cognitive and linguistic demands across grades.

Table 7. Summary of Item Response Demands Within and Across Grade Levels and Contributions to Coherent and Disarticulated Inferences About Achievement Growth
Response Demand Category Within Grade Across Grades Coherent Articulation?
Cognitive Demands
 Reading load Mostly low reading load at all grades Some moderate reading load at grade 5 (only) Reasonable
 Depth of knowledge Mostly recall (level 1) at all three grades Skill/concept (level 2) higher in grade 3 Disarticulation at grade 3
 Mathematical complexity Generally low No increase across grades Reasonable to expect an increase across grades
 Question type Majority are use/apply items Fewer analyze/categorize/ hypothesize items at grade 5 Disarticulation at grade 5
Vocabulary Demands
 Ambiguous words 75% of all items contain 1–5 ambiguous words Slight increase at grade 5 Reasonable
 Difficult mathematics terms ∼67% of all items contain 1–2 mathematics terms Number of items with no terms is lower at grade 5; only grade 5 items contain 3+ terms Inconsistent
Syntactical Complexity Demands
 Complex verbs Few items contain complex verbs, at all grades No increase Reasonable
 Pronouns ∼67% of all items contain 1–2 pronouns Number of pronouns decreases across grades Disarticulation
 Prepositions Almost all items contain 1–2 prepositions, some as many as 5 or more Percentage with 3 or more prepositions is slightly higher at grade 3 Reasonable

Hypothesized Cognitive Demands

The hypothesized cognitive response demands of these items do not increase consistently across the grade level tests, as might be expected for items that represent Proficient performance at each grade. Specifically:

  • The reading load of these items is mostly low at all three grades level. The number of items at the moderate level does increase at grade 5.

  • The depth of knowledge response demands are mostly at the lowest level (recall) at all three grades. Depth of knowledge demands are slightly higher at grade 3: (a) the percentage of recall (level 1) items is lower at grade 3, and the percentage of skill/concept (level 2) items is higher at grade 3 than at grades 4 and 5.

  • The mathematical complexity of these items generally is low. Complexity does not increase across grades.

  • The majority of the items are use/apply items, which require examinees to use information provided with the item. The number of analyze/categorize/hypothesize items, which may require more complex cognitive processing, is lower on the grade 5 test than on the grades 3 and 4 tests.

Hypothesized Vocabulary Demands

As with the cognitive demands, hypothesized vocabulary demands do not increase consistently across the grade level tests, in contrast to what might be expected. Specifically:

  • Approximately three quarters of these items contain at least one and as many as five ambiguous words. Grade 5 items contain 3–5 ambiguous words somewhat more often than do grades 3 and 4 items.

  • Almost two thirds of the items contain at least one or two mathematics terms that require specific mathematics understanding. In contrast to what might be expected, the number of items with no mathematics vocabulary is lower at grade 5; only grade 5 items contain three or more mathematics terms.

Hypothesized Influences of the Syntactic Complexity of These Mathematics Items

The hypothesized linguistic response demands related to syntactic complexity of these items do not increase consistently across the grade level tests, in contrast to what might be expected. Specifically:

  • Few of these items contain complex verbs that would increase the syntactic complexity of these items.

  • Almost two thirds of these items contain one or two pronouns. The trend across grades actually is opposite of might be expected. Almost all grade 3 items contain one or two pronouns, while only approximately one-third of the grade 5 items contain one or two pronouns.

  • Almost all of these items contain 1–2 prepositional phrases, some items as many as five or more. The percentage of grade 3 items with three or more prepositions is slightly higher than at grades 4 and 5. (These items are not unusually wordy or complex compared to other state assessment program items.)

Using the Coding Frameworks to Examine the Articulation of Response Demands Across Grade Levels

These hypothesized cognitive and linguistic demands are useful for examining across-grade articulation, specifically regarding assessing achievement growth across-grade. Specifically, testing program developers can examine item response demands and item scale locations during test forms assembly to select items that are more closely aligned with targeted performance level descriptors in each grade level. Doing so should improve parallelism of annual test forms within the same grade. And, because performance level descriptors define increasing expectations for Proficient performance across grades—to define growth expectations—doing so could improve the clarity and validity of inferences about achievement growth as examinees continue through their schooling careers.

Table 7 summarizes the response demands within and across grade levels and conclusions about the contributions of these response demands to coherence and disarticulation in describing Proficient performance across grade levels.

Table 7 indicates that the items in this study that define Proficient performance increase response demands in reasonable ways in terms of reading load, ambiguous words, complex verbs, and numbers of prepositional phrases. The items that define Proficient performance introduce disarticulation in defining growth in Proficient performance in terms of response demands related to depth of knowledge, mathematical complexity, question type, difficult mathematics terms, and numbers of relative pronouns.

Examining the Influence of Hypothesized Item Response Demands on Item Difficulty and Discrimination

Because targeting proficiency levels during item writing and test forms assembly is, in essence, predicting item difficulty, we also investigated the degree to which these item response demands frameworks predict item difficulty. We followed the tradition of item difficulty modeling studies (e.g., Gorin & Embretson, 2006) and used the item codings as independent variables to predict item p-values in linear regressions. In addition, because they represent the relationship of an item to the construct represented by all items in a test, we also used item discriminations (i.e., item-total test score correlations) as the dependent variable in separate regressions to conduct item discrimination modeling analyses. We conducted regression analyses using the study items only and all items (we coded items above the Proficient cut scores in order to conduct these analyses), and we conducted regressions within and across grades.

Item difficulty modeling aims to identify item features that are related to item cognitive processing and to estimate the impact of those features on item difficulty (Bejar, 1993; Bennett, 1999). Item difficulty modeling studies in a range of contexts have produced R2 that are as low as .04 for mathematics items (e.g., Shaftel et al., 2006) and as high as .35 for listening comprehension and .58 for reading comprehension items on the paper–pencil version of the Test of English as a Foreign Language (TOEFL; see Huff, 2003, table 2.1). These results suggest that other factors determine the empirical difficulty of test items for an examinee population. (We hypothesize that opportunity to learn [e.g., Herman, Klein, & Abedi, 2000; McDonnell, 1995] is one of those factors.) Item discrimination modeling aims to evaluate the relationship between response demands of individual items and response demands of the achievement construct represented by the entire test. And, because item discriminations (i.e., item-total, point-biserial correlations) are an item quality indicator, it is valuable to know how well response demand codes predict discriminations.

Results

After conducting full model regressions (i.e., models that included all four cognitive and all five linguistic codings) to identify significant predictors, we estimated reduced models that retained only the significant predictors. Of the 16 full model regressions (i.e., difficulty and discrimination regressions × three within-grade and one across-grade regressions × study item versus all item regressions, or 2 × 4 × 2), six were significant. In all cases except one, the statistical power for significant models was greater than .92; power was low or moderate (i.e., .28 to .54 for all non-significant models). R2 for significant full models ranged from .27 to .62, consistent with results from other studies, reported earlier.

Item difficulty modeling results, reduced models We discuss the across-grades model based on all test items here because it compares favorably with the significant within-grade models (R2= .28, p < .01, α= .05).2 Reading load, question type, number of ambiguous words, number of mathematics terms (i.e., the mathematics vocabulary coding framework), and number of relative pronouns are significant predictors of item difficulty.

Item discrimination modeling results, reduced models Although the across-grades model based on all test items is significant (p < .05, α= .05) and mathematics vocabulary is a significant predictor, the model explains only 3% of the variance in item discrimination. The across-grades model based only on the study items is significant (R2= .26, p < .01, α= .05). In this model, reading load, depth of knowledge, number of ambiguous words, and number of mathematics terms are significant predictors of item discrimination.

In all regressions, residual plots for both models suggest only minor heteroskedasticity and departures from normality. Results for all 16 regressions are available from the authors.

Noteworthy findings Reading load is the only consistent cognitive response demand related to item difficulty and discrimination; it is significant in all 16 regression models. Among the linguistic demands, number of ambiguous words is a consistently significant predictor of difficulty and discrimination. Number of mathematics terms also is a consistently significant predictor of difficulty and discrimination. Mathematics terms (i.e., the mathematics vocabulary coding framework), discussed here as a linguistic demand, also can be thought of as a cognitive demand in the sense that understanding mathematics terms likely indicates some understanding of the mathematics concepts and procedures the terms represent.

The Role of Item Response Demands in Item Difficulty and Discrimination

We had hoped that modeling the relationship between item features that represent hypothesized response demands and item difficulty and discrimination statistics would illuminate achievement growth expectations and provide additional validation support for the item response demands coding frameworks. We found that reading load, ambiguous words, and mathematics vocabulary are consistently related to item difficulty and discrimination. These results provide empirical support for using these three coding frameworks for other item evaluations and research—when the goal is to predict item difficulty or discrimination—and item writer training. The results represent disconfirming evidence for the other coding frameworks, even though other studies demonstrate their usefulness for evaluating item response demands and their accuracy in predicting actual examinee cognitive processing observed in think-aloud protocols (Ferrara & Chen, 2011). These results suggest that training item writers on item response demands alone would provide only limited guidance in successfully targeting proficiency levels. And they suggest that Hambleton and Jirka's (2006) conclusion that anchor items and anchor item descriptions may contribute to making accurate item difficulty predictors.

Conclusion

The results of this study suggest that establishing performance standards that enable coherent inferences about what can be expected of Proficient students across grade levels is a significant challenge, even where care is taken to attend to supporting those inferences. Our results from examining expectations for Proficient performance suggest that item selection should attend to cognitive and linguistic demands, like those identified here, as well as the content blueprint and psychometric considerations that typically are targeted. Unintended shifts across grade levels in cognitive and linguistic demands for Proficient performance obscure what it means to perform at the Proficient level and grow in achievement across grades. Likewise, failing to attend to cognitive and linguistic demands of within grade, alternate test forms can undermine necessary assumptions about the parallelism of those forms. Results from this study suggest that the likelihood of achieving a fully parallel development and interpretation system—and enabling valid inferences about achievement growth across grades—can be enhanced by training item writers more effectively to hit their proficiency level descriptor targets and assembling within-grade test forms by selecting items that (a) meet content blueprint requirements, (b) align with targeted proficiency levels, and (c) represent coherent cognitive and linguistic response demands.

The role of content, cognitive, linguistic, and other response demands in item development, test design and development, and validation of inferences from test performances suggest many research questions. Suggestions for future studies include (a) conducting empirical research on the efficacy of the coding frameworks we used in this study, especially those that have limited or no empirical support (e.g., mathematical complexity, depth of knowledge); (b) conducting studies like this for other tests, grades, and content areas; (c) examining the accuracy with which judgmental coding of items accurately “predicts” evidence of actual cognitive processing observed in think-aloud studies (see Ferrara & Chen, 2011); (d) examining the role of opportunity to learn in item difficulty and discrimination statistics and modeling studies; and (e) conducting validation studies that compare item quality and alignment with intended proficiency level targets for item writers trained under current, standard protocols and enhanced conditions like those discussed here.

Acknowledgments

Parts of the research reported in this paper are based on work supported by a National Science Foundation grant award #0126088.

    Notes

  1. 1 In the case of the Bookmark and ID Matching methods, the probability of successful responses is based on a Response Probability (RP) criterion, most often .50 or .67, that is used to locate and order items on the underlying IRT scale. Examinees with estimated proficiency, θ, equal to the RP50 location of an item have an estimated 50% chance of responding correctly to that item.
  2. 2 R 2 for all across-grade regressions in this study are lower than R2 for within-grade regressions, primarily for the grade 3 regressions.
    • The full text of this article hosted at iucr.org is unavailable due to technical difficulties.