Research on equating with small samples has shown that methods with stronger assumptions and fewer statistical estimates can lead to decreased error in the estimated equating function. This article introduces a new approach to linear observed-score equating, one which provides flexible control over how form difficulty is assumed versus estimated to change across the score scale. A general linear method is presented as an extension of traditional linear methods. The general method is then compared to other linear and nonlinear methods in terms of accuracy in estimating a criterion equating function. Results from two parametric bootstrapping studies based on real data demonstrate the usefulness of the general linear method.

Equating methods estimate a statistical relationship between two score scales that is used to define their equivalence. When these scales come from two forms of the same test, created according to the same specifications and having similar statistical properties, equating is said to produce equivalent scores across forms, so that the forms may be treated as interchangeable (Kolen & Brennan, 2004). When measures differ in some way, for example in content, reliability, or intended population, a different type of relationship is estimated, and the linked scales are said to be less than equivalent to some degree (Holland & Dorans, 2006).

The equivalence between two scales, X and Y, is estimated by an equating function that expresses any score x on form X in terms of its equivalent value on Y, while adjusting for the difficulty difference between them. To be effective, equating functions must accurately summarize the difficulty difference between the two forms so that it may be fully accounted for when transforming scores. When sample sizes are large, or population data are available, an equipercentile equating function can be used to adjust for a difficulty difference that potentially varies by score point. When samples are not large, sampling error can make equipercentile equating unreliable, and a simpler equating function may be more appropriate. Rather than estimate a difficulty difference that varies by score point, the observed-score equating functions described below make different assumptions regarding the difficulty difference between forms and how this difference does or does not change across the X and Y score scales.

Numerous studies have explored equating methods that approximate the equipercentile function under smaller sample sizes (e.g., Babcock, Albano, & Raymond, 2012; Livingston, 1993; Livingston & Kim, 2010; Skaggs, 2005). As with any procedure based on statistical estimates, these methods involve a trade-off between random sampling error and bias: as one tends to decrease, the other tends to increase; however, previous research has shown that some methods involving stronger assumptions and fewer statistical estimates can lead to decreases in sampling error without significant increases in bias. For example, Kim, von Davier, and Haberman (2008) showed the utility of averaging different observed-score functions with the identity equating function. Livingston and Kim (2009) demonstrated a curvilinear approximation to equipercentile equating that combines linear and identity equating. Both of these simplified functions led to reductions in standard error (SE) without significant increases in bias, compared to other methods.

This article introduces a new approach to observed-score equating with small samples. The approach combines information from multiple sources, including the score scales and empirical distributions of X and Y, to flexibly control how form difficulty is assumed versus estimated to change across the X and Y scales. The result is an equating function that can be tailored to the specific features of a testing program. The traditional linear methods, identity, mean, and linear equating, are first presented below, and a more general linear function is introduced. Two methods that extend or modify the traditional identity function, circle-arc and synthetic equating, are also discussed. Finally, the general linear method is compared to other observed-score methods in two parametric bootstrapping studies based on real data.

Observed-Score Equating Functions

As noted above, observed-score equating functions make different assumptions regarding the difficulty difference between forms and how this difference does or does not change across the X and Y scales. Based on these assumptions, form difficulty is then estimated, to some extent, using empirical score distributions for individuals taking X and Y. For simplicity, most of the discussion below is based on a situation where individuals taking X and Y are sampled from the same population. As a result, scores are presumed to come from an equivalent-groups or single-group equating design, where estimated differences in the X and Y score distributions reflect a combination of (a) actual differences in form difficulty, (b) random error caused by the sampling process, and (c) systematic error or bias caused by violations of the assumptions of a given function.

The assumptions of a given function can be expressed in the form of a line within the coordinate space defined by X and Y (for a geometric discussion of equating in the context of average functions, see Holland & Strawderman, 2011). The ordered possible scores on X and their corresponding equated scores on Y make up the coordinates $urn:x-wiley:00220655:media:jedm12062:jedm12062-math-0001$ for this equating line, where x is an observed score on X and $urn:x-wiley:00220655:media:jedm12062:jedm12062-math-0002$ a function linking or equating x to the Y scale. For example, the simplest equating function, the identity function, makes no adjustment to scores on X when converting them to Y. Thus, scores are considered equated when they have the same value. Coordinates for the identity equating line are found simply as $urn:x-wiley:00220655:media:jedm12062:jedm12062-math-0003$ .

A General Linear Function

To develop the general linear function, a more general form of the traditional identity function is first presented. Note that, in addition to assuming that X and Y do not differ in difficulty, identity equating also assumes that the scales for each form are equal. This may not always be the case. Instead, differences in the score scales for X and Y, perhaps caused by the removal or modified scoring of one or more items, may provide useful information regarding differences in their empirical distributions. A more flexible form of the identity function is formulated here to take scale differences into account. Coordinates for scores of x and y are found based on their relative positions within each scale:

$urn:x-wiley:00220655:media:jedm12062:jedm12062-math-0004$ (1)

Here, $urn:x-wiley:00220655:media:jedm12062:jedm12062-math-0005$ and $urn:x-wiley:00220655:media:jedm12062:jedm12062-math-0006$ are coordinates for any two points in the scales of X and Y, for example, the minimum and maximum possible scale values. The coordinate $urn:x-wiley:00220655:media:jedm12062:jedm12062-math-0007$ defines a point through which the line will pass. Solving 1 for y and rearranging terms produces a more general form of the identity function:

$urn:x-wiley:00220655:media:jedm12062:jedm12062-math-0008$ (2)

where $urn:x-wiley:00220655:media:jedm12062:jedm12062-math-0009$ and $urn:x-wiley:00220655:media:jedm12062:jedm12062-math-0010$ , and $urn:x-wiley:00220655:media:jedm12062:jedm12062-math-0011$ denotes the identity function applied to scores on X. In this linear form, the slope and intercept of the equating line are $urn:x-wiley:00220655:media:jedm12062:jedm12062-math-0012$ and $urn:x-wiley:00220655:media:jedm12062:jedm12062-math-0013$ , respectively, as opposed to 1 and 0 for the traditional identity line.

In its more general form, the identity function still does not estimate any parameters of the X and Y score distributions. However, the traditional assumption of no difficulty difference between forms is modified. Forms are now assumed only to differ in difficulty as a result of known differences in their scales. Because the traditional and more general identity functions do not involve parameter estimates, the SEs for the equating line $urn:x-wiley:00220655:media:jedm12062:jedm12062-math-0014$ will always be zero. To the extent that form difficulty differences do exist, the identity lines will both be biased. However, by accounting for form difficulty differences resulting entirely from scaling differences, the general form of the identity function can potentially be less biased than the traditional form.

The identity line in 2 can also be shifted upward or downward so as to pass through any coordinate pair $urn:x-wiley:00220655:media:jedm12062:jedm12062-math-0015$ ; this point on the equating line could be based on known scale values, such as $urn:x-wiley:00220655:media:jedm12062:jedm12062-math-0016$ , or estimates from the distributions of X and Y. Furthermore, the slope of the identity line, expressed by $urn:x-wiley:00220655:media:jedm12062:jedm12062-math-0017$ , can also be replaced by any estimate $urn:x-wiley:00220655:media:jedm12062:jedm12062-math-0018$ of the variability in Y over any estimate $urn:x-wiley:00220655:media:jedm12062:jedm12062-math-0019$ of the variability in X. This results in the general linear equating function:

$urn:x-wiley:00220655:media:jedm12062:jedm12062-math-0020$ (3)

The mean and linear equating functions, as traditionally defined, are obtained by using the means for X and Y to estimate the coordinate pair $urn:x-wiley:00220655:media:jedm12062:jedm12062-math-0021$ through which the line passes, and then replacing the slope $urn:x-wiley:00220655:media:jedm12062:jedm12062-math-0022$ with 1 in mean equating and with the ratio of standard deviations (SDs) for Y over X in linear equating.

In linear equating, it is assumed that the difficulty difference between forms is captured by their means and SDs. Any change in scale length need not be included in the equation, as its impact on the difficulty difference is captured in the SDs. Coordinates for the linear equating line are thus obtained using standardized deviation scores:

$urn:x-wiley:00220655:media:jedm12062:jedm12062-math-0023$ (4)

In mean equating, it is assumed that the mean difference between forms suffices to explain a constant change in difficulty across the score scale. Thus, coordinates for the mean line are obtained using mean deviations. SDs, instead of being estimated, are assumed to be equal. However, in its more general form, the mean function can account for any impact that the scales of X and Y may have on variability in the respective distributions. Coordinates for the general mean equating line are then defined by mean deviations relative to their position within each scale:

$urn:x-wiley:00220655:media:jedm12062:jedm12062-math-0024$ (5)

The general mean function does not estimate parameters for variability. In this way, it is similar to the traditional and general identity functions, and the traditional mean function. However, in contrast to the traditional functions, the general mean function assumes that the constant difficulty difference between forms may change, linearly, due to the difference in scaling for the forms. As with the general identity function, this may lead to less bias than the traditional mean function.

The general form of the linear function can be used to obtain other linear functions that combine information from the scales and empirical distributions of X and Y. For example, in practice the SD may be estimated reliably for one form (e.g., Y) but not the other (X), perhaps because of a small sample size; in this case, estimating the slope of 3 with $urn:x-wiley:00220655:media:jedm12062:jedm12062-math-0025$ and a fixed value for $urn:x-wiley:00220655:media:jedm12062:jedm12062-math-0026$ may be preferable to either assuming the slope to be 1, as in traditional identity and mean equating, or estimating both parts of it, as in traditional linear equating. In another situation, the mean may adequately represent average difficulty on one form (e.g., X) but not the other (Y); in this case, $urn:x-wiley:00220655:media:jedm12062:jedm12062-math-0027$ and a fixed value for $urn:x-wiley:00220655:media:jedm12062:jedm12062-math-0028$ may better estimate the central difference in form difficulty. These examples represent different combinations of identity, mean, and linear functions for X and Y in situations were sample size may limit the reliability of certain estimates.

The general linear function can also be applied with any nonequivalent-groups linear method that estimates means and SDs of X and Y in the target population T using weighted combinations of estimates from the nonequivalent-groups design. Thus, the Tucker and Levine linear methods for nonequivalent groups (Kolen & Brennan, 2004) can both be used to obtain parameters for 3 (for details see Albano, 2014). Combined forms of the resulting function can also be estimated, as described above.

Two other methods for combining equating functions have been demonstrated in the literature. The first reviewed here is referred to as circle-arc equating (Livingston & Kim, 2009); it involves a curvilinear combination of the mean and identity lines. The second is referred to as synthetic equating (Kim et al., 2008); it involves a combination of linear equating functions by averaging (see Holland & Strawderman, 2011).

Circle-Arc Functions

Nonlinear equating functions are appropriate when X and Y differ nonlinearly in difficulty, that is, when difficulty differences fluctuate across the score scale. As noted above, equipercentile equating is the most flexible method for accounting for these fluctuating difficulty differences. In equipercentile equating, no assumptions are made about how the forms relate in terms of difficulty. Instead, difficulty differences are estimated at each score point; thus, each coordinate on the equipercentile curve is estimated using information from the distributions of X and Y.

Circle-arc equating also defines a nonlinear relationship between score scales; however, only three score points in X and Y are used to do so: the low and high points, as defined above for the identity function, and a midpoint that is estimated based on the score distributions for X and Y. On their own, the low and high points define the identity linking function $urn:x-wiley:00220655:media:jedm12062:jedm12062-math-0029$ , a straight line. When the midpoint does not fall on the identity linking line, it can be connected to $urn:x-wiley:00220655:media:jedm12062:jedm12062-math-0030$ and $urn:x-wiley:00220655:media:jedm12062:jedm12062-math-0031$ by the circumference of a circle with center $urn:x-wiley:00220655:media:jedm12062:jedm12062-math-0032$ and radius r. Having obtained these values, the circle-arc function is then defined as

$urn:x-wiley:00220655:media:jedm12062:jedm12062-math-0033$ (6)

where the second quantity, under the square root, is added to $urn:x-wiley:00220655:media:jedm12062:jedm12062-math-0034$ when $urn:x-wiley:00220655:media:jedm12062:jedm12062-math-0035$ and subtracted when $urn:x-wiley:00220655:media:jedm12062:jedm12062-math-0036$ . When $urn:x-wiley:00220655:media:jedm12062:jedm12062-math-0037$ , the circle-arc function is the identity function.

Livingston and Kim (2010) refer to the arc connecting $urn:x-wiley:00220655:media:jedm12062:jedm12062-math-0038$ , $urn:x-wiley:00220655:media:jedm12062:jedm12062-math-0039$ , and $urn:x-wiley:00220655:media:jedm12062:jedm12062-math-0040$ as symmetric circle-arc equating. They also describe a simpler approach, which they refer to as simplified circle-arc equating. In this approach, the circle-arc function is decomposed into the linear component defined by $urn:x-wiley:00220655:media:jedm12062:jedm12062-math-0041$ and $urn:x-wiley:00220655:media:jedm12062:jedm12062-math-0042$ , which is the identity function, and the circumference of a circle passing through the points $urn:x-wiley:00220655:media:jedm12062:jedm12062-math-0043$ , $urn:x-wiley:00220655:media:jedm12062:jedm12062-math-0044$ , and $urn:x-wiley:00220655:media:jedm12062:jedm12062-math-0045$ . The low and high points in the simplified circle-arc reduce to (x₁, 0) and (x₂, 0). All three points are then used to find the center coordinates and radius of the new circle, $urn:x-wiley:00220655:media:jedm12062:jedm12062-math-0046$ , which is combined with the identity function to obtain the simplified circle-arc function:

$urn:x-wiley:00220655:media:jedm12062:jedm12062-math-0047$ (7)

Research has shown that the circle-arc method can produce significantly lower total error than other methods for equating with small samples (e.g., Livingston & Kim, 2009, 2010). A drawback to the simplified function is that it does not maintain the symmetry property of equating (for a discussion of symmetry and other properties of equating, see Holland & Dorans, 2006). The loss of symmetry may be worth the gains that the method achieves in equating accuracy (Livingston & Kim, 2011). However, given that the symmetric and simplified functions have been shown to produce very similar results (e.g., Livingston & Kim, 2010), symmetric circle-arc equating may be preferable.

Both of the circle-arc methods involve curvilinear combinations of the mean and identity lines. Note that Livingston and Kim (2009) recommend using the lowest and highest meaningful score points, i.e., Equation 2, to define the identity component of the resulting composite function. The circle-arc functions can also be formulated to combine the mean with different forms of the general linear function in 3. In each case, the identity component is simply replaced by the general linear function, and the circle-arc is redefined to pass through the low and high points of 3 as opposed to 2.

Synthetic Functions

As noted above, the circle-arc equating functions involve a curvilinear combination of the identity and mean functions, where the circle-arc overlaps with the identity function at the low and high points and with the mean function at the midpoint $urn:x-wiley:00220655:media:jedm12062:jedm12062-math-0048$ . Kim et al. (2008) demonstrated a different approach to incorporating information from the identity function using what they refer to as synthetic linking and equating. The synthetic function is a weighted average of the identity and another linear or nonlinear observed-score function:

$urn:x-wiley:00220655:media:jedm12062:jedm12062-math-0049$ (8)

where one of the functions to be combined is $urn:x-wiley:00220655:media:jedm12062:jedm12062-math-0050$ and the weights w₁ and w₂ are values between 0 and 1 that sum to 1.

Similar to the general circle-arc functions, synthetic functions can also involve combinations of general linear functions. Combining linear and identity linking functions, for example, results in a synthetic function with slope

$urn:x-wiley:00220655:media:jedm12062:jedm12062-math-0051$ (9)

and intercept

$urn:x-wiley:00220655:media:jedm12062:jedm12062-math-0052$ (10)

where a₁ and a₂ are the slopes for the first and second functions. The synthetic weights are used to control the influence of a given function on the final result. In the example above, when the linear weight is 1, the identity weight is 0, and $urn:x-wiley:00220655:media:jedm12062:jedm12062-math-0053$ ; conversely, when the linear weight is 0, the identity weight is 1, and $urn:x-wiley:00220655:media:jedm12062:jedm12062-math-0054$ . As shown by Kim et al. (2008), the size of the weight determines the error associated with the synthetic function. When the identity weight is .5, for example, the SE of $urn:x-wiley:00220655:media:jedm12062:jedm12062-math-0055$ is reduced by half compared to when the identity weight is zero; the bias is half of that for linear equating and half of that for identity equating.

Like the simplified circle-arc function, the synthetic function will generally not be symmetric. However, symmetry can be maintained for any combination of two or more linear functions when the weights are adjusted by the slopes for the linear functions being combined (Holland & Strawderman, 2011). For two linear functions, the adjusted weight W₁ is

$urn:x-wiley:00220655:media:jedm12062:jedm12062-math-0056$ (11)

As shown above, the synthetic, like other averaging methods, combines two functions after the parameters of each have been determined. The general linear method can also incorporate information from two separate functions; however, this information can be used to obtain the intercept and slope independently of one another. As a result, the intercept could be based on the means for two forms, while the slope is based on an average of information from the identity and linear functions. In this way, different functions, that is, different sources of information, can be used to obtain $urn:x-wiley:00220655:media:jedm12062:jedm12062-math-0057$ and $urn:x-wiley:00220655:media:jedm12062:jedm12062-math-0058$ independently of one another, and $urn:x-wiley:00220655:media:jedm12062:jedm12062-math-0059$ and $urn:x-wiley:00220655:media:jedm12062:jedm12062-math-0060$ independently of one another. This is demonstrated below.

Summary

The goal of linking and equating is to estimate the form difficulty difference from X to Y and describe how it changes across the score scale. To effectively meet this goal, any linking or equating function should take into account the known differences in the difficulty of X and Y that result from features of the test forms themselves. Linear and equipercentile equating naturally account for differences in form features, such as scale length; however, identity and mean equating, as traditionally defined, do not. This article demonstrates a more general form of the linear function for testing applications where form differences produce predictable changes in the scales for each form.

In the context of circle-arc equating, Livingston and Kim (2009) recommend incorporating information about the prespecified end-points of the X and Y scales. This idea is extended here to the traditional identity and mean functions. In identity and mean equating, difficulty is assumed to either not differ or only differ by a constant, regardless of known differences in X and Y. The general linear functions take these known differences into account. As a result, in some situations they may be better suited to describe the equivalence between X and Y. One such situation involves performance assessment, as demonstrated in the first study below (for examples of equating with performance assessment and other forms of assessment where scale length can vary, see Albano & Rodriguez, 2012; Betts, Pickart, & Heistad, 2009; Francis et al., 2008; Montague, Penfield, Enders, & Huang, 2010).

As with the synthetic and other composite methods, the main motivation for a general linear method is the reduction of SE in situations where the difficulty difference between forms is best estimated using a combination of multiple sources of information, including the score distributions on X and Y. The general linear method allows for control over which sources of information have more impact on the final equating function. In this way, it is similar to the synthetic method. However, the general linear method extends this control individually to the slope and intercept that define the equating line, and it does so while maintaining symmetry between the functions linking X to Y and Y to X. This control can be useful in situations where sample sizes differ for X and Y, as demonstrated in the second study below. (Livingston and Kim [2009] also demonstrate equating with samples of different sizes.)

Comparing Equating Methods

Study 1

Data for Study 1 come from two forms of a reading fluency test, READ1 and READ2, where the total score was calculated as the number of words read correctly from a passage of text in 60 seconds. The reading passages were written according to the same test specifications, and were intended to be equivalent in structure and difficulty; however they differed slightly in length and in expected minimum and maximum scores. The forms were administered to 2,089 elementary students with counterbalancing in a single sitting, making this a single-group equating design. Each word read correctly was worth 1 score point. Means were 79 for READ1 and 70 for READ2, and SDs were 46 and 35, respectively. Due to the construction of the passages, the effective score ranges were [5, 250] for READ1 and [15, 200] for READ2. Note that such large score scales are expected to be more problematic than smaller ones when equating with small samples; longer score scales can result in relatively fewer individuals per score point.

Equating was performed using the equate package (Albano, 2014), within the statistical environment R (R Core Team, 2013). The equate package is open-source software for observed-score equating, supporting a variety of data collection designs and equating methods. Loglinear models were first used to smooth the distributions of READ1 and READ2, preserving the first two univariate moments of each and the first bivariate moment. Figure 1 contains the smoothed bivariate frequency distribution. The scatterplot reveals that smoothed scores on form READ1 extend to about 250, whereas smoothed scores on READ2 taper off around 200.

Details are in the caption following the image — **Figure 1**
Open in figure viewer PowerPoint

Study 1 smoothed bivariate frequency distribution for two forms of an oral reading fluency test, with the equipercentile criterion equating line superimposed.

Linking and equating methods were compared using parametric bootstrapping where the smoothed score distributions for READ1 and READ2 served as a pseudopopulation from which samples were drawn with replacement. The equipercentile equating function based on the smoothed distributions was used as the criterion or true function. Samples of four different sizes, $urn:x-wiley:00220655:media:jedm12062:jedm12062-math-0061$ 30, 50, 100, 300, were drawn with replacement from the population, and linking and equating functions were estimated using each of the methods listed below, with sample sizes always matching for READ1 and READ2. The reference form was READ2.

The resampling procedure was repeated 1, 000 times, and the results were then summarized for a given method based on the SD across replications (SE), the difference between the mean estimated function across replications and the criterion equating function (bias), and the square root of the sum of SE squared plus bias squared (root mean squared error, RMSE). Weighted-mean SE, bias, and RMSE were also obtained using smoothed pseudopopulation frequencies on X as weights.

Fourteen different equating functions were compared in Study 1: the traditional and general forms of the identity, mean, and circle-arc functions; synthetic averages of the linear and traditional/general identity functions, with identity weights of .75 for function S1, .50 for S2, and .25 for S3; and linear and equipercentile functions. This resulted in two forms of each identity, mean, and circle-arc function, two forms of each of the three synthetic averages, and one form of the linear and equipercentile functions. Synthetic weights were adjusted using 11 to maintain symmetry.

The general linear equating functions differed from their respective traditional forms in the estimation of the linear slope as $urn:x-wiley:00220655:media:jedm12062:jedm12062-math-0062$ compared to $urn:x-wiley:00220655:media:jedm12062:jedm12062-math-0063$ . $urn:x-wiley:00220655:media:jedm12062:jedm12062-math-0064$ was fixed to 200 − 15 and $urn:x-wiley:00220655:media:jedm12062:jedm12062-math-0065$ was fixed to 250 − 5, resulting in $urn:x-wiley:00220655:media:jedm12062:jedm12062-math-0066$ . Any general identity functions were also given a midpoint based on the medians of the effective scales of X and Y, so that the equating line passed through the coordinate (127.5, 107.5). Note that the linear slope based on the pseudo-population was also .76. This suggests that the general forms of the identity, mean, and circle-arc functions, along with the three synthetic functions using Δ to define the identity slope, should better estimate the linear change in difficulty across the scale, on average, than the corresponding functions where the slope is fixed at 1.

Table 1 contains the weighted mean RMSE for each equating method by sample size. Values were all multiplied by 100 to improve readability. Figure 2 shows a subset of these results for functions with values below 1. The general identity function resulted in a weighted mean RMSE of .51, which is indicated in Figure 2 by the horizontal solid black line. Abbreviations in Table 1 and Figure 2 are the first letter of the corresponding function, where I, M, and C denote identity, mean, and circle-arc, respectively; G is added for the general form of I, M, C, S1, S2, and S3; and L and E denote the linear and equipercentile functions. Note that Figure 2 only shows the general forms of each function, along with the linear and equipercentile functions, as the remaining functions based on the traditional forms all had errors above 1.

Table 1. Study 1 Weighted Mean $urn:x-wiley:00220655:media:jedm12062:jedm12062-math-0067$ by Sample Size

N	I	M	C	S1	S2	S3	GI	GM	GC	GS1	GS2	GS3	L	E
30	4.49	4.12	3.60	3.40	2.38	1.52	.51	.90	.82	.49	.66	.91	1.21	1.45
50	4.49	3.98	3.48	3.39	2.33	1.39	.51	.68	.65	.45	.53	.70	.93	1.11
100	4.49	3.88	3.40	3.38	2.29	1.28	.51	.49	.50	.42	.42	.51	.65	.75
300	4.49	3.81	3.34	3.37	2.25	1.17	.51	.31	.35	.40	.32	.32	.39	.44

Note

I, M, and C are identity, mean, and circle-arc, respectively; S1, S2, and S3 are the synthetic functions with identity weights of .75, .50, and .25; G indicates the general linear form of the corresponding function; L and E are linear and equipercentile.

Error for all of the functions decreased as sample size increased. The general synthetic functions GS1 and GS2 tended to outperform all others, with GS1 having the lowest error at samples sizes 30 and 50, and GS1 and GS2 having the lowest values at sample size 100. At sample size 300, GM had the lowest error, GS2 and GS3 were slightly larger than GM, and all the methods shown in Figure 2 were below .50. The results of Study 1 show that all of the functions begin to produce similar amounts of error as sample size increases. However, the general linear forms of each function produced noticeably lower error than their respective traditional forms, and were also lower than linear and equipercentile equating at sample sizes of 30, 50, and 100.

Study 2

Data for Study 2 are based on two forms of a large-scale certification exam, FORM1 and FORM2, containing 200 multiple-choice items each, and each administered to over 5, 000 examinees. Items were scored correct/incorrect, resulting in a maximum possible score of 200. Total and anchor test means for FORM1 were 158.5 and 28.7, with SDs of 19.5 and 4.3. Total and anchor test means for FORM2 were 160.7 and 28.7, with SDs of 19.3 and 4.3. These values indicate that the examinees taking each form were similar in ability, and that the forms were similar in difficulty.

To artificially create more of a need for equating, the total score mean for FORM2 was adjusted downward to 148.0 and the SD was increased to 25.0. The FORM2 anchor score mean was modified only slightly to 28.0, and the SD was increased to 4.8. Correlations between total and anchor tests were .86 in FORM1, FORM2, and the modified version of FORM2, referred to as FORM2R.

The bivariate distributions in FORM1 and FORM2R were smoothed to preserve the first three univariate moments and one bivariate moment. Like in Study 1, these smoothed distributions served as pseudopopulations from which samples were drawn with replacement. The sample size for FORM1 varied as $urn:x-wiley:00220655:media:jedm12062:jedm12062-math-0068$ 20, 50, 100, 500, whereas the sample size for FORM2R was always 500. Equating functions were estimated at each of 1, 000 replications, with FORM2R as the reference form, and error was summarized as described in Study 1. The chained equipercentile equating function based on the smoothed distributions was used as the criterion function.

Study 2 compared 13 equating functions. These included identity, Tucker mean, Tucker and chained linear, chained circle-arc, and chained and frequency estimation equipercentile. Three general linear functions were also estimated, with the X part of the linear slope, $urn:x-wiley:00220655:media:jedm12062:jedm12062-math-0069$ , obtained as a weighted combination of the Tucker SD for FORM1 ( $urn:x-wiley:00220655:media:jedm12062:jedm12062-math-0070$ ) and for FORM2R ( $urn:x-wiley:00220655:media:jedm12062:jedm12062-math-0071$ ) in the target population. In the first general linear function, G1, $urn:x-wiley:00220655:media:jedm12062:jedm12062-math-0072$ from Equation 3 was estimated as $urn:x-wiley:00220655:media:jedm12062:jedm12062-math-0073$ ; in G2, it was estimated as $urn:x-wiley:00220655:media:jedm12062:jedm12062-math-0074$ ; and in G3, it was estimated as $urn:x-wiley:00220655:media:jedm12062:jedm12062-math-0075$ . Note that with a weight of 0, the slope would be estimated as $urn:x-wiley:00220655:media:jedm12062:jedm12062-math-0076$ , effectively resulting in mean equating. With a weight of 1, the slope would be $urn:x-wiley:00220655:media:jedm12062:jedm12062-math-0077$ , resulting in linear equating. Thus, the three general linear functions examined here represent compromises between mean and linear equating in the X component of the slope; the G1 slope was closer to the mean slope of 1, the G3 slope was closer to the linear slope, and the G2 slope was midway between the two. In these and the other unchained equating methods (i.e., Tucker mean, Tucker linear, and frequency estimation) the population weights for determining T were set proportional to sample size (for details see Albano, 2014).

Three synthetic average functions were also obtained. These differed from the general linear functions in that they involved averages of the traditional Tucker linear and identity functions. As a result, both the intercepts and slopes were averaged, compared with only the slope being averaged in the general linear functions. The identity functions were weighted as in Study 1, using .75 in function S1, .50 in function S2, and .25 in S3. As a result, S1 relied on the identity the most, and S3 relied on it the least. Weights were again adjusted using 11 to maintain symmetry.

Table 2 contains the weighted mean RMSE, again multiplied by 100, for each function across the four sample sizes for Study 2. Trends in the results across sample sizes were less evident in Study 2 than in Study 1, making them difficult to visualize in a figure. Error again tended to decrease as sample size increased. At FORM1 sample size 20, error was lowest for G1, followed closely by G2; at sample size 50, error was lowest for G2, followed by G1 and G3; at sample size 100, error was again lowest for G2; at sample size 500, error was lowest for chained equipercentile, followed by G2, G3, chained linear and frequency estimation equipercentile. For identity equating, mean RMSE (i.e., bias) multiplied by 100 was constant at 3.53. The synthetic function relying the most on the identity, S1, had the next highest error overall. Results were mixed for the remaining functions, with some higher than others at different sample sizes.

Table 2. Study 2 Weighted Mean $urn:x-wiley:00220655:media:jedm12062:jedm12062-math-0078$ by FORM1 Sample Size

N	I	MT	LC	LT	CC	EC	EF	G1	G2	G3	S1	S2	S3
20	3.53	1.56	1.85	2.07	2.52	2.11	2.30	1.45	1.46	1.66	2.67	1.92	1.61
50	3.53	1.25	1.21	1.39	1.87	1.29	1.36	1.08	1.01	1.11	2.65	1.81	1.19
100	3.53	1.10	.93	1.09	1.58	.94	.95	.92	.81	.86	2.65	1.78	1.03
500	3.53	.95	.60	.82	1.37	.54	.60	.74	.57	.58	2.65	1.74	.87

Note

I is identity; MT is Tucker mean; LC is chained linear; LT is Tucker linear; CC is chained circle-arc; EC and EF are chained and frequency estimation equipercentile; G1, G2, and G3 are the general linear functions, with weights of .25, .50, and .75 applied to the X component of the linear slope; and S1, S2, and S3 are the general synthetic functions with identity weights of .75, .50, and .25.

Discussion

This article introduces a new approach to estimating linear change in difficulty from one test form to another in the context of small-sample equating. The general linear method offers flexible control over the sources of information used to estimate the linear intercept and slope in equating. This control can be useful in situations where additional information, for example, from the test forms themselves or from other score distributions, can reduce bias and improve accuracy of the estimated equating function.

Study 1 showed how information from the scales of X and Y can be used to improve the estimation of the equating function. In this study, one of the synthetic averages combining the general identity and linear functions outperformed all others, including the general identity function, at sample sizes of 30, 50, and 100. At a sample size of 300, the different general functions and linear and equipercentile functions examined began to produce similar amounts of error. These findings indicate that known information from the scales of X and Y should be accounted for when considering any simplified form of the linear function, that is, identity and mean functions and combinations of them. The results of Study 1 are especially relevant to applications of small-sample equating with performance assessments and other types of assessment with variable test length, where the number of items or scale points may have a predictable impact on variability (e.g., Albano & Rodriguez, 2012; Betts et al., 2009; Francis et al., 2008; Montague et al., 2010).

Results from Study 2 showed that combining information from multiple sources to estimate individual parameters in the linear function can improve equating accuracy when sample sizes are small. In this study, the general linear functions relied on the SD of the reference form, FORM2R, to different degrees, to estimate the linear slope. Even at the largest FORM1 sample size of 500, combining information from FORM1 and FORM2R to estimate the FORM1 component of the linear slope resulted in error levels similar to those of linear and equipercentile equating and better than those of other methods. These findings indicate that, with unequal sample sizes, information from a more reliable source, that is, the form with a larger sample size, can be used to improve the accuracy of linear equating functions. The results of Study 2 are especially relevant to testing programs that have a small sample on one test form, e.g., a new test form with limited use, but not the other (e.g., Livingston & Kim, 2009).

Studies 1 and 2 are only two simple examples of how the general linear method can improve equating accuracy. The test forms used in these studies were chosen to demonstrate the strengths of the method; thus, the results found here may only generalize to testing programs involving similar conditions. In Study 1, forms with known differences in their scales were purposefully chosen to show how these scaling differences could be incorporated into the linear equating function. As is evident in Figure 1, the difficulty difference between forms READ1 and READ2 in the pseudopopulation was very linear; as a result, the linear functions tended to do well overall. In situations where the true difficulty difference between forms is curvilinear, the linear methods, including the general linear method, would likely produce more bias and thus higher RMSE. In Study 2, forms differed in their means by roughly 10 points, or about half of a standard deviation on form FORM1. The difficulty difference in the pseudopopulation also appeared to be linear, based on the descriptive statistics for these forms. As a result, identity equating performed poorly, whereas the linear methods performed well, especially at small sample sizes. Again, in situations where the true form difficulty difference is not linear, results for the linear and general linear functions would likely not be so strong. This issue of the match between true and estimated equating functions, or the extent to which different functions violate their assumptions about the true equating, is an important area of future research.

In addition to addressing assumption violations, future studies employing the general linear function could also explore its application in other testing situations, for example, with shorter score scales, as those examined here included 200 or more score points. Applications which incorporate other sources of information could also be explored. For example, recent research has demonstrated the use of collateral information, including information from external variables and prior equating functions, to improve estimation of the equating function (e.g., Branberg & Wiberg, 2011; Kim, Livingston, & Lewis, 2011; Livingston & Kim, 2011). Such collateral information may also be used to improve estimation of specific components of the general linear function.

When basing estimates on multiple sources of information, choosing an appropriate weighting system may be a challenge. Kim et al. (2008) referred to this issue in the context of synthetic equating. Although a clear solution does not exist, the choice of weights is much like the choice of an equating function in general; it involves a balance between the error expected to result from estimation based on samples (random sampling error) and violation of the assumptions of a method (bias). More weighting of the identity function, or any predetermined source of information, suggests less confidence in the estimation of a difficulty difference, and, conversely, more confidence in the assumption that a difficulty difference does not exist. On the other hand, less weighting on known sources of information indicates more trust in estimates and less need for assumptions. In practice, a decision must be made based on prior experience within a testing program and available research on similar programs. When sample sizes differ for X and Y, this can also impact the choice of weights; setting weights proportional to sample size may be appropriate; incorporating SEs for the parameters being estimated, in a sort of Bayesian framework (e.g., Livingston & Kim, 2011) is also a possibility. The extent to which a composite or average equating function relies on different sources of information is another important area of research.

In summary, identity and mean equating are presented here as simplified forms of linear equating, where additional assumptions are used in place of estimated parameters with the goal of reducing SE while not increasing bias. The general linear function improves upon the traditional forms of identity and mean equating by allowing for finer control over the assumptions that are made and the sources of information that are used to estimate the linear slope and intercept. As demonstrated here, the general linear method shows promise in small-sample equating applications.

Biography

ANTHONY D. ALBANO is Assistant Professor, Educational Psychology, University of Nebraska–Lincoln, 114 Teachers College Hall, Lincoln, NE 68588; [email protected]. His primary research interests include multilevel modeling of item-response data, linking and equating, assessment literacy, and technology in testing and test development.

References

Albano, A. D. (2014). equate: Observed-Score Linking and Equating [Computer software manual]. Retrieved October 1, 2014, from http://CRAN.R-project.org/package=equate (R package version 2.0-3).
Google Scholar
Albano, A. D., & Rodriguez, M. C. (2012). Statistical equating with measures of oral reading fluency. Journal of School Psychology, 50, 43–59.
10.1016/j.jsp.2011.07.002
PubMedWeb of Science®Google Scholar
Babcock, B., Albano, A. D., & Raymond, M. (2012). Nominal weights mean equating: A method for very small samples. Educational and Psychological Measurement, 72, 608–628.
10.1177/0013164411428609
Web of Science®Google Scholar
Betts, J., Pickart, M., & Heistad, D. (2009). An investigation of the psychometric evidence of CBM-R passage equivalence: Utility of readability statistics and equating for alternate forms. Journal of School Psychology, 47, 1–17.
10.1016/j.jsp.2008.09.001
Web of Science®Google Scholar
Branberg, K., & Wiberg, M. (2011). Observed score linear equating with covariates. Journal of Educational Measurement, 48, 419–440.
10.1111/j.1745-3984.2011.00153.x
Web of Science®Google Scholar
Francis, D. J., Santi, K. L., Barr, C., Fletcher, J. M., Varisco, A., & Foorman, B. R. (2008). Form effects on the estimation of students' oral reading fluency using DIBELS. Journal of School Psychology, 46, 315–342.
10.1016/j.jsp.2007.06.003
PubMedWeb of Science®Google Scholar
Holland, P. W., & Dorans, N. J. (2006). Linking and equating. In R. L. Brennan (Ed.), Educational measurement ( 4th ed., pp. 187–220). Westport, CT: Greenwood Press.
10.1080/15512160600669122
Google Scholar
Holland, P. W., & Strawderman, W. E. (2011). How to average equating functions, if you must. In A. A. Davier (Ed.), Statistical models for test equating, scaling, and linking (pp. 89–107). New York, NY: Springer.
Google Scholar
Kim, S., Livingston, S. A., & Lewis, C. (2011). Collateral information for equating small samples: A preliminary investigation. Applied Measurement in Education, 24, 302–323.
10.1080/08957347.2011.607057
PubMedWeb of Science®Google Scholar
Kim, S., von Davier, A. A., & Haberman, S. (2008). Small-sample equating using a synthetic linking function. Journal of Educational Measurement, 45, 325–342.
10.1111/j.1745-3984.2008.00068.x
Web of Science®Google Scholar
Kolen, M. J., & Brennan, R. L. (2004). Test equating, scaling, and linking. New York, NY: Springer.
10.1007/978-1-4757-4310-4
Google Scholar
Livingston, S. A. (1993). Small-sample equating with log-linear smoothing. Journal of Educational Measurement, 30, 23–39.
10.1111/j.1745-3984.1993.tb00420.x
Web of Science®Google Scholar
Livingston, S. A. & Kim, S. (2009). The circle-arc method for equating in small samples. Journal of Educational Measurement, 46, 330–343.
10.1111/j.1745-3984.2009.00084.x
Web of Science®Google Scholar
Livingston, S. A., & Kim, S. (2010). Random-groups equating with samples of 50 to 400 test takers. Journal of Educational Measurement, 47, 175–185.
10.1111/j.1745-3984.2010.00107.x
Web of Science®Google Scholar
Livingston, S. A., & Kim, S. (2011). New approaches to equating with small samples. In A. A. Davier (Ed.), Statistical models for test equating, scaling, and linking (pp. 109–122). New York, NY: Springer.
Web of Science®Google Scholar
Montague, M., Penfield, R. D., Enders, C., & Huang, J. (2010). Curriculum-based measurement of math problem solving: A methodology and rationale for establishing equivalence of scores. Journal of School Psychology, 48, 39–52.
10.1016/j.jsp.2009.08.002
PubMedWeb of Science®Google Scholar
R Core Team. (2013). R: A language and environment for statistical computing [Computer software manual]. Vienna, Austria: R Foundation for Statistical Computing. Retrieved October 1, 2014, from http://www.R-project.org/
Google Scholar
Skaggs, G. (2005). Accuracy of random groups equating with very small samples. Journal of Educational Measurement, 42, 309–330.
10.1111/j.1745-3984.2005.00018.x
Web of Science®Google Scholar

Citing Literature

Volume52, Issue1

Spring 2015

Pages 55-69

A General Linear Method for Equating With Small Samples

Abstract