A Framework for Policies and Practices to Improve Test Security Programs: Prevention, Detection, Investigation, and Resolution (PDIR)
Abstract
Test security is not an end in itself; it is important because we want to be able to make valid interpretations from test scores. In this article, I propose a framework for comprehensive test security systems: prevention, detection, investigation, and resolution. The article discusses threats to test security, roles and responsibilities, rigorous training for everyone involved in testing, and an evaluation of current practices in test security. I call on everyone responsible for testing programs—the Association of Test Publishers, Council of Chief State School Officers, National Council on Measurement in Education, U.S. Department of Education, and state assessment program managers, their vendors, and the research community—to collaborate on guidelines and practices for security violation prevention, detection, investigation, and resolution.
Test security is not an end in itself: test security is important because we want to be able to make valid interpretations from test scores and take appropriate actions based on those interpretations. We can make valid interpretations and uses of test scores only if test data have integrity. So, test security is an important means to an important end.
In this article, I propose a comprehensive framework for thinking about, planning, implementing, and operating test security systems. The framework is intended for testing program managers and their vendors, though anyone involved in the testing process should be concerned about test security and may find the framework helpful. The framework has four parts: prevention, detection, investigation, and resolution (PDIR).1 The comprehensiveness of test security systems is important especially because it appears that, in practice, prevention and detection efforts often do not address all security risks, investigations are not always conducted rigorously (e.g., Judd, 2012) or at all, and test security violations may not be resolved fully (e.g., Office of the Inspector General, 2014; Toppo, 2013). Designing comprehensive and effective security systems requires systems thinking2 (e.g., Nichols, 2016). So, in addition to addressing the four parts of test security thinking, the framework also addresses all parties in the testing process before, during, and after test administrations. Anyone involved in the testing process can pose a threat to test security. Everyone involved in the testing process is responsible for protecting test security and minimizing risks of security violations. Finally, the PDIR framework implies and guides remediation of weaknesses to security threats and failures when security breaches are discovered.
The framework also identifies threats to test security, countermeasures, and available detection measures. I also summarize the best thinking from several sources on preventing test security violations and detecting violations that may have occurred. Recommendations on these two parts of the framework have appeared in multiple publications, especially in recent years, and in numerous conferences, including the annual Conference on Test Security (see https://cete.ku.edu/2016-conference-test-security). I do not propose frameworks for investigation and resolution. Little has been published and discussed publicly about investigation and resolution—and practice in these two areas is inadequate (Ferrara, 2014). Such frameworks should be developed by experts in professional investigation techniques; educational policy and law; and teacher certification, employment, and legal requirements. I describe some professional investigation principles and propose that education and testing managers, policy makers, and authorities collaborate to develop best practices for investigation and resolution, to ensure comprehensive frameworks for test security.
As part of preparing to write this article, I conducted a survey of state educational testing programs on their PDIR policies and practices. The three-minute, 15–selected response item survey was administered online, using Survey Monkey, for 2 weeks in February 2016.3 I sent the survey to 51 program directors; 13 responded, for a response rate of 25%. Because the survey responses were anonymous, I had no means of evaluating the diversity or representativeness of the responders. Survey results appear in the appendix; I refer to selected results in several sections of the article. Given the small sample of respondents, I use the results as anecdotes rather than generalizable results. Those survey responses are supplemented by results from the 2015 test security survey of the Association of Test Publishers (Association of Test Publishers [ATP], 2015). Test publishers and their vendors (e.g., testing center and online platform providers) responded to a lengthy online survey, for a 13.5% response rate. The majority of respondents were from certification, licensure, and vendor organizations (ATP, 2015, pp. 11–12).
Background
A test security system includes rules, guidelines, requirements, and procedures—including training—to protect security, avoid breeches, and enforce security. Rigorous test security systems by themselves are insufficient. Once a test is printed and distributed or released to examinees in technology-based environments, secure test material no longer is controlled by testing program managers and their vendors. At that point, responsibility for test security has shifted to testing site managers and test administrators—and, for that matter, examinees. So, producing data with integrity relies on an effective test security system and a local culture of professional ethics and general ethical behavior.4 While it is not necessary that local testing site managers and test administrators buy into the culture, it is necessary that they abide by it. The same applies to examinees as well, of course. A culture of professional ethics encourages professionally ethical thinking and behavior, reporting of irregularities, and cooperation with investigations when irregularities may have occurred (Ferrara, 2012).
Threats to Test Security
To design a comprehensive test security system, it can be helpful to consider threats to test security: (a) cheating by examinees, test administrators, and other educators; (b) inappropriate (e.g., disorderly) test administrations; (c) exposure/disclosure of secure test material; (d) inappropriate test preparation (e.g., practicing responding to secure items with examinees); and (e) additional threats to security, including threats in technology-based testing, electronic threats to secure content and examinee test responses, and threats in performance assessment and for other constructed-response items. Table 1 lists selected examples of cheating and other threats to test security. It includes threats from various participants before, during, and after test administrations. Croft (2014) outlines test security threats that are unique to computer-based testing. Testing program vendors who responded to the ATP survey ranked exposure and dissemination of test content through social media the most important concern for their clients (ATP, 2015, p. 95) and computer intrusions, test administrator, coaching of examinees, and manipulation of examinee responses the least concerns. In addition, testing program managers, school administrators, and parents who choose to opt out of testing (e.g., Schweig, 2016,) pose threats to data integrity by excluding students from test administrations, specifically when nonrandom students are left out.
Before Test Administration | During Test Administration | After Test Administration |
---|---|---|
Examinees | ||
Acquiring test items |
|
Divulging secure test material Failing to report security violations by other examinees or test administrators |
Test Administrators | ||
|
|
|
Other Staff in Local Testing Sitesa | ||
Failing to train local tests administration staff, or failing to monitor test content security | Failing to monitor test administrations and protect test content security |
|
Testing Program Managers and Operations Vendorsb | ||
Failing to publicize expectations for professional behavior or to provide effective training on test administration procedures | Failing to observe test administrations and local protection of secure test content |
|
- Note. Adapted from Fremer & Ferrara (2013, table 2.1); used with permission.
- aOther staff include professional testing site managers and staff, school testing directors, and other administrative and support staff. In schools, instructional aides and other staff may assist with managing test administrations and secure materials.
- bTesting program managers and vendors may not be present in local testing sites.
Additional Threats to Test Security
Some additional threats do not fit neatly into the before-during-after test administration framework. (a) Limited item banks can lead to item overexposure when, for example, some items appear in multiple test administrations because too few items are available in all or parts of the item bank. Item overexposure enables test administrators to recall specific items and to divulge those items intentionally or inadvertently. Limited item banks and item overexposure is a particular problem in computer-adaptive testing (Cohen & Wollack, 2006, pp. 362–370), especially when limited numbers of items are available at specific ranges on the theta scale. (b) Digital device–based testing in schools often requires long test administration calendar periods. Long administration windows threaten item exposure, because teachers and early examinees can divulge items to subsequent examinees. Device-based testing poses a range of other threats, including opportunity to hack into testing servers, fraudulently log into examinee sessions, log examinee keystrokes, and move item information outside of the test delivery system (Croft, 2014; see also Foster, 2013). (c) Tiny cameras and cell phones can be used to record and transmit test items (see Cohen & Wollack, 2006, pp. 362–370). (d) Memorability is an intractable problem in performance assessments (Ferrara, 2014). Constructed-response items, essay prompts, performance tasks, and simulations are memorable because of their uniqueness (Ferrara, 1997), which exposes them as public knowledge and precludes reuse. (e) Nonsecure tests used for high-stakes purposes can create security risks. For example, some periodic tests intended to provide formative feedback to students may be used for teacher evaluation. (f) Sharing of tests and items in multistate consortia and sharing of item banks, long-standing ideas to reduce testing costs that are now in wide practice, expose secure test items widely and increase the possibility of security violations. For example, if cheating were expected to occur in only .01% of testing groups (cf. Fremer & Ferrara, 2013, p. 17–18), we might expect as many as 1,300 cases of cheating in a consortium with approximately 5 million examinees in Grades 3–8. (g) Threats to test security may be particularly worrisome in licensure, certification, and other testing situations (e.g., student test performance used for teacher evaluation) where test results are important to individuals, and individuals may be personally motivated to get an edge. (h) All of these threats are exacerbated by the ease of sharing information on websites and through social media.
Prevention
Preventing security violations is the first and perhaps most effective countermeasure to test security threats. Minimizing threats can be supported by clear guidelines and requirements for reducing those threats and avoiding security breaches, communicating about the guidelines and requirements, training all professional staff and examinees on the guidelines and requirements, and reinforcing and enforcing them. A strong and public declaration that test security violations will not be tolerated and will be reinforced by effective detection, investigation, and resolution actions is a good starting place. Even with the most rigorous prevention measures and efforts, test security violations always are a threat. So, a realistic goal is a combination of maximizing prevention, relying on rigorous detection and investigation procedures, and intervening forcefully when possible security violations arise. To my knowledge, no one has conducted test security practice efficacy studies. So, I base these recommendations on experience and communication with colleagues.
Security guidelines (as opposed to strict rules) are necessary because, inevitably, situations arise that require professional judgment where rules may not apply directly or explicitly (e.g., applying a no camera phones rule to camera pens; administering a test just outside of the testing window because of a significant testing site disruption at the end of the window). In other cases, decisions may be clear, simply by abiding by test security rules (e.g., by locking up secure test materials and guarding the chain of custody). Guidelines and requirements should address managing secure test materials and content at all times, conducting orderly test administrations, guidance on how to intervene when necessary, training test administrators and testing site staff, and ensuring that all security guidelines and requirements are implemented rigorously. Regular communication about the responsibilities of all roles in the testing process (see Table 1 above) to protect test security and behaviors that are prohibited is crucial for effective test security systems. Training the people who serve in those roles about their responsibilities and test security guidelines and requirements builds on that basis. As above, I am not aware of systematic research on the efficacy of test security training. I have observed in a range of testing programs that training can be cursory and superficial, that test administrators and their supervisors vary widely in how closely they appear to pay attention during training, and how rigorously they implement the guidelines and requirements in local testing sites. My observations indicate that, often, training involves a cursory review of test security guidelines in test administration preparation sessions. Often, little attention is devoted to threats to test security, how the people in roles in the testing process can avoid and counter those threats, and how their actions are part of a larger system of test security and data integrity. The judgment, rules, guidelines, and requirements necessary to protect test security suggest the need for test manuals that supplement test administration manuals and for rigorous training.
The main responsibilities of testing program managers and their vendors are to develop guidelines and requirements to protect test security, to communicate those guidelines and requirements to train test administrators and other local testing site staff, to monitor implementation, and to build and implement countermeasures for threats that seem likely for a testing program. Table 2 summarizes the main test security responsibilities of people serving in key roles in the testing process. Wollack and Case (2017) address protecting test security in the context of maintaining fair, or comparable, testing conditions for all examinees. A study of a single licensure examination indicates no deleterious effects on item and test form characteristics of publicly releasing a large number of items (Buckendahl & Gerrow, 2016), a strategy often considered as a way of undermining motivation to reconstruct or otherwise compromise test item content.
Before Test Administration | During Test Administration | After Test Administration |
---|---|---|
Examinees should: | ||
Avoid and repudiate attempts to acquire secure test content | Not copy or supply test items or answers; report suspicious or blatant activities of others |
|
Test administrators should: | ||
|
|
|
Other staff in local testing sites should: | ||
Publicize expectations for professional behavior and provide training; monitor test content security |
|
Acknowledge or report possible violations |
Testing program managers and operations vendors should: | ||
|
|
|
Countermeasures
Identifying the highest priority threats to test security and identifying parties in the testing process who may pose those threats is a good first step to develop countermeasures. The goal in creating countermeasures is to reduce the risk of threats to test security. In fact, in using that term, rather than “prevention,” acknowledges that no one is able to eliminate—to prevent—security violations. Everyone with a role in the testing process can contribute to minimizing the risk that violations will occur.
Cohen and Wollack (2006, pp. 362–370) propose three categories of countermeasures: human observation and prevention, electronic, and psychometric countermeasures. Electronic countermeasures are increasingly important because of the rapid and ongoing expansion of online and device-based testing and the shocking array of digital devices that can be used to breach test security. These devices include smart phones and old-fashioned hacking of test content and examinee response files. The PDIR framework treats psychometric countermeasures or data forensics as part of the investigation process that should follow when test security violations are suspected. They also can serve as prevention measures (e.g., deterring cheating because of the risk of detection) and play a role during investigation (e.g., conducting response similarity analyses after a report of sharing test content or supplying answers) and resolution (e.g., determining the severity of the violation and subsequent sanctions).
The Handbook of Test Security (Wollack & Fremer, 2013) is another source of advice on preventing test security violations. Foster (2013, chapter 3) recommends prevention solutions for technology based testing; specifically, against theft of test item files, item theft during test administrations, inappropriate retakes, and cheating and general deterrence solutions (pp. 71–78). Other authors recommend prevention measures for classroom testing (Woodruff, 2013, pp. 90–98) and large-scale writing assessment (Lane, 2013, pp. 117–118), legal precautions and measures (Fitzgerald & Mulkey, 2013, pp. 134–144), and protections for the physical security of local testing sites and testing companies (Scicchitano & Meade, 2013, chapter 7). Test security case studies describe lessons learned from test security violations in testing programs for certification and licensure (Carson, 2013, pp. 264–282), clinical testing (Williams, Rzepa, Sitarenios, & Wheldon, 2013, pp. 287–288), educational testing (Kingston, 2013, pp. 308–309), in employment testing (Bartram & Burke, 2013, pp. 328–329), and in a commentary on the case studies (Hatherill, 2013, chapter 16).
Table 3 summarizes countermeasures for specific threats to test security posed by parties in the testing process who pose threats to security.
Before Test Administration | During Test Administration | After Test Administration |
---|---|---|
Examinees | ||
Acquiring test items: Management of chain of custody of secure printed materials and protection of access to digital files | Copying or supplying answers, using cheat sheets: Management of disallowed materials brought into testing sessions, observation of examinee behavior, intervention on suspicious behavior; spacing, orienting, and shielding computer screens, so examinees cannot see others’ screens | Divulging secure test material: Postadministration surveys; observations and surveys about test preparation activities prior to the next administration |
Test Administrators | ||
Divulging/teaching secure material: Surveys about test preparation activities prior to test administration | Providing answers, indicating answers that should be changed: Test administration observations; response similarity analyses | Changing answers on answer documents, tampering with test files: Erasure/wrong-to-right answer change analyses |
Other Staff in Local Testing Sites | ||
Failing to publicize expectations for professional behavior or provide training; failing to monitor management of chain of custody: Communication and training | Failing to monitor test administrations and test content security: Targeted and random monitoring of test administrations | Failing to acknowledge, report, investigate, or resolve violations: Policies and practices to support enforcement and follow-through |
Testing Program Managers and Operations Vendors | ||
Failing to provide effective test administration and security training: Comprehensive communication activities, training plans, and policies to ensure all test administrators and local testing site staff participate | Failing to observe test administrations and local protection of secure test content: Observations of randomly selected testing sites, sites where reports suggest worrisome behavior or lax procedures, and sites with previous violations |
|
Live observations in local testing sites of all phases in the testing process play a significant role in discouraging test security violations. Testing program managers can schedule observations that focus on protection of secure test materials in local testing sites before, during, and after testing administrations; training of local testing site staff; and test administration sessions themselves. As Cohen and Wollack (2006) observe, “a surprising number of security breaches stem from disgruntled or dishonest employees” (p. 363) involved in any part of the testing process. Knowing that a testing site may be visited during any phase of the testing process encourages local staff to follow guidelines and requirements to protect secure material, train test administration staff, and conduct secure and orderly test administrations. That knowledge also may reduce temptation to violate security or allow lax practice and even discourage those who may want to cheat. (It also may drive cheating efforts deeper underground, where they may be more difficult to detect.)
Testing program managers can conduct observations of training sessions for test administrators and testing site staff and determine whether the training is rigorous, not pro forma. They can conduct observations of test administration sites before, during, and after test administration sessions to determine whether administrations are orderly and test security is protected. Some observations should be unannounced and in random testing sites and sessions. Some observations should be at sites where previous activity suggests worrisome behavior or lax procedures. To reinforce the influence of observations, testing program managers can file reports with testing site managers and their supervisors, with recommendations for improving procedures and protections. Testing programs do not have the resources to observe all testing phases and sites, so publicizing and implementing a system of targeted and random observations may be the most feasible prevention measure.
Reporting forms and channels may help to deter security violations and could help with follow-up investigations. Many programs provide forms and communication channels to document the chain of custody of secure test materials and protection of secure computer files, testing system login protections, and other protections against hackers; report test administration irregularities (e.g., disruptive behavior, suspected cheating behavior); report disruptions during test administrations (e.g., fire alarms, Internet connection problems); and enable anonymous whistle-blowers to report suspicious or lax behavior.5
These may be the only practically feasible countermeasures available to testing program managers. Once test material is available or accessible at local testing sites, testing program managers and their vendors have to rely on the professionalism of testing site staff and local ethical cultures. Because there always is a likelihood of mischief, errors, and laxness somewhere, testing program managers also must implement test security violation detection procedures, as in the “trust, but verify” concept in United States–Russia relations since the 1980s (e.g., https://en.wikipedia.org/wiki/Trust,_but_verify; http://trustbutverifybook.com/), rather than blindly trusting everyone to protect test security.
Other, more invasive, countermeasures are possible. Video surveillance in testing centers could be an effective factor to deter examinee and test administrator cheating and imposters and helpful in follow-up investigations. (See Constitutional Project, 2006, for recommended principles, rules, and procedures for public video surveillance systems.) Noise generators and cell phone nullifiers in testing sites that scramble incoming and outgoing signals could be helpful but are illegal (see https://www.fcc.gov/general/jammer-enforcement). And simply requiring all examinees to check their technology devices outside the door could help. However, putting classroom teachers in the role of screeners seems incompatible with their primary roles and responsibilities as educators and helpers.
Detection
Detection of possible cheating has benefited from a surge of attention in the psychometric and testing communities, due to efforts of the US Department of Education (e.g., the Symposium on Data Integrity; see US Department of Education, 2013), Association of Test Publishers publications,6 Council of Chief State School Officers’ publications (e.g., Council of Chief State School Officers and Association of Test Publishers, 2013; Olson & Fremer, 2013), annual Conference on Test Security (see https://cete.ku.edu/2016-conference-test-security), and researcher publications like the Handbook of Test Security (Wollack & Fremer, 2013), Test Fraud: Statistical Detection and Methodology (Kingston & Clark, 2014), and the Handbook of Quantitative Methods for Detecting Cheating on Tests (Cizek & Wollack, 2017). Table 4 summarizes statistical and other methods for detecting possible test security breaches. The table addresses the roles and responsibilities in the testing process from Table 2 and test security threats posed in Table 1. Table 4 includes only those observational and statistical detection methods with logical promise or some empirical support for efficacy or effectiveness in detecting possible test security violations, where statistical methods have acceptable reliability and accuracy under specified conditions.
Supported or Promising Detection Methodsb | Research Findings on Efficacy or Effectivenessc |
---|---|
Detection Before Test Administration | |
Test Administrators and Local Testing Staff Failing to Protect Secure Test Materials and Test Content | |
(1) Systematic and random audits of local sites; video surveillance of secure materials; see also methods in Detection After Administration section regarding unusual score gains and similar response patterns | – |
Caveats for use: Video surveillance may conflict with local privacy protections and might be prohibitively expensive | |
Test Administrators or Examinees Acquiring and Divulging Secure Test Content | |
(2) Best detection methods available may be web and social media monitoring and relying on whistleblowers; see also Detection After Test Administration section regarding unusual score gains and similar response patterns | Number of whistleblower reports may be less than the number of actual violations |
Caveats for use: Accuracy and comprehensiveness of whistleblower reports is not widely known | |
(3) Examinees who intentionally take and fail an examination an unusual number of times and may have unusually low scores, in order to memorize test content (Carson, 2013, p. 266) | No norms cited for unusual number of retakes and unusually low scores |
Caveats for use: Interviewing such examinees may not yield confessions but may curb the behavior—or may offend an innocent examinee | |
Test Administrators or Others Providing Inappropriate Test Preparation | |
(4) Systematic and random observations of classrooms and whistleblower reports | Number of whistleblower reports may be less than the number of actual violations; Koretz (2015) defines inappropriate test preparation as reallocation of instructional resources and coaching |
Caveats for use: Accuracy and comprehensiveness of whistleblower reports is not widely known; there may be no general agreement on definitions of appropriate and inappropriate test preparation | |
Testing Program Managers and Operations Contractors Failing to Publicize Expectations for Professional Behavior | |
(5) Systematic and random audits of use of training materials, schedules, and sessions | A majority of respondents to the Association of Test Publishers survey use multiple communication channels, but percentages are low for several channels (ATP, 2015, pp. 37, 41) |
Caveats for use: Use multiple communication channels to publicize expectations, including at the beginning and end of administration, on testing site posts, in administrator manuals and examinee information, and via social media | |
Testing Program Managers and Operations Contractors Failing to Provide Effective Training on Test Administration and Security Procedures | |
(6) Audits to confirm complete implementation of effective training plans | – |
Caveats for use: This threat is challenging to address: Managers and contractors typically train local testing site managers and rely on them to deliver effective training to test administrators; simply providing information on test administration and security threats may be less effective than direct training and involving trainees in discussions of concerns, threats, and mitigation | |
Detection During Test Administration | |
Examinees Impersonating Other Examinees | |
(7) Authentication of IDs, eye scans, and other high tech methods (regarding technology based testing; Foster, 2013, p. 76) | Availability of efficacy studies not known |
Caveats for use: Efficacy of these methods probably can be assumed; effective implementation in specific testing sites should be confirmed (Foster, 2013, p. 76) | |
Examinees Supplying Answers to Other Examinees and Copying Answers From other Examinees | |
(8) Providing orderly administration conditions and observing examinee behavior during test administration; video surveillance | Effectiveness of training of test administrators on effective observation techniques is not widely known |
Caveats for use: Observing suspicious behavior and recording it as evidence can be sensitive and can disturb orderly administrations; video surveillance may conflict with local privacy protections | |
Test Administrators Supplying Answers to Examinees | |
(9) Random and systematic audits of test administrations; video surveillance | Effectiveness of training of test administrators and of observation techniques is not widely known |
Caveats for use: Observing suspicious behavior by test administrators would likely disrupt orderly administrations; video surveillance may conflict with local privacy protections | |
Test Administrators and Examinees Acquiring and Transmitting Secure Test Content Using Technology | |
(10) Monitoring internet and social media traffic for secure test content and intervention | Efficacy of publicized monitoring and intervention efforts is promising; effectiveness is not widely known |
Caveats for use: Public complaints of intrusion on personal privacy and potential lawsuits is worrisome; demanding student passwords to social media accounts (Herold, 2015) seems heavy-handed | |
(11) (a) Requiring and confirming that examinees have not brought any microrecording devices, cell phones, miniature microphones and cameras, concealable internet receivers, etc., into the testing room (e.g., Chajewski, Kim, Antal, & Sweeney, 2014, p. 101); (b) noise generators and blockers that can scramble signals to and from devices (Cohen & Wollack, 2006, p. 363); (c) monitoring signals from cheaters’ devices | (a) Publicizing and enforcing this requirement seems that it should be effective, though perhaps uncomfortable for educators and examinees; (b) signal blocking probably would be highly effective, though illegal; (c) capturing outgoing signals from devices could capture possible secure content, but processing all signals to identify secure content requires resources and may not be 100% reliable |
Caveats for use: (a) It may not be reasonable to expect educators to frisk or wand examinees or to ask examinees to turn pockets inside out and remove baggy clothing for hand inspection; (b) electronic devices can be expensive; (c) radio signal jammers that nullify cell phone reception are illegal (see https://www.fcc.gov/general/jammer-enforcement); (c) capturing outgoing signals probably constitutes wiretapping, and most likely would require a court order | |
Detection After Test Administration | |
Identification of Unusual Group Score Gains | |
(12) (a) Jacob and Levitt algorithm (Cohen & Wollack, 2006, pp. 369 ff.; Maynes, 2013, p. 192); (b) analysis of regression residuals and score differencing (Maynes, 2013, pp. 186–188); (c) cumulative logistic regression model to build longitudinal databases to detect unusual gains in performance-level percentages (Clark, Skorupski, & Murphy, 2017); (d) regression-based local outlier detection algorithm (RegLOD; Simon, 2014) compares gains for similar schools rather than all schools; (e) histograms and test score plots that indicate unusual or suspicious shifts in performance of a testing group (Maynes, 2013, p. 192); (f) Bayesian hierarchical linear modeling (BHLM) and the posterior probability of cheating (PPoC) to detect group-level cheating (Skorupski, Fitzpatrick, & Egan, 2017; Skorupski & Egan, 2014) | (a) “Some support of this model” (Cohen & Wollack, 2006, p. 370); regression successful in identifying “extreme instances in group-based test fraud” (Maynes, 2013, p. 187); (b) score differencing may “rectify” the tendency to flag smaller schools instead of larger schools (Maynes, 2013, p. 187); (c) initial illustration looks promising; also the MLR and z-score approaches “hold promise” for lower performing cheating groups and for average performing groups in larger schools, in contrast with multilevel logistic regression for higher performing schools and for smaller schools in Gaertner and McBride (2017, p. 273); (d) RegLOD evaluated in only one study; (e) two examples cited in Maynes (2013, p. 192) suggest that this approach may be helpful for identifying possible violations; (f) BHLM method “appears to have great promise (Skorupski et al., 2017, p. 243) |
Caveats for use: (a) “Much more work is necessary” (Cohen & Wollack, 2006, p. 370) to test model appropriateness, error rates, and evaluative criteria (see also “literature on pass rate analysis is relatively thin,” Gaertner & McBride, 2017, p. 264); need to account for violation of regression assumptions in test data (Maynes, 2013, p. 187); (b) these approaches may be instructive but allow alternative explanations that must be investigated (Maynes, 2013, p. 192); (c) problems can arise from missing test scores and inadequate numbers of predictors that correlate with the proficiency-level percentages (Clark et al., 2017, pp. 259, 260); (d) needs more testing; (e) promising exploratory tool to be used with other indicators; and (f) BHLM/PPoC research is promising but limited | |
Examinees Impersonating Other Examinees | |
(13) Similarity across successive test administrations of item response patterns, test scores, and time to complete a test for several examinees in immediately successive test administrations (regarding technology based tests; Foster, 2013, p. 76) | Efficacy and effectiveness are not widely known |
Caveats for use: Effectiveness of these approaches is not widely known | |
Improbably Similar Response Patterns Due to Test Content Preknowledge, Examinee Answer Sharing or Copying, or Test Administrators Supplying Answers During Test Administration | |
(14) Response similarity indices, comparisons of performance on secure and compromised items | (a) Cohen and Wollack (2006) conclude that Angoff's B and H indices are effective but not suited for short tests and small numbers of examinees p. 366), S2 needs research, and ω controls Type I error well for a range of situations (p. 368); (b) they report that Segall's sharing response model with selected response Trojan Horse items “appear[s] to have potential” based on simulations (p. 365); see also Eckerly (2017, pp. 109–110); (c) differential person functioning with differential item functioning appears promising and needs more research (O'Leary & Smith, 2017); (d) the divergence algorithm is “a work in progress” (Belov, 2017, p. 175); (e) Sotaridona, Wibowo, and Hendrawan (2014) propose a combination of a parametric and nonparametric method to detect low answer similarity rates (e.g., 10%); (f) Maynes (2014, 2017) summarizes research and evaluates a number of response similarity indices; (g) Zopluoglu (2017, table 2.1) lists answer copying and response similarity indices and concludes that the ώ, GBT, K, and VM indices perform well in terms of power and Type I errors (p. 33); (h) see also Allen (2014) for estimates of baseline rates of identical incorrect responses on selected response items; (i) Maynes (2014) demonstrates analysis of score differences to identify performance differences with a set of test responses that may indicate answer copying, disclosed items, inordinate answer changes, and test administration disruptions; (j) Belov (2016) compared the performance of eight different item preknowledge detection statistics; Sinharay (2017) proposed two new detection statistics with desirable Type I error rates and statistical power |
Caveats for use: Large examinee samples are needed; cannot distinguish the source of responses and the copier; statistics alone cannot rule out other explanations for response similarities; as always, controlling for Type I errors reduces statistical power; sensitive to the number of similar responses; power decreases as student scores increase (Maynes, 2017, pp. 54–55); most research on answer copying and response similarity focuses on individual test takers (Maynes, 2013, p. 190); Maynes (2014) claims that the score difference approach does not require assuming normality, large item sets, or large numbers of examinees | |
(15) Person fit indices | Zopluoglu (2017; table 2.1) lists person fit indices and concludes that, although person fit indices generally have limited use in detecting answer copying, HT and D(θ) appear to perform well enough in some contexts (p. 33); a review by Karabatsos supported HT and U3 (cited in Kim, Woo, & Dickison, 2017, p. 77), who also support the lz and lz* indices (p. 73) |
Caveats for use: Type I error rates and power always must be considered; most evaluations are based on data simulations (Zopluoglu, 2017, p. 33), rather than real test data; person fit flags alone is not evidence of item preknowledge (Eckerly, 2017, p. 104) | |
(16) Clustering algorithms to detect group collusion (Maynes, 2013, p. 190), deterministic gated IRT model (see Eckerly, 2017), hierarchical growth models, and factor analysis (Maynes, 2013, p. 196) | Analysis of group based collusion has “potential to be…powerful” but “much work is needed to understand the nature of group-based collusion, how to measure it, and how to investigate it” (Maynes, 2013, p. 191) |
Caveats for use: Cluster analysis looks “quite promising” though more research is indicated (Wollack & Maynes, 2017, pp. 147, 148) | |
Improbably Similar Constructed Responses Due to Test Content Preknowledge, Examinee Answer Sharing, or Test Administrator Intervention | |
(17) “Saving Private Ryan phenomenon”:d Human scorers may notice repetition of highly similar responses (e.g., in the Maryland School Performance Assessment Program; Ferrara, 1997); plagiarism detections programs could be useful; Maynes's (2014) suggestion that analyzing score differences for selected and constructed response items may be useful for detecting disclosure of selected response items may be useful for constructed response items as well | No research is known to exist on applying response similarity detection in scoring projects or plagiarism software |
Caveats for use: The “Saving Private Ryan phenomenon” relies on luck—implementing it as a detection method is cost-prohibitive; research on practical feasibility and effectiveness of using plagiarism software for detection is needed | |
Detection Methods Specifically for Computer-Based Testing | |
(18) (a) Aberrant response times (e.g., rapid responding and high test scores) on computer based tests that may indicate item preknowledge; Lewis, Lee, and von Davier (2014) demonstrate using regression analysis to identify individual test takers and the CUSUM index (Cohen & Wollack, 2006) to identify items with rapid responding in multistage adaptive tests; (b) other papers that focus on rapid or slow responding in general, not specifically applied to test security, may be relevant, including Meyer (2010), van der Linden (2009), and Wise and Kong (2005) | (a) CUSUM person fit index for computer adaptive tests demonstrated modest power (.65) only for large amounts of item preknowledge (Cohen & Wollack, 2006, p. 366); (b) Market Basket Analysis (unsupervised machine learning) “appears promising” (Kim et al., 2017, p. 95); (c) Bayesian models for response time, response similarity correlations (Maynes, 2013, p. 191); (d) non-IRT, IRT, and hierarchical IRT approaches (Boughton, Smith, & Ren, 2017); (e) item over exposure control is most effective using explicit modeling techniques (Cohen & Wollack, 2006, p. 364); (f) Wise, Ma, and Theaker (2014) found evidence of lower effort in fall administrations than in spring in a pretest–posttest student growth design for teacher evaluation using response time effort measures for computer-based testing and response accuracy measures in computer adaptive testing |
Caveats for use: (a) CUSUM requires large numbers of items, as in a follow-up study by Meijer (2002); a study of 36 fit statistics for detecting cheating (Karabatsos, 2003) found “they performed rather poorly…[because of] the lack of published distribution theory…[and because] they detect other types of aberrance (Maynes, 2013, p. 189); (b) Market Basket approach not adequately studied; (c) research may be limited to the two studies cited in Maynes (2013); (d) hierarchical IRT model show only modest performance (Boughton et al., 2017, p. 189); (e) eventually, items will become over exposed, so large item banks and systematic item retirement is required; (f) research is needed to manage the effects of low student effort on test scores (Wise et al., 2014, p. 185) | |
Improbable Numbers of Wrong-to-Right (WTR) Answer Changes due to Answer Sheet Tampering | |
(19) (a) State versus local testing group mean difference t-test and variations (Maynes, 2013, pp. 184–185); (b) generalized binomial test (GBT), ω, and ω-based erasure detection index (EDI) extension to group detection (Wollack & Eckerly, 2017); (c) Mroch, Lu, Huang, and Harris (2014) examined indices for individual examinee wrong-to-right erasures and answer changes; (d) Primoli (2014) used t-tests of school average wrong-to-right answer change rates to identify aberrant rates, conditional on recent average yearly progress (AYP) status |
(a) “Quite credible” (Maynes, 2013, p. 185); Bishop and Egan (2017) review a number of promising approaches (e.g., hierarchical linear modeling), most of which are supported by still inconclusive research; (b) GBT and ω are “two of the most powerful indexes available for detecting” response similarity (Wollack & Eckerly, 2017, p. 215) and the EDI for group detection shows good Type I error control and power for moderate or large breaches (p. 230), while Sinharay and Johnson (2017) propose statistics that do not require standard normal distribution assumptions and provide slight improvements; (c) Mroch et al. (2014) found that erasures and answer changes were “relatively rare events” (p. 145) and higher rates on harder items, small lower rates for lower ability examinees and differences in rates across subject area tests; (d) schools that failed to make AYP goals for four or more years had the most disproportionally high wrong-to-right answer changes |
Caveats for use: (a) Results depend in part on intensity settings on optical mark readers; appropriateness of statistical assumptions about distributions of WTR answer changes is not known (Maynes, 2013, p. 185); different flagging criteria yield different results; (e.g., Bishop & Egan, 2017, p. 211); (b) EDI for group detection investigated in only one study; (c) Mroch et al. (2014) acknowledge the need to establish baseline rates and define aberrant individual examinee answer changing behaviors; and (d) only a single study, flagging criterion is arbitrary | |
Test Administrators and Hackers Tampering With Test Response Files | |
(20) Monitoring for failure of strict adherence to chain of control protocols, immediate shipment of materials out of the local testing site, and video surveillance may detect tampering with answer sheets; monitoring for hacker attacks. Note: No parallel method is known widely to detect tampering with student responses or response files in computer-based test administrations; Tiemann and Kingston (2014) provide baseline data for answer changing in a computer-based test |
Availability of effectiveness studies not known |
Caveats for use: Chain of control and immediate shipping may be the only effective protection against answer sheet tampering, which is challenging for multiple testing dates in the same testing site and with makeup administrations for absentees; video surveillance may conflict with privacy protections | |
Test Administrators, Examinees, or Organized Groups Acquiring and Divulging Secure Test Content From a Recent Test Administration | |
(21) (a) Setting up and publicizing a dedicated email address, Twitter account, website, and phone number for reporting potential violations; (b) monitoring the Internet and social media for secure test content; (c) highly similar response choices and unusual score gains in subsequent test administrations; see Improbably Similar Response Patterns Due to Test Content Preknowledge, Examinee Answer Sharing or Copying, or Test Administrators Supplying Answers during Test Administration and Identification of Unusual Score Gains above; (d) shifts in classical item statistics, shifts in group performance, subsets of examinees with item preknowledge, response similarity, and collusion analysis (Maynes 2013, 179–180) | (a) Accuracy and comprehensiveness of whistleblower reports is not widely known; (b) effectiveness of Internet monitoring is not widely known; (c) see above; (d) see above |
Caveats for use: Finding score gains and other evidence after administration is too late to repair data integrity for the previous administration; accuracy and comprehensiveness of whistleblower reports is not widely known; hiring professional services to monitor web activity may be too costly; assessment program staff monitoring may require significant time and expertise | |
Operational Contractors Failing to Account for All Secure Test Material or Erase Computer Files with Test Content | |
(22) Testing program manager due diligence to confirm that contractors have completed these steps | – |
Caveats for use: – |
- Note. General principle and caveats for use of statistical detection methods: Statistical test results alone do not provide adequate evidence to indicate that a test security violation has been detected; alternate, innocuous explanations may explain evidence. Other, more direct and concurring evidence is required. Generated evidence may support or fail to support suspicions or allegations of test security violations; lack of supporting evidence alone may not refute suspicions or allegations. Some detection measures (e.g., observations, video surveillance, and use of statistical detection methods) also may act as deterrence measures. When primary references are not cited in the table, references to original publications are available in the compendium references provided in this table. Original references are not cited in the table here for ease of use.
- CAT = computer-adaptive testing; SR = selected response items; TBT = technology-based testing.
- aDetect means generate evidence to indicate a possible test security violation.
- bTable includes selected observational and statistical detection methods with empirical support or logical promise for efficacy in detecting possible test security violations, where statistical methods are supported by evidence of acceptable reliability and accuracy under specified conditions.
- cIn this table, effectiveness refers to the degree to which a detection method achieves its goals; efficacy refers to the ability or potential of a method to be effective.
- dIn the movie Saving Private Ryan, typists in the World War II War Department notice by chance that three separately typed letters of condolence are addressed to the same Mrs. Ryan of Paton, Iowa.
Before considering the contents of Table 4, it is important to remember that a statistical flag alone may not represent adequate evidence that a test security violation has occurred. For example, explanations for inordinate numbers of wrong-to-right answer changes include cheating (e.g., educators tampering with examinee responses, test administrators telling examinees to change answers) and innocent ones (e.g., distributing the wrong test form, erasing responses, and restarting with the correct test form). The same is true for verbal reports of possible security violations. For example, observing a test administrator providing an approved read aloud accommodation for students with disabilities could be mistaken for assisting examinees with responding to test items. As a general principle, no single piece of evidence is adequate to support a claim that a test security violation has occurred and that sanctions may be warranted. Concerns and single pieces of evidence about possible test security violations almost always require additional investigation and evidence.
- – While most attention in professional publications focuses on statistical detection methods, we also must rely on observations of testing sites and situations and reports by testing center staff and others to detect possible security violations.7 A corollary also seems apparent: The concern that people in responsible roles in the testing process may fail to report a possible test security threat is a perpetual one (as are concerns about potential overreporting).
- – Some detection methods also serve as prevention measures (e.g., observations of protection of secure materials and test administrations, publicizing detection methods) and even encouragements toward professional, ethical behavior.
- – Research on statistical detection methods is limited and “overall results…have been mixed” (Maynes, 2013, p. 194), scattered across a number of publications, not organized into taxonomies, and is “still in its infancy” (Maynes, 2013, p. 194). In addition, much of the research has focused on a small number of types of cheating (e.g., inordinate response similarities, inordinate numbers of wrong-to-right answer changes). Further, as always, many statistical methods cannot be used to detect possible cheating for small numbers of examinees, common in small schools, for test administration accommodations like reading items aloud delivered in small testing groups, or for small numbers of items.
- – Research on the distribution theory for some detection statistics is not known, the power of some statistics has not been studied (although false positive and negative error rates often are examined in data simulations), and the credibility of detection statistics varies (Maynes, 2013, p 184; see also Harris & Huang, 2017, p. 308).
- – Systematic summaries of test security violation incidence rates is lacking, so assumptions about distributional characteristics for detection statistics can lead to faulty false positive rates and can make it challenging to determine which situations should be investigated (Harris & Huang, 2017, p. 308). “This may be compounded in situations involving next-generation assessments, where there is also little historical knowledge of examinee behavior” (Harris & Huang, 2017, p. 308) available to establish baseline rates. Current estimates of teacher and examinee cheating, for example, can vary as widely as 1%–2% to 4%–5% (Fremer & Ferrara, 2013, p. 18).
- – Simulation studies suffer from these inadequacies and hamper comparisons of results across studies and generalizability (Wollack & Cizek, 2017, p. 395).
- – Our industry would benefit from attention from technology and legal experts on prevention and detection methods that are effective and legal.
Developing detection methods for the many threats to test security will benefit from continuation of current individual efforts plus organized, systematic research and development programs that address all threats, rather than limited types of threats in Table 4.
So detection methods research, development, and refinement are emerging, but the state of the art in detection is still quite limited. Detection methods are limited to selected response items, even though short constructed-response items, performance tasks (Ferrara, 1997), and essay prompts (Lane, 2013, chapter 5) are increasingly common in state testing programs. Similarly, I have not yet found publicized attention to widely used next-generation assessment approaches like technology-enabled items (e.g., drag-and-drop and hot spot items) and technology-enabled features like item sets associated with animated, sometimes manipulatable, simulations. And the array of assessment approaches encouraged in the Every Student Succeeds Act (ESSA; i.e., projects, portfolios, and locally designed formative assessments to create a summative test score; see http://www.ed.gov/essa?src = rn) will expand the need for prevention and detection methods. These limitations put current security practices in operational testing programs at risk, at least in educational testing. As things stand now, perhaps only half of state educational testing programs use statistical detection analyses (e.g., Bello & Toppo, 2011; Government Accountability Office, 2013).
Investigation
When an apparent test security violation is suspected or alleged8 as a result of detection analyses, it may be necessary to undertake an investigation. The goal of investigations is to gather evidence to support or refute allegations. Of course, the evidence may be insufficient to support allegations, even if suspicions linger. Because results of investigating possible test security violations may be tried in federal, state, or local administrative courts (e.g., for actions against a teacher's license), civil courts (e.g., infringement of copyrighted material; M. Croft, personal communication, July 6, 2016), and even criminal courts (e.g., as in the Atlanta Public Schools case), the evidence produced in an investigation must be able to withstand courtroom scrutiny and must follow rules of evidence. Investigation is a profession, just like education and psychometrics. It is a profession with principles of practice, practical and scientific methods, regulations and guidelines, prepractice training and continuing education, professional societies, journals, and licensure requirements.
- – Requirements for effective interviews and interrogations (Lushbaugh & Weston, 2016, chapter 7), maintaining the interviewer's self-control and direction of the interview or interrogation, conditions for conducting interviews and interrogations, logging interviews and interrogations, reviewing statements (Kinnee, 1994, pp. 342–348), charting investigations (Kinnee, 1994, chapter 6), and court preparation (Kinnee, 1994, chapter 22).
- – Leads, informants, and investigative techniques (Lushbaugh & Weston, 2016, chapter 6).
- – Theft of time, products, and services (e.g., Stephens, 2008, pp. 215 ff.), such as stealing secure, state-owned test content.
- – Gathering and storing evidence and preserving a crime scene (e.g., location of a test security violation; Stephens, 2008, chapter 10).
- – Rules of evidence (Lushbaugh & Weston, 2016, chapter 2) for federal criminal and civil courts and for local courts (e.g., Stephens, 2008, pp. 192 ff.) that govern whether evidence is admissible.
- – Cyber sleuthing (Stephens, 2008, pp. 229–230), which hackers can use to access test content, program managers can use to protect against hacking, and investigators can use to gather evidence about possible hacking.
- – Use of investigative equipment (e.g., covert cameras; Stephens, 2008, chapter 17 and p. 253) that are available for use for cheating, protecting against security violations, and gathering evidence about possible violations.
- – Interviewing possible witnesses to a violation and interrogating suspects, including differences in techniques most appropriate for men, women, and children (e.g., Stephens, 2008, p. 264); mirroring and matching respondent behavior to establish rapport (e.g., Stephens, 2008, pp. 265–266); reading body language to try to detect evasiveness and lying and posing questions to avoid evasiveness (e.g., Stephens, 2008, pp. 266–267); “behavior symptom analysis” to evaluate attitudes, verbal, nonverbal, and paralinguistic behavior, baiting techniques, and behavior provoking questions (see http://www.reid.com/training_programs/r_interview.html); and principles for obtaining usable confessions (e.g., Stephens, 2008, pp. 273–274).
- – Interviewing school-age children (e.g., Stephens, 2008, p. 264), which is particularly sensitive and can stir up public concern and undermine investigations.
- The Standards for Educational and Psychological Testing (American Educational Research Association, American Psychological Association, & National Council on Measurement in Education, 2014), which recommends that the “type of evidence and the general procedures…to investigate the irregularity should be explained” when it is deemed necessary to withhold a test score (standard 8.11, p. 137).
- The Code of Fair Testing Practices in Education (Joint Committee on Testing Practices, 2004), which recommends that we “describe procedures for investigating and resolving circumstances that might result in canceling or withholding scores, such as failure to adhere to specified testing procedures” (p. 10).
- The Council of Chief School Officers’ TILSA Test Security Guidebook (Olson & Fremer, 2013), which provides useful guidance on preparing interview questions and some guidance on features of defensible follow-up investigations and some guidance on investigations (i.e., be conservative, respect privacy, and maintain records) but none on how to conduct investigations.
- Operational Best Practices for Statewide Large-Scale Assessment Programs (Council of Chief State School Officers & Association of Test Publishers, 2013), which addresses primarily prevention and detection.
- Testing and Data Integrity in the Administration of Statewide Student Assessment Programs (National Council on Measurement in Education, 2012) provides recommendations to identify “qualified and trained” investigators and to develop policies regarding turning investigations over to a “third party” (p. 6).
- – “Many states have laws related to preventing test security breaches, but few specify detection or investigation methods” (p. ii).
- – “Few states have specific information about the investigation procedures” (p. 14).
- – “A greater number of states provided information about penalties for violating test security than provided information about methods to detect irregularities or conduct investigations” (p. 17).
- – “Overall, states with investigation information focus on the sanctions but provide little information on investigation procedures” (p. 24).
Who conducts investigations for educational testing programs when suspicions, allegations, or evidence warrants it? In practice, it seems to be primarily principals, local assistant superintendents, and state assessment directors and staff. Three of the 13 states that responded to the survey for this article reported that the state department of education (only) conducts investigations, five reported that local school system staff (only) conduct investigations, and five reported that both state and local system staff conduct investigations. Two of the states also reported that they would hire an outside contractor to conduct investigations. Contrast that with other widely reported test security cases. The Atlanta School Board hired the auditing firm, KPMG, and the test security company, Caveon, to investigate widely publicized cheating in the Atlanta Public Schools (Kingston, 2013, p. 302).10 In addition, the District of Columbia Public Schools system hired Alvarez & Marsal, a “global professional services firm specializing in…performance improvement and business advisory services” (see http://www.alvarezandmarsal.com/) and Caveon (e.g., Gillum & Bello, 2011)11 to investigate allegations of wrong-to-right answer changing.
Most state and local educational administrators have not been trained in professional investigation techniques. In addition, the knowledge, skills, and abilities (KSAs) and job requirements for investigators seem incompatible with the KSAs and job requirements required of educational administrators. This is evident in the Occupational Information Network, O*Net, a free, online database sponsored by the U.S. Department of Labor/Employment and Training Administration, which contains standardized and occupation specific descriptors for 974 occupations in the United States.12 A search on the O*Net category Investigators yielded 184 occupations (on November 5, 2016; see http://www.onetonline.org/find/result?s=investigators&a=1). An O*Net search for Fraud Examiners, Investigators and Analysts (http://www.onetonline.org/link/summary/13-2099.04) and Education Administrators, Elementary and Secondary School (at http://www.onetonline.org/link/summary/11-9032.00) reveals some similarities. Both jobs require similar types of knowledge, skills, and abilities; for example, knowledge of human psychology and behavior, active listening and judgment and decision making skills, and “problem sensitivity… the ability to tell when something is wrong or is likely to go wrong.” However, required work activities reveal how different these jobs and their requirements are, as indicated in Table 5.
Investigators | Elementary and Secondary Administrators |
---|---|
Work Activities | |
Prepare evidence for presentation in court | Observe teaching methods…to determine areas where improvement is needed |
Testify in court regarding investigation findings | Collaborate with teachers to… |
Interview witnesses or suspects and take statements | Counsel and provide guidance to students regarding personal, academic, vocational, or behavioral issues |
Conduct in-depth investigations | Plan and lead professional development activities… |
- Note. Adapted from O*Net categories Fraud Examiners, Investigators, and Analysts (http://www.onetonline.org/link/summary/13-2099.04) and Education Administrators, Elementary and Secondary School (http://www.onetonline.org/link/summary/11-9032.00); emphases added.
As Table 5 indicates, work activities required of investigators include gathering and presenting facts and evidence about and from people who are under suspicion. The focus is on finding credible evidence that supports or disconfirms suspicions about unprofessional, unethical, and even criminal behavior for the purpose of convicting or clearing those under suspicion. In contrast, tasks required of school administrators focus on helping, counseling, and collaborating with people to help them improve professional learning and performance. O*Net lists three work values for educational administrators and investigators, two that are common to both professions: independence and working conditions. Relationships is another work value for educators; “Occupations that satisfy this work value allow employees to provide service to others and work with coworkers in a friendly noncompetitive environment.” In contrast, another work value for investigators is achievement; “Occupations that satisfy this work value are results-oriented and allow employees to use their strongest abilities, giving them a feeling of accomplishment.” These values conform with stereotypes about kindly teachers and steely, objective detectives.
Should educators conduct investigations of possible test security violations? Clearly, some could be trained to be effective investigators. However, state and local educational administrators also face conflicts of interest. A conflict of interest is “a set of circumstances that creates a risk that professional judgment or actions regarding a primary interest will be unduly influenced by a secondary interest” (quoted and cited at https://en.wikipedia.org/wiki/Conflict_of_interest). Local school system administrators investigating possible test security violations in a school in that school system face a conflict of interest: On the one hand, their job as the investigator is to discover evidence that may support the suspected violation and protect the integrity of the school system's testing program and data; on the other hand, their job as a school system administrator is to protect the school system's reputation and avoid unfavorable publicity. Because this conflict of interest can be identified and defused before impropriety occurs, the appearance of a potential of conflict of interest should be avoided. Testing program vendors face a conflict of interest for similar reasons. This is not a new insight (e.g., see G. Cizek quoted in Kingston, 2013, p. 302); it is one that is often not acknowledged and acted upon. My view is that when a test administration irregularity is reported or suspected—one that appears not to be a security breach, such as an examinee discovered to have a cell phone during a testing session but with no indication of using it—a school system official could follow up. However, when evidence suggests the need for investigation of alleged test security violations, states or school systems should hire professionals to conduct those investigations—at least when the most damaging and egregious offenses are suspected. I recognize that state and local budgets may not be able to afford professional investigation services. Perhaps, states could form consortia to share the cost of a retainer fee for investigative service needs that may arise in the future.
And, in fact, that probably is the most common practice. In that case, the people assigned to conduct investigations should be trained in professional investigation techniques and should come from a situation that enables them to be free from any real or perceived conflict of interest in the outcome of an investigation. Or, perhaps, the USDE should create a fund to support states’ investigative service needs.
Evidence that exists about educational test security investigation practices around the country is not positive. For example, the Atlanta Journal-Constitution examined more than 130 cheating cases in several major cities with high-profile cheating cases and surveyed state education departments (Judd, 2012). Under the headline “School Cheating Thrives While Investigations Languish,” Judd cited the “haphazard manner in which many states and school districts handle reports of cheating on high-stakes achievement tests,” “investigating allegations of cheating remains a low priority in many states, despite high-profile scandals,” and that states badly underbudgeted for test security investigations even though Georgia spent $2.2 million in Atlanta and Dougherty County.
Resolution
Resolving a suspected, alleged, or confirmed test security violation involves determining to the extent possible, or failing to determine, that test security was violated, the extent of damage to data integrity and test score validity and usefulness, and who did or may have committed the violation(s).
The Department and all five SEAs had systems of internal control designed to prevent and detect inaccurate, unreliable, or incomplete statewide test results. However, these systems did not always require corrective action if indicators of inaccurate, unreliable, or incomplete statewide test results were found. (Office of the Inspector General, 2014, p. 1; emphasis added)
While the sample of states in this audit is small and the time period is limited (the 2007–2008 through 2009–2010 school year administrations), most likely it represents current practice in the United States. In addition, three questions in the survey of state assessment programs conducted for this article are relevant to this discussion. Eleven of the 16 states responded yes to the question, “Does your state have legislation, state board of education regulations, or other rules that…protect test security or test data integrity?” However, only 7 of the 12 responding states answered yes that they have legislation, regulations, or rules to “guide sanctions if test security violations are detected and supported by evidence.” In response to the question “In your opinion, if a possible test security violation is detected and supported by sufficient evidence, how likely is it that sanctions on individuals, schools, or local school systems will be enforced?” only three states responded highly likely, seven responded somewhat likely, and two responded somewhat unlikely. Further, (a) a USA Today article cites a PBS Frontline broadcast report by education journalist John Merrow and suggests that then DC Schools Chancellor Michelle Rhee did not follow up on evidence of widespread erasures and wrong-to-right answer changes in the DC CAS program at multiple schools (Toppo, 2013); and (b) a recent Baltimore Sun article documented that two Maryland 10th grade students had posted test questions from a statewide English test on Twitter and that “any disciplinary actions against the students would be decided by the schools” (Bowie, 2015, p. 1), even though posting the material is a violation of protections for copyrighted test material.
In some cases, public opinion seems to oppose resolving some test security violations. For example, bloggers and tweeters supported a Columbia University professor's posting of three PARCC Grade 4 test questions. PARCC's strong efforts to have the posters remove the items were considered by some to be infringing on free speech rights to discuss test flaws (see Lewin, 2016). Publicized opinions in some instances opposed monitoring social media as a violation of privacy rather than an extension of classroom monitoring for cheating (e.g., Strauss, 2015). Such opposition may seem justified in conjunction with concerns about school systems forcing students to reveal their passwords to social media accounts and state laws (in Louisiana,, Maine, Michigan, Rhode Island, and Utah) give schools that access in order to monitor social media posts for possible exposure of secure test content (see Herold, 2015).
- – Should the sanction be a personnel matter for the employer (e.g., a local school system and the board of education)? Should the employer demote or fire the violators? Should the employer place a letter in the violators’ personnel files as a warning to other potential employers?
- – Should the sanction be a state certification matter, where educator certification can be suspended or withdrawn, and managed by a state department of education and state board of education?
- – Should the violator or the school system repay the state for retesting or loss of secure test property?
- – Has everyone who was involved in the violation been identified? Or were some caught, others not? What is the level of culpability for each violator and the appropriate level of punishment?
- – How can the degree and type of sanction be calibrated for the degree of test security violation? This requires additional consideration of loss of the testing program's test content and replacement costs; loss of data to meet local, state, and federal reporting requirements; loss of public trust in the testing program and responsible testing agency and staff; potential loss of appropriate program placement and instruction (e.g., remediation) to students; and the potential damage to school-age children because of the unethical actions of highly influential adults, their teachers, and other school leaders.
- – Are the amount, types, and trustworthiness of the evidence about the violation adequate to support the degree and types of sanction deemed appropriate? Will the evidence support local personnel sanctions? State educator certification sanctions? Will the evidence withstand public and media scrutiny and potential challenges in administrative law, civil, and criminal courts?
- – Are the people and the procedures they followed to collect evidence adequate to support the degree and types of sanctions and potential legal challenges?
Other considerations come into focus when suspicions have been raised and investigations undertaken. Specifically, the reputation of anyone suspected or accused of a violation may suffer irreparable harm, and the public may come to distrust test data. Likewise, when sanctions are imposed or overturned (e.g., by a state board of education or law court), public confidence in the sanctioning body may suffer, and that body could be liable to retaliation through court cases. It is difficult and risky to try to do the right thing. Investigations conducted unprofessionally and weak evidence exacerbate the risk.
Consequences of not following through can be dire, however, either because the “failure of districts and states to adopt and enforce effective test security policies can result in litigation involving parents, employees, administrators, whistleblowers and others adversely affected by falsified scores” (Phillips, 2011, p. 6) or because it undermines the culture of professional ethics, test data integrity, and test score validity.
Resolving a suspected test security violation requires an authorized body to weigh evidence and make judgments about people who are responsible for the violation, the degree of responsibility in the violation, the extent of loss and damage to the testing program and to examinees and others, and to determine the types and levels of sanctions to impose on those who are responsible. The authorized body cannot act capriciously. Its decisions will be accepted by the public and be able to withstand legal challenge if those decisions are based on well publicized (a) policy, practices, guidelines, and procedures for prevention, detection, and investigation; and (b) sanctions that correspond to various types and degrees of violations. Sanctions calibrated to the severity of violations would not be unlike sentencing guidelines that criminal court judges may follow.
Aside from the evidence cited above, we do not know a great deal about the extent to which state testing programs follow through with sanctions when security violations are supported by evidence. We do know a fair amount about prescribed sanctions for test security violations. Croft (2014) found guidance in an examination of 38 states’ test security “state statutes and regulations” (p. 3), where statutes and regulations have the force of law (M. Croft, personal communication, June 6, 2016.) Croft found that (a) a greater number of states provide information on penalties for violating test security than they do on preventing and investigating violations; (b) educator penalties range from requiring additional professional development to suspension or revocation of certification; (c) 12 states penalize school districts in which violations occur; and (d) six states permit civil penalties for violations, while seven states’ statutes specify criminal penalties. It is not clear that the information on penalties for violations includes guidance on calibrating those penalties with the severity and impact of the violations. Respondents to the ATP survey, most of whom represent certification, licensure, and testing vendor organizations, reported that the most common actions in response to a security breach were invalidating test scores and conducting security integrity investigations (ATP, 2015, p. 79). Very few respondents reported more severe actions, such as personnel actions, referral to law enforcement, and pursuing civil legal actions.
I have written elsewhere that the U.S. Department of Education, which mandates state testing and accountability requirements through the Every Student Succeeds Act, and state departments of education have the responsibility and authority to make and enforce test security policy (Ferrara, 2014). In fact, they, local school systems, schools, and licensure and certification boards have an inherent interest in the integrity of test data. Guidance in resolving test security incidents is essentially nonexistent in previously cited, key publications on test security beyond the models of state statutes and regulations in Croft (2014).
Discussion and Recommendations
In the wake of the shocking cheating scandal in the Atlanta Public Schools (see, for example, Kingston, 2013), the Atlanta Journal-Constitution conducted analyses of reading and mathematics test scores for 69,000 schools nationwide. The analyses identified “suspicious test scores in roughly 200 school districts” (see Perry, 2012).13 I cannot attest to the rigor and appropriateness of the analysis methods and interpretation of these results. However, the pervasiveness of inordinate increases and decreases in school performance is alarming—and points to the need for rigorous, comprehensive test security systems in K-12 and licensure and certification testing programs.
The U.S. Department of Education's Office of the Inspector General found in March 2014 (Office of the Inspector General, 2014), among other things, that “The Department Could Strengthen Its Monitoring of States’ Test Results and Test Administration Procedures” (finding number 1) and “SEAs Could Strengthen Their Oversight of Statewide Test Administration and Security” (finding number 2). Finding number 2 implies that states should improve monitoring of test administrations, use of forensics, and resolution of security problems. The Department of Education agreed with these recommendations. Efforts in improving test security have come from an array of interested organizations: the U.S. Department itself, the Association of Testing Publishers, Council of Chief State School Officers, annual Conference on Test Security, individual states and school districts, testing contractors, and individual researchers through conference papers and journal articles.
The review here suggests that these disparate responses are making solid contributions to improving test security policies, practices, and perhaps outcomes—and suggests that a coordinated and more systematic approach would be more efficient and probably more effective. For example, rather than 51 separate state assessment programs (plus those of U.S. territories) working independently to develop test security policies and practices, a committee of state stakeholders could develop guidelines for PDIR for adaptation and implementation. The detection subcommittee could include statisticians and psychometricians who could conduct additional reviews of test security data forensics methods (e.g., in presentations from the 2014–2016 Conference), make recommendations for use for specific types of security threats, and develop a research agenda for additional development of forensics in the research community. The investigation subcommittee could include expert investigators, while the resolution subcommittee could include legal, legislation, and regulatory experts. Many state and other assessment programs do not have staff and financial resources to develop comprehensive security systems—and will remain at risk for significant security breaches until they do. Pooling resources—and getting significant support from the federal government, which requires state testing programs and demands trustworthy accountability data—is essential. Testing contractors and the measurement research community bear some responsibility here, as well. But the U.S. Department of Education and the state departments of education are the testing program sponsors who are most responsible for ensuring test security and data integrity, and are at risk of failing to provide adequate protections. The committee could be selected and convened by any one or a combination of the organizations that support the interests of educational, certification, and licensure testing programs: the Association of Test Publishers, Board on Testing and Assessment of the National Research Council, Council of Chief State School Officers, and the National Council on Measurement in Education. Other organizations could provide specialized guidance; for example, the National Association of State Boards of Education, on reasonable expectations and sanctions for educators.
The U.S. Department of Education has taken an appropriate first step. The revised peer review guidance for state assessment programs includes two critical elements for preparing assessment peer review submissions: critical elements 2.5—Test Security and 2.6—Systems for Protecting Data Integrity and Privacy (see U.S. Department of Education, 2015). Providing the means for states to collaborate and develop guidance for PDIR is the critical next step. Department of Education funding should support a coordinated effort among the testing programs and other professional communities with stakes in large-scale testing. The support for using federal education dollars is in the public record: A colloquy between Senators Barbara Mikulski and Lamar Alexander during hearings on December 8, 2015, on the Every Student Succeeds Act confirmed that states are provided the “flexibility to use [federal education] funds to preserve and maintain the integrity and validity of these important [state] assessments” (Assessment Security, 2015, p. S8469). The bipartisan, bicameral bill was signed into law by President Obama two days later.
Some Additional Points
We do not have a clear idea of test security systems, practices, and effectiveness in K-12 testing. Licensure and certification testing programs do not divulge much about test security (e.g., Ferrara & Lai, 2016, p. 610). Many states do not have capabilities to conduct security violation detection statistical analyses and have not included those analyses in their contracts. States and their contractors tend to focus on a limited range of security threats (i.e., answer changing, response similarity, and spikes in group performance). Scattered reports indicate uneven efforts and effectiveness in prevention, investigation, and resolution. Systematic collection of this information would be valuable to the educational testing industry, primarily to determine the extent of vulnerability of state testing programs. The ATP survey (2015) suggests that licensure and certification testing programs, which are reputed to be responsive to security threats and violations, are not doing much better: 20% of responding programs report not using any “major active or passive security procedures” (p. 12).
Implementing effective security plans is good in itself. Working proactively to detect and prevent violations is even better. It is associated with lower impact security breaches (e.g., only one examinee or testing site involved, retiring 20 or fewer items) and a “significantly lower occurrence of high impact breaches” (ATP, 2015, pp. 12, 73). Testing program managers can build modest systems at first around the highest priority, most worrisome security risks and expand incrementally. They can start from the threats in Table 1, roles and responsibilities in Table 2, and prevention countermeasures in Table 3 to design observational, disclosure, and statistical detection procedures for the highest risks; implement procedures and decision criteria for determining when investigation is necessary and who will conduct investigations; and work with education leaders to build a strong commitment to completing rigorous investigations and resolving all security violations.
Our research community can expand its support of test security by developing tools and procedures to address the threats to test security in Table 4 that have not yet received adequate attention. In addition, we can collaborate with testing program managers and legal experts to develop guidance on using statistical methods to discover unknown security violations and developing corroborating evidence to minimize the risks of false positive errors.
Other writers cited have raised concerns about several issues, which I address here in closing. These are issues that the committees I propose above could consider in their deliberations.
Should testing programs use statistical forensics in exploratory fashion, to find potential but unreported possible test security violations, or only as follow-up to an allegation? Most of the researchers cited in Table 4 warn against the risk of false positive identification of violations, and they take steps to minimize false positive error risk. Wainer (2014) advises vehemently against using statistical detection methods in exploratory fashion. He cites a standard from Educational Testing Service's security policy: “Statistical evidence of cheating is almost never used as the primary motivator of an investigation” (p. 9), and he provides three case studies to illustrate his point.
From a statistical point of view, exploratory searches for violations seem unwise because of the risk of statistical and inferential errors and the possibility of damaging peoples’ reputations and livelihood and undermining trust in a testing program. On the other hand, testing program managers have a responsibility to protect test security, especially against threats to security that may be hidden or blatant and nevertheless unreported (think of the Atlanta Public Schools case). It makes sense to conduct exploratory forensic investigations as part of due diligence, if conducted following the principle that strong corroborating evidence is required before investigations can proceed. This is a program operations and policy decision that may require support from the testing program's authorizing body.
How can testing program managers and their detection services combine results from multiple statistical and nonstatistical detection procedures to determine if a security violation should be suspected and investigated? Part V of Kingston and Clark's Test Fraud (2014) contains three chapters on using multiple methods to detect possible security violations. Several other chapters there and in other sources cited in Table 4 illustrate using multiple methods to detect various types of security violations. Little advice exists, however, to help testing program managers and their security services with weighting various forms of evidence and combining them to guide reasonable decisions about dismissing an allegation or suspicion or continuing with further detection efforts and launching an investigation. Stephens (2008) describes types of evidence, discusses combining evidence to create a proof (pp. 174–176), and lists rules of evidence (e.g., admissible evidence; see pp. 192–198). Lushbaugh and Weston (2016) devote several chapters to evidence, but they are focused primarily on major crimes. As discussed earlier, our field needs professional advice on using evidence to make decisions about determining whether and how to undertake additional investigation and to level charges of test security violations.
How can testing program managers deliver adequate investigative services and ensure appropriate resolution of major test security violations? Alone, they cannot. States need guidance and options for hiring investigation services and deciding when evidence suggests the need for professional investigations. And they need guidance in distinguishing security violations from gaffes that are addressed more appropriately by information gathering to repair what may be test administration errors and other less disastrous threats to test security and data integrity. Testing program authorizing bodies may be more likely to take strong action when a security violation has been confirmed—and to insist that all suspicions and allegations are investigated thoroughly—if regulations or statutes are in place and are explicit about requirements for investigating and resolving violations. State boards of education and legislatures would benefit from guidance from other states and from educator licensure and education law experts to develop and implement these requirements. And they might feel more assured in implementing them if they knew that other states were doing similar things.
Comprehensive test security systems are characterized by strong policies and practices for PDIR. We need to do more, and do better at each element.
Acknowledgments
The author thanks Michelle Croft, Joseph Martineau, the Minnesota Department of Education and its Technical Advisory Committee, the editor, and two anonymous reviewers for their insightful comments and additional ideas for this article, and Scott Norton and Amy Kinsman for their support on the survey conducted for this study.
Notes
Appendix: Results From Survey of State Assessment Program Directors Conducted for This Study
Yes | No | |
---|---|---|
Does your state have legislation, state board of education regulations, or other rules that: | ||
(1) Protect test security or test data integrity? | 11 | 2 |
(2) Prohibit cheating on state tests? | 8 | 4 |
(3) Require investigation of possible test security violations? | 9 | 4 |
(4) Guide sanctions if test security violations are detected and supported by sufficient evidence? | 7 | 5 |
Does your state department of education or testing program collect information or conduct analyses to detect the following possible test security violations: | ||
(5) Cheating by teachers, other school personnel, or students? | 10 | 3 |
(6) Inappropriate test preparation by teachers or students? | 8 | 4 |
(7) Exposure/disclosure of secure test material? | 9 | 3 |
(8) Unruly test administrations? | 6 | 5 |
Does your state department of education or testing program conduct or arrange for: | ||
(9) Observations of test administrations in schools? | 10 | 3 |
(10) Observations of school and school system practices intended to protect test security before, during, or after test administrations? | 10 | 2 |
If a test security violation is suspected, who conducts or manages investigations to determine if sufficient evidence exists to warrant sanctions? (Check all that apply) | ||
State department of education staff | 3 | |
Local school system staff | 5 | |
Both state and local staff | 4 | |
A state or local law enforcement agency | 0 | |
A contractor hired by the state or local school system | 0 | |
State and local staff working with a contractor | 1 | |
Comment: “We would most likely hire Caveon.” | ||
In your opinion, if a possible test security violation is detected and supported by sufficient evidence, how likely is it that sanctions on individuals, schools, or local school systems will be enforced? | ||
Highly likely | 3 | |
Somewhat likely | 7 | |
Somewhat unlikely | 2 | |
Not at all likely | 0 |
- Note. Fifty-one state assessment directors were invited to respond to the online survey during February, 2016; 13 responded, for a response rate of 25%.