Procedural parameters in equivalence-based instruction with individuals diagnosed with autism: A call for systematic research
Editor-in-Chief: John Borrero
Handling Editor: Jason Vladescu
Abstract
Equivalence-based instruction (EBI) is an efficient and efficacious methodology to establish equivalence classes that has been used to teach various academic skills to neurotypical adults. Although previous reviews confirmed the utility of EBI with participants with developmental disabilities, it is unclear whether certain procedural parameters are associated with positive equivalence outcomes. We extended previous reviews by categorizing studies that used EBI with individuals diagnosed with autism spectrum disorder and assessed whether any procedural parameters were associated with better equivalence responding. Due to the wide variability of procedural parameters in EBI research, the best procedural permutations to form equivalence classes with individuals diagnosed with autism spectrum disorder are still unknown. Thus, this paper serves as a call to action for applied researchers. Specifically, we encourage and invite researchers to systematically investigate the necessary variables or combination of variables that may lead to successful equivalence class formation.
A growing body of research supports the utility of equivalence-based instruction (EBI) as an approach to teach symbolic relations (e.g., Brodsky & Fienup, 2018; Gibbs & Tullis, 2021; Rehfeldt, 2011). Equivalence-based instruction incorporates conditional discrimination teaching procedures to establish equivalence classes consisting of physically dissimilar and often socially significant stimuli (Fienup et al., 2010). A strength of EBI is its efficacy, as only a few stimulus relations need to be established directly (i.e., via reinforcement), whereas others emerge without explicit teaching (i.e., emergent relations).1
Equivalence-based instruction has led to performance improvements across various individuals, academic topics, contexts, and teaching formats (e.g., worksheets, online) with college students (Brodsky & Fienup, 2018). Brodsky and Fienup (2018) asserted that EBI via matching-to-sample2 may be a viable approach to increasing student performance efficiency based on its generative aspects. In a typical matching-to-sample preparation, a sample is presented alongside multiarray comparison stimuli. Participants are taught to select (often by pointing to) a comparison stimulus conditional upon the sample. After teaching conditional (baseline) relations among stimuli, their substitutability is verified by testing for the properties of reflexivity, symmetry, and transitivity (Sidman, 1971). In the reflexive relation (e.g., A = A), identical stimuli are related to themselves. For example, if (A) is a picture of a monkey, reflexivity occurs if the picture of a monkey is matched to an identical picture. A symmetrical relation refers to the reversibility of the stimuli (e.g., if A = B, then B = A). One learns to point to the printed word “monkey” (B) in the presence of the picture of a monkey (A) and then identify the picture in the presence of its printed word, without teaching. In the transitive relation, after teaching two other stimulus relations, the emergence of previously untrained relations is observed (e.g., if A = B and B = C, then A = C). Transitivity and, ultimately, equivalence relations emerge when in the presence of the picture of the monkey (A), the printed word “macaco,” Portuguese for monkey, (C) is chosen, and vice versa, respectively. Transfer of function (Sidman & Tailby, 1982) occurs if one class member acquires a specific function (e.g., discriminative) and all other class members show the same function (e.g., Miguel et al., 2009).
The emergence of untrained relations has been explained from three different conceptual frameworks. Sidman (1994, 2000) viewed the formation of equivalence classes as a product of the reinforcement contingency. In other words, all members of the contingency (antecedent stimulus, response, and consequence) become equivalent under specific conditions. Consequently, sample and comparison stimuli become substitutable for each other when teaching conditional discriminations.3 Alternatively, Hayes et al. (2001) suggested that performance during equivalence tasks is a form of generalized responding under the control of contextual cues (e.g., the procedure itself) due to a history of reinforcement across multiple exemplars. Contextual cues specify the type(s) of relations among stimuli, as stimuli may have a variety of relations beyond sameness (e.g., temporal, opposite). How each set of stimuli should be related (i.e., relational frame) will depend on the contextual cues present during reinforcement. This approach became known as relational frame theory. Finally, Horne and Lowe (1996) posited that equivalence class formation (at least for verbally sophisticated organisms; Miguel, 2018) is highly dependent on both speaker and listener behaviors in that sample stimuli occasion speaker behavior (i.e., tacts and intraverbals), the product of which results in listener behavior in the form of comparison selection. This approach became known as the (bidirectional) naming hypothesis (see Miguel, 2016). As all three approaches are based on operant principles to explain emergent responding, the current review includes studies targeting individuals with autism spectrum disorder (ASD) and equivalence regardless of their theoretical account.
McLay et al. (2013) evaluated whether children with ASD could form equivalence classes via EBI examining (a) what specific characteristics are necessary to contribute to its success, (b) which procedural variables (e.g., independent variables, assessments) affect equivalence class formation, and (c) what areas need further investigation. They reviewed nine studies with 49 participants. Targeted skills included learning coin values, identifying states and their capitals, following picture activity schedules, and some “proof-of-concept” examples (rather than teaching skills, e.g., Eikeseth & Smith, 1992). McLay et al. noted inconsistent reporting of both the prerequisites and most effective procedural permutations for forming equivalence classes. However, McLay et al. only included studies that applied Sidman's (1994, 2000) conceptual framework for interpreting emergent relations.
Recently, Gibbs and Tullis (2021) reviewed studies on emergent responding for individuals with ASD and developmental disabilities from different theoretical perspectives. Gibbs and Tullis sought to (a) investigate the current evidence of emergent relations using a quality outcome tool, (b) determine whether certain learner characteristics lead to emergent responding, (c) specify assessment tools for identifying prerequisite skills, and (d) evaluate instructional procedures correlated with an increased likelihood of demonstrating emergent responding. Most participants formed equivalence classes, and Gibbs and Tullis concluded that participants' age and learning history may have been associated with a higher probability of equivalence class formation. However, there is no established standard for reporting prerequisite skills in equivalence research (e.g., Lee et al., 2015). Gibbs and Tullis also reported various instructional strategies to produce equivalence classes such as matching-to-sample, tact training, and stimulus pairing. Finally, Gibbs and Tullis completed the Single Case Analysis Research Framework (Ledford et al., 2020) to evaluate each study's methodological rigor. Generally, they found the overall quality of studies to be poor due to insufficient data or lack of procedural fidelity.
Several methodological variations in basic research on stimulus equivalence are shown to influence equivalence outcomes (Arntzen, 2012) including observing responses (Perez et al., 2020), errorless teaching (Schilmoeller et al., 1979), number of comparisons (Carrigan & Sidman, 1992), number of classes and members (Arntzen & Holth, 2000), stimulus modality (Green, 1990), and performance criteria (Arntzen, 2012). The effects of specific training structures, which specify the number and function of stimuli (i.e., nodes) serving to interrelate baseline relations, is one of the most studied procedural parameters in stimulus equivalence research (Ayres-Pereira & Arntzen, 2021). Structures that involve a single node (i.e., many-to-one and one-to-many structures) are more predictive of equivalence class formation than when more than one node is included (i.e., linear series; Arntzen & Holth, 1997). Moreover, the protocol in which baseline relations are individually taught to mastery, followed by emergent relations tests conducted sequentially (i.e., simple-to-complex), seems to be the most efficacious (Adams et al., 1993; Fields et al., 1997).
It remains unclear whether the best basic and translational research practices are consistent with applied contexts where students learn responses to socially significant material (Brodsky & Fienup, 2018). Arntzen's (2012) preliminary work on identifying the specific procedural parameters necessary for equivalence class formation in the human-operant laboratory may serve as a preliminary guide. For example, Arntzen (2004) found the linear series training structure was the least effective at producing equivalence outcomes. Conversely, several applied studies have demonstrated success with the linear series structure (e.g., Dixon, Stanley, et al., 2017). Thus, identifying the procedural variables that produce optimal equivalence outcomes in clinical settings seems warranted.
Moreover, given the increase in EBI research (Gibbs & Tullis, 2021), it seems important to identify any associations between specific procedural parameters and positive equivalence outcomes with individuals with ASD. For example, Fienup et al. (2015) compared the simple-to-complex and simultaneous training protocols across three- and four-member classes with neurotypical adults. Three-member classes produced larger effects (i.e., higher number of participants who passed on the first test and less time to form equivalence classes) compared with four-member classes. Conversely, Haydu and de Paula (2008) found that the type of teaching arrangement (e.g., many-to-one vs. one-to-many protocols) affected whether the number of members per class (e.g., three vs. four) would differentially influence equivalence class formation. Thus, identifying the most efficient training structure may assist applied researchers on which parameter permutations to adopt when designing EBI. The primary purpose of the present paper was to update and extend the McLay et al. (2013) review by incorporating all conceptual frameworks of equivalence to evaluate whether any specific variables or combinations are more likely to produce equivalence class formation.
Our review differs from existing reviews in three ways. First, we calculated the proportion of participants who passed equivalence tests without remedial teaching to identify any associations between specific procedural parameters and successful equivalence class formation. Second, given that a majority (72%) of Board Certified Behavior Analysts reports working with individuals with ASD (Behavior Analyst Certification Board, n.d.), we focused this discussion as a call to applied researchers, who could then systematically investigate different variables and share their findings as a guide for practitioners. Third, we included all equivalence studies from various conceptual frameworks following the Preferred Reporting Items for Systematic Reviews and Meta-Analyses model (Moher et al., 2009; see respective checklist in Supporting Information).
METHOD
Inclusion criteria
Studies had to meet the following criteria to be included: (a) published in English in an academic, peer-reviewed journal, (b) at least one participant with a formal diagnosis of ASD as explicitly stated by the authors, (c) objective measures of transitive or equivalence relations and/or transfer or transformation of function, (d) interpretations of emergent relations that corresponded to an equivalence relation (e.g., frames of coordination in relational frame theory), and (e) used nonarbitrary stimuli (described more below). Any study with one or more individuals with ASD was included, but only those participants with an ASD diagnosis were reviewed. For example, if there were three participants in a study but only one had ASD, then the study was included but only the information related to the individual with ASD was reviewed for analysis. In addition, descriptions such as diagnoses of intellectual and developmental disabilities, pervasive developmental disorder, or ambiguous phrasing (e.g., they attended an autism school) were not accepted. Furthermore, because the main purpose of our review was to identify relevant areas for applied researchers, we only included studies that performed EBI using relevant or “meaningful” stimuli for the individual. Meaningful stimuli varied for each participant such that learning to identify different trees and leaves for one child was considered meaningful because the child lived near farms and had curriculum goals to learn to identify trees and plants and this target was reported as desirable from both the individual and the parents (e.g., Arntzen et al., 2014). We excluded studies if (a) they did not test for equivalence or derived relations, (b) included frames other than coordination (i.e., does not denote equivalence relations, e.g., opposition, hierarchal), (c) provided an unclear or vague identification of an ASD diagnosis such as a developmental delay, or (d) used abstract stimuli (i.e., meaningless symbols or words).
Search procedures
The second author identified and screened articles in the PsycINFO and Education Resources Information Center academic databases initially in Spring 2018 and again during March–April 2022 (to update the search through December 2020). Figure 1 displays a representation of the search methods. A search was conducted using the following terms using All Text as the field option with no start date of publications through December 2020: “autism spectrum disorder” OR “ASD” AND each of the following terms: “equivalence relations,” “derived responding,” “stimulus equivalence,” “relational responding,” “relational responses,” “derived relations,” “reversibility,” “symmetry,” and “transitivity.” These search terms were identical to those used by McLay et al. (2013). Based on our initial search of the databases, we identified 450 possible articles.

Additionally, we searched the following journals: Journal of Applied Behavior Analysis, The Psychological Record, Research in Developmental Disabilities, Research in Autism Spectrum Disorders, and The Analysis of Verbal Behavior. We selected these journals given their history of publishing works on either stimulus equivalence studies or involving individuals diagnosed with ASD. On each journal's homepage, the keywords “equivalence relations” OR “stimulus equivalence” were used, as these were the search terms used by McLay et al. (2013). The journal search terms did not include all the database search terms to reduce redundancy. The results from the journal searches yielded 858 possible articles, five of which were not identified by the database search parameters.
After identifying the preliminary studies and removing duplicates across the database and journal searches (n = 280), three researchers independently screened each remaining study to determine whether each met the inclusion criteria. A total of 1,028 articles were screened by title and abstract for inclusion, of which 934 were excluded due to not meeting the inclusion criteria described above. If it was not clear from the title or abstract whether an article met all the inclusion criteria, the article was automatically included for a full-text eligibility assessment to ensure screening of the greatest number of articles possible. A total of 94 articles underwent full-text eligibility review. The same researchers completed an ancestry search from references of articles that met full-text eligibility, which resulted in full-text eligibility screening of seven additional articles. A total of 71 articles that did not meet the criteria for full-text eligibility assessment were excluded with reason (see Figure 1). Overall, 30 studies were included in our analysis, seven of which were not included in previous reviews, with 93 participants and 100 applications (see Table 1). In some cases, applications of EBI were counted, as some studies included more than one experiment (e.g., Experiment 1 used a many-to-one structure, and Experiment 2 used a one-to-many structure; Arntzen et al., 2010).
Study | Number of participants | Age (years) | Content taught | % of participants who passed derived test without remediation |
---|---|---|---|---|
Arntzen et al. (2010)1 | 1 | 16 | Piano Chords | 100 |
Arntzen et al. (2014)2 | 1 | 17 | Trees/Leaves | 100 |
Daar et al. (2015)2 | 3 | 10–11 | Person Place | 100 |
Dixon et al. (2016)2 | 2 | 13 & 15 | Geometry | 100 |
Dixon, Belisle, Stanley, Munoz, et al. (2017)2 | 2 | 10 | Gustatory | 100 |
Dixon, Belisle, Stanley, Speelmanm, et al. (2017)2 | 1 | 8–9 | Shape, Letters, Colors | 100 |
Dixon, Stanley, et al. (2017)2 | 2 | 9 & 12 | Geography | 100 |
Dunne et al. (2014)2 | 9 | 3–5 | Un/Familiar Pictures | 33 |
Fairchild et al. (2020) | 2 | 5 & 7 | Letter Phonemes | 50 |
Groskreutz et al. (2010)1 | 6 | 4–18 | Line Drawings | 100 |
Hill et al. (2020) | 4 | 11 | Musical Notes | 100 |
Keintz et al. (2011)1 | 2 | 2 & 6 | Coins | 50 |
LeBlanc et al. (2003)1 | 2 | 6 & 13 | States & Capitols | 100 |
Lee et al. (2015)2 | 4 | 3–5 | Dog Breeds | 50 |
May et al. (2013)2 | 3 | 6–11 | Fictitious Characters | 100 |
McKeel & Matas (2017)2 | 3 | 23, 24, 63 | Edibles & Pictures | 100 |
McLay et al. (2016)2 | 10 | 4–11 | Numbers | 50 |
Miguel et al. (2009)1 | 2 | 6 | Preferred Activities | 100 |
Noro (2005) | 1 | 5 | Photos, Drawings | 100 |
Omori et al. (2011) | 4 | 13–17 | Spelling, Reading | -- |
Omori & Yamamoto (2013)2 | 3 | 11–14 | Kanji Characters | 100 |
Rosales et al. (2014)2 | 2 | 5 & 6 | Pictures, Text | 0 |
Sprinkle & Miguel, (2012) | 4 | 5–7 | Spanish | 50 |
Stanley et al. (2018)2 | 3 | 13–18 | Science, Math, History | 67 |
Still et al. (2015)2 | 11 | 4–12 | Pictures, Text | 91 |
Stromer et al. (1996) | 1 | 41 | Pictures, Text | 100 |
Tanji et al. (2013)2 | 3 | 9–11 | Pictures, Text | 100 |
Varella & de Souza (2015)2 | 1 | 3 | Letters | 100 |
Walsh et al. (2014)2 | 2 | 5 & 6 | Animals | 0 |
Yorlets et al. (2018) | 1 | 10 | States | 100 |
- 1 Studies included in McLay et al. (2013) review.
- 2 Studies included in Gibbs and Tullis' (2021) review.
- -- indicates no individual data for % of passing.
Data classification
Study characteristics
Two reviewers independently reviewed each article and entered the corresponding data into an electronic spreadsheet. We summarized the following independent variables in a manner similar to that used by McLay et al. (2013) and Gibbs and Tullis (2021): (a) participant demographics (chronological age, reported sex), (b) skills taught (e.g., learning Spanish words), (c) developmental or language assessments (e.g., Verbal Behavior Milestones Assessment and Placement Program; Sundberg, 2008; Peabody Picture Vocabulary Test; Dunn & Dunn, 2007), (d) teaching procedure(s) and training structure(s), and (e) main findings/emergent responses (i.e., number of participants who passed derived relations tests [transitivity, equivalence, or transfer/transformation of function with and without remediation]). Additionally, we added the following variables unique to our review: (a) teaching parameters (i.e., teaching protocol, number of comparisons, number of classes, and number of members in each class), (b) procedural variations (e.g., observing response, errorless teaching, response topography, passing criteria), and (c) testing of generality and maintenance of derived relations.
Coded variables
We further coded the following variables for each participant for the purpose of conducting the data analysis: (a) training structure (i.e., linear, one-to-many, or many-to-one structure); (b) teaching protocol4 (i.e., simultaneous, simple-to-complex, or complex-to-simple procedure); (c) number of comparisons (i.e., one, two, three, or more) on the first teaching trial (i.e., if baseline conditions were with one comparison during the first six trials of the first block and then increased to three comparisons for the last six trials of the same block, the number of comparisons was coded as one); (d) number of classes (i.e., two, three, four, or more); (e) stimulus members per class (i.e., two, three, four, or more); (f) requirement of an observing response to the sample (i.e., yes or no); (g) errorless teaching (i.e., yes or no) if the first taught response to a comparison was errorless (i.e., 0-s prompt delay); (h) response topography (i.e., listener or speaker); (i) stimulus modality (e.g., visual–visual, auditory–visual, a combination of visual and auditory, or other [e.g., gustatory]); and (j) passing criteria (i.e., each study's passing criterion on derived relations tests, without remediation), which was categorized as 79% or less, between 80% to 89%, or 90% and above.
Data analysis
Descriptive analysis
Due to the wide variability of procedural parameters used across studies, dependent variables were the percentage of studies that reported each aforementioned variable, the percentage of participants who passed derived relations tests without remedial teaching (based on the original authors' reporting) into three categories of passing across the literature (79% or less, between 80% to 89%, or 90% and above), and the percentage of participants who passed with remedial instruction for each category. Due to the wide variability of passing criteria across studies, we classified the percentage of participants who passed based on the main three passing criteria categories.
We calculated percentages by dividing the frequency of studies (or applications) reporting each variable by the total number of studies (or applications) and multiplying it by 100. Similarly, we divided the number of participants who passed equivalence tests without remedial teaching by the total number of participants who experienced a particular procedural parameter and multiplied it by 100. Table 1 depicts some of the characteristics of each study and the data for the percentage of participants who passed derived tests for each study.
Interrater agreement
Two raters independently compared their findings from the initial search of databases and journals for interrater agreement purposes. The overall agreement of the number of studies identified was 94%. We resolved disagreements as a group, and a final consensus of 100% was reached. During the screening phase, agreement was 100% for the inclusion of the same 30 studies. During data coding, raters divided the total number of studies between two groups (i.e., first and second authors and two research assistants), so 100% of studies were reviewed for interrater agreement. One rater from each group read and scored each study across all the independent variables. Subsequently, a second independent rater from each group read the same half and scored each variable. The second author then calculated exact interrater agreement on every variable for all 30 studies. An agreement was defined as both raters scoring the same response for the same category (e.g., both reviewers scored a linear series structure). A disagreement was defined as one rater scoring one answer (e.g., many-to-one structure) but the other rater scoring a different response (e.g., one-to-many structure) for the same category and application. Raters then divided the number of agreements by the sum of disagreements and agreements and multiplied it by 100. The mean interrater agreement was 96.2% (range: 87.5%–100%). We resolved disagreements as a group, and a final consensus of 100% was reached.
RESULTS
Thirty studies met the inclusion criteria (see Table 1) including 93 participants and 100 applications. Given that some participants underwent multiple teaching procedures within the same study (e.g., one-to-many and many-to-one procedures, speaker and listener), each procedure was evaluated separately.
Descriptive data
Participant characteristics
We summarized chronological age, skills taught, and the percentage of participants who passed derived tests without remediation in Table 1. The age of participants ranged from 3 to 69 years (M = 10). Most participants were between 11 and 20 years old (73%), with the remainder being 6–10 (17%), 0–5 (6%), or 21 years old or older (3%). Sixty-nine participants (74%) were reported as male, and 17 (18%) were reported as female (no information was reported for seven participants, 8%).
Skills taught varied across studies and were mainly specific to participants' educational goals. The most common topics were academic skills such as reading (32%), math (16%), geography (12%), and leisure skills (12%; e.g., playing the piano). See Table 1 for a full list of skills taught in each study.
Assessments
A variety of assessments was reported to evaluate participants' baseline level of language (Peabody Picture Vocabulary Test; 36%, n = 11) and intelligence quotient (e.g., Weschler Intelligence Scale for Children; Wechsler, 1949; 23%, n = 7). Four studies (13%) used the Promoting the Emergence of Advance Knowledge (Dixon, 2014) curriculum or a combination of assessments that included the Verbal Behavior Milestones and Placement Program and similar alternatives (e.g., Assessment of Basic Language and Learning Skills–Revised; Partington, 2010; Weschler Intelligence Scale for Children). For instance, Dixon et al. (2016) reported the scores from the Verbal Behavior Milestones and Placement Program and Promoting the Emergence of Advance Knowledge assessments for participants learning to tact geometrical shapes and the number of sides of each shape. Finally, for 26% of studies (n = 8), assessment data were not reported.
Teaching parameters
Table 2 depicts the results for the various teaching parameters.
Independent variables | Categories | ||
---|---|---|---|
Training structure | LS (n = 27) | *MTO (n = 11) | *OTM (n = 54) |
63% (17/27) | 100% (11/11) | 80% (43/54) | |
Training protocol | SIM (n = 56) | *STC (n = 11) | CTS (n = 7) |
75% (42/56) | 82% (9/11) | 100% (7/7) | |
Stimulus modality | A-V (n = 21) | V-V (n = 25) | A + V (n = 47) |
90% (19/21) | 64% (16/25) | 79% (37/47) | |
Number of classes | 2 (n = 36) | 3 (n = 24) | 4 or more (n = 35) |
78% (28/36) | 92% (22/24) | 69% (24/35) | |
Number of members | 2 (n = 3) | 3 (n = 74) | 4 or more (n = 21) |
100% (3/3) | 73% (54/74) | 95% (20/21) | |
Errorless teaching | *Yes (n = 33) | No (n = 50) | |
79% (26/33) | 80% (40/50) | ||
Observing response | *Yes (n = 34) | No (n = 42) | |
82% (28/34) | 79% (33/42) | ||
Number of comparisons | 1 (n = 6) | 2 (n = 14) | *3 or more (n = 70) |
67% (4/6) | 57% (814) | 82% (58/70) | |
Passing criteria | >90% (n = 43) | 80%–89% (n = 42) | <79% (n = 10) |
74% (32/43) | 88% (37/42) | 60% (6/10) | |
Response topography | Speaker (n = 12) | Listener (n = 84) | Mix (n = 2) |
75% (9/12) | 79% (66/84) | 100% (2/2) |
- Note. Percentages are out of the total number of participants who underwent that parameter and not the total number of participants reviewed. LS = linear series, MTO = many-to-one, OTM = one-to-many, SIM = simultaneous, STC = simple-to-complex, CTS = complex-to-simple, A = auditory, and V = visual.
- * Indicates recommendations from basic laboratory studies with neurotypical adults.
Training structure
The most common training structure was the one-to-many procedure (47%, n = 14), followed by the linear series (26%, n = 8), then a mix of one-to-many and many-to-one procedures (13%, n = 4), and finally many-to-one procedures (7%, n = 2). For example, Dixon, Belisle, Stanley, Speelman, et al. (2017) used a linear series structure to teach shapes, letters, and colors, whereas Lee et al. (2015) used a one-to-many structure to teach categories of dog breeds. Two studies (7%) either did not report the training structure or used structures other than linear series, one-to-many, and many-to-one types. To illustrate, Noro (2005) taught AB relations to mastery, followed by CA relations, and then tested for emergent relations. The results indicated that 100% (11 out of 11), 80% (43 out of 54), and 63% (17 out of 27) of participants passed equivalence tests without remedial instruction when exposed to many-to-one, one-to-many, and linear series training structures, respectively (Table 2).
Teaching protocol
The most frequently used teaching protocol was the simple-to-complex procedure (40%, n = 12), followed by simultaneous (30%, n = 9) and then the complex-to-simple procedure (13%, n = 4). For example, Arntzen et al. (2014) used the simultaneous protocol by teaching printed names of trees to pictures of trees and pictures of leaves. Symmetry (pictures of trees to printed names and pictures of leaves to printed names) and equivalence relations (pictures of trees to pictures of leaves and pictures of leaves to pictures of trees) were tested by interspersing taught, symmetrical, and equivalence trials in the same block. Daar et al. (2015) used a simple-to-complex protocol to relate pictures of community helpers (e.g., teachers) to pictures of community locations (e.g., classrooms) and printed job functions (e.g., teach kids). The researchers tested for symmetry (e.g., picture of the location to picture of community helper) before testing for equivalence (e.g., picture of the location to printed function). Finally, five studies (17%) did not report or provide enough information to determine the type of protocol. Table 2 shows that 100% of participants (seven out of seven) passed equivalence tests without remedial teaching when using complex-to-simple protocols, whereas 82% (nine out of 11) and 75% (42 out of 56) of participants did so for the simple-to-complex and simultaneous protocols, respectively.
Number of comparisons
Three comparisons were used most frequently during initial baseline trials (67%, n = 20), followed by one comparison (13%, n = 4) and two comparisons (10%, n = 3). Dixon, Belisle, Stanley, Speelman, et al. (2017) presented three comparisons including a picture of a triangle, square, and circle in the presence of an auditory sample, “Which goes with the triangle?” Uniquely, Arntzen et al. (2010) used serialized trials to teach piano chords in which the number of comparisons started with one, then increased to two, and finally to three comparisons by the fifth block. Last, two studies did not report how many comparisons were presented during teaching trials. The highest proportion of participants who passed equivalence tests without remedial instruction was when studies had three or more comparisons (70%, 58 out of 70) compared with one (67%, four out of six) or two comparisons (57%, eight out of 14; Table 2).
Number of classes and members per class
Most studies reported having four or more classes (40%, n = 12), followed by three classes (30%, n = 9) and two classes (27%, n = 8); one study did not report on this parameter. The highest proportion of participants passed equivalence tests without remedial instruction when studies involved three classes (92%, 22 out of 24) compared with two (78%, 28 out of 36) and four or more classes (69%, 24 out of 35). In addition, most studies reported having three-member classes (70%, n = 21) and nine studies (29%) had four or more members. For example, Arntzen et al. (2014) taught a participant four, three-member equivalence classes consisting of names and pictures of trees and leaves. In contrast, Omori et al. (2011) taught participants two, four-member classes consisting of printed English words and pictures to teach spelling and reading. Moreover, 100% (three out of three), 95% (20 out of 21), and 73% (54 out of 74) of participants did not require remedial teaching when studies involved two-member, four-or-more-member, and three-member classes, respectively (Table 2). For instance, Dunne et al. (2014) taught three, two-member classes relating familiar and unfamiliar pictures to nine children with ASD. After learning the AB relation, eight out of nine children passed symmetry and equivalence tests with a varied number of trials. Moreover, the researchers measured Verbal Behavior Milestones and Placement Program scores correlated with the number of trials to meet the passing criterion but the relation was statistically nonsignificant.
Stimulus modality
The stimulus modality most often used was a combination of auditory and visual (45%, n = 13), followed by visual–visual (35%, n = 10), auditory–visual (10%, n = 3), and other (e.g., gustatory; 13%, n = 4) modalities. For example, Keintz et al. (2011) taught two participants to identify coins and their corresponding values using both auditory–visual (e.g., dictated name to actual coin) and visual–visual (e.g., actual coin to printed price) modalities. In contrast, Walsh et al. (2014) presented only visual stimuli (pictures of animals, printed English and Irish words of animals) in teaching and testing of equivalence relations. Ninety percent (19 out of 21) of participants did not require remedial teaching with the auditory–visual stimulus modality, compared with 79% (37 out of 47) and 64% (16 out of 25) when the modality was a combination of auditory and visual stimuli and when it was visual–visual, respectively (Table 2).
Procedural variations
Observing response
Nearly half of the studies required an observing response to the sample (47%; n = 14), whereas 33% (n = 10) did not. For example, Daar et al. (2015) presented visual comparison stimuli followed by an immediate auditory sample (e.g., “Which goes with the doctor?”) before participants selected. Six studies did not report whether an observing response was required. A slightly higher proportion of participants passed equivalence tests without remedial teaching when the observing response was included (82%, 28 out of 34) versus when it was not (79%, 33 out of 42; Table 2).
Errorless teaching
More than half of the studies did not use errorless teaching procedures (53%, n = 16), whereas 11 studies did (37%). Three studies did not provide enough information to determine whether they used errorless teaching strategies. Tanji et al. (2013) did not use errorless teaching, as they provided a token and approving feedback for correct responses and corrective feedback for incorrect selections from the onset of teaching of baseline relations. In contrast, Groskreutz et al. (2010) incorporated a progressive prompt delay beginning with a 0-s prompt to select an item and progressed to 1, 2, 3, 4, and 5 s. The prompt delay increased following two consecutive nine-trial blocks with 89% or higher correct responses. Approximately the same percentage of participants did not require remedial teaching when errorless teaching was included as part of the procedure (79%, 26 out of 33) compared with when it was not (80%, 40 out of 50; see Table 2).
Response topography
Three studies (10%) used both speaker and listener instructions, and another two studies (6%) used speaker instruction only. For example, Sprinkle and Miguel (2012) taught a set of three, three-member classes of kitchen items. For one set, the taught response was pointing to the correct comparison (i.e., listener). For another set, participants were taught to tact the comparisons. The remainder of the studies (83%, n = 25) taught listener responses (i.e., auditory–visual conditional discriminations) which consisted of pointing to the correct comparison. Both participants (100%, two out of two) passed without remedial teaching when researchers used both speaker and listener instructions relative to 79% (66 out of 84) and 75% (nine out of 12) of participants when taught via listener and speaker instructions, respectively (Table 2).
Main findings
Passing criteria
Passing criteria for derived relations tests varied across studies, ranging from 90% or above (47%, n = 14) to 75% or above (3%, n = 1). Twelve studies (40%) reported a criterion between 80% and 89%. Three studies did not include their passing criterion. Only one study (Tanji et al., 2013) reported a passing criterion of 75% or above across two posttests. It is important to note that even though the passing criteria were similar, some studies had additional requirements that changed the degree of stringency. To illustrate, McKeel et al. (2017) had a passing criterion of 90% correct across three consecutive 10-trial blocks, whereas McLay et al. (2016) set a criteria of 100% correct across three consecutive trials across 2 days or on three out of four consecutive trials.
Tests of equivalence relations
The most common derived relations tests were transitivity or equivalence and symmetry tests (80%, n = 24), followed by equivalence (10%, n = 3) or transfer of function tests only (7%, n = 2). For example, Stromer et al. (1996; Study 2) tested the accuracy of emergent responses after individuals learned to match printed words to pictures, pictures to printed words, written spelling of pictures and objects, then selected the objects based on the written words. During the transfer of function test, the individual wrote the names of visible sample items on a list, matched the printed names with the comparison item on a shelf in another room, and then physically retrieved those items from the list off the shelf to take to the other room. One study (Omori et al., 2011) did not test for symmetry, transitivity, or equivalence after teaching students that a picture of an animal (e.g., camel) was the same as the printed word paired with a dictated word as a consequence for correct comparison selection. However, the researchers tested for other derived responses such as the participant providing a written response in the presence of an animal picture.
Based on the specific passing criteria that were used across studies, 34% of participants (n = 43) passed when the criterion was 90% or higher, 88% passed (n = 42) when the criterion was 80%–89%, 60% passed (n = 10) when the criterion was less than 79%, and five applications did not report these data.
Number of participants who passed transitivity/equivalence
Overall, most participants passed transitivity and/or equivalence tests the first time (77%, n = 77), whereas 21% (n = 21) did not. Data for two participants were not provided. Fifteen participants received remediation teaching, and 73% of those (11 out of 15) then passed derived tests after remediation (e.g., Walsh et al., 2014).
Remediation type
The most reported remediation strategy was tact instruction (n = 4, 44%), then reteaching baseline relations (n = 3, 33%), multiple exemplar training (n = 1, 11%), and continued exposure (n = 1, 11%). For example, Dunne et al. (2014) taught baseline relations and directly taught selection responses following failures with a new set of stimuli for two of nine participants. For four participants, remediation consisted of reteaching baseline relations using the same stimuli.
Generalization and maintenance
Only 23% (n = 7) and 20% (n = 6) of studies demonstrated stimulus or response generalization and maintenance probes, respectively. Tanji et al. (2013) tested for both generalization and maintenance of reading and spelling words by testing new spelling and reading sets and probing 3–4 weeks after the initial sets were learned. Dixon, Belisle, Stanley, Speelman, et al. (2017) tested the generalization of category names (e.g., shapes, letters) as intraverbal responses following the establishment of equivalence classes. Finally, Groskreutz et al. (2010) tested for maintenance of identifying line drawings or photos and printed words, for two participants, at 1- and 4-month follow ups.
DISCUSSION
Consistent with previous reviews (Gibbs & Tullis, 2021; McLay et al., 2013), results from the current paper support the conclusion that individuals with ASD can form equivalence classes with certain instructional practices in place. Specifically, our data suggest that individuals taught with one of the following permutations had the most efficacious outcomes: many-to-one training structure, complex-to-simple teaching protocol, three or more comparisons, three or more classes with four or more members each, an observing response, and both speaker and listener response topographies. Like McLay et al. (2013), despite differences in the overall proportion of participants who passed equivalence tests, there was no difference in outcomes based on the specific variable(s) used (e.g., training structure, training protocol).
Similar to Gibbs and Tullis (2021), given the methodological rigor of some EBI studies, our results should be interpreted with caution. We attempted to include the Single Case Analysis Research Framework to evaluate the quality and rigor of EBI studies; however, we were less concerned with teaching baseline relations and most interested in whether participants formed equivalence classes and under what conditions. A robust analysis was difficult to complete based on the types of experimental designs and probe data used frequently in EBI studies. For example, studies consistently scored low on the required minimum number of data points (e.g., probe data) or demonstrations of effect necessary for a rigorous study (e.g., two-tier multiple-baseline design; data available upon request). Thus, more rigorously designed EBI studies (e.g., experimental designs) are warranted.
When examining each variable in isolation, our findings are somewhat consistent with the basic literature recommendations for best practices (see Arntzen, 2012). However, there is variability across teaching procedures within the applied EBI literature with individuals with ASD. For example, researchers incorporated differing emergent passing criteria across studies (75%–90%), making it difficult to identify which specific variables or combination of variables may predict equivalence class formation.
Another area of caution related to the interpretation of our results is based on the inconsistent application of procedural parameters across studies. Consequently, we were unable to conduct statistical modeling to produce any meaningful findings (i.e., lack of statistical power making it unlikely to identify small but potentially relevant interactions between parameters [increased probability of Type II errors], a small number of cases for certain variables, and limited data). Thus, another option was to analyze each variable separately (e.g., chi-square test of independence) to investigate the statistical significance between passing equivalence and each procedural parameter. Although such analyses may produce interesting results, their utility may be limited. In research and practice, procedural parameters are not implemented independently of one another (Fienup et al., 2015). Thus, simply testing for significant associations involving individual procedural components may not capture the meaningful interactions between these parameters and does not correspond to real-world applications (e.g., how do the passing criteria interact with the training structure or protocol type?). Another possibility was to select certain combinations of procedural parameters hypothesized to be predictors of passing equivalence tests and to conduct more focused analyses to investigate how they influence treatment outcomes. However, as mentioned above, the threat of Type II errors remains a concern. Therefore, we need more data to conduct statistical analyses that can fully capture the influence and interactions of independent variable combinations. Researchers may also consider standardizing the emergent passing criteria to increase the likelihood of detecting variables influencing treatment outcomes.
The future of EBI research
Applied researchers should focus on systematically evaluating the parameters studied by basic researchers (Arntzen, 2012) within experimentally rigorous designs. Arntzen (2012) systematically identified some parameters within the basic literature that may be more likely to lead to equivalence class formation. However, those studies incorporated mainly arbitrary stimuli with neurotypical adults, so it is unclear whether the same parameters would be relevant with more “meaningful” stimuli or with persons with developmental disabilities. Therefore, applied researchers should build on these findings with more meaningful stimuli to determine whether the same effects occur within applied studies. Researchers can then investigate which combination(s) of variables may lead to optimal outcomes. Moreover, parametric or component analyses could elucidate the effect of combined or individual procedural variables on equivalence class formation.
Applied researchers should consistently report demographic data, assessment results, preexisting skills, and verbal repertoires for individuals with ASD (McLay et al., 2013). Autism spectrum disorder is commonly associated with health or developmental challenges, so reporting this information is relevant when interpreting the generality of findings. Only a few studies reported results from standardized assessments. When adopted, the assessments varied, despite O'Donnell and Saunders (2003) advocating for consistency in assessment protocols to facilitate comparisons of findings across studies. By using reliable and comparable measures, we can better understand the skills that are necessary for forming equivalence classes. For example, children scoring in Level 6 of the Assessment of Basic Learning Abilities-Revised (Kerr et al., 1977; i.e., could perform auditory–visual discriminations), consistently responded positively to equivalence tests (de Melo Wider et al., 2020; Vause et al., 2005), suggesting that auditory–visual discriminations may be a prerequisite to forming equivalence classes.
Further, comparisons of different groups of individuals with ASD could also provide information on the verbal repertoires required for emergent performance. By controlling for similar participant demographics or characteristics across groups, researchers can systematically manipulate independent variables and compare findings. For example, Devany et al. (1986) compared three groups—neurotypical preschoolers, nonvocal children with intellectual and developmental disabilities (used speech or signs independently), and children with intellectual and developmental disabilities with no formal communication method—to assess the role that verbal behavior may play in the formation of equivalence relations. Participants with more advanced verbal behavior passed equivalence tests more frequently when compared with those with less advanced verbal behavior. These findings, along with single-case data (e.g., Lee et al., 2015), suggest that both listener and speaker behaviors (i.e., bidirectional naming; Miguel, 2016, 2018) are associated with positive equivalence outcomes.
Some limitations of the current review should be mentioned. There were inconsistent emergent criteria (from 75% to 90% with varying numbers of trials to criterion) and training protocols (e.g., simple-to-complex) across studies. We categorized “passing” emergent tests based on each study's reporting. The stringency of emergent criteria has been found to greatly affect testing performance (Bortoloti et al., 2013). If a standard criterion was used across all studies, a chi-square test may have yielded clear outcomes. Additionally, the variability in training protocols across studies required us to categorize them based on the testing phase only. Types of training protocols may have differing effects on how participants perform on emergent tests (Arntzen, 2012), so this inconsistency precludes us from drawing clear conclusions.
As the focus of our review was to synthesize the EBI literature for individuals with ASD to guide applied research, we specifically included only individuals with ASD and excluded individuals with intellectual or developmental delays (with no diagnosis of ASD). We recognize that EBI likely has utility for these populations as well; however, we sought to focus our analysis on evaluations of persons with ASD. Future reviews could consider comparing EBI procedures for individuals with diverse developmental delays to determine how delays in certain areas may (or may not) influence the emergence of equivalence classes. A final consideration across all three published EBI reviews is the exclusion of unpublished studies. The absence of unpublished studies leads to the potential for publication bias due to the likelihood of published studies mainly reporting successful outcomes (Tincani & Travers, 2019). Publication bias may have minimized or eliminated cases in which individuals with ASD did not form equivalence classes. This information would be invaluable in helping researchers identify necessary skills or procedural parameters to promote emergent performance, so we recommend the inclusion of unpublished studies in subsequent reviews.
In closing, EBI has growing support as a teaching methodology to produce emergent responding for learners with ASD. Although our review attempted to include only studies with “meaningful” stimuli, in some cases it may be difficult to determine what is socially significant for certain individuals. To that end, minimal attention has focused on the social validity of EBI procedures (Wolf, 1978). Applied researchers should incorporate social validity measures and collaborate with educators and caregivers when developing EBI to determine the acceptability and relevance of the targets and procedures for each individual (Gibbs & Tullis, 2021).
Although we believe that EBI should be adopted to promote novel responding across various topics for learners with developmental delays, such as ASD, we have yet to generate clear recommendations for practitioners on how to implement EBI successfully. Thus, any attempt to recommend a specific set of parameters should be viewed as premature (e.g., Maguire & Allen, 2022). The precise procedural parameters required to promote the most efficient responding for individuals with developmental disabilities are still widely unknown. This finding may discourage clinicians from adopting EBI technology if they do not know how to design lessons that are most likely to produce positive equivalence outcomes. In the absence of clear data suggesting which parameters should be incorporated, those using EBI in clinical settings run the risk of selecting a combination of variables that may be ineffective, inefficient, or both. Until more research on specific procedural parameters is published, practitioners should proceed with caution and ensure they seek consultation or supervision from individuals who are proficient with the vast stimulus control literature before designing curricula.
ACKNOWLEDGMENTS
We would like to thank A. J. Guarino for his assistance with data analyses, as well as Shannon Luoma, Vanessa Lee, and Svea Love for their assistance with data collection.
CONFLICT OF INTEREST STATEMENT
The authors declare no conflicts of interest.
ETHICS APPROVAL
No human or animal participants were used in the production of this review article.