Article

Full Access

Cue Reliance in L2 Written Production

Corresponding Author

Daniel Wiechmann

University of Amsterdam

Correspondence concerning this article should be addressed to Daniel Wiechmann, University of Amsterdam, Department of English Linguistics, Spuistraat 210, 1012 VT Amsterdam, The Netherlands. E-mail: [email protected]Search for more papers by this author

Elma Kerz,

Elma Kerz

RWTH Aachen University

Search for more papers by this author

Daniel Wiechmann,

Corresponding Author

Daniel Wiechmann

University of Amsterdam

Elma Kerz,

Elma Kerz

RWTH Aachen University

Search for more papers by this author

First published: 30 April 2014

https://bibliotheek.ehb.be:2102/10.1111/lang.12047

Citations: 5

We would like to thank the anonymous Language Learning reviewers and the editors for their valuable comments, which helped us to considerably improve the quality of this article.

Share a link

Email
Facebook
x
LinkedIn
Reddit
Wechat

Abstract

Second language learners reach expert levels in relative cue weighting only gradually. On the basis of ensemble machine learning models fit to naturalistic written productions of German advanced learners of English and expert writers, we set out to reverse engineer differences in the weighting of multiple cues in a clause linearization problem. We found that, while German advanced learners succeeded in identifying important cues, their assignment of cue importance differed from that of the expert control group. Even at advanced levels, learners are found to rely on a smaller set of perceptually salient cues than native speakers do, focusing on cues that exhibit relatively high cue availability and relatively low cue reliability. Our findings suggest that the principles of the Unified Model of first and second language acquisition, which have been extensively supported for comprehension also underlie the written production of advanced second language learners.

Introduction

Experience-based (also known as emergentist or usage-based) models of language hold that using language hinges on the capacity to learn from environmental input (see Baayen, Hendrix, & Ramscar, 2013; Bod, 2009; Chang, Dell, & Bock, 2006; Daelemans & van den Bosch, 2005; Elman, 2009; McClelland et al., 2010; Ramscar & Gitcho, 2007). As the human brain continuously adjusts itself to sensory experience via the strengthening and weakening of connections among billions of neurons, humans capture the statistical regularities of the linguistic input they are exposed to and apply this knowledge in both language comprehension and language production. This capacity to detect statistical regularities is at work not only in earlier stages of language acquisition but remains throughout life (Chang et al., 2006; Farmer, Fine, & Jaeger, 2011) and it is operative not only in the acquisition of a first language (L1) but also a second language (L2) (MacWhinney, 2008, 2011). Linguistic knowledge then can be viewed as a byproduct of the formation of connections resulting from exposure to the probabilistic patterns underlying the linguistic input (see Ellis, 1999, and Wiechmann, Kerz, Snider, & Jaeger, 2013, for a recent compilation of developments in model architectures). Figure 1 sketches some of the crucial assumptions of this perspective.

Details are in the caption following the image — **Figure 1**
Open in figure viewer PowerPoint

A simple model of input–output relationships in language learning.

Experience-based theories of language learning and processing assume that learners induce linguistic knowledge from input by way of complex automatic distributional analyses of linguistic exemplars (Bates & MacWhinney, 1987; Bod, 1998; Chang et al., 2006; Conway, Bauernschmidt, Huang, & Pisoni, 2010; Dell, Reed, Adams, & Meyer, 2000; Hunt & Aslin, 2010; Perruchet & Pacton, 2006; Saffran, Aslin, & Newport, 1996; inter alia, see Chang, Janciauska, & Fitz, 2012, for a recent overview). This knowledge comprises a multitude of simple and complex associative relationships among its constitutive elements, and the distributional analyses of co-occurring features are guided by selective attention to cues in the input that enable the mapping of form–function relations during message comprehension. These cues vary in terms of their informativeness, often expressed in terms of a cue's availability (i.e., how often is the cue present in the input), its reliability (i.e., how often does the cue lead to the correct outcome), and derived notions. Language learning, in this view, can be modeled as an error-based implicit learning process in which deviations between expected forms and actually observed forms serve as error signals that cause the continuous adjustment of cue weights (Chang, 2002, 2009; Chang et al., 2006; Fitz, 2009; Fitz & Chang, 2008; Elman, 1990, 1993; Jaeger & Snider, 2013). In the most straightforward model architectures, comprehension and production employ a shared sequencing system so that the cue weights learned through the statistical analysis of the input serve as probabilistic constraints governing the learner's linguistic output in language production (Chang et al., 2006). Linguistic knowledge can hence be conceived of as a highly dynamic constraint satisfaction system, in which every experience with language affects the adjustments of cue weights (Seidenberg & MacDonald, 1999; see MacDonald, 2013, for a recent overview).1 L1 learning appears to be rational in the sense that it leads to “an end-state model of language that is a proper reflection of input and that optimally prepares speakers for comprehension and production” (Ellis, 2006, p. 164). L2 learning, on the other hand, is constrained by additional factors such as interference, overshadowing, or blocking, and additional probabilistic rules governing learner productions may be introduced to the system through explicit instruction, all of which may affect the relative weighting of individual constraints (Ellis, 2008). However, the basic mechanics of L1 and L2 learning are the same, as both L1 learners and L2 learners have to figure out which cues are most predictive for a given mapping (Ellis, 2008; MacWhinney, 2008, 2011).

In most studies investigating cue weighting, the problem has been one of interpreting probabilistic cues to identify an unambiguously correct choice. For example, in language comprehension, experimental designs would test the relative strengths of various cues that could be used to identify a target category, from phonemes to grammatical subjects to syntactic complementation patterns (for overviews, see Elman, Hare, & McRae, 2004, and MacWhinney, 2011). In all these studies there is exactly one correct outcome. For example, a nominal either functions as the direct object of the clause currently processed or as the subject of a new clause, a verbal –ed suffix either indicates simple active past tense or past participle morphology, and so on. In comprehension, the problem, then, is one of disambiguation, that is, reducing uncertainty so as to find the most probable of k competing discrete possibilities. The task of the learner translates into identifying the cues in the input that allow for fast and accurate language processing. Such situations, however, apply in only a subset of the choices that speakers have to make in utterance planning. In many cases, there is no single correct linguistic behavior, but only contextually preferred behavioral responses among a set of grammatically permitted possibilities. This class of phenomena has recently been likened to high-level nonlinguistic action planning (MacDonald, 2013), which—in contrast to fine-grained motor control—is characterized by more loosely constrained sequencing (Botvinick & Plaut, 2004; Cooper & Shallice, 2000). Such phenomena, which involve an element of optionality, are challenging even for advanced learners (see, e.g., DeKeyser, 2005). The phenomenon investigated here, clause linearization in complex sentences, falls within this category. Consider the example in (1):

a) Although Peter cannot afford it, he bought her a lovely gift. (1)
b) Peter bought her a lovely gift, although he cannot afford it.

In the vast majority of cases, both positional options are grammatical and the preferential use of one arrangement rather than the other is codetermined by multiple factors. While the exact causal dynamics are not yet fully understood, corpus-based investigations have identified various factors that demonstrably affect the ordering choice (e.g., Diessel, 2008; Wiechmann & Kerz, 2013). These include the relative length (syntactic weight) of the clausal constituents, semantic properties signaled through lexical choices of the subordinating conjunction, and discourse-level effects like bridging, that is, whether the adverbial clause (AC) connects to information from the preceding discourse as signaled, for example, by anaphoric items (see further details below).

In this study, we employ ensemble machine learning models to investigate how German advanced learners of English weigh a set of cues to solve a clause linearization problem in high-level linguistic action planning. In particular, by way of a comparative analysis of naturalistic productions of complex sentence constructions, we investigate in what ways (temporally unconstrained) offline sentence-level planning processes of German advanced learners of English differ from those of expert writers with respect to their reliance on specific cues.2 We assume that mismatches in constraint weighting are, at least partially, caused by properties of a cue that concern its learnability, such as its detectability, availability, and reliability, so that an assessment of the cue weights of learners permits inferences about general aspects of L2 learning. For example, there is evidence that in early phases of learning, both L1 and L2 learners focus on cues that are highly available and that high cue reliability becomes more important than cue availability only at later phases of learning (Matessa & Anderson, 2000; Taraban & Palacios, 1993). Following this pattern, advanced learners’ productions of high-level language planning problems, such as the linearization of clausal constituents in complex sentences, should be more strongly governed by highly available and detectable cues and less strongly than that of experts by less available but highly reliable cues. Our approach sets out to reverse engineer the cue weights through a distributional analysis of the linguistic output, meaning that we reason from language productions to properties of the system underlying the productions.3 We do so by exploiting the connection between: (a) statistical regularities in the input, (b) internalized cue weights, and (c) statistical regularities in the output (as sketched in Figure 1).

It should be stressed that in this study we are interested in uncovering the types of informational sources that learners rely on during high-level linguistic action planning. We are not concerned with accurate language use. Of course, the assessment of what knowledge learners rely on is related to questions of accurate usage, but these two issues are logically independent of each other: The reliance on a specific informational source (lexical, structural, or discourse functional) does not imply the accurate calibration of the subregularities within that informational source.

Method

Corpus

The advanced learner data were retrieved from a total of 50 term papers produced by German students of English linguistics in their second and third year of study (∼216,000 words) and a same-sized control expert corpus of peer-reviewed articles appearing in various journals on language studies.4 The learner data were produced under comparable conditions (same amount of contact time spent in a previous seminar, same lecturer, and same formatting suggestions). The target constructions were identified by matching a set of subordinators (see below) in the two corpora, yielding a total amount of 1,471 data points. A total of 601 of these retrieved sentences was produced by experts. To maximize the comparability of results, we reduced the larger data set from the learners by taking a random sample that matched the size of the expert data set (N_learner = 601; N_expert = 601).

Corpus Annotation and Modeling of the Data

The target structures were manually annotated with information pertaining to the following six variables, which were motivated by previous research (see Diessel, 2008, and Wiechmann & Kerz, 2013, for a detailed discussion):

Semantic subtype
- ○ Causal ACs were distinguished from concessive ACs (following the classification in Quirk, Greenbaum, Leech, & Svartvik, 1985).
  - ■ The preferred position of an AC is influenced by the semantic relation that holds between the situations described by the clausal elements. For example, causal clauses that are used to justify a belief tend to follow their main clause. Concessive clauses that are used to “set the stage” (Verstraete, 2004) tend to precede their main clause.
Subordinator
- ○ Lexical realization of subordinating conjunction (because, although, whereas, since, while, as, and even (subsuming even if and even though).
  - ■ For each conjunction, there is a probability distribution describing the tendency for an AC headed by that conjunction to occur in sentence-initial position. This preference or (dispreference) is register-specific, that is, it may vary across situational contexts (Kerz & Wiechmann, in press).
Proportional length of AC (syntactic weight)
- ○ Proportional length was measured as the difference in number of words between AC and its associated main clause
  - ■ In head-initial languages like English, longer constituents tend to follow shorter constituents (Behaghel, 1932; Hawkins, 2004).
Structural complexity of AC
- ○ An AC was considered complex if it exhibited at least one embedded clause.
  - ■ Deeper embedding increases the complexity of a clause. More complex constituents tend to follow simpler constituents (Diessel, 2008).
Deranking of verb forms in AC
- ○ An AC was considered balanced if it was tensed ( = finite). Participial, infinitival, or verbless clauses were labeled as “deranked” (Cristofaro, 2003).
  - ■ Deranked ACs are more likely to precede their associated main clauses than balanced ACs.
Bridging
- ○ An AC was taken to serve a bridging function if it included an anaphoric item that was co-referential with a noun phrase (NP) or clausal constituent of a preceding sentence (cf. Biber, Johansson, Leech, Conrad, & Finegan, 1999).
  - ■ ACs that link between informational unites across sentences tend to precede their associated main clause (Verstraete, 2004).

Our approach to understanding human cue weighting relies on techniques from machine learning, which as a class have been employed with great success in reproducing linguistic choice behavior suggesting that probabilities of occurrence are somehow available to the human processing system (see Baayen, 2011, for a recent discussion). In principle, the assessment of cue weights can be approached via a variety of statistical and algorithmic techniques that generate estimates of effect size or variable importance. However, not all techniques are equally well suited for the task at hand. For example, variable importance estimates from regression models are more likely to suffer from the effects of correlating variables (multicollinearity; cf. Belsley, Kuh, & Welsh, 1980; for a discussion in a linguistic context, see Tagliamonte & Baayen, 2012; Wiechmann, 2012; Wiechmann & Kerz, 2013). While correlations among predictors are not harmful to the overall predictive power of a model, multicollinearity may lead to erratic changes in the corresponding expression of variable importance in regression models. As correlating predictors constitute the norm rather than the exception in linguistic contexts—in fact, strong correlations “probably provide exactly the redundancy that makes human learning of language data robust” (Baayen, 2011, p. 14)—regression is often not the most suitable analytical tool if the goal is the accurate estimation of variable importance. In our case, we can expect strong correlations between length and complexity or semantic type of AC and subordinator.

To assess cue weights as accurately as possible, we employed two techniques that are known to be less vulnerable to the distorting effects of multicollinearity: discrete adaptive boosting models with single-layer decision trees (Freund & Shapire, 1996) and random forests models with conditional inference trees (Strobl, Boulesteix, Kneib, Augustin, & Zeileis, 2008). Both techniques belong to the family of ensemble methods, that is, methods that do not rely on a single model but on many models. They differ, however, in the way they solve a classification task. Random forests can be viewed as a refined version of bootstrap aggregation (bagging), in which many decision trees are fit to resampled versions of the training data, and the final classification is arrived at by majority vote (Breiman, 1996, 2001; Hastie, Tibshirani, & Friedman, 2009). The random forests variant employed here has as its base model decision trees using recursive partitioning by conditional inference and express cue weights through a permutation variable importance measure (Hothorn, Hornik, & Zeileis, 2006), which adjusts for correlations between predictor variables and is not biased when predictor variables vary in their number of categories or scale of measurement.5 The random forest was set up in such a way that its constitutive trees were allowed to be of arbitrary size as long as a variable to be included in the tree introduced a statistically significant split based on multiplicity adjusted p values from a permutation test (see Strasser & Weber, 1999). Each classifier in the forest, or each conditional inference tree, is thus by itself a strong classifier.

The fundamental rationale of adaptive boosting is to sequentially fit weak classifiers, that is, classifiers that perform only slightly better than chance, to reweighted versions of the training data and employ a weighted majority vote to arrive at a final classification decision regarding whether an AC is preposed or postposed.6 With increasing number of iterations, boosting algorithms focus on hard-to-classify cases and produce a dynamic similar to human processing with respect to infrequent events: In human processing, infrequent events can have a strong impact on produced behaviors as evidenced by inverse frequency effects. For example, less frequent structures tend to yield stronger priming effects (Bock, 1986; Ferreira, 2003, Scheepers, 2003). This method is probably effective for reducing bias and variance, and improving misclassification rates (Bauer & Kohavi, 1999; Breiman, 1998; Dietterich, 2000). The boosting variant used here produces a classification model as an ensemble of decision stumps, that is, single-level recursive decision trees (cf., e.g., Breiman, Friedman, Stone, & Olshen, 1984).7 The use of decision stumps increases robustness against overfitting and also improves the assessment of variable importance. For purposes of exposition, we will focus on reporting the results of the boosting model and use results obtained from the random forest technique to complement these findings.

Results

Overall Model Performance: Prediction Accuracy

To assess the strength of the predictive relationships investigated in the study, we first employed a holdout method and split the data into disjoint subsets that served as training data (70%) and testing data (30%). The models were evaluated in terms of predicted accuracy on the test data, which is defined as $urn:x-wiley:00238333:lang12047:equation:lang12047-math-0001$ .8 The only parameter to tune in a boosting model is stopping time, that is, the number of iterations in which the instances in the training data are reweighted. We ran a total number of 5,000 boosting iterations and then inspected the development of the error rate. Figure 2 shows this development through these iterations run over the expert data.

The classifier reaches its optimal performance at around 1,000 iterations. At later iterations, the model adjusts its estimates of variable importance so as to better fit the training data but does not generalize as well to the test data. The “1,000 iterations” expert model displays a training error of 0.21 (corresponding to 79% prediction accuracy) and a test error rate of 0.27 (corresponding to 73% prediction accuracy). The “1,000 iterations” expert model was then submitted to a 10-fold cross-validation, which resulted in an average error rate of 0.24.9 We applied the same procedure to the learner data. That is, we first set up a boosting model using the same specifications as described above. The “1,000 iterations” learner model reached a training error rate of 0.14 (86% prediction accuracy) and a test error rate of 0.16 (84% prediction accuracy). Accuracy after a 10-fold cross-validation was 86%.

The random forest model performed competitively but slightly worse (average prediction accuracy was about 2% below that of a corresponding boosting model). Furthermore, the random forest model was a lot more biased toward the majority class, that is, it showed a much stronger tendency toward reducing the classification error of the larger of the two classes of the response. Almost all its errors were made in predicting the less frequent class, that is sentence initial positions. Predicting rare events is notoriously difficult for statistical procedures (e.g., Joshi, Kumar, & Agarwal, 2001). What is more, without some kind of bias correction, most statistical procedures tend to underestimate the effects of infrequent events on a response (Tomz, King, & Zeng, 2003). In adaptive boosting models, infrequent factor level combinations can have relatively strong impacts on the estimation of variable importance. As the technique effectively changes the underlying data distribution with each iteration, it handles imbalanced data sets a lot better than most classification algorithms. In our study, it made in fact only about half as many errors in the minority class.

Variable Importance: Assessing Cue Weights

In pursuit of our main goal to assess which cues learners and experts rely on most, we next investigated the estimates of the variable importance (VI) obtained from both methods. Figures 3 and 4 show, for experts and learners respectively, the cue weighting as estimated via the VI measure implemented in ada (cf. Hastie et al., 2009, p. 367 and following). We tracked the VI estimates across iterations to better understand in what direction the cue weights are adjusted before reaching their optimal values. As the numerical value of the estimates varies with the number of boosting iterations, all estimates were standardized.

All adjustments were unidirectional, that is, with increasing learning iterations, there was only adjustment toward the final estimate. Comparing the final cue weighting in Figures 3 and 4, the first thing to note is that the rank order of the investigated cues is the same in both groups. In both expert and learner language, clause positioning relies most heavily on three variables (subordinator-specific preferences, length differentials of the AC and its corresponding main clause, and the presence or absence of a bridging context) and depends less heavily on the three remaining variables (complexity AC, semantic type of AC, deranking of AC). Second, we observe that the weighting of the three most important predictors is different in the two models. While in the expert model the three important predictors are judged as being about equal in terms of their importance, there is a clear downward gradient in learner language. Here, subordinator is about twice as important as length, which in turn is roughly twice as important as bridging. Third, concerning the adjustment of the constraints over iterations, we observe for both models that subordinator and length were heavily underestimated in earlier iterations, while bridging was overestimated in earlier models. The results of the random forest technique are presented in Figures 5 and 6.

The estimations from the random forest technique support the results obtained from the boosting models that subordinator is judged to be very important for both experts and learners and that bridging is judged to be more important for the experts than it is for the learners. The techniques arrived at different estimations with respect to length, however, which is considered relatively less important in expert language and considered virtually unimportant in learner language. With respect to length, the solution of the random forest is more similar to that of the boosting model after up to 800 iterations. However, as shown in Figure 2, the boosting models adjusted the weights quite drastically in the next 200 iterations before reaching their most predictive calibration at around 1,000 iterations. Because the boosting models’ prediction errors of the test data decreased noticeably in the interval between iterations 800 and 1,000, there is no reason to assume that the later adjustments indicate overfitting. In consequence, it seems conservative to assume that the boosting models’ estimates are better approximations of variable importance. Finally, we found further support for the relevance of length for expert but not for learner productions from the inspection of the best ( = most predictive) decision trees to feed into the random forest. Figure 7 presents the best trees for experts and learners, respectively.

The tree to partition the expert data makes use of four variables including both weight-related constraints, length and complexity of the AC, suggesting that the relative weight of the AC does play a significant role in the expert choices. The absence of these variables in the learner tree suggests that the constraints do not factor (as prominently) in the learner choices.10 We may further note that the best tree in the forest describing the expert data also assigns more importance to bridging than to subordinator reflecting the near-equivalent importance of the constraints in expert production.

Discussion

Our analyses yielded two main findings. First, learner choices were easier to predict than expert choices (a 10-fold cross-validated “1,000 iterations” boosting model produced a 0.24 error rate to classify unseen data of expert productions and a 0.16 error rate for learner productions). Second, learners assigned proportionally more cue weight to subordinator and proportionally less weight to the discourse-level factor, that is, bridging. The role of relative length of the AC could not be decisively determined on the basis of the available data. We will discuss these findings in turn.

In regard to the first finding, the general limitations of the models’ predictive success indicate that the investigated constraints describe only a subset of all constraints involved in the investigated positional choice. The fact that the learner model was considerably more predictive than the expert model suggests that, relative to experts, the investigated constraints cover a larger portion of the full constraint set of learners. When combined with the finding that discourse-level constraints were more important in expert language, this suggests that the missing constraints are likely to be found at the discourse level.11 The differences in cue strength, the second finding, can be detailed along three parameters.

First, we observe a difference in the number of important cues: While experts assign importance to three cues (subordinator, bridging, and length), learner choices are clearly most strongly influenced by one cue (subordinator). This reflects the general finding that learners at first focus on fewer constraints to understand an aspect of grammar (Bates & MacWhinney, 1987; Ellis, 2006; MacWhinney, Bates, & Pleh, 1985).

Second, we observe a difference in the types of cues that L2 learners focus on. Experimental research into cue strength typically reports that L2 learners pick up frequent (i.e., highly available) cues first. With growing experience, they tend to rely more and more on reliable cues (Matessa & Anderson, 2000; Taraban & Palacios, 1993). While the definitions of cue reliability and availability used in the Competition Model would have to be adapted so as to fit the probabilistic task investigated here, our models clearly support these findings irrespective of the details of their mathematical expression: bridging (by way of anaphoric relations) clearly is a highly reliable but not very available constraint. In other words, it is relatively rare, but when it applies, it almost invariably occurs with sentence-initial positioning of the AC. This highly reliable predictor is considerably less important in the learner model than in the expert model. The most important variable in the learner models, subordinator, is a highly available but less reliable cue: There is a subordinating conjunction in every exemplar in the input that learners use to build up their production constraints, but the positional preferences are only probabilistic and interdependent with other cues rendering the cue less reliable (cf. Wiechmann & Kerz, 2013 for a detailed discussion).

Third, our result that learners assign proportionally more cue weight to subordinator and proportionally less weight to the discourse-level bridging factor supports the general finding that in language, as in other cognitive domains, humans most readily learn detectable cues, that is, statistical regularities among elements that are perceptually salient and temporally proximal. Functional similarities (without perceptual similarity) and temporally nonadjacent generalizations are harder to detect and thus harder to learn (Creel, Newport, & Aslin, 2004; Endress, Nespor, & Mehler, 2009; for discussions of cue detectability in the Competition Model, see also Bates & MacWhinney, 1982 & MacWhinney, 2008). Learners’ positional planning is thus expected to be most strongly influenced by statistical regularities of variables like subordinator, whose distributions are easier to track. Semantic or discourse functional cues such as bridging are not directly perceivable, and identifying their statistical regularities requires a deeper analysis of the linguistic input. Indeed, this is reflected in the relatively lesser weight assigned to the bridging constraint in learner language, which is harder to detect because it is: (a) relatively infrequent and (b) multiple realizable, meaning that many formal devices can be used to establish the anaphoric link (e.g., different types of demonstrative pronouns and NPs with demonstrative determiners). We should note that not every semantic cue is on equal footing with respect to its detectability. In her study of adult L2 learners’ sensitivity to phonological, morphological, and semantic cues to French grammatical gender, Carroll (1999) reports high levels of sensitivity to a semantic cue, namely natural gender. However, natural gender represents a concept that arguably is established relatively earlier in development than the concept of cohesiveness of texts modulated by the bridging variable. In this sense, statistical regularities of natural gender are likely to be easier to exploit than those of bridging. Generally, adult L2 learners tend to fare better with internal interfaces, such as syntax-semantics, than with external interfaces, such as syntax-pragmatics (see Sorace & Serratrice, 2009; Donaldson, 2012).12 If learning proceeds via an error-based implicit learning mechanism, we would expect such findings, because deviations from the former are more likely to provoke more informative feedback (e.g., corrections) than utterances that are merely deviations from the latter. In the extreme case, when communicative success does not depend on it, learners might never “get around to noticing low salience cues” (Ellis, 2006, p. 170).

Limitations and Conclusion

We would like to point out some caveats and limitations of the present study. We believe to have demonstrated the empirical reality of systematic differences in cue reliance between experts and learners. On the basis of the available data, we cannot, however, ascertain the role of interactions with L1 knowledge (i.e., any interference effects). L2 learners typically attempt to first transfer cue weightings from the L1 whenever they can perceive correspondences between items in L1 and L2 (Robinson & Ellis, 2008; also MacWhinney, 2011). For the phenomenon investigated here, however, there is reason to believe that transfer effects play only a subsidiary role. Prior research suggests that transfer of item-based syntactic patterns is very limited, as such patterns cannot be readily matched across languages, meaning that item-specific preferences must be learned from the bottom up without any support from the L1 (MacWhinney, 2011). In both expert and learner productions, clause serialization was found to be heavily influenced by item-specific preferences of individual subordinators. Nevertheless, our results suggest that the lexical preferences of learners differ substantially from those of experts. Readers will find evidence of this in Figure S3 of the Supporting Information online, which presents graphically the differences in subordinator-specific positional preferences after controlling for bridging and length as estimated from a regression model. The estimates in Figure S3 are best viewed as rough approximations of the true lexical preferences, as they incorporate only three constraints and do not reflect any dynamics from their interaction. However, they do suggest that learners have calibrated the constraint subordinator differently than experts: Although, since and while are either balanced or display a bias toward initial positioning in expert language, but not in learner language. Because and whereas are biased toward final positioning in both data sets, but much more pronouncedly so in learner language. Less frequent subordinators, even though|if and as are preferred in initial positions in learner language, but not in expert language. The differences in lexical preference cannot be reduced to frequency, as the correlations of the differences between the regression coefficients (delta beta_expert-beta_learner) and either frequency_experts or frequency_learners were not significant at alpha = 0.05. Some of the deviation from expert behavior may be predictable from the semantic proximity of an English subordinator to its nearest German correspondent form (Dong, Gui, & MacWhinney, 2005). Still the semantic space of adverbial relations is likely to be split up differently across the two languages, which will impede successful transfer. As a principled answer to these questions requires comparable data from German, these issues need to be addressed in future research.

Overall, our results reveal that the basic principles of the Unified Model of first and second language acquisition (MacWhinney, 2008, 2011), which have been documented so extensively for comprehension also underlie the written production of advanced L2 learners. We also have shown how the reverse engineering of cue weights by way of ensemble methods constitutes a fruitful complement to other computational and experimental approaches.

Notes

1 For the purposes of this article, we follow Seidenberg and MacDonald (1997) in using the terms cue and constraint synonymously to highlight the conceptual similarities between the Competition Model and related probabilistic constraint satisfaction models.

2 We contrasted the learner behavior with that of professional writers of academic English. We labeled our control group experts (rather than native speakers) to highlight the functional dimension of variation (register). Extensionally, this is inconsequential as all text in our corpus have been either authored or at least copyedited by native speakers of (American) English.

3 We use the term reverse engineering in the sense specified in the Free Online Dictionary of Computing to refer to “[t]he process of analysing an existing system to identify its components and their interrelationships and create representations of the system in another form or at a higher level of abstraction” (http://foldoc.org/,search “reverseengineering”)

4 We thank an anonymous reviewer for pointing out that the exact number of people that have contributed to the observable properties of the investigated corpora (authors, readers, editors, reviewers, etc.) is fundamentally unattainable. Strictly speaking, the study is thus best conceived of as an attempt to identify typical types of behavior of German advanced learners of that register against this idealized standard.

5 All calculations were carried out in R (R Development Core Team, 2012). The random forest was grown using the function cforest in the party library. The number of input variables randomly sampled as candidates at each node, was set to the number of input variables to reduce variance and helps to avoid overfitting. We grew models with up to 500 trees. Maximal performance was achieved at around 50 trees.

6 In the case of a binary classification task, this adaptive boosting can be understood as an approximation to additive modeling on the logistic scale that uses Bernoulli likelihood as a criterion (Friedman, Hastie, & Tibshirani, 2000).

7 We employed the AdaBoost algorithm as implemented in the packages ada (Culp, Johnson, & Michailidis, 2010) and adabag (Alfaro, Gamez, & Garcia, 2013). The single-layer decision trees were generated through the rpart function (Therneau, Atkinson, & Ripley, 2013).

8 The holdout estimate of error has the basic drawback that it may yield misleading results if we happen to make an unfortunate split. To safeguard against this possibility, we repeated the procedure three times, retrieving similar results at each pass.

9 In our cross-validation, the data are divided into 10 non-overlapping subsets of roughly equal size. Then, the “1,000 iterations” boosting model is applied on nine of the subsets. Finally, predictions are made for the left out subset, and the process is repeated for each of the 10 subsets. The cross-validation was conducted using the function boosting.cv from the adabag library.

10 To further ascertain the role of length, we also fit a logistic regression model to the data, which—to avoid multicollinearity effects—only comprised the top three constraints subordinator, bridging, and length. The model identified a small but statistically significant effect for length in expert productions (at alpha = 0.01) but not in learner productions (Pr(>|z|) > 0.4) supporting the boosting estimates (see the Appendix for details).

11 Another factor to potentially contribute to the lesser success in predicting the expert data is a high degree of variability in cue weighting across the expert texts. This, however, would have been reflected in the 10-fold cross-validation we submitted the models to. There is, however, evidence that the expert data are slightly more variable that the learner data. Correlating random 50% splits of both data sets suggest that the learner data are more homogeneous that the expert data (see the Supporting Information online for details).

12 The term interface may evoke modular views on linguistic knowledge. Our usage of the term here is not intended to entail any ontological claims about the existence of modules or interfaces.

Supporting Information

References

Alfaro, E., Gamez, M., & Garcia, N. (2013). Adabag: An R package for classification with boosting and bagging. Journal of Statistical Software, 54(2), 1–35.
10.18637/jss.v054.i02
Web of Science®Google Scholar
Baayen, R. H. (2011). Corpus linguistics and naïve discriminative learning. Brazilian Journal of Applied Linguistics, 11, 295–328.
Google Scholar
Baayen, R. H., Hendrix, P., & Ramscar, M. (2013). Sidestepping the combinatorial explosion: An explanation of n-gram frequency effects based on naïve discriminative learning. Language and Speech, 56, 329–347.
10.1177/0023830913484896
PubMedWeb of Science®Google Scholar
Bates, E., & MacWhinney, B. (1982). Functionalist approaches to grammar. In E. Wanner & L. Gleitman (Eds.), Language acquisition: The state of the art (pp. 173–218). New York: Cambridge University Press.
Google Scholar
Bates, E., & MacWhinney, B. (1987). Competition, variation, and language learning. In B. MacWhinney (Ed.), Mechanisms of language acquisition (pp. 157–193). Mahwah, NJ: Erlbaum.
Google Scholar
Bauer, E., & Kohavi, R. (1999). An empirical comparison of voting classification algorithms: Bagging, boosting and variants. Machine Learning, 36, 105–139.
10.1023/A:1007515423169
Web of Science®Google Scholar
Behaghel, O. (1932). Deutsche syntax: Eine geschichtliche Darstellung (Band 4: Wortstellung). Periodenbau. Heidelberg, Germany: Winter.
Google Scholar
Belsley, D. A., Kuh, E., & Welsh, R. E. (1980). Regression diagnostics: Identifying influential data and sources of collinearity. New York: Wiley.
10.1002/0471725153
Google Scholar
Biber, D., Johansson, S., Leech, G., Conrad, S., & Finegan, E. (1999). Longman grammar of spoken and written English. London: Longman.
Google Scholar
Bock, J. K. (1986). Syntactic persistence in language production. Cognitive Psychology, 18, 355–387.
10.1016/0010-0285(86)90004-6
Web of Science®Google Scholar
Bod, R. (1998). Beyond grammar: An experience-based theory of language. Stanford, CA: CSLI/Cambridge University Press.
Web of Science®Google Scholar
Bod, R. (2009). From exemplar to grammar: A probabilistic analogy-based model of language learning. Cognitive Science, 33, 752–793.
10.1111/j.1551-6709.2009.01031.x
PubMedWeb of Science®Google Scholar
Botvinick, M., & Plaut, D. C. (2004). Doing without schema hierarchies: A recurrent connectionist approach to normal and impaired routine sequential action. Psychological Review, 111, 395–429.
10.1037/0033-295X.111.2.395
PubMedWeb of Science®Google Scholar
Breiman, L. (1996). Bagging predictors. Machine Learning, 24, 123–140.
10.1007/BF00058655
Web of Science®Google Scholar
Breiman, L. (1998). Arcing classifiers. Annals of Statistics, 26, 801–849.
10.1214/aos/1024691079
Web of Science®Google Scholar
Breiman, L. (2001). Random forests. Machine Learning, 45, 5–32.
10.1023/A:1010933404324
Web of Science®Google Scholar
Breiman, L., Friedman, J., Stone, C. J., & Olshen, R. A. (1984). Classification and regression tree. New York: Chapman and Hall.
Google Scholar
Carroll, S. E. (1999). Input and SLA: Adults’ sensitivity to different sorts of cues to French gender. Language Learning, 49, 37–92.
10.1111/1467-9922.00070
Web of Science®Google Scholar
Chang, F. (2002). Symbolically speaking: A connectionist model of sentence production. Cognitive Science, 26, 609–651.
10.1207/s15516709cog2605_3
Web of Science®Google Scholar
Chang, F. (2009). Learning to order words: A connectionist model of heavy NP shift and accessibility effects in Japanese and English. Journal of Memory and Language, 61, 374–397.
10.1016/j.jml.2009.07.006
Web of Science®Google Scholar
Chang, F., Dell, G. S., & Bock, K. (2006). Becoming syntactic. Psychological Review, 113, 234–272.
10.1037/0033-295X.113.2.234
PubMedWeb of Science®Google Scholar
Chang, F., Janciauskas, M., & Fitz, H. (2012). Language adaptation and learning: Getting explicit about implicit learning. Language and Linguistics Compass, 6, 259–278.
10.1002/lnc3.337
Google Scholar
Cooper, R., & Shallice, T. (2000). Contention scheduling and the control of routine activities. Cognitive Neuropsychology, 17, 297–338.
10.1080/026432900380427
CASPubMedWeb of Science®Google Scholar
Conway, C. M., Bauernschmidt, A., Huang, S. S., & Pisoni, D. B. (2010). Implicit statistical learning in language processing: Word predictability is the key. Cognition, 114, 356–371.
10.1016/j.cognition.2009.10.009
PubMedWeb of Science®Google Scholar
Creel, S., Newport, E., & Aslin, R. (2004). Distant melodies: Statistical learning of nonadjacent dependencies in tone sequences. Journal of Experimental Psychology: Learning, Memory, and Cognition, 30, 1119–1130.
10.1037/0278-7393.30.5.1119
PubMedWeb of Science®Google Scholar
Cristofaro, S. (2003). Subordination strategies: A typological study. Oxford, UK: Oxford University Press.
Google Scholar
Culp, M., Johnson, K., & Michailidis, G. (2010). ada: ada: an R package for stochastic boosting. R package version 2.0–2. Available from http://CRAN.R-project.org/package=ada
Google Scholar
Daelemans, W., & van den Bosch, A. (2005). Memory-based language processing. Cambridge, UK: Cambridge University Press.
10.1017/CBO9780511486579
Google Scholar
DeKeyser, R. (2005). What makes learning second language grammar difficult? A review of issues. Language Learning, 55(Supplement 1), 1–25.
10.1111/j.0023-8333.2005.00294.x
Web of Science®Google Scholar
Dell, G. S., Reed, K. D., Adams, D. R., & Meyer, A. S. (2000). Speech errors, phonotactic constraints, and implicit learning: A study of the role of experience in language production. Journal of Experimental Psychology: Learning, Memory, & Cognition, 26, 1355–1367.
10.1037/0278-7393.26.6.1355
CASPubMedWeb of Science®Google Scholar
Diessel, H. (2008). Iconicity of sequence: A corpus-based analysis of the positioning of temporal adverbial clauses in English. Cognitive Linguistics, 19, 457–482.
10.1515/COGL.2008.018
Web of Science®Google Scholar
Dietterich, T. G. (2000). An experimental comparison of three methods for constructing ensembles of decision trees: Bagging, boosting, and randomization. Machine Learning, 40, 139–158.
10.1023/A:1007607513941
Web of Science®Google Scholar
Donaldson, B. (2012). Syntax and discourse in near-native French: Clefts and focus. Language Learning, 62, 902–930.
10.1111/j.1467-9922.2012.00701.x
Web of Science®Google Scholar
Dong, Y. P., Gui, S. C., & MacWhinney, B. (2005). Shared and separate meanings in the bilingual mental lexicon. Bilingualism: Language and Cognition, 8, 221–238.
10.1017/S1366728905002270
Web of Science®Google Scholar
Ellis, N. C. (1999). Cognitive approaches to SLA. Annual Review of Applied Linguistics, 19, 22–42.
10.1017/S0267190599190020
Google Scholar
Ellis, N. C. (2006). Selective attention and transfer phenomena in L2 acquisition: Contingency, cue competition, salience, interference, overshadowing, blocking, and perceptual learning. Applied Linguistics, 27, 164–194.
10.1093/applin/aml015
Web of Science®Google Scholar
Ellis, N. C. (2008). Usage-based and form-focused language acquisition: The associative learning of constructions, learned-attention, and the limited L2 endstate. In P. Robinson & N. C. Ellis (Eds.), Handbook of cognitive linguistics and second language acquisition (pp. 372–406). London: Routledge.
Web of Science®Google Scholar
Elman, J. L. (1990). Finding structure in time. Cognitive Science, 14, 179–211.
10.1207/s15516709cog1402_1
Web of Science®Google Scholar
Elman, J. L. (1993). Learning and development in neural networks: The importance of starting small. Cognition, 48, 71–99.
10.1016/0010-0277(93)90058-4
CASPubMedWeb of Science®Google Scholar
Elman, J. L. (2009). On the meaning of words and dinosaur bones: Lexical knowledge without a lexicon. Cognitive Science, 33, 547–582.
10.1111/j.1551-6709.2009.01023.x
PubMedWeb of Science®Google Scholar
Elman, J., Hare, M., & McRae, K. (2004). Cues, constraints, and competition in sentence processing. In M. Tomasello & D. Slobin (Eds.), Beyond nature-nurture: Essays in honor of Elizabeth Bates (pp. 111–138). Mahway, NJ: Erlbaum.
Google Scholar
Endress, A. D., Nespor, M., & Mehler, J. (2009). Perceptual and memory constraints on language acquisition. Trends in Cognitive Sciences, 13, 348–353.
10.1016/j.tics.2009.05.005
PubMedWeb of Science®Google Scholar
Farmer, T., Fine, A. B., & Jaeger, T. F. (2011). Implicit context-specific learning leads to rapid shifts in syntactic expectations. In L. Carlson, C. Hoelscher, & T. F. Shipley (Eds.), Proceedings of the 33rd Annual Meeting of the Cognitive Science Society (pp. 2055–2061). Austin, TX: Cognitive Science Society.
Google Scholar
Ferreira, V. S. (2003). The persistence of optional complementizer production: Why saying “that” is not saying “that” at all. Journal of Memory and Language, 48, 379–398.
10.1016/S0749-596X(02)00523-5
Web of Science®Google Scholar
Fitz, H. (2009). Neural syntax. Doctoral dissertation, University of Amsterdam, Institute for Logic, Language, and Computation.
Google Scholar
Fitz, H., & Chang, F. (2008). The role of the input in a connectionist model of the accessibility hierarchy in development. In H. Chan, H. Jacob, & E. Kapia (Eds.), Proceedings of the 32nd Boston University Conference on Language Development (pp. 120–131). Somerville, MA: Cascadilla Press.
Web of Science®Google Scholar
Freund, Y., & Shapire, R. E. (1996). Experiments with a new boosting algorithm. In L. Saitta (Ed.), Machine learning: Proceedings of the Thirteenth International Conference (pp. 148–156). San Francisco: Morgan Kaufmann.
Google Scholar
Friedman, J., Hastie, T., & Tibshirani, R. (2000). Additive logistic regression: A statistical view of boosting. Annals of Statistics, 28, 337–407.
10.1214/aos/1016218223
Web of Science®Google Scholar
Hastie, T., Tibshirani, R., & Friedman, J. (2009). The elements of statistical learning: Data mining, inference, and prediction. Stanford, CA: Springer.
10.1007/978-0-387-84858-7
Google Scholar
Hawkins, J. (2004). Efficiency and complexity in grammars. Oxford, UK: Oxford University Press.
10.1093/acprof:oso/9780199252695.001.0001
Google Scholar
Hothorn, T., Hornik, K., & Zeileis, A. (2006). Unbiased recursive partitioning: A conditional inference framework. Journal of Computational and Graphical Statistics, 15, 651–674.
10.1198/106186006X133933
Web of Science®Google Scholar
Hunt, R. H., & Aslin, R. N. (2010). Category induction via distributional analysis: Evidence from a serial reaction time task. Journal of Memory and Language, 62, 98–112.
10.1016/j.jml.2009.10.002
PubMedWeb of Science®Google Scholar
Jaeger, T. F., & Snider, N. (2013). Alignment as a consequence of expectation adaptation: Syntactic priming is affected by the prime's prediction error given both prior and recent experience. Cognition, 127, 57–83.
10.1016/j.cognition.2012.10.013
PubMedWeb of Science®Google Scholar
Joshi, M. V., Kumar, V., & Agarwal, R. C. (2001). Evaluating boosting algorithms to classify rare cases: Comparison and improvements. First IEEE International Conference on Data Mining, San Jose, CA.
Google Scholar
Kerz, E., & Wiechmann, D. (in press). Register-contingent entrenchment of constructional patterns: Causal and concessive adverbial clauses in academic and newspaper writing. Journal of English Linguistics.
Google Scholar
MacDonald, M. C. (2013). How language production shapes language form and comprehension. Frontiers in Psychology, 4, 1664–1078.
10.3389/fpsyg.2013.00226
Web of Science®Google Scholar
MacWhinney, B. (2008). A unified model. In N. C. Ellis & P. Robinson (Eds.), Handbook of cognitive linguistics and second language acquisition (pp. 341–372). New York: Erlbaum.
Google Scholar
MacWhinney, B. (2011). The logic of the Unified Model. In S. M. Gass & A. Mackey (Eds.), Handbook of second language acquisition (pp. 211–227). New York: Routledge.
Google Scholar
MacWhinney, B., Bates, E., & Pleh, C. (1985). The development of sentence interpretation in Hungarian. Cognitive Psychology, 17, 178–209.
10.1016/0010-0285(85)90007-6
Web of Science®Google Scholar
Matessa, M., & Anderson, J. R. (2000). Modelling focused learning in role assignment. Language and Cognitive Processes, 15, 263–292.
10.1080/016909600386057
Web of Science®Google Scholar
McClelland, J. L., Botvinick, M. M., Noelle, D. C., Plaut, D. C., Rogers, T. T., Seidenberg, M. S., et al. (2010). Letting structure emerge: Connectionist and dynamical systems approaches to understanding cognition. Trends in Cognitive Sciences, 14, 348–356.
10.1016/j.tics.2010.06.002
PubMedWeb of Science®Google Scholar
Perruchet, P., & Pacton, S. (2006). Implicit learning and statistical learning: One phenomenon, two approaches. Trends in Cognitive Sciences, 10, 233–238.
10.1016/j.tics.2006.03.006
PubMedWeb of Science®Google Scholar
Quirk, R., Greenbaum, S., Leech, G., & Svartvik, J. (1985). A comprehensive grammar of the English language. London: Longman.
Google Scholar
R Development Core Team. (2012). R: A language and environment for statistical computing. Vienna, Austria: R Foundation for Statistical Computing.
Google Scholar
Ramscar, M., & Gitcho, N. (2007). Developmental change and the nature of learning in childhood. Trends in Cognitive Sciences, 11, 274–279.
10.1016/j.tics.2007.05.007
PubMedWeb of Science®Google Scholar
Robinson, P., & Ellis, N. (2008). Handbook of cognitive linguistics and second language acquisition. London: Routledge.
10.4324/9780203938560
Google Scholar
Saffran, J. R., Aslin, R. N., & Newport, E. L. (1996). Statistical learning by 8-month old infants. Science, 274, 1926–1928.
10.1126/science.274.5294.1926
CASPubMedWeb of Science®Google Scholar
Scheepers, C. (2003). Syntactic priming of relative clause attachments: Persistence of structural configuration in sentence production. Cognition. 89, 179–205.
10.1016/S0010-0277(03)00119-7
PubMedWeb of Science®Google Scholar
Seidenberg, M. S., & MacDonald, M. C. (1999). A probabilistic constraints approach to language acquisition and processing. Cognitive Science, 23, 569–588.
10.1207/s15516709cog2304_8
Web of Science®Google Scholar
Sorace, A., & Serratrice, L. (2009). Internal and external interfaces in bilingual language development: Beyond structural overlap. International Journal of Bilingualism, 13, 195–210.
10.1177/1367006909339810
Web of Science®Google Scholar
Strasser, H., & Weber, C. (1999). On the asymptotic theory of permutation statistics. Mathematical Methods of Statistics, 8, 220–250.
Google Scholar
Strobl, C., Boulesteix, A., Kneib, T., Augustin, T., & Zeileis, A. (2008). Conditional variable importance for random forests. BMC Bioinformatics, 9, 1471–2105.
10.1186/1471-2105-9-307
CASWeb of Science®Google Scholar
Tagliamonte, S., & Baayen, H. (2012). Models, forests and trees of York English: Was/were variation as a case study for statistical practice. Language Variation and Change, 24, 135–178.
10.1017/S0954394512000129
Web of Science®Google Scholar
Taraban, R., & Palacios, J. M. (1993). Exemplar models and weighted cue models in category learning. In G. V. Nakamura, R. Taraban, & D. Medin (Eds.), The psychology of learning and motivation (Vol. 29: Categorization by humans and machines) (pp. 91–127). San Diego, CA: Academic Press.
10.1016/S0079-7421(08)60137-1
Web of Science®Google Scholar
Therneau, T., Atkinson, B., & Ripley, B. (2013). rpart: Recursive partitioning. R package version 4.1–1. Available from http://CRAN.R-project.org/package=rpart
Google Scholar
Tomz, M., King, G., & Zeng, L. (2003). ReLogit: Rare events logistic regression. Journal of Statistical Software, 8, 137–163.
Google Scholar
Verstraete, J-C. (2004). Initial and final position of adverbial clauses in English: The constructional basis of the discursive and syntactic differences. Linguistics, 42, 819–853.
10.1515/ling.2004.027
Web of Science®Google Scholar
Wiechmann, D. (2012). Exploring probabilistic differences between genetically related languages. Languages in Contrast, 11, 193–215.
10.1075/lic.11.2.03wie
Google Scholar
Wiechmann, D., & Kerz, E. (2013). The positioning of concessive adverbial clauses in English: Assessing the importance of discourse-pragmatic and processing-based constraints. English Language & Linguistics, 17, 1–23.
10.1017/S1360674312000305
Web of Science®Google Scholar
D. Wiechmann, E. Kerz, N. Snider, & T. F. Jaeger (Eds.). (2013). Parsimony and redundancy in models of language [Special issue]. Language and Speech, 56(3).
Google Scholar

Citing Literature

Volume64, Issue2

June 2014

Pages 343-364

Cue Reliance in L2 Written Production

Abstract

Introduction

Method

Corpus

Corpus Annotation and Modeling of the Data

Results

Overall Model Performance: Prediction Accuracy

Variable Importance: Assessing Cue Weights

Discussion

Limitations and Conclusion

Notes

Supporting Information

References

Citing Literature

Figures

References

Information

About Wiley Online Library

Help & Support

Opportunities

Connect with Wiley

Cue Reliance in L2 Written Production

Abstract

Introduction

Method

Corpus

Corpus Annotation and Modeling of the Data

Results

Overall Model Performance: Prediction Accuracy

Variable Importance: Assessing Cue Weights

Discussion

Limitations and Conclusion

Notes

Supporting Information

References

Citing Literature

Figures

References

Related

Information