Cue Reliance in L2 Written Production
We would like to thank the anonymous Language Learning reviewers and the editors for their valuable comments, which helped us to considerably improve the quality of this article.
Abstract
Second language learners reach expert levels in relative cue weighting only gradually. On the basis of ensemble machine learning models fit to naturalistic written productions of German advanced learners of English and expert writers, we set out to reverse engineer differences in the weighting of multiple cues in a clause linearization problem. We found that, while German advanced learners succeeded in identifying important cues, their assignment of cue importance differed from that of the expert control group. Even at advanced levels, learners are found to rely on a smaller set of perceptually salient cues than native speakers do, focusing on cues that exhibit relatively high cue availability and relatively low cue reliability. Our findings suggest that the principles of the Unified Model of first and second language acquisition, which have been extensively supported for comprehension also underlie the written production of advanced second language learners.
Introduction
Experience-based (also known as emergentist or usage-based) models of language hold that using language hinges on the capacity to learn from environmental input (see Baayen, Hendrix, & Ramscar, 2013; Bod, 2009; Chang, Dell, & Bock, 2006; Daelemans & van den Bosch, 2005; Elman, 2009; McClelland et al., 2010; Ramscar & Gitcho, 2007). As the human brain continuously adjusts itself to sensory experience via the strengthening and weakening of connections among billions of neurons, humans capture the statistical regularities of the linguistic input they are exposed to and apply this knowledge in both language comprehension and language production. This capacity to detect statistical regularities is at work not only in earlier stages of language acquisition but remains throughout life (Chang et al., 2006; Farmer, Fine, & Jaeger, 2011) and it is operative not only in the acquisition of a first language (L1) but also a second language (L2) (MacWhinney, 2008, 2011). Linguistic knowledge then can be viewed as a byproduct of the formation of connections resulting from exposure to the probabilistic patterns underlying the linguistic input (see Ellis, 1999, and Wiechmann, Kerz, Snider, & Jaeger, 2013, for a recent compilation of developments in model architectures). Figure 1 sketches some of the crucial assumptions of this perspective.

Experience-based theories of language learning and processing assume that learners induce linguistic knowledge from input by way of complex automatic distributional analyses of linguistic exemplars (Bates & MacWhinney, 1987; Bod, 1998; Chang et al., 2006; Conway, Bauernschmidt, Huang, & Pisoni, 2010; Dell, Reed, Adams, & Meyer, 2000; Hunt & Aslin, 2010; Perruchet & Pacton, 2006; Saffran, Aslin, & Newport, 1996; inter alia, see Chang, Janciauska, & Fitz, 2012, for a recent overview). This knowledge comprises a multitude of simple and complex associative relationships among its constitutive elements, and the distributional analyses of co-occurring features are guided by selective attention to cues in the input that enable the mapping of form–function relations during message comprehension. These cues vary in terms of their informativeness, often expressed in terms of a cue's availability (i.e., how often is the cue present in the input), its reliability (i.e., how often does the cue lead to the correct outcome), and derived notions. Language learning, in this view, can be modeled as an error-based implicit learning process in which deviations between expected forms and actually observed forms serve as error signals that cause the continuous adjustment of cue weights (Chang, 2002, 2009; Chang et al., 2006; Fitz, 2009; Fitz & Chang, 2008; Elman, 1990, 1993; Jaeger & Snider, 2013). In the most straightforward model architectures, comprehension and production employ a shared sequencing system so that the cue weights learned through the statistical analysis of the input serve as probabilistic constraints governing the learner's linguistic output in language production (Chang et al., 2006). Linguistic knowledge can hence be conceived of as a highly dynamic constraint satisfaction system, in which every experience with language affects the adjustments of cue weights (Seidenberg & MacDonald, 1999; see MacDonald, 2013, for a recent overview).1 L1 learning appears to be rational in the sense that it leads to “an end-state model of language that is a proper reflection of input and that optimally prepares speakers for comprehension and production” (Ellis, 2006, p. 164). L2 learning, on the other hand, is constrained by additional factors such as interference, overshadowing, or blocking, and additional probabilistic rules governing learner productions may be introduced to the system through explicit instruction, all of which may affect the relative weighting of individual constraints (Ellis, 2008). However, the basic mechanics of L1 and L2 learning are the same, as both L1 learners and L2 learners have to figure out which cues are most predictive for a given mapping (Ellis, 2008; MacWhinney, 2008, 2011).
- a) Although Peter cannot afford it, he bought her a lovely gift. (1)
- b) Peter bought her a lovely gift, although he cannot afford it.
In the vast majority of cases, both positional options are grammatical and the preferential use of one arrangement rather than the other is codetermined by multiple factors. While the exact causal dynamics are not yet fully understood, corpus-based investigations have identified various factors that demonstrably affect the ordering choice (e.g., Diessel, 2008; Wiechmann & Kerz, 2013). These include the relative length (syntactic weight) of the clausal constituents, semantic properties signaled through lexical choices of the subordinating conjunction, and discourse-level effects like bridging, that is, whether the adverbial clause (AC) connects to information from the preceding discourse as signaled, for example, by anaphoric items (see further details below).
In this study, we employ ensemble machine learning models to investigate how German advanced learners of English weigh a set of cues to solve a clause linearization problem in high-level linguistic action planning. In particular, by way of a comparative analysis of naturalistic productions of complex sentence constructions, we investigate in what ways (temporally unconstrained) offline sentence-level planning processes of German advanced learners of English differ from those of expert writers with respect to their reliance on specific cues.2 We assume that mismatches in constraint weighting are, at least partially, caused by properties of a cue that concern its learnability, such as its detectability, availability, and reliability, so that an assessment of the cue weights of learners permits inferences about general aspects of L2 learning. For example, there is evidence that in early phases of learning, both L1 and L2 learners focus on cues that are highly available and that high cue reliability becomes more important than cue availability only at later phases of learning (Matessa & Anderson, 2000; Taraban & Palacios, 1993). Following this pattern, advanced learners’ productions of high-level language planning problems, such as the linearization of clausal constituents in complex sentences, should be more strongly governed by highly available and detectable cues and less strongly than that of experts by less available but highly reliable cues. Our approach sets out to reverse engineer the cue weights through a distributional analysis of the linguistic output, meaning that we reason from language productions to properties of the system underlying the productions.3 We do so by exploiting the connection between: (a) statistical regularities in the input, (b) internalized cue weights, and (c) statistical regularities in the output (as sketched in Figure 1).
It should be stressed that in this study we are interested in uncovering the types of informational sources that learners rely on during high-level linguistic action planning. We are not concerned with accurate language use. Of course, the assessment of what knowledge learners rely on is related to questions of accurate usage, but these two issues are logically independent of each other: The reliance on a specific informational source (lexical, structural, or discourse functional) does not imply the accurate calibration of the subregularities within that informational source.
Method
Corpus
The advanced learner data were retrieved from a total of 50 term papers produced by German students of English linguistics in their second and third year of study (∼216,000 words) and a same-sized control expert corpus of peer-reviewed articles appearing in various journals on language studies.4 The learner data were produced under comparable conditions (same amount of contact time spent in a previous seminar, same lecturer, and same formatting suggestions). The target constructions were identified by matching a set of subordinators (see below) in the two corpora, yielding a total amount of 1,471 data points. A total of 601 of these retrieved sentences was produced by experts. To maximize the comparability of results, we reduced the larger data set from the learners by taking a random sample that matched the size of the expert data set (Nlearner = 601; Nexpert = 601).
Corpus Annotation and Modeling of the Data
- Semantic subtype
- ○ Causal ACs were distinguished from concessive ACs (following the classification in Quirk, Greenbaum, Leech, & Svartvik, 1985).
- ■ The preferred position of an AC is influenced by the semantic relation that holds between the situations described by the clausal elements. For example, causal clauses that are used to justify a belief tend to follow their main clause. Concessive clauses that are used to “set the stage” (Verstraete, 2004) tend to precede their main clause.
- ○ Causal ACs were distinguished from concessive ACs (following the classification in Quirk, Greenbaum, Leech, & Svartvik, 1985).
- Subordinator
- ○ Lexical realization of subordinating conjunction (because, although, whereas, since, while, as, and even (subsuming even if and even though).
- ■ For each conjunction, there is a probability distribution describing the tendency for an AC headed by that conjunction to occur in sentence-initial position. This preference or (dispreference) is register-specific, that is, it may vary across situational contexts (Kerz & Wiechmann, in press).
- ○ Lexical realization of subordinating conjunction (because, although, whereas, since, while, as, and even (subsuming even if and even though).
- Proportional length of AC (syntactic weight)
- Structural complexity of AC
- ○ An AC was considered complex if it exhibited at least one embedded clause.
- ■ Deeper embedding increases the complexity of a clause. More complex constituents tend to follow simpler constituents (Diessel, 2008).
- ○ An AC was considered complex if it exhibited at least one embedded clause.
- Deranking of verb forms in AC
- ○ An AC was considered balanced if it was tensed ( = finite). Participial, infinitival, or verbless clauses were labeled as “deranked” (Cristofaro, 2003).
- ■ Deranked ACs are more likely to precede their associated main clauses than balanced ACs.
- ○ An AC was considered balanced if it was tensed ( = finite). Participial, infinitival, or verbless clauses were labeled as “deranked” (Cristofaro, 2003).
- Bridging
- ○ An AC was taken to serve a bridging function if it included an anaphoric item that was co-referential with a noun phrase (NP) or clausal constituent of a preceding sentence (cf. Biber, Johansson, Leech, Conrad, & Finegan, 1999).
- ■ ACs that link between informational unites across sentences tend to precede their associated main clause (Verstraete, 2004).
- ○ An AC was taken to serve a bridging function if it included an anaphoric item that was co-referential with a noun phrase (NP) or clausal constituent of a preceding sentence (cf. Biber, Johansson, Leech, Conrad, & Finegan, 1999).
Our approach to understanding human cue weighting relies on techniques from machine learning, which as a class have been employed with great success in reproducing linguistic choice behavior suggesting that probabilities of occurrence are somehow available to the human processing system (see Baayen, 2011, for a recent discussion). In principle, the assessment of cue weights can be approached via a variety of statistical and algorithmic techniques that generate estimates of effect size or variable importance. However, not all techniques are equally well suited for the task at hand. For example, variable importance estimates from regression models are more likely to suffer from the effects of correlating variables (multicollinearity; cf. Belsley, Kuh, & Welsh, 1980; for a discussion in a linguistic context, see Tagliamonte & Baayen, 2012; Wiechmann, 2012; Wiechmann & Kerz, 2013). While correlations among predictors are not harmful to the overall predictive power of a model, multicollinearity may lead to erratic changes in the corresponding expression of variable importance in regression models. As correlating predictors constitute the norm rather than the exception in linguistic contexts—in fact, strong correlations “probably provide exactly the redundancy that makes human learning of language data robust” (Baayen, 2011, p. 14)—regression is often not the most suitable analytical tool if the goal is the accurate estimation of variable importance. In our case, we can expect strong correlations between length and complexity or semantic type of AC and subordinator.
To assess cue weights as accurately as possible, we employed two techniques that are known to be less vulnerable to the distorting effects of multicollinearity: discrete adaptive boosting models with single-layer decision trees (Freund & Shapire, 1996) and random forests models with conditional inference trees (Strobl, Boulesteix, Kneib, Augustin, & Zeileis, 2008). Both techniques belong to the family of ensemble methods, that is, methods that do not rely on a single model but on many models. They differ, however, in the way they solve a classification task. Random forests can be viewed as a refined version of bootstrap aggregation (bagging), in which many decision trees are fit to resampled versions of the training data, and the final classification is arrived at by majority vote (Breiman, 1996, 2001; Hastie, Tibshirani, & Friedman, 2009). The random forests variant employed here has as its base model decision trees using recursive partitioning by conditional inference and express cue weights through a permutation variable importance measure (Hothorn, Hornik, & Zeileis, 2006), which adjusts for correlations between predictor variables and is not biased when predictor variables vary in their number of categories or scale of measurement.5 The random forest was set up in such a way that its constitutive trees were allowed to be of arbitrary size as long as a variable to be included in the tree introduced a statistically significant split based on multiplicity adjusted p values from a permutation test (see Strasser & Weber, 1999). Each classifier in the forest, or each conditional inference tree, is thus by itself a strong classifier.
The fundamental rationale of adaptive boosting is to sequentially fit weak classifiers, that is, classifiers that perform only slightly better than chance, to reweighted versions of the training data and employ a weighted majority vote to arrive at a final classification decision regarding whether an AC is preposed or postposed.6 With increasing number of iterations, boosting algorithms focus on hard-to-classify cases and produce a dynamic similar to human processing with respect to infrequent events: In human processing, infrequent events can have a strong impact on produced behaviors as evidenced by inverse frequency effects. For example, less frequent structures tend to yield stronger priming effects (Bock, 1986; Ferreira, 2003, Scheepers, 2003). This method is probably effective for reducing bias and variance, and improving misclassification rates (Bauer & Kohavi, 1999; Breiman, 1998; Dietterich, 2000). The boosting variant used here produces a classification model as an ensemble of decision stumps, that is, single-level recursive decision trees (cf., e.g., Breiman, Friedman, Stone, & Olshen, 1984).7 The use of decision stumps increases robustness against overfitting and also improves the assessment of variable importance. For purposes of exposition, we will focus on reporting the results of the boosting model and use results obtained from the random forest technique to complement these findings.
Results
Overall Model Performance: Prediction Accuracy
To assess the strength of the predictive relationships investigated in the study, we first employed a holdout method and split the data into disjoint subsets that served as training data (70%) and testing data (30%). The models were evaluated in terms of predicted accuracy on the test data, which is defined as .8 The only parameter to tune in a boosting model is stopping time, that is, the number of iterations in which the instances in the training data are reweighted. We ran a total number of 5,000 boosting iterations and then inspected the development of the error rate. Figure 2 shows this development through these iterations run over the expert data.

The classifier reaches its optimal performance at around 1,000 iterations. At later iterations, the model adjusts its estimates of variable importance so as to better fit the training data but does not generalize as well to the test data. The “1,000 iterations” expert model displays a training error of 0.21 (corresponding to 79% prediction accuracy) and a test error rate of 0.27 (corresponding to 73% prediction accuracy). The “1,000 iterations” expert model was then submitted to a 10-fold cross-validation, which resulted in an average error rate of 0.24.9 We applied the same procedure to the learner data. That is, we first set up a boosting model using the same specifications as described above. The “1,000 iterations” learner model reached a training error rate of 0.14 (86% prediction accuracy) and a test error rate of 0.16 (84% prediction accuracy). Accuracy after a 10-fold cross-validation was 86%.
The random forest model performed competitively but slightly worse (average prediction accuracy was about 2% below that of a corresponding boosting model). Furthermore, the random forest model was a lot more biased toward the majority class, that is, it showed a much stronger tendency toward reducing the classification error of the larger of the two classes of the response. Almost all its errors were made in predicting the less frequent class, that is sentence initial positions. Predicting rare events is notoriously difficult for statistical procedures (e.g., Joshi, Kumar, & Agarwal, 2001). What is more, without some kind of bias correction, most statistical procedures tend to underestimate the effects of infrequent events on a response (Tomz, King, & Zeng, 2003). In adaptive boosting models, infrequent factor level combinations can have relatively strong impacts on the estimation of variable importance. As the technique effectively changes the underlying data distribution with each iteration, it handles imbalanced data sets a lot better than most classification algorithms. In our study, it made in fact only about half as many errors in the minority class.
Variable Importance: Assessing Cue Weights
In pursuit of our main goal to assess which cues learners and experts rely on most, we next investigated the estimates of the variable importance (VI) obtained from both methods. Figures 3 and 4 show, for experts and learners respectively, the cue weighting as estimated via the VI measure implemented in ada (cf. Hastie et al., 2009, p. 367 and following). We tracked the VI estimates across iterations to better understand in what direction the cue weights are adjusted before reaching their optimal values. As the numerical value of the estimates varies with the number of boosting iterations, all estimates were standardized.


All adjustments were unidirectional, that is, with increasing learning iterations, there was only adjustment toward the final estimate. Comparing the final cue weighting in Figures 3 and 4, the first thing to note is that the rank order of the investigated cues is the same in both groups. In both expert and learner language, clause positioning relies most heavily on three variables (subordinator-specific preferences, length differentials of the AC and its corresponding main clause, and the presence or absence of a bridging context) and depends less heavily on the three remaining variables (complexity AC, semantic type of AC, deranking of AC). Second, we observe that the weighting of the three most important predictors is different in the two models. While in the expert model the three important predictors are judged as being about equal in terms of their importance, there is a clear downward gradient in learner language. Here, subordinator is about twice as important as length, which in turn is roughly twice as important as bridging. Third, concerning the adjustment of the constraints over iterations, we observe for both models that subordinator and length were heavily underestimated in earlier iterations, while bridging was overestimated in earlier models. The results of the random forest technique are presented in Figures 5 and 6.


The estimations from the random forest technique support the results obtained from the boosting models that subordinator is judged to be very important for both experts and learners and that bridging is judged to be more important for the experts than it is for the learners. The techniques arrived at different estimations with respect to length, however, which is considered relatively less important in expert language and considered virtually unimportant in learner language. With respect to length, the solution of the random forest is more similar to that of the boosting model after up to 800 iterations. However, as shown in Figure 2, the boosting models adjusted the weights quite drastically in the next 200 iterations before reaching their most predictive calibration at around 1,000 iterations. Because the boosting models’ prediction errors of the test data decreased noticeably in the interval between iterations 800 and 1,000, there is no reason to assume that the later adjustments indicate overfitting. In consequence, it seems conservative to assume that the boosting models’ estimates are better approximations of variable importance. Finally, we found further support for the relevance of length for expert but not for learner productions from the inspection of the best ( = most predictive) decision trees to feed into the random forest. Figure 7 presents the best trees for experts and learners, respectively.

The tree to partition the expert data makes use of four variables including both weight-related constraints, length and complexity of the AC, suggesting that the relative weight of the AC does play a significant role in the expert choices. The absence of these variables in the learner tree suggests that the constraints do not factor (as prominently) in the learner choices.10 We may further note that the best tree in the forest describing the expert data also assigns more importance to bridging than to subordinator reflecting the near-equivalent importance of the constraints in expert production.
Discussion
Our analyses yielded two main findings. First, learner choices were easier to predict than expert choices (a 10-fold cross-validated “1,000 iterations” boosting model produced a 0.24 error rate to classify unseen data of expert productions and a 0.16 error rate for learner productions). Second, learners assigned proportionally more cue weight to subordinator and proportionally less weight to the discourse-level factor, that is, bridging. The role of relative length of the AC could not be decisively determined on the basis of the available data. We will discuss these findings in turn.
In regard to the first finding, the general limitations of the models’ predictive success indicate that the investigated constraints describe only a subset of all constraints involved in the investigated positional choice. The fact that the learner model was considerably more predictive than the expert model suggests that, relative to experts, the investigated constraints cover a larger portion of the full constraint set of learners. When combined with the finding that discourse-level constraints were more important in expert language, this suggests that the missing constraints are likely to be found at the discourse level.11 The differences in cue strength, the second finding, can be detailed along three parameters.
First, we observe a difference in the number of important cues: While experts assign importance to three cues (subordinator, bridging, and length), learner choices are clearly most strongly influenced by one cue (subordinator). This reflects the general finding that learners at first focus on fewer constraints to understand an aspect of grammar (Bates & MacWhinney, 1987; Ellis, 2006; MacWhinney, Bates, & Pleh, 1985).
Second, we observe a difference in the types of cues that L2 learners focus on. Experimental research into cue strength typically reports that L2 learners pick up frequent (i.e., highly available) cues first. With growing experience, they tend to rely more and more on reliable cues (Matessa & Anderson, 2000; Taraban & Palacios, 1993). While the definitions of cue reliability and availability used in the Competition Model would have to be adapted so as to fit the probabilistic task investigated here, our models clearly support these findings irrespective of the details of their mathematical expression: bridging (by way of anaphoric relations) clearly is a highly reliable but not very available constraint. In other words, it is relatively rare, but when it applies, it almost invariably occurs with sentence-initial positioning of the AC. This highly reliable predictor is considerably less important in the learner model than in the expert model. The most important variable in the learner models, subordinator, is a highly available but less reliable cue: There is a subordinating conjunction in every exemplar in the input that learners use to build up their production constraints, but the positional preferences are only probabilistic and interdependent with other cues rendering the cue less reliable (cf. Wiechmann & Kerz, 2013 for a detailed discussion).
Third, our result that learners assign proportionally more cue weight to subordinator and proportionally less weight to the discourse-level bridging factor supports the general finding that in language, as in other cognitive domains, humans most readily learn detectable cues, that is, statistical regularities among elements that are perceptually salient and temporally proximal. Functional similarities (without perceptual similarity) and temporally nonadjacent generalizations are harder to detect and thus harder to learn (Creel, Newport, & Aslin, 2004; Endress, Nespor, & Mehler, 2009; for discussions of cue detectability in the Competition Model, see also Bates & MacWhinney, 1982 & MacWhinney, 2008). Learners’ positional planning is thus expected to be most strongly influenced by statistical regularities of variables like subordinator, whose distributions are easier to track. Semantic or discourse functional cues such as bridging are not directly perceivable, and identifying their statistical regularities requires a deeper analysis of the linguistic input. Indeed, this is reflected in the relatively lesser weight assigned to the bridging constraint in learner language, which is harder to detect because it is: (a) relatively infrequent and (b) multiple realizable, meaning that many formal devices can be used to establish the anaphoric link (e.g., different types of demonstrative pronouns and NPs with demonstrative determiners). We should note that not every semantic cue is on equal footing with respect to its detectability. In her study of adult L2 learners’ sensitivity to phonological, morphological, and semantic cues to French grammatical gender, Carroll (1999) reports high levels of sensitivity to a semantic cue, namely natural gender. However, natural gender represents a concept that arguably is established relatively earlier in development than the concept of cohesiveness of texts modulated by the bridging variable. In this sense, statistical regularities of natural gender are likely to be easier to exploit than those of bridging. Generally, adult L2 learners tend to fare better with internal interfaces, such as syntax-semantics, than with external interfaces, such as syntax-pragmatics (see Sorace & Serratrice, 2009; Donaldson, 2012).12 If learning proceeds via an error-based implicit learning mechanism, we would expect such findings, because deviations from the former are more likely to provoke more informative feedback (e.g., corrections) than utterances that are merely deviations from the latter. In the extreme case, when communicative success does not depend on it, learners might never “get around to noticing low salience cues” (Ellis, 2006, p. 170).
Limitations and Conclusion
We would like to point out some caveats and limitations of the present study. We believe to have demonstrated the empirical reality of systematic differences in cue reliance between experts and learners. On the basis of the available data, we cannot, however, ascertain the role of interactions with L1 knowledge (i.e., any interference effects). L2 learners typically attempt to first transfer cue weightings from the L1 whenever they can perceive correspondences between items in L1 and L2 (Robinson & Ellis, 2008; also MacWhinney, 2011). For the phenomenon investigated here, however, there is reason to believe that transfer effects play only a subsidiary role. Prior research suggests that transfer of item-based syntactic patterns is very limited, as such patterns cannot be readily matched across languages, meaning that item-specific preferences must be learned from the bottom up without any support from the L1 (MacWhinney, 2011). In both expert and learner productions, clause serialization was found to be heavily influenced by item-specific preferences of individual subordinators. Nevertheless, our results suggest that the lexical preferences of learners differ substantially from those of experts. Readers will find evidence of this in Figure S3 of the Supporting Information online, which presents graphically the differences in subordinator-specific positional preferences after controlling for bridging and length as estimated from a regression model. The estimates in Figure S3 are best viewed as rough approximations of the true lexical preferences, as they incorporate only three constraints and do not reflect any dynamics from their interaction. However, they do suggest that learners have calibrated the constraint subordinator differently than experts: Although, since and while are either balanced or display a bias toward initial positioning in expert language, but not in learner language. Because and whereas are biased toward final positioning in both data sets, but much more pronouncedly so in learner language. Less frequent subordinators, even though|if and as are preferred in initial positions in learner language, but not in expert language. The differences in lexical preference cannot be reduced to frequency, as the correlations of the differences between the regression coefficients (delta betaexpert-betalearner) and either frequencyexperts or frequencylearners were not significant at alpha = 0.05. Some of the deviation from expert behavior may be predictable from the semantic proximity of an English subordinator to its nearest German correspondent form (Dong, Gui, & MacWhinney, 2005). Still the semantic space of adverbial relations is likely to be split up differently across the two languages, which will impede successful transfer. As a principled answer to these questions requires comparable data from German, these issues need to be addressed in future research.
Overall, our results reveal that the basic principles of the Unified Model of first and second language acquisition (MacWhinney, 2008, 2011), which have been documented so extensively for comprehension also underlie the written production of advanced L2 learners. We also have shown how the reverse engineering of cue weights by way of ensemble methods constitutes a fruitful complement to other computational and experimental approaches.