The learner as corpus designer

The learner as corpus designer

Guy Aston

Advanced School for Interpreters and Translators, University of Bologna

1. Introduction

In recent years it has been suggested that it may be both useful and motivating for teachers and learners to construct their own corpora to analyse with appropriate interrogation software. Such suggestions include the construction of:

· collections of texts written by learners themselves (Seidlhofer 2000), or which have been previously read by them during their courses (Gavioli 1997, Willis 1998);

· collections of texts which illustrate a particular text-type and/or domain of use (Bowker 1998; Maia 2000a, b; Pearson 1998, 2000), or which illustrate particular linguistic features (Bertaccini & Aston 2001; Varantola 2000).

Proposals of the first type see ‘home-made’ corpora as a means to cast further light on previously encountered texts and linguistic features, where analysis is facilitated by virtue of the learners’ prior familiarity with the texts involved. Proposals of the second type instead see them as a means to investigate unfamiliar features, domains or text-types which are inadequately documented by existing resources – particularly in the area of translator training, where the ability to construct specialized corpora is increasingly seen as fundamental for the identification of domain-specific terminology (Pearson 1998, 2000; Maia 2000b).

The main ideas underlying these proposals appear to be that:

· ‘home-made’ corpora may be more appropriate for learning purposes than pre-compiled ones, insofar as they can be specifically targeted to the learner’s knowledge and concerns;

· such corpora permit analyses which would not otherwise be readily feasible, providing a specialized hypertextual environment for the study of particular texts and instances;

· by compiling corpora for themselves, learners may gain insight into how to select and use corpora appropriately, acquiring skills and knowledge which may be of value to them in the future.

Against these potential benefits, however, we must balance the costs of corpus construction. A considerable amount of work is likely to be involved, and in comparison with corpora constructed by professional researchers, the quality of the product is likely to be relatively low. Home-made corpora will typically be more opportunistic, less carefully designed and edited, and less comprehensively encoded and annotated than those compiled by experts. Consequently, teachers and learners may be unconvinced that it is worth building corpora of their own. In this paper I shall outline an intermediate strategy which, I argue, can provide some of the same benefits while considerably reducing these costs, by providing specialized environments for particular areas of study, while at the same time offering insights into how to design, select and use corpora appropriately.

2. Corpora in language learning

In both corpus linguistics and language pedagogy it is a well-established principle that design must be based on an analysis of users’ objectives. From this perspective, there are at least five types of corpus-based activity that appear relevant to language learners (Aston 2000, 2001b):

· form-focused activity, aiming to establish and practice the use of particular linguistic features (“data-driven learning”: Johns 1991);

· meaning-focused activity, aiming either to establish meanings in a particular corpus text or to understand the concepts referred to and the functions realized in a particular text-type – what we might term ‘data-driven cultural learning’;

· skill-focused activity, aiming to develop particular reading skills and strategies, particularly of a ‘bottom-up’ variety (Brodine 2001);

· reference activity, where corpora are used for support in tasks involving other texts, in particular as aids to reading, writing and translating (e.g. Owen 1996; Zanettin 2001 amongst many others);

· browsing activity, where learners alternate between the previous types of activity in serendipitous explorations of the corpus (Bernardini 2000a, b).

Home-made corpora can lend themselves to all these uses. We may briefly exemplify them in relation to a learner-designed corpus of astrophysics research articles (Raffa 2000). From a form-focused perspective, this corpus is an excellent resource for identifying astrophysical terminology and establishing its uses. From a meaning-focused one, it provides many opportunities to learn about white dwarfs, black holes, red giants, etc., as well as about the general methodology of astrophysical research. From a reading skills perspective, it can provide focused practice in such areas as the parsing of complex nominal groups, or the resolution of anaphoric and cataphoric reference in this kind of text. And obviously, it can serve as a reference tool while reading, writing or translating astrophysics research articles (the corpus was originally designed to provide a resource for non-native speakers engaged in astrophysical research), since it provides an intertextual background against which to construct and evaluate interpretative or productive hypotheses. Last but not least, it is a corpus that can be browsed serendipitously, travelling from one linguistic or cataclysmic variable to another.

All these types of activity, it should be stressed, lend themselves to being contextualized in a framework of communication, since they provide numerous opportunities for report and discussion of linguistic, social, cognitive and methodological issues. Thereby they allow not only extensive communicative practice, but can also develop linguistic awareness and encourage learning autonomy (Aston 2000, 2001b).

2.1 Why make your own corpora?

Why, however, should learners bother to construct their own corpora in order to engage in activities of these kinds? To use an analogy, making your own corpus seems rather like making your own fruit salad. Why make your own when you can buy a tin off the supermarket shelf? The reasons (for both corpora and fruit salads) appear similar:

· Control. You can devise your own recipe, choosing your own ingredients, thereby obtaining assortments that may be unavailable in pre-packaged versions. There was no publicly-available corpus of astrophysics research articles which could be used to investigate the particular linguistic and conceptual characteristics of the latter. The BNC, for instance, contains only 10 written texts which mention astrophysical white dwarfs – too small and hetergoeneous a sample to warrant generalizations in this domain.

· Certainty. If you make your own fruit salad, you have a good idea of what went into it, and this makes it easier to decide what that strange-looking bit was, or why it tastes too bitter or too sweet. It is much easier to interpret concordances or numerical data if you know exactly what texts a corpus consists of, since this allows a greater degree of top-down processing. It takes some time to gain sufficient familiarity with a pre-packaged corpus to recognize particular texts, or to interpret results in the light of its particular quirks. With one you have made yourself it is easier to make adjustments, and to recognize the limits to inferences.

· Creativity. Corpus-making, like cooking, can be fun, giving scope for individual panache. It is also gratifying when your fruit salad turns out to be delicious, or your corpus a useful resource.

· Critical awareness. Through trial and error, and consulting books and experts, you will probably become a better chef (whether of corpora or fruit salads) as you compare the effects of different proportions of different ingredients, or discover that mixing popular science with research articles is not always a good idea. Even if you are unsatisfied with the results of your efforts, the experience of making your own seems likely to make you a more critical corpus user, increasing awareness of how design affects the results – results which are (to quote Sinclair 1991: 13) “only as good as the corpus”.

· Communication. Making your own corpus or fruit salad can have more social spin-offs than opening a supermarket tin, providing lots to talk about with co-constructors and with other chefs, as well as with the consumers of the end product. Making your own opens up a whole range of opportunities for learners to discuss how best to compile and encode corpora for particular purposes, and to discuss how good they effectively are for these purposes and how they might be improved.

2.2 Why use standard pre-packaged corpora?

Since there is a market for tinned fruit salad, there must presumably be some arguments in its favour. Pre-packaged corpora typically offer advantages compared with the home-made variety in terms of:

· Reliability. A pre-packaged corpus (provided it is well-designed and fits your needs) is likely to give more reliable results. Just as tinned fruit salads are subject to quality controls based on market research, it is more likely that a pre-packaged corpus will be reasonably “representative” of the population it aims to cover (Biber 1993), and carefully “balanced” amongst the different types of text which make up that population.

· Documentation. Pre-packaged corpora generally provide better documentation than home-made ones. With an off-the-shelf fruit salad, it is easier to find out the exact sugar content, and exactly how many calories you are consuming per portion. Pre-packaged corpora will generally include metatextual information about individual texts and their sources, and categorizations of their contents. They may also incorporate details of text structure, annotation of part-of-speech or syntactic features, etc.

· Designer software. Many off-the-shelf corpora come with specially-designed interrogation software, such as the SARA interface, which was designed to allow the BNC’s metatextual documentation to be exploited in interrogating the corpus. All-purpose, plain text concordancers such as Wordsmith Tools (Scott 1999) or MonoConc (Barlow 1998) cannot generally interpret such information satisfactorily.

· Convenience. It is clearly less effort to use a pre-packaged corpus than to make your own. All you have to do, as it were, is open the tin. Most readers of this paper will have their own favourite corpora, and many will feel that using them is vastly preferable to going through the effort of designing and constructing their own. This is perhaps the main factor to justify the compromise strategy to be outlined in the next section.

3. The pick’n’mix compromise

One way to avoid much of the effort involved in constructing your own corpus (or fruit salad) is to steal the necessary ingredients from elsewhere. The web is one prolific source of corpus ingredients (Bertaccini & Aston 2001), but these may require complex searches and considerable adaptation before they can be used (Pearson 2000). A more attractive strategy may be to extract a subcorpus from a larger corpus, whose texts can be treated as pre-prepared ingredients, just waiting to be selected in the desired proportions. The analogy here might be with a (fruit) salad bar, where you put together your own mixture from a series of bowls, each containing one kind of fruit which is ready washed, peeled and chopped into pieces of the right size. Here you can control the ingredients, selecting those which appeal to you. You can decide for a preponderance of raspberries, or indeed select or avoid one particularly bloated raspberry. You can omit the grapefruit or add another dollop of cream. The following conversational extract illustrates the process fairly well (however dubious you may be as to the result):

PS0A2: Right there’s erm (.) it’s a tin of fruit salad but I’ve put in some er kiwi and grapes (.) so it’s fresh fruit, it’s in its own juice, so it’s not in a heavy thick juice, there’s Vienetta or you can have a bit of each

PS09U: Well I’ll have a little bit of each then please

(BNC Sampler: KC2)

As well as an increase in control, the fruit salad bar provides an increase in certainty (you have a clearer idea what went in), and also in opportunity for creativity and communication. Repeated attempts may also lead to increased critical awareness. The fruit salad bar is convenient, requiring little effort bar that of decision-making. Substantial reliability can be maintained (assuming the components are subject to quality control), allowing you to draw on the documentation provided for each component, and to exploit it in designing and consuming your own mixture.

In the same way, if you construct your own subcorpus from the ingredients provided by a larger corpus, you can (within the limits of what is on offer) choose your own text-types, and indeed individual texts. Not only can you thereby increase control over and certainty as to the content, but you can also indulge your creativity, and exploit opportunities to communicate about your strategies and their results. In the process you may – through trial and error – become more critically aware of what are (and are not) useful subcorpora, and what are (and are not) appropriate design criteria. As we shall see in the next section, constructing your own subcorpora in this manner can maintain much of the reliability and conserve the documentation attached to the original corpus, as well as allowing you to exploit software specifically designed for use with that corpus. It is also far less work than compiling a corpus of your own.

4. The SARA subcorpus option

Recent releases of the BNC Sampler (1999) and the just-published BNC World edition (2001) come with enhanced software (SARA98) which allows users to define and then analyse subcorpora within the corpus in question. Since SARA can be used with any TEI-conformant corpus, the procedures outlined below are not, in theory, limited to the BNC, but can also be applied to other corpora which adopt similar encoding principles.

The SARA subcorpus option allows the user to define subcorpora consisting of:

· one or more specific texts selected from a list of all the texts in the corpus;

· all those texts which contain solutions to a particular query, for instance all those containing the word “Austria”.

Since SARA permits queries concerning the metatextual information provided with the texts as well as regarding their linguistic content, this second procedure can also be used to define subcorpora consisting of all the texts belonging to a particular design category (spoken/written, monologue/dialogue, imaginative/informative, etc.), or to a particular descriptive one (e.g. produced by a particular type of author/for a particular type of audience, or belonging to a particular genre: cf. 4.4 below and Lee, this volume). The two procedures can also be combined, with manual editing of the list of texts obtained from a particular query.

Once defined, a subcorpus can be saved and used as the basis for subsequent queries. It is also possible to index a saved subcorpus for easy re-use in subsequent sessions. In the rest of this paper I illustrate some practical examples of indexed subcorpora extracted from the BNC Sampler, and relate these examples to the learner uses of corpora described earlier (cf. 2 above).

4.1 A specific text as subcorpus

Scrolling through the list of texts in the BNC Sampler, I was struck to discover the Monster Raving Loony Party’s Draft manifesto for the British General Election of 1992 (AP6: Figure 1). Given the resurgence of extremist political parties in Europe today, I felt that participants at this conference, like many learners, might share my curiosity concerning this text:

Figure 1. Selecting texts for inclusion in the subcorpus

If we select this text and save it as a subcorpus, we can then begin to pose queries about it. In the first place, we can simply ask for a list of the most common words in it, shown in Figure 2:

Figure 2. Words whose frequency in the subcorpus is greater than 20

One striking feature here is the high frequency of the modals will and shall. This is presumably because manifestos announce programmes, providing declarations of intent as to future action. This is confirmed when we look at the results of a query for these two forms (followed by be + past participle: Figure 3):

Figure 3. Shall/will be VVN in the Monster Loony Party Manifesto

While the distribution of shall and will in these citations is not easy to account for, the concordance clearly demonstrates how a subcorpus consisting of just one text can highlight its distinctive formal characteristics, and also cast light on its style and meanings – as well as providing ample opportunity for discussion.

4.2 A bad language subcorpus

Subcorpora need not, of course, be limited to a single text. If we carry out a query in the spoken texts of the BNC Sampler for forms beginning with the characters fuck, we find 225 occurrences in 14 texts. In Figure 4, these texts are listed in order of the number of occurrences found (see the Query 1 column), so we can easily select all those which contain these forms as a subcorpus:

Figure 4. Texts containing forms of fuck[1]

In a browsing activity, this subcorpus could be employed to explore various aspects of bad language use. For example, we can generate a list of the collocates of the forms in question to cast light on their typical usage within these texts (Figure 5). What emerges most strikingly here is the collocate oh, which occurs no less than 45 times in a span of 3 words to the left and to the right. As a curiosity, it then comes naturally to ask what other words (if any) oh precedes in this subcorpus (Figure 6):

Figure 5. Collocates of forms of fuck in the bad language subcorpus, distinguished by part-of-speech and ranked by frequency

Figure 6. Right collocates of oh in the bad language subcorpus, ranked by z-score

Ranking the collocates in a span of 2 words to the right of oh in order of significance (Figure 6), we obtain a list which suggests that we have indeed created a subcorpus of bad language texts, including a number of other expressions with oh which learners wishing to improve their abusive competence might explore to their profit. It might also be of interest to compare the usage of male and female speakers – is it women or men who say oh dear? This is another area which can be investigated thanks to the metatextual information provided in the corpus and the specialized design of the interrogation software.

Clearly, the number of texts included in this subcorpus is very small, and we cannot assume that they constitute a reasonable cross-section of spoken texts involving bad language. Nonetheless they may still enable the user to generate, if not to definitively test, hypotheses as to use in this area. The other subcorpora discussed in the next sections are similarly too small to permit definitive conclusions, but they can again provide interesting suggestions as to language use in certain kinds of contexts.

4.3 Subcorpora of encoded categories of texts

The spoken texts in the BNC fall into two main classes, demographic and context-governed. These labels, which refer to the way in which recordings were collected, distinguish free conversations from talk recorded in settings of an institutional nature – classrooms, courtrooms, business meetings, and the like. The context-governed texts may be either monologue or dialogue – a feature which is again indicated in the BNC text classification. A search for context-governed monologue in the BNC Sampler finds 17 texts, while a search for context-governed dialogue finds 29. We can use the same procedure as in the last section to list the texts which match these queries, and to form separate subcorpora of context-governed monologue and dialogue texts.

Analysing these two subcorpora, we find considerable differences. If we compare the 200 most frequent words in each, we find that could, had, he, know, their, were, when, who, and your are ranked more than 20 positions higher in the monologue subcorpus, while ’ll, ’m, any, no, pounds, right, yeah and yes are more than 20 positions higher in the dialogue subcorpus. The differences in the frequencies of yeah, yes, no and right suggest that speakers may be less concerned with explicitly negotiating agreement when they hold a monopoly of the floor (for instance, we find that there are no occurrences of all right in the monologue subcorpus). There also appears to be a difference in the use of pronouns: for instance, we find that we is relatively more frequent in dialogue, and you in monologue (Figure 7):

	we	you
Monologue	2014	4253
Dialogue	4949	6635

Figure 7. Occurrences of we and you in the monologue and dialogue subcorpora

Perhaps this is again due to the unwillingness of speakers in monologue contexts to claim shared attitudes with their audiences, given that the latter have little chance to disagree. By examining sample sets of citations, it may however be possible to advance other hypotheses to account for these differences, and in any case for learners to reflect on the linguistic differences between monologue and dialogue settings.

4.4 Subcorpora based on other categorizations

It is also possible to define subcorpora which are based on different categorizations from those originally encoded by the corpus designers. Lee (this volume) provides a personal categorization of the written texts in the BNC, based on criteria such as academic vs non-academic, prose vs poetry, fiction vs non-fiction.[2] We can use Lee’s lists to define subcorpora from the Sampler corresponding to such categories as:

· academic non-fiction (13 texts);

· non-academic non-fiction (15 texts);

· prose fiction (13 texts).

Looking at wordlists for these three subcorpora, we discover a number of items which appear to be more common in one than in the other two. For example, the following adverbs in -ly all occur more than 15 times in the subcorpus indicated, and less than 10 times in each of the others:

· academic non-fiction: accordingly, essentially, eventually, largely, namely, notably, respectively, surprisingly

;

· non-academic non-fiction: effectively, merely, normally, obviously, possibly, specially

;

· prose fiction: carefully, quietly, slightly, slowly, softly, surely, truly.

If we take just one of the items from the academic non-fiction list, largely, we can follow the traditional procedures of data-driven inductive learning to explore its uses in this genre (Johns 1991). On the one hand, largely appears to qualify participial predicates with a negative semantic prosody, collocating with

expressions like annihilated by, confined to, denied to, ignored, invalidated by and limited to. On the other hand, it appears to qualify linking expressions introducing causes, such as accounted for, based upon, because, dependent upon and due to (Figure 8):

Figure 8. Largely in academic non-fiction (Lee’s categorization)

There are two apparent exceptions to these patterns in this concordance – largely by catching fish and largely genetic – which I leave to the reader to account for.

A concordance of this kind casts light on the language of the text-type in question by providing a limited, relatively homogeneous set of citations, which are easier to categorize and interpret than ones taken from a broader variety of texts. While in no way permitting an exhaustive account of the ways in which the word largely can be used, the concordance clearly illustrates two uses which would seem to play a significant role in this text-type, and which may therefore be of use to those who are learning to deal with such texts in their reading or writing.

A subcorpus of this kind could also be used in other ways: to generate exercises in reading academic prose (for instance in parsing nominal groups containing particular heads: Brodine 2001), or as reference tools to assist learners with other features of academic writing. Taking instead a browsing perspective, investigating largely may lead the learner to examine near-synonyms (mostly, mainly, to a large extent, for the most part), or to examine other collocates of such expressions as confined to and limited to.

Nor need the learner’s investigations be confined to this subcorpus alone. We have seen that largely appears particularly frequent in texts classed as written academic prose, but in analysing its uses in those texts, we have not posed the question of whether it has the same uses in other text types. The entry for largely in the Collins Cobuild dictionary (1995) suggests that this may not be the case, since it cites examples from the Bank of English which fit into neither the causative or the negative prosody category: The fund is largely financed through government borrowing … I largely work with people who already are motivated … Their weapons have been largely stones. Working with a subcorpus frequently invites comparison of the results obtained with those from other subcorpora, or indeed from the whole corpus from which the subcorpus has been derived. Starting from a limited set of texts of a single type will simplify this process insofar as it reduces the initial number and variability of citations – providing, that is, a line of approach to the analysis of samples drawn from the full corpus (Gavioli 2001).

5. Conclusions

Through these examples I hope to have illustrated how subcorpora derived from the BNC Sampler can allow learners to carry out activities of each of the types listed in 2 above, in particular:

· to study and compare forms in particular texts or text-types, contrasting these with those in other texts or text-types;

· to study and compare meanings in particular texts or text-types, contrasting these with those in other texts or text-types;

· to carry out focussed reading practice;

· to adopt appropriate reference tools for particular tasks;

· to carry out focussed browsing.

At the same time, I would argue, subcorpora like these share many of the characteristics which have motivated proposals to use ‘home-made’ corpora in language learning (cf. 1 above). To summarize:

· Subcorpora can provide small, manageable amounts of data of a more homogeneous nature than is possible with large mixed corpora, thereby facilitating analysis. It is, of course, essential for users to recognize that such subcorpora are neither sufficiently large, nor sufficiently carefully designed, to be considered “representative” samples of the text-types involved, and that inferences made from them should not be treated as definitive. However, as I have stressed elsewhere, language learning appears to be a matter of progressive approximation on the basis of ever-growing experience (Aston 1997). Thus, while a learner who sees the two uses of largely presented in 4.4 above cannot pretend to have fully understood all the potential uses of the word in academic discourse, s/he arguably has formed an idea of two of the main ways in which it can be used, and is well placed to refine this knowledge further in the future.

· Subcorpora can provide a specialized environment for the study of particular texts and instances. As the subcorpora described in this paper were all taken from the BNC Sampler (which contains a mere 2% of the texts in the full BNC), they were extremely small, and also relatively unspecialized. With the new BNC World edition, however, more highly specialized subcorpora can be constructed – not just of written academic discourse, but of written academic discourse in the field of medicine, not just of spoken monologue but of lectures, and so on. Increased specialization entails increased homogeneity, and consequently more precise focussing and reduced dispersiveness in corpus use. Alternatively, much larger subcorpora can be extracted for categories like those discussed in the last section, with a corresponding increase in reliability: there are, for instance, 504 texts classed as written academic prose in the World edition, which include 2348 occurrences of largely. These larger numbers may however be difficult for learners to manage, initially requiring analyses of smaller selections. Consequently there would still seem to be a pedagogic place for small subcorpora such as those illustrated here, as points of initial focus for the learner to generate hypotheses which can then be tested against the larger subcorpora obtainable from the full BNC. Nor should we forget that the complete corpus, with its myriad paths for the motivated learner to adventure down, is always there to be consulted (Bernardini 2000).

· If learners create and select their own subcorpora for particular tasks, they will also acquire practice and experience in corpus design which may be of use to evaluate corpora with which they are unfamiliar, or to create corpora of their own from less structured sources, such as the Web. These skills would appear useful not only for would-be readers, writers and translators of specialized texts, but also for more general-purpose language learners, insofar as the latter need to develop a sensitivity to genre and register variation. It is clear that subcorpora extracted from large mixed corpora like the BNC cannot be expected to satisfy all possible requirements – only a specifically collected corpus of astrophysics is likely to provide enough information concerning white dwarfs or the rhetoric of astronomers; only a corpus of learner texts will satisfy the need to study learner or lingua franca English (Granger 1998, this volume; Seidlhofer 2000). But, one might argue, it is precisely because subcorpora have these limits that they can provide valuable ways of learning to design and use corpora in general.

References

Aston, Guy (1997), “Small and large corpora in language learning,” in: Lewandowska-Tomaszczyk/Melia (1997), 51-62.

Aston, Guy (2000), “The British National Corpus as a language learning resource,” in: Battaner/Lopez (2000), 15-40.

Aston, Guy, ed. (2001a), Learning with corpora, Houston TX: Athelstan.

Aston, Guy (2001b), “Learning with corpora: an overview”, in: Aston (2001a), 7-45.

Barlow, Michael (1998), MonoConc, Houston TX: Athelstan.

Battaner, M. Paz/Carmen López, eds. (2000), VI jornada de corpus linguistics, Barcelona: Institut universitari de lingüística aplicada, Universitat Pompeu Fabra.

Bernardini, Silvia (2000a), Competence, capacity, corpora, Bologna: CLUEB.

Bernardini, Silvia (2000b), “Systematizing serendipity: proposals for concordancing large corpora with learners,” in: Burnard/McEnery (2000), 225-234.

Bernardini, Silvia/Federico Zanettin, eds. (2000), I corpora nella didattica delle lingue. Bologna: CLUEB.

Bertaccini, Franco/Guy Aston (2001), “Going to the Clochemerle: exploring cultural connotations through ad hoc corpora,” in: Aston (2001a), 198-219.

Biber, Doug (1993), “Representativeness in corpus design,” Literary and linguistic computing 8, 243-257.

Bowker, Lynne (1998), “Using specialized monolingual native-language corpora as a translation resource: a pilot study,” Meta 43, 631-651.

The British National Corpus World Edition (2001), Oxford: Oxford University Computing Services.

The BNC Sampler (1998), Oxford: Oxford University Computing Services.

Brodine, Ruey (2001), “Integrating corpus work into an academic reading course,” in: Aston (2001a), 138-176.

Burnard, Lou/Tony McEnery, eds (2000), Rethinking language pedagogy from a corpus perspective, Frankfurt am Main: Peter Lang.

Collins Cobuild Dictionary (1995, 2nd edition), London: HarperCollins.

Gavioli, Laura (1997), “Exploring texts through the concordancer: guiding the learner,” in: Wichmann/Fligelstone/McEnery/Knowles (1997), 83-99.

Gavioli, Laura (2001), “The learner as researcher: introducing corpus concordancing in the classroom,” in: Aston (2001a), 108-137.

Granger, Sylviane, ed. (1998), Learner English on computer, London: Longman.

Johns, Tim (1991), “Should you be persuaded: two samples of data-driven learning materials,” English language research journal 4, 1-16.

Lewandowska-Tomaszczyk, Barbara/James Melia, eds. (1997), Palc ’97: Practical applications in language corpora, Łódz: Łódz University Press.

Maia, Belinda (2000a). “Making corpora: a learning process,” in: Bernardini/Zanettin (2000), 47-60.

Maia, Belinda (2000b), “Comparable and parallel corpora – and their relationship to terminology work and training”, Paper presented at 2nd Corpus Use and Learning to Translate conference, Bertinoro.

Owen, Charles (1996), “Do corpora require to be consulted?” ELT journal 50, 219-224.

Pearson, Jennifer (1998), Terms in context, Amsterdam: Benjamins.

Pearson, Jennifer (2000), “Surfing the internet: teaching students to choose their texts wisely,” in: Burnard/McEnery (2000), 235-239.

Raffa, Giuliana (2000), The astrophysics research article: a corpus-based analysis. Unpublished dissertation. Forlì: SSLMIT.

Scott, Mike (1999), Wordsmith tools ver. 3.0, Oxford: Oxford University Press.

Seidlhofer, Barbara (2000), “Operationalizing intertextuality: using learner corpora for learning,” in: Burnard/McEnery (2000), 207-223.

Sinclair, John (1991), Corpus concordance collocation, Oxford: Oxford University Press.

Varantola, Krista (2000), “Translators and disposable corpora”, Paper presented at 2nd Corpus Use and Learning to Translate conference, Bertinoro.

Wichmann, Ann/Steve Fligelstone/Tony McEnery/Gerry Knowles, eds. (1997), Teaching and language corpora, London: Longman.

Willis, Dave (1998), “Learners as researchers,” Paper presented at IATEFL 32nd annual conference, UMIST, Manchester.

Zanettin, Federico (2001), “Swimming in words: corpora, translation, and language learning,” in: Aston (2001a), 177-198.

[1] Text 000 in this list is the corpus header file, which must always be included in any subcorpus.

[2] Lee’s categorizations have been added to the metatextual information provided in the text headers of the BNC World edition (2001).