Book review

Gries, S. Th., Wulff, S. and Davies, M. (eds). Corpus-linguistic applications. Current studies, new directions Amsterdam and New York: Rodopi. [Language and Computers series] pp. 1-260. ISBN:978-90-420-2800-5

Reviewed by Adriano Ferraresi and Silvia Bernardini

This volume brings together a selection of papers presented at the eighth Conference of the American Association for Corpus Linguistics, held at Brigham Young University, Utah, in 2008. As the title suggests, and as the editors claim in their introduction, the aim of the volume is to answer the question “what are the current studies and new directions? Or, in other words: what’s hot in corpus linguistics in 2008/2009?” (p. 1). Four areas are identified, namely diachronic, function-oriented, register/genre-based and methodological applications. It is debatable whether these areas can in fact be taken to represent what is hot in corpus linguistics in general (one notices for instance the lack of works dealing with multilingual, multimodal or web corpora, to mention but the first that come to our mind), or should not rather be interpreted as providing a snapshot of what is hot in North American corpus linguistics (11 out of 15 articles are authored by scholars affiliated with U.S. or Canadian institutions). This is of course no criticism, quite the contrary: we believe it is among the assets of this volume, and were surprised that the editors did not play it up.

As well as making the subject of a dedicated section, methodological concerns are central to virtually all the works included in this volume, as is the provision of a more thorough theoretical contextualisation for corpus analyses than one is used to finding in (some of) the corpus literature. More on this point in the reviews of the single contributions below. From the methodological point of view, the statistical/computational input provided by the volume is especially rich, and should be of great interest to those corpus linguists who wish to extend their research with methods derived from these neighbouring fields. If this “pedagogic” intent was among the objectives of the authors/editors, then possibly a bit more detail on the toughest technical parts would have been welcome.

Concerning the introduction, one could take issue with some of the claims made by the editors. For instance, the claim is made that “corpus linguistics was born in an attempt to adequately describe and analyze contemporary language data as opposed to prescriptive grammar books. In 2008, a major emerging trend is to use established corpus-linguistic tools and apply them to diachronic data.” (p. 2) Rather than an emerging trend, the focus on diachronic issues might be better described as a resurgence of interest. Diachronic studies have been with us for a long time, witness corpora such as the Helsinki corpus of English texts (whose construction began in 1984; cf. Kytö (1996)) and volumes like Hickey et al (1997). Writing in the early 90s, Aarts et al. claimed that diachronic corpora were “enjoying a new popularity” ((1993:i), emphasis added), and quite possibly the same is happening today. Also, the editors mention several “seemingly innocent notions and assumptions” that the articles in the collection problematize, thus making “corpus linguistics grow[…] and prosper[…]” (p. 5). Among these is the notion of “the word as the central unit of investigation” (ibid.). This claim is rather surprising for two reasons. First, some articles in the collection do take single words as their central units of investigation (cf. Miglio on dizque, Rudanko on submit). Secondly, and more importantly, the claim does not do justice to the long-standing concern of corpus linguistics with the search for meaningful units above the word level. If one looks back to the lexicographic tradition that is mentioned elsewhere in the introduction (p. 3), one finds that rejecting the word as a central unit of investigation was the prime concern leading to the construction and exploitation of corpora, as even a brief foray into the work of Sinclair et al (1970(2004)), (1991) and (2004) and colleagues will reveal (cf. e.g. Louw (1993) on semantic prosody, Sinclair and Renouf (1991) on collocational frameworks).

Moving on to discuss the different contributions individually, and taking them in the order in which they appear in the volume, the first section (on diachronic applications) opens with Viola G. Miglio’s article on the Spanish adverb dizque. This contribution provides a detailed historical account of this adverb, used mainly in Latin American Spanish. Drawing evidence from four different corpora (the well-known reference corpora of Spanish, CREA, CORDE and CDE and DLNE, a specialised collection of Mexican texts from the Colonial period), the author describes the evolution of this word both in quantitative terms (from emergence in the 13th century through decline in the 17th and resurgence in the 20th with a different text-type distribution) and qualitative terms (grammaticalization and shift from evidential to mainly epistemic marker). As is often the case with diachronic data (especially), the corpora are not ideal for disentangling variables related to e.g. formality, orality, and geographic origin, which seem to play a role in the diachronic description of this word; but the author honestly acknowledges this and certainly does her best with the available data. The only small criticism one could move to this careful work is the inconsistent provision of normalised frequencies which, while understandable given the small numbers of hits from the corpora, makes direct comparison (across corpora, registers, centuries, geographic origin) less straightforward.

The article by Alfonso Medina Urrea proposes an automatic method to extract what he calls “morphological profiles” from corpora, i.e. sets of grammatical modifiers and their sequences (morphological affixes), which he then uses as reference points for comparing corpora taken to represent different diachronic varieties of Mexican Spanish (16th / 18th / 20th century). The method for morphological segmentation is based on a combination of three statistical measures derived from Information Theory, which are conflated to obtain a measure of what the author calls “affixality”, i.e. the extent to which word fragments “may be joined to other items in order to form other graphical words” (p. 30). These measures allow Medina Urrea to extract from the three corpora the most salient morphological items, which he then compares using Euclidean distance as a measure of similarity. The resulting distances are small but seem to confirm the author’s expectations, i.e. the distances between 16th century Mexican Spanish and the other two varieties are the greatest ones. While the author does make an attempt to check for the non-randomness of these results through a follow-up study comparing results with those obtained for two varieties of Peninsular Spanish (i.e. from the 18th and 20th centuries), some methodological issues remain unanswered, such as the reason why no comparison is made with 16th century Peninsular Spanish, which would have made the claim on the period of emergence of Mexican Spanish as a distinct dialect more compelling; or the possible effects of corpus composition (the 16th and 18th century Mexican Spanish corpora are not described; cf. also the discussion of Miglio’s article, above). However, the idea of comparing corpora on the basis of morphological traits rather than lexical bases/words is an extremely interesting one, with wide-ranging applications beyond diachronic research.

As was the case with Miglio’s article, data sparseness also afflicts Juhani Rudanko’s article on the alternation between the to infinitive and the to -ing complements of the verb submit (in this case however normalised frequencies are usually provided). Despite the scarce evidence — several of the few examples found come from the work of singles authors, a problem that the author acknowledges — the contribution presents claims of noteworthy theoretical import regarding grammatical change. A reduction in use of the structure and a shift from prevalence of the to infinitive to prevalence of the to -ing complement is observed in both British and American English; the same does not seem to apply to Canadian English, even though the evidence available for the latter variety is not sufficient to draw firm conclusions. Furthermore, based on the work of single authors employing both constructions the author is able to conclude that “it may be necessary to allow for the availability, for individual speakers, of competing grammars that exhibit non-diaglossic variation during a time of grammatical change” (p. 64).

The more methodologically-oriented work by Cristina Mota deals with corpus comparison in a diachronic perspective. She applies the method for corpus comparison proposed by Kilgarriff (2001) to a corpus of journalistic texts in Portuguese (CETEMPúblico), annotated with information on text topic and time of publication. For each of 5 topics and 16 time periods (the corpus spans a relatively short time, from 1991 to 1998, and each year is split into 6 month partitions) she calculates, using the chi square metric normalized by the degrees of freedom, scores of homogeneity and similarity. The ultimate aim of the paper is to evaluate empirically how corpus composition may affect the performance of a named entity tagger, and to this end the experiments on within-topic similarity over time are repeated for (non-sentence initial) uppercase words, a very simple heuristic to identify entities, and lowercase words. Mota’s results suggest that 1) the same-topic corpora become increasingly dissimilar as the time gap increases and the curve of corpus dissimilarity does not seem to stabilize (over a span of 8 years); 2) texts on the same topic tend to become as dissimilar from each other over time as texts on different topics. While the aim of this contribution (assessing the possible effects of corpus similarity over the performance of a named entity tagger) may be seen as somewhat marginal to corpus based studies of diachronic data, and while the study leaves some questions open, that the author, however, honestly acknowledges (e.g. the extent to which results are generalizable if experiments are repeated on a non-single source corpus), the results are potentially of relevance to all corpus linguists, who sooner or later are bound to find themselves in need of assessing the ways in which the (sub)corpora they work with resemble or differ from each other.

The section of “function-oriented applications” opens with Georgie Columbus’s contribution on invariant tags as used in three varieties of English. Relying on the relevant sub-corpora of ICE, Columbus provides a detailed manual analysis of four invariant tags (eh, yeah, no and na) used in private dialogue across British, Indian and New Zealand English. Through a painstaking process of co-textual interpretation the author establishes several cross-varietal differences in use that might lead to misunderstandings and should certainly be taken into account in ESOL pedagogy. One might argue that a transcribed multi-purpose corpus like ICE is not ideal for an investigation such as this one (the potential bias due to transcription is not mentioned). The author does try to compensate for lack of sound or intonational mark-up going through several readings of concordance lines, yet the perspective would seem to remain partial, not doing justice to the richness of non-verbal cues offered by dialogic data. Also, implications regarding the mutual intelligibility of speakers of different varieties of English, which is claimed to be “not entirely possible at the discourse level” (p. 99), do not seem to be warranted by the data or by the analysis presented in this article, nor indeed by the mono-varietal sub-corpora of ICE: a lingua franca corpus would be in order.

The contribution by Philip Dilts is concerned with the notions of semantic preference and semantic orientation, and the interactions between them. The author devotes a relatively extensive section to the literature review, where he defines the two notions: semantic orientation is the property of a word of arousing good or bad feelings in people when they are presented with that word, a concept that is synthetically presented in the article’s title “Good nouns, bad nouns: […] what native speakers think”. After presenting previous work on the computational methods to extract semantic orientations from corpora, Dilts moves on to explore more in detail the corpus linguistics-derived notion of semantic preference (the “what the corpus says” part of the title), that he relates to the one of semantic prosody. It is argued that the present study is primarily concerned with semantic preferences, i.e. the tendency of words to co-occur with other words belonging to a limited set of semantic categories (which in the author’s view may take the value of positivity and negativity), rather than dealing with semantic prosody, which is in fact the notion traditionally accounting for the positivity/negativity of an expression based on its lexical collocates. The author’s position, shared by Bednarek (2008) challenges current views (see e.g. Stewart (2009)), therefore a more extensive discussion would have been welcome. After this introductory part, an experiment is presented in which the author evaluates adjective-noun pairs extracted from the BNC; drawing on ANEW (Bradley and Lang (1999)), a resource containing human ratings of the semantic orientation of given nouns, and on the results of a previous study aimed at weighing the (good/bad) semantic preference of adjective-noun pairs found in the BNC, the author explores how the positive/negative semantic orientation of a noun correlates with positive/negative semantic preferences of the adjective pairs in which it enters. This leads to interesting results: while “good” nouns tend to collocate primarily with adjectives reinforcing their semantic orientation, “bad” nouns collocate both with “bad” adjectives and with a surprisingly high number of “good” adjectives, thus suggesting that semantic orientation and semantic preference do not necessarily coincide. The contribution presents innovative research, opening up very interesting directions for future studies at the boundaries between corpus and psycholinguistics.

Tatiana Zdorenko studies subject omission in the Russian National Corpus, finding register effects on the frequency of the phenomenon (with informal spoken language displaying the most instances and written genres the least) as well as local factors likely to affect use of a null subject such as different person contexts and lexicalised collocations. This contribution is exemplary of how research in corpus linguistics should be: well-grounded in theory, explicit about assumptions, data set, methodology, bringing to light the value added of the corpus approach, and arriving at non-trivial findings.

The section on “register/genre applications” begins with Phuong Dzung Pho’s contribution on the linguistic realization of (manually identified) rhetorical moves in abstracts and introductions of research articles in applied linguistics and educational technology. The statistical analysis carried out on the data highlights both the prototypical features of each move compared to the others and any differences in the use of these features between the two disciplines. The general conclusions are hardly momentous — the fact that “[l]inguistic features do vary across moves” (p. 149) should not be surprising: if linguistic features did not vary across moves, how could their function vary, and indeed how would we identify moves in the first place? Yet the identification of the prototypical features of moves through a well-designed, clearly circumscribed study can provide input to EAP teaching as well as testifying to the potential of corpus methods for genre and discourse analysis.

Like Pho, Eniko Csomay and Viviana Cortes examine linguistic features (in this case lexical bundles) in previously identified sub-units of discourse. The units are identified bottom-up rather than top down, using an automated procedure based on lexical cues to topic and orientation change, and the methodology is applied to spoken data, i.e. academic lectures. The findings support previous studies carried out in the same tradition: discourse organising and stance expressing bundles feature more prominently in the more interactive (initial) units of discourse and then decline as discourse becomes more informational and monologic, as well as richer in referential expressions. As was the case with Pho’s contribution, the findings are hardly surprising. For instance, the authors claim that “[i]nterestingly enough bundles with personal attributes […] declined while bundles classified as impersonal intention and prediction showed growth” (p. 162); yet this would be the common sense expectation, since speakers obviously progress from interactive class management to impersonal lecturing, and lexically-defined units are likely to pick up just such differences, as previous work cited confirms. Again, the strong point of this contribution is in the solid methodology adopted, that is described in sufficient detail to allow replication and adaptation to novel settings.

Luciana Diniz’s contribution provides a detailed analysis of several means employed to express indirect orders in academic spoken discourse. Target expressions, including modals and lexical verbs, are identified based on previous research, personal experience/intuition and manual searches of the corpus materials (drawn from MICASE); their function is then analysed through manual inspection of concordance lines. The observations made are interesting and have clear pedagogic implications, but one would have valued an attempt at relating them to other genres or teaching contexts, so as to put them into perspective. We can agree with the author that “modals and lexical verbs that are usually labelled as performing a simple advisability function in pedagogical grammar are, in reality, context-sensitive” (p. 178), but the claim would be much more compelling if data from a different context, or from a reference corpus, were provided for the same expressions, showing that the “indirect order” use is indeed more prominent in this setting than in others.

Eileen Fitzpatrick and Joan Bachenko apply corpus methods to first compile a list of deception cues (i.e. words and phrases signalling that a speaker is lying) and then to test their predictive power. This is a fascinating area of study, as well as an extremely challenging one from several points of view, not least the identification of adequate corpus materials for the test phase especially. While the model is found to predict deception with an encouraging overall accuracy of 75%, the authors point out that performance varies substantially depending on similarity between the training and the testing corpora, a big hurdle given the difficulties of assembling large and fine-tuned corpus materials in this field.

The article by Stefan Th. Gries opens the fourth and last section of the book, on “methodology and tools”. As also pointed out by the editors in the introduction, the four contributions included here have been grouped so as to provide examples of how corpus linguistics can be enriched, and is indeed being enriched, by methods derived from disciplines such as statistics and computational linguistics. Gries’s contribution excellently exemplifies the point. The author discusses several statistical measures that are widely used in the corpus linguistics literature for calculating dispersion (i.e. how homogeneously distributed words are across different parts of a corpus) and adjusted frequencies (i.e. methods for penalizing frequencies of words that are attested only in a small part of a corpus). The questions the author asks are 1) how do these measures perform and compare to each other?, and 2), drawing evidence from psycholinguistics, what do (some of) these measures exactly measure?. To exemplify the point, Gries investigates how well the results obtained through these statistics correlate with “word familiarity” in the brain of native speakers. The answers to the questions are not definitive: the author limits himself to providing preliminary indications as to some of the advantages and limitations of the measures taken into account, and advocates the need for further methodological refinements to his study. The main issue he raises, however, is a central one: we know little (and nonetheless take much for granted) about what certain statistic measures do, and we should gain a better understanding of them if we are to assess their significance for corpus analysis.

The article by Christopher Cox is concerned with the construction of a corpus of Mennonite Low German, and in particular with issues related to resource requirements and accuracy of (probabilistic/statistical) part-of-speech tagging. The author aims at evaluating the effects in terms of tagging accuracy and time expenditure of three possibly interrelated factors during the training of the probabilistic tagger Qtag: orthographic normalization of the input data, which is often a key issue when dealing with minority languages, the size of the training corpus, and the choice of the tagset (ranging from few broad morpho-syntactic labels to more complex and fine-grained annotation schemes). Methodologically, this is a neat, solid contribution, which seeks to demonstrate through a case study how different tagging choices may impact on the final quality of the tagged corpus, and to provide indications for the processing of minority language data in general. The results are somewhat predictable though (e.g., if input data are orthographically normalized as part of the tagging process tagging accuracy increases, but so do time requirements), as are some of the issues to which the author gives centre stage — e.g. the need to assess the trade-off between effort required and desired accuracy of the resulting tagging.

As other contributions in the book (cf. Mota and Medina Urrea), the work by Elke Teich and Peter Frankhauser is concerned with methods for corpus comparison, that the authors apply to a corpus of disciplinary writing (DaSciTex) from the humanities, science and engineering. Teich and Frankhauser are particularly interested in shedding light on register variation both between their corpus (taken to represent disciplinary writing as such) and a general purpose corpus (FLOB), and across various sub-corpora, representing a) computer science, b) a “pure” discipline like linguistics, biology etc. and c) a “mixed” discipline at the boundaries between a) and b), such as computational linguistics or bio-informatics. Methods derived from data mining are employed, including feature ranking, clustering and classification, and features for comparison are derived from systemic functional linguistics. While some of the choices could have been justified more explicitly (e.g. the use of the relative number of nouns, lexical verbs and adverbs as potential indicators of “abstract language”), the method leads to very interesting insights, such as the ability of we+verb patterns (corresponding to ways in which authors “represent” themselves) to discriminate between different registers, e.g. between computer science, computational linguistics and linguistics. By applying data mining techniques to corpus comparison, which presents a degree of originality in itself, the authors find that the mixed disciplines are more similar to their “mother discipline” than they are to computer science.

Finally, Bloom and Argamon present a sophisticated computational method for extracting appraisal expressions from corpora, i.e. linguistic realizations of opinions, a notion grounded in systemic functional linguistics. The method involves several steps, starting with shallow parsing of the input data to extract attitudes (the textual realization of the opinion itself) and targets (to which the attitude refers); the subsequent step consists in identifying the so-called “linkages”, i.e. the possible syntactic structures which connect an attitude to a target; the system finally scores (and “learns”) each linkage to decide the most likely connection between the extracted attitudes and targets. The system evaluation is carried out on two corpora, one of user-generated product reviews of baby strollers, digital cameras and printers (for each of these products a lexicon specific of the domain is constructed), and one of movie reviews: results indicate that the performance of the automatic extraction is comparable to manual results obtained in previous work. The topic of this contribution is perhaps peripheral to mainstream corpus linguistics and closer to NLP; yet, the authors present their work in an extremely clear fashion, and thus contribute to bringing together the two research communities, an avowed aim of the book as a whole.

Summing up, despite the minor drawbacks pointed out above and some formal minutiae (e.g., an endnote repeated verbatim in Columbus, a missing section referred to in Diniz, wrong note numbering in Fitzpatrick and Bachenko, the low quality of some figures, such as e.g. the graphs on p. 76) there is no doubting that the book brings together some very high-quality contributions. Taken together, these provide valuable insights (descriptive, methodological, theoretical) of relevance to the whole of corpus linguistics, showing several ways of “compiling, extracting, and evaluating […] data in complex and innovative ways” (p. 6), and in so doing set a high standard for edited volumes in corpus linguistics. In particular, this collection should not be missed by researchers new to the field (thanks to its methodological slant), by those wishing to explore the potential of statistical and computational methods further, and by corpus linguists wishing to become more familiar with the current developments of the discipline in North America.

References




Adriano Ferraresi and Silvia Bernardini (2010) “Book Review of Stefan Th Gries, S Wulff, and M Davies (eds) "Corpus-linguistic applications. Current studies, new directions"”, ELR Journal, 4 (2).