Daniel Gallego Hernández (Universidad de Alicante) & Ramesh Krishnamurthy (Aston University)
This paper describes the initial stages of the COMENEGO project, which is initially creating comparable corpora of Business texts in Spanish and French. The language of business remains a vital field in global activities, and the globalised market requires frequent cross-border, cross-linguistic, and cross-cultural interaction. The need for rapid and accurate translations places increasing demands on the business community, on translation practitioners, and on those who train the translators. High-frequency activities also tend to react, innovate, and adapt more quickly to changes in their environment and practices, and therefore there is a constant need to renew outdated resources. However, this pressure often leads to ad hoc collections. The highly competitive nature of commerce also poses problems of data availability, accessibility, and cost. Translated texts may contain traces of source language interference and other non-native-speaker features, hence especially for advanced and specialised translation purposes, comparable corpora may be more suitable. For Spanish and French, these factors and features are evident in previous corpora. COMENEGO focuses on up-to-date texts, textual variety and balance, and the necessary compromise between more idealised corpus design and practical factors, and will constitute a valuable resource for researchers, translators, and translator trainers and trainees. This paper will discuss not only the process of corpus design and creation, but also the applications of comparable corpora in translation pedagogy.
The acronym COMENEGO stands for ‘Corpus Multilingüe de Economía y Negocios’. ‘Corpus’ may be defined as ‘a collection of pieces of language that are selected and ordered according to explicit linguistic criteria in order to be used as a sample of the language’ (Sinclair, 1996). ‘Multilingüe’ refers currently only to French and Spanish, but other languages may be added. ‘Economía y Negocios’ is a label (Mateo Martínez, 2007) which includes a wide variety of discourses related to business and economics: from the theoretical language of economics, to the practical worlds of commerce, finance, etc.
COMENEGO is a comparable corpus of business and economics texts which may be used for specialised language research (i.e. as the object of analysis). From the point of view of the practice of translation, it may be also considered as a set of parallel texts, i.e. texts related to the source text which provide information on text-type conventions or particularities of field-specific language use, for translation practitioners. In this sense it can be used as a teaching tool for translator trainers and as a specialised linguistic resource.
It is being created because there seems to be a lack of stable electronic corpora specialised in business (French and Spanish). For example, MLCC Multilingual and Parallel Corpora is reasonable in size (10m words for each language) but is now outdated (early 1990s), lacks textual variety (single sources — Expansion for Spanish and Le Monde for French) and is a commercial resource (€450-3600). IULA’s corpus tècnic has variety, but is rather small (1m words of Spanish), contains many translated texts, and has no French (Cabré & Martorell, 2004: 174). The subcorpora EGAL and CONSUMER in Corpus Lingüístico da Universidade de Vigo (CLUVI) are also smaller (0.4m and 1.8m Spanish), and contain translated texts as well as some outdated texts (1998 onwards). Vicente (2007) is diachronic and contrastive (1995 and 2006), contains only press articles (French: Le Monde, Les Echos; Spanish: El Pais, Expansion), and is not publicly available.
Due to this lack of stable corpora, as translator trainers, we have to implement in our courses ad hoc web as/for corpus methodologies applied to the practice of translation, as discussed in Bernardini & Zanettin, eds. (2000); Corpas Pastor (2002); Zanettin et al., eds. (2003); Sánchez Gijón (2004); Beeby et al., eds. (2009); Gallego Hernández (2010a, 2010b, 2011a).
The results of a survey carried out on translator trainees (Gallego Hernández, 2011b) suggest that COMENEGO may complement these kinds of methodologies. Its application in business translation training may take place before translation (discourse analysis of target specialised language and target textual genres), during translation (terminology extraction, chunk extraction, general foreign language patterns), and after translation (translation quality assessment). Of course, its applications can also include bilingual specialised language acquisition.
The compilation of a corpus involves different steps which, as a whole, represent a cyclic process:
Following Atkins et al. (1992), who state that a ‘corpus should be designed and constructed exclusively on external criteria’, the textual resources of COMENEGO have been collected according to different external criteria (URL, text-type, source-type, etc.) and are classified as follows:
|bank products, financial products and insurances; corporate webpages (commercial websites)||COMMERCIAL|
|on line courses; guides for consumers, investors and bank clients (webpages of teachers, universities, institutions)||DIDACTIC|
|laws, codes, decrees and legal advices (websites of ministries and agencies)||LEGAL|
|articles of associations, regulations, annual meetings, rules (corporate and informational websites)||ORGANIZATIONAL|
|press releases, news, newsletters (corporate websites and newspapers)||PRESS|
|academic papers (informational websites: specialised journals)||SCIENTIFIC|
|financial prospectus, annual accounts, annual reports, financial results, corporate responsibilities, management reports, analysis, country-specific, sector-specific reports, marketing plans (corporate and informational websites)||TECHNICAL|
The next table contains some details (number of files, tokens, types, and token-type ratios) related to both Spanish and French texts and also to the different categories:
These figures correspond to converted and partially cleaned TXT files. The Spanish corpus has around nine million words and the French one has also around nine million words. As regards the copyright permission, some requests have received an affirmative answer, but some texts have no copyright. In any case, this is a process still to be completed, which will change the figures presented in Table 2.
A range of software is available in corpus linguistics, such as AntConc, which was used in this research, or WordSmith Tools, which allow the user to obtain various analytical outputs from corpora: word frequency lists (‘the rank of a word-form in a corpus frequency list has some relationship to the importance of that word-form in the linguistic system’ (Krishnamurthy, 2001)), concordances, collocations, and n-grams. The following sections contain only preliminary analyses, but they serve to illustrate the type of analyses that we are going to undertake in subsequent research. At several points, we have indicated why further steps in the analysis are necessary.
As seen in Table 2, the Spanish (9115352 tokens) and French (9086627 tokens) corpora are very similar in total size. However, we must bear in mind that wordcounts may be affected by any differences in the morphology of the two languages.
The subcorpora are also roughly equal in size (average 1.3 million words, varying between 1.19 million and 1.37 million), but these figures may be affected by any differences in the text typology of the two languages.
However, some interesting differences between the Spanish and French corpora are already noticeable:
The most frequent words in any language tend to be the grammatical words (determiners, prepositions, etc). However, precisely because of their exceptionally high frequency in any corpus, their polyfunctionality, and the complexity of their usage, they are best analysed later in the research.
We therefore begin with an analysis of the most frequent non-grammatical words, sometimes called content words or vocabulary words. Here is a table containing the most frequent 15 content words in the Spanish and French corpora (note that we have chosen to treat any initial capital forms, e.g. the first word in a sentence, or proper nouns like France, as lower case forms). Such frequency lists highlight the words that translators will need to know in a wide range of contexts, meanings, and usages, as well as indicating the large number of rarely-occurring words that they need not focus so much of their attention on.
We first notice a number of similar items: 7 out of the most frequent 15 words are cognates, and the first equivalents given in most bilingual dictionaries (however, we must again bear in mind that they may have a different range of meanings and usages in the two languages). There are one or two minor variations: the Spanish singular form cuenta is paralleled by both singular and plural French forms (compte, comptes). Of course we must look at the full Spanish list, as cuentas may occur only slightly lower down. Similarly, the Spanish plural form empresas is represented by the French singular form entreprise. In subsequent research, it will be important to compare a much longer list of items, and to analyse the differences in rank more precisely.
Next, we consider the items that occur in only one list. In the Spanish list: información, mercado, valores, millones, general, valor, riesgo. In the French list: assurance, actions, conseil, france, conditions, ans. Again we would need to check the full lists to check whether these are significant differences, or merely reflect minor variations in frequency in the two corpora.
As stated earlier, concordances allow us to see any of the items in the frequency list exactly as they occur in texts, with the contexts they occur in. This is the most detailed level of corpus analysis, and enables us to investigate every word in every text, if we so wish. It allows us to ascertain word classes, meanings, usages, collocations and phraseologies, grammatical patterns, pragmatic uses, and genre-specific features.
…so de ser cliente de cuenta Nómina) o con Certifi…
…eneral de la Policía cuenta con 350 Oficinas de E…
…so de ser cliente de cuenta Nómina) o con el Cert…
…comendamos tengan en cuenta. Ante cualquier duda,…
…total o parcial, por cuenta de gobiernos o autori…
… capital liberada. A cuenta Complementario Total …
…07/07/880,1800,144 A cuenta Complementario Total …
…d, abertis logística cuenta con cerca de 933.000 …
… ad El Grupo abertis cuenta con una plantilla med…
… de los trabajadores cuenta con contrato indefini…
Even in this extremely small sample (10 out of the total 11879 occurrences for cuenta — see previous table) we can see several patterns: 4 lines contain cuenta followed by con, 2 have the long sequence so de ser cliente de cuenta Nómina, 2 have A cuenta Complementario Total, one has tengan en cuenta, and another has por cuenta de. When we look at all the 11879 examples in future research, some of these patterns may prove to be very useful for translators, for example by indicating technical terminology.
…st ajusté pour tenir compte de la modification du… …inatif pur Ouvrir un compte au nominatif pur Pour… …’Administration, qui compte désormais 12 membres … … du dividende. Votre compte sera crédité dans les… …irectement sur votre compte. Au nominatif adminis… …e 17 mai 2010. Votre compte sera crédité dans les… … suivants pour tenir compte des délais de traitem… …ts et en lui rendant compte de son examen : organ… … d’une action tenant compte des opérations ayant … …ier pour la tenue du compte-titres. Ils représent… …s titres inscrits en compte nominatif pur. Droit … …umenté au moyen d’un compte rendu écrit. Avec le …
As in the Spanish examples, we can see some patterns in these 12 French examples (out of 10342): 4 are followed by de, 3 of which are preceded by tenir/tenant, and one by rendant (we also have one example of compte rendu); 3 examples of votre compte, two of which are followed by sera and one preceded by sur; one for compte-titres (and titres occurs separately in another); one for ouvrir un compte; and even one for the verb: qui compte. Again, a fuller investigation will reveal which of these features are significant. The almost equally high frequency of the singular form compte (10342) and the plural form comptes (9149) may indicate distinct uses of this form in French.
While we can obtain greater certainty about the features and behaviours of a word by looking at concordances, there are some corpus software tools that can speed up our initial detection procedures, for example the collocation tool. However, these tools often require a greater degree of linguistic knowledge and sophistication on the part of the researcher. The collocation tool performs a quantitative analysis of the words that occur in close proximity to the node or key word (i.e. the one that we are studying). AntConc allows us to choose the exact distance, and previous research (eg Sinclair 1970) shows that most significant collocates occur within 4 words of the node/keyword. Of course, we will later have to check these collocates in the concordances.
At first glance, there seem to be far fewer obvious formal correspondances in these lists than in the frequency lists (cf. 4.1): only corriente — courant and depósito — dépôt. On closer inspection, a potential semantically equivalent pair is seen: ahorro — épargne. The fact that so many of the collocates cannot be associated easily, suggests that the words cuenta and compte are used in many different contexts and phraseologies in Spanish and French. This proves the need for caution in future research: words with formal/etymological similarities may turn out to be ‘false friends’.
There are other aspects of the collocation tool that can be used in further research, such as determining exactly which position each collocate occurs in, with relation to the node/keyword.
Another tool that can be useful in quickly identifying language features that would take much longer to detect by manual inspection of concordances is the n-gram tool. Instead of creating a word frequency list, this tool creates frequency lists of fixed word sequences, for example 2-word sequences (i.e. 2-grams), 3-word sequences (3-grams), etc.
|de millones de euros||dans le cadre de|
|del Mercado de Valores||du Code de commerce|
|de la Ley de||dans les conditions prévues|
|la Ley de de||à compter de la|
|del Consejo de Administración||alinéa de l’article L|
|Nacional del Mercado de||Crédit Agricole S A|
|Comisión Nacional del Mercado||de commerce et d’industrie|
|la Comisión Nacional del||droit préférentiel de souscription|
|el Consejo de Administración||en application de l’article|
|en el caso de||valeurs mobilières donnant accès|
|artículo de la Ley||la mise en place|
|Consejo de Administración de||par lettre recommandée avec|
|el artículo de la||dans la limite de|
|los millones de euros||l’Autorité des marchés financiers|
|lo dispuesto en el||le commissaire aux comptes|
|en el artículo de||des commissaires aux comptes|
|En el caso de||le cadre de la|
|que se refiere el||dispositions de l’article L|
As with the collocation lists (cf. 4.3), the 4-gram lists show more differences than similarities. Only one parallel item from the word frequency lists (cf. 4.1) remains, i.e. artículo &mdash article. Some words remain in one of the lists, but not the other (e.g. euros is still in the Spanish list, but no longer in the French; comptes is still in the French list, and the verb compter has arrived, but compte has disappeared - as has cuenta in the Spanish list). Many new items appear in these fixed word sequences (e.g. cadre) which were not in the word frequency lists. We have also retained capitalization in these lists to show that many proper names of institutions have become apparent. Some new cognate items are also evident, e.g. dispuesto - dispositions.
These analyses represent the tip of the iceberg, in terms of the type of analyses that can be performed. We did not want to go much further until now, as we were still adjusting and equalising the corpora and their components. Now that we have a stable corpus, we can embark on a more rigorous and comprehensive analysis, using the corpus software tools to their maximum effect. The observations and insights reported so far merely hint at the wide range of results that can be obtained, and their potential benefits for translators.
COMENEGO is still a pilot corpus, so its analysis should help in defining the different categories. Its contents have been collected by intuition and personal experience. Therefore, different surveys concerning the needs of professional translators and companies should help to ensure that in future, we include different textual typologies according to the real professional world. The virtual platform for COMENEGO is still under construction, and the obtaining of copyright permissions has not yet been completed. Consequently the applications of the corpus are currently restricted to research and training material. The project team is very small and has no funds, hence the construction and development of this resource is being conducted little by little.
Bednarek, M (2006) Subjectivity and cognition. Inscribing, evoking and provoking opi>
Atkins, S; Clear, J and Ostler N (1992) Corpus Design Criteria. Literary and Linguistic Computing, 7, 1-16. Available at http://www.natcorp.ox.ac.uk/archive/vault/tgaw02.pdf.
Beeby A; Rodríguez Inés, P and Sánchez Gijón, P eds. (2009) Corpus Use and Translating. Amsterdam/Philadelphia: John Benjamins.
Bernardini, S and Zanettin, F eds. (2000) I corpora nella didattica della traduzione: Corpus use and learning to translate. Bologna: CLUEB.
Cabré Castellví, M T and Bach Martorell C (2004) El corpus tècnic del IULA: corpus textual especializado plurilingüe. Panacea, 16, 173-176. Available at http://medtrad.org/panacea.html.
Corpas Pastor, G (2002) Traducir con corpus: de la teoría a la práctica. In García Palacios, J. and M.ª T. Fuentes Morán (eds.) Texto, terminología y traducción. Salamanca: Almar: 189-226.
Gallego Hernández, D (2010a) Traducción económica y textos parlelos en internet. Aproximación teórica y metodológica. PhD Thesis, Universidad de Alicante.
Gallego Hernández, D (2010b) Acquiring instrumental sub-competence by building do-it-yourself corpora for business translation. Using Corpora in Contrastive and Translation Studies. Available at http://www.lancs.ac.uk/fass/projects/corpus/UCCTS2010Proceedings/.
Gallego Hernández, D (2011a) Documentación aplicada a la traducción económica, comercial y financiera: estrategias de compilación ad hoc de textos paralelos. Paper presented at the V Congreso de la AIETI (Asociación Ibérica de Estudios de Traducción e Interpretación). Castellón: UJI.
Gallego Hernández, D (2011b) Web for corpus en el aula de traducción económica, financiera y comercial. ¿Se siente el traductor de mañana capacitado para trabajar con corpus ad hoc? Paper presented at the I Congreso Internacional T3L: Tradumática, Tecnologías de la Traducción y Localización. Barcelona: Universitat Autònoma.
Krishnamurthy, R (2001) Size Matters: creating Dictionaries from the World’s Largest Corpus. 8th Annual KOTESOL Conference Proceedings. Taegu: KOTESOL: 169-180.
Mateo Martínez, J (2007) El lenguaje de las ciencias económicas, en Alcaraz Varó, E. et al. (eds.) Las lenguas profesionales y académicas. Barcelona: Ariel: 191-203.
Sánchez Gijón, P (2004) L’ús de corpus en la traducció especialitzada: compilació de corpus ad hoc i extracció de recursos terminològics. Barcelona: Universitat Pompeu Fabra.
Sinclair, J (1996) Preliminary recommendations on Corpus Typology. Available at http://www.ilc.cnr.it/EAGLES/corpustyp/corpustyp.html.
Vicente, C (2007) Lingüística de corpus y traducción especializada: aplicaciones a la traducción francés-español de la economía. Paper presented at XXV Congrès international de linguistique et de Philologie Romanes.
Zanettin, F; Bernardini, S and Stewart, D eds. (2003) Corpora in Translator Education. Manchester/Northampton: St. Jerome.