COMENEGO (Corpus Multilingüe de Economía y Negocios): design, creation and applications

Daniel Gallego Hernández (Universidad de Alicante) & Ramesh Krishnamurthy (Aston University)

Abstract

This paper describes the initial stages of the COMENEGO project, which is initially creating comparable corpora of Business texts in Spanish and French. The language of business remains a vital field in global activities, and the globalised market requires frequent cross-border, cross-linguistic, and cross-cultural interaction. The need for rapid and accurate translations places increasing demands on the business community, on translation practitioners, and on those who train the translators. High-frequency activities also tend to react, innovate, and adapt more quickly to changes in their environment and practices, and therefore there is a constant need to renew outdated resources. However, this pressure often leads to ad hoc collections. The highly competitive nature of commerce also poses problems of data availability, accessibility, and cost. Translated texts may contain traces of source language interference and other non-native-speaker features, hence especially for advanced and specialised translation purposes, comparable corpora may be more suitable. For Spanish and French, these factors and features are evident in previous corpora. COMENEGO focuses on up-to-date texts, textual variety and balance, and the necessary compromise between more idealised corpus design and practical factors, and will constitute a valuable resource for researchers, translators, and translator trainers and trainees. This paper will discuss not only the process of corpus design and creation, but also the applications of comparable corpora in translation pedagogy.

1. COMENEGO: the acronym

The acronym COMENEGO stands for ‘Corpus Multilingüe de Economía y Negocios’. ‘Corpus’ may be defined as ‘a collection of pieces of language that are selected and ordered according to explicit linguistic criteria in order to be used as a sample of the language’ (Sinclair, 1996). ‘Multilingüe’ refers currently only to French and Spanish, but other languages may be added. ‘Economía y Negocios’ is a label (Mateo Martínez, 2007) which includes a wide variety of discourses related to business and economics: from the theoretical language of economics, to the practical worlds of commerce, finance, etc.

2. COMENEGO: justification and aims

COMENEGO is a comparable corpus of business and economics texts which may be used for specialised language research (i.e. as the object of analysis). From the point of view of the practice of translation, it may be also considered as a set of parallel texts, i.e. texts related to the source text which provide information on text-type conventions or particularities of field-specific language use, for translation practitioners. In this sense it can be used as a teaching tool for translator trainers and as a specialised linguistic resource.

It is being created because there seems to be a lack of stable electronic corpora specialised in business (French and Spanish). For example, MLCC Multilingual and Parallel Corpora is reasonable in size (10m words for each language) but is now outdated (early 1990s), lacks textual variety (single sources — Expansion for Spanish and Le Monde for French) and is a commercial resource (€450-3600). IULA’s corpus tècnic has variety, but is rather small (1m words of Spanish), contains many translated texts, and has no French (Cabré & Martorell, 2004: 174). The subcorpora EGAL and CONSUMER in Corpus Lingüístico da Universidade de Vigo (CLUVI) are also smaller (0.4m and 1.8m Spanish), and contain translated texts as well as some outdated texts (1998 onwards). Vicente (2007) is diachronic and contrastive (1995 and 2006), contains only press articles (French: Le Monde, Les Echos; Spanish: El Pais, Expansion), and is not publicly available.

Due to this lack of stable corpora, as translator trainers, we have to implement in our courses ad hoc web as/for corpus methodologies applied to the practice of translation, as discussed in Bernardini & Zanettin, eds. (2000); Corpas Pastor (2002); Zanettin et al., eds. (2003); Sánchez Gijón (2004); Beeby et al., eds. (2009); Gallego Hernández (2010a, 2010b, 2011a).

The results of a survey carried out on translator trainees (Gallego Hernández, 2011b) suggest that COMENEGO may complement these kinds of methodologies. Its application in business translation training may take place before translation (discourse analysis of target specialised language and target textual genres), during translation (terminology extraction, chunk extraction, general foreign language patterns), and after translation (translation quality assessment). Of course, its applications can also include bilingual specialised language acquisition.

3. Steps involved in the creation of COMENEGO, and its contents

The compilation of a corpus involves different steps which, as a whole, represent a cyclic process:

Following Atkins et al. (1992), who state that a ‘corpus should be designed and constructed exclusively on external criteria’, the textual resources of COMENEGO have been collected according to different external criteria (URL, text-type, source-type, etc.) and are classified as follows:

Table 1: Text-types and categories in COMENEGO
TEXT-TYPES CATEGORIES
bank products, financial products and insurances; corporate webpages (commercial websites) COMMERCIAL
on line courses; guides for consumers, investors and bank clients (webpages of teachers, universities, institutions) DIDACTIC
laws, codes, decrees and legal advices (websites of ministries and agencies) LEGAL
articles of associations, regulations, annual meetings, rules (corporate and informational websites) ORGANIZATIONAL
press releases, news, newsletters (corporate websites and newspapers) PRESS
academic papers (informational websites: specialised journals) SCIENTIFIC
financial prospectus, annual accounts, annual reports, financial results, corporate responsibilities, management reports, analysis, country-specific, sector-specific reports, marketing plans (corporate and informational websites) TECHNICAL

The next table contains some details (number of files, tokens, types, and token-type ratios) related to both Spanish and French texts and also to the different categories:

Table 2.1: Text details (Spanish)
CATEGORY FILES TOKENS TYPES RATIO
COM 5255 1329915 35321 37.65
DID 1491 1276089 40641 31.39
LEG 211 1342698 23077 58.18
ORG 429 1337822 29417 45.47
PRS 2214 1329029 37314 35.61
SCI 99 1311731 34483 38.03
TEC 351 1188068 40777 29.13
TOTAL 10050 9115352 113100 80.59
Table 2.2: Text details (French)
CATEGORY FILES TOKENS TYPES RATIO
COM 3909 1325544 29316 45.21
DID 1121 1304585 35937 36.30
LEG 21 1293704 14772 87.57
ORG 634 1365468 21885 62.39
PRS 2859 1308418 37928 34.49
SCI 203 1301102 32710 39.77
TEC 133 1187806 24646 48.19
TOTAL 8880 9086627 89133 101.94

These figures correspond to converted and partially cleaned TXT files. The Spanish corpus has around nine million words and the French one has also around nine million words. As regards the copyright permission, some requests have received an affirmative answer, but some texts have no copyright. In any case, this is a process still to be completed, which will change the figures presented in Table 2.

4. Analysis of COMENEGO

A range of software is available in corpus linguistics, such as AntConc, which was used in this research, or WordSmith Tools, which allow the user to obtain various analytical outputs from corpora: word frequency lists (‘the rank of a word-form in a corpus frequency list has some relationship to the importance of that word-form in the linguistic system’ (Krishnamurthy, 2001)), concordances, collocations, and n-grams. The following sections contain only preliminary analyses, but they serve to illustrate the type of analyses that we are going to undertake in subsequent research. At several points, we have indicated why further steps in the analysis are necessary.

As seen in Table 2, the Spanish (9115352 tokens) and French (9086627 tokens) corpora are very similar in total size. However, we must bear in mind that wordcounts may be affected by any differences in the morphology of the two languages.

The subcorpora are also roughly equal in size (average 1.3 million words, varying between 1.19 million and 1.37 million), but these figures may be affected by any differences in the text typology of the two languages.

Table 3: Tokens in Spanish and French
CATEGORY SPANISH FRENCH
COM 1329915 1325544
DID 1276089 1304585
LEG 1342698 1293704
ORG 1337822 1365468
PRS 1329029 1308418
SCI 1311731 1301102
TEC 1188068 1187806
TOTAL 9115352 9086627

However, some interesting differences between the Spanish and French corpora are already noticeable:

Table 4: Files in Spanish and French
CATEGORY SPANISH FRENCH
COM 5255 3909
DID 1491 1121
LEG 211 21
ORG 429 634
PRS 2214 2859
SCI 99 203
TEC 351 133
TOTAL 10050 8880
Table 5: Files in Spanish and French
CATEGORY SPANISH FRENCH
COM 5255 3909
DID 1491 1121
LEG 211 21
ORG 429 634
PRS 2214 2859
SCI 99 203
TEC 351 133
TOTAL 10050 8880

Table 6: Files in Spanish and French

CATEGORY SPANISH FRENCH
COM 5255 3909
DID 1491 1121
LEG 211 21
ORG 429 634
PRS 2214 2859
SCI 99 203
TEC 351 133
TOTAL 10050 8880
Table 7: Average Text Length (Words)
CATEGORY SPANISH FRENCH
COM 253 339
DID 856 1164
LEG 6363 61605
ORG 3118 2154
PRS 600 458
SCI 13250 6409
TEC 3385 8931
avge 907 1023

4.1. Word Frequency Lists

The most frequent words in any language tend to be the grammatical words (determiners, prepositions, etc). However, precisely because of their exceptionally high frequency in any corpus, their polyfunctionality, and the complexity of their usage, they are best analysed later in the research.

We therefore begin with an analysis of the most frequent non-grammatical words, sometimes called content words or vocabulary words. Here is a table containing the most frequent 15 content words in the Spanish and French corpora (note that we have chosen to treat any initial capital forms, e.g. the first word in a sentence, or proper nouns like France, as lower case forms). Such frequency lists highlight the words that translators will need to know in a wide range of contexts, meanings, and usages, as well as indicating the large number of rarely-occurring words that they need not focus so much of their attention on.

Table 8.1: Word Frequency Lists — Content Words only (Spanish)
RANK FREQUENCY WORD
29 17540 información
30 17531 mercado
31 17339 artículo
35 15557 euros
36 15015 Caso
38 13921 sociedad
39 13582 valores
40 13378 millones
41 12421 capital
42 12407 general
43 11879 cuenta
45 11777 valor
47 11358 empresas
49 11141 grupo
50 10969 riesgo
Table 8.2: Word Frequency Lists – Content Words only (French)
RANK FREQUENCY WORD
28 26843 article
41 15921 cas
42 15294 société
50 11854 capital
52 11535 assurance
55 10898 actions
61 10379 entreprise
62 10342 compte
63 10335 conseil
69 9556 france
70 9532 groupe
73 9149 comptes
74 9027 conditions
76 8821 ans
78 8607 euros

We first notice a number of similar items: 7 out of the most frequent 15 words are cognates, and the first equivalents given in most bilingual dictionaries (however, we must again bear in mind that they may have a different range of meanings and usages in the two languages). There are one or two minor variations: the Spanish singular form cuenta is paralleled by both singular and plural French forms (compte, comptes). Of course we must look at the full Spanish list, as cuentas may occur only slightly lower down. Similarly, the Spanish plural form empresas is represented by the French singular form entreprise. In subsequent research, it will be important to compare a much longer list of items, and to analyse the differences in rank more precisely.

Next, we consider the items that occur in only one list. In the Spanish list: información, mercado, valores, millones, general, valor, riesgo. In the French list: assurance, actions, conseil, france, conditions, ans. Again we would need to check the full lists to check whether these are significant differences, or merely reflect minor variations in frequency in the two corpora.

4.2. Concordances

As stated earlier, concordances allow us to see any of the items in the frequency list exactly as they occur in texts, with the contexts they occur in. This is the most detailed level of corpus analysis, and enables us to investigate every word in every text, if we so wish. It allows us to ascertain word classes, meanings, usages, collocations and phraseologies, grammatical patterns, pragmatic uses, and genre-specific features.

…so de ser cliente de cuenta Nómina) o con Certifi…
…eneral de la Policía cuenta con 350 Oficinas de E…
…so de ser cliente de cuenta Nómina) o con el Cert…
…comendamos tengan en cuenta. Ante cualquier duda,…
…total o parcial, por cuenta de gobiernos o autori…
… capital liberada. A cuenta Complementario Total …
…07/07/880,1800,144 A cuenta Complementario Total …
…d, abertis logística cuenta con cerca de 933.000 …
… ad El Grupo abertis cuenta con una plantilla med…
… de los trabajadores cuenta con contrato indefini…

Even in this extremely small sample (10 out of the total 11879 occurrences for cuenta — see previous table) we can see several patterns: 4 lines contain cuenta followed by con, 2 have the long sequence so de ser cliente de cuenta Nómina, 2 have A cuenta Complementario Total, one has tengan en cuenta, and another has por cuenta de. When we look at all the 11879 examples in future research, some of these patterns may prove to be very useful for translators, for example by indicating technical terminology.

…st ajusté pour tenir compte de la modification du… …inatif pur Ouvrir un compte au nominatif pur Pour… …’Administration, qui compte désormais 12 membres … … du dividende. Votre compte sera crédité dans les… …irectement sur votre compte. Au nominatif adminis… …e 17 mai 2010. Votre compte sera crédité dans les… … suivants pour tenir compte des délais de traitem… …ts et en lui rendant compte de son examen : organ… … d’une action tenant compte des opérations ayant … …ier pour la tenue du compte-titres. Ils représent… …s titres inscrits en compte nominatif pur. Droit … …umenté au moyen d’un compte rendu écrit. Avec le …

As in the Spanish examples, we can see some patterns in these 12 French examples (out of 10342): 4 are followed by de, 3 of which are preceded by tenir/tenant, and one by rendant (we also have one example of compte rendu); 3 examples of votre compte, two of which are followed by sera and one preceded by sur; one for compte-titres (and titres occurs separately in another); one for ouvrir un compte; and even one for the verb: qui compte. Again, a fuller investigation will reveal which of these features are significant. The almost equally high frequency of the singular form compte (10342) and the plural form comptes (9149) may indicate distinct uses of this form in French.

4.3. Collocations

While we can obtain greater certainty about the features and behaviours of a word by looking at concordances, there are some corpus software tools that can speed up our initial detection procedures, for example the collocation tool. However, these tools often require a greater degree of linguistic knowledge and sophistication on the part of the researcher. The collocation tool performs a quantitative analysis of the words that occur in close proximity to the node or key word (i.e. the one that we are studying). AntConc allows us to choose the exact distance, and previous research (eg Sinclair 1970) shows that most significant collocates occur within 4 words of the node/keyword. Of course, we will later have to check these collocates in the concordances.

Table 9: Collocates of cuenta and compte
SPANISH FRENCH
nómina unités
corriente bancaire
servicio épargne
naranja titres
vivienda bred
caixa dépôt
condiciones courant
ahorro logement
producto numéro
crédito livret
vista ouverture
tarjeta gestion
depósito titulaire
valores relevés

At first glance, there seem to be far fewer obvious formal correspondances in these lists than in the frequency lists (cf. 4.1): only corrientecourant and depósitodépôt. On closer inspection, a potential semantically equivalent pair is seen: ahorroépargne. The fact that so many of the collocates cannot be associated easily, suggests that the words cuenta and compte are used in many different contexts and phraseologies in Spanish and French. This proves the need for caution in future research: words with formal/etymological similarities may turn out to be ‘false friends’.

There are other aspects of the collocation tool that can be used in further research, such as determining exactly which position each collocate occurs in, with relation to the node/keyword.

4.4. N-grams

Another tool that can be useful in quickly identifying language features that would take much longer to detect by manual inspection of concordances is the n-gram tool. Instead of creating a word frequency list, this tool creates frequency lists of fixed word sequences, for example 2-word sequences (i.e. 2-grams), 3-word sequences (3-grams), etc.

Table 10: Most frequent Spanish and French 4-grams
SPANISH FRENCH
de millones de euros dans le cadre de
del Mercado de Valores du Code de commerce
de la Ley de dans les conditions prévues
la Ley de de à compter de la
del Consejo de Administración alinéa de l’article L
Nacional del Mercado de Crédit Agricole S A
Comisión Nacional del Mercado de commerce et d’industrie
la Comisión Nacional del droit préférentiel de souscription
el Consejo de Administración en application de l’article
en el caso de valeurs mobilières donnant accès
artículo de la Ley la mise en place
Consejo de Administración de par lettre recommandée avec
el artículo de la dans la limite de
los millones de euros l’Autorité des marchés financiers
lo dispuesto en el le commissaire aux comptes
en el artículo de des commissaires aux comptes
En el caso de le cadre de la
que se refiere el dispositions de l’article L

As with the collocation lists (cf. 4.3), the 4-gram lists show more differences than similarities. Only one parallel item from the word frequency lists (cf. 4.1) remains, i.e. artículo &mdash article. Some words remain in one of the lists, but not the other (e.g. euros is still in the Spanish list, but no longer in the French; comptes is still in the French list, and the verb compter has arrived, but compte has disappeared - as has cuenta in the Spanish list). Many new items appear in these fixed word sequences (e.g. cadre) which were not in the word frequency lists. We have also retained capitalization in these lists to show that many proper names of institutions have become apparent. Some new cognate items are also evident, e.g. dispuesto - dispositions.

These analyses represent the tip of the iceberg, in terms of the type of analyses that can be performed. We did not want to go much further until now, as we were still adjusting and equalising the corpora and their components. Now that we have a stable corpus, we can embark on a more rigorous and comprehensive analysis, using the corpus software tools to their maximum effect. The observations and insights reported so far merely hint at the wide range of results that can be obtained, and their potential benefits for translators.

5. Conclusions

COMENEGO is still a pilot corpus, so its analysis should help in defining the different categories. Its contents have been collected by intuition and personal experience. Therefore, different surveys concerning the needs of professional translators and companies should help to ensure that in future, we include different textual typologies according to the real professional world. The virtual platform for COMENEGO is still under construction, and the obtaining of copyright permissions has not yet been completed. Consequently the applications of the corpus are currently restricted to research and training material. The project team is very small and has no funds, hence the construction and development of this resource is being conducted little by little.

References




Daniel Gallego Hernández and Ramesh Krishnamurthy (2013) “COMENEGO (Corpus Multilingüe de Economía y Negocios): design, creation and applications”, ELR Journal, 7 (1).