Yasunori Nishina, University of Birmingham
This paper investigates a number of text corpora belonging to different genres, and applies various statistical methods on features extracted from them. Through this empirical analysis we can identify internal criteria which support the assignment of genres to texts based on external ones.
New methods in corpus linguistics enable us to reassess and discover the detailed linguistic characteristics and differences existing across text genres which are yet to be described. The use of both corpus-driven methods and statistics allows us to do the computation of automatic text matching from anonymous various texts, based on the frequency and the distribution of content words. The present paper shows that an exhaustive corpus-driven approach, mixed with statistics, is the most effective and sophisticated analytical method for comparing texts across genres; it does this by re-evaluating the achievements of its forerunners and by discovering new facts from an exclusively empirical viewpoint. In order to achieve these aims, I compare texts of different genres, specifically, literary, newspaper and academic texts, from the viewpoint of statistical data, text levels, vocabularies, phrases, personalities, passives, hedges, nominalization and multivariate analysis. These linguistic features, which are extracted by a corpus-driven method, enable us to take a new step in evaluating the characteristics and language patterns of use in a specific discourse and text style. This paper is also intended as an investigation of the methodologies which allow such an evaluation.
The emergence of genre as a research topic has its origins during the 1960s and 70s from the work of such researchers as Barber (1962), Herbert (1965), Ewer and Latorre (1969), Ewer (1971), Ewer and Hughes-Davies (1972), Lackstrom et al. (1972) and others. The basic description of ‘genre’ is given by Biber (1988: 70) as follows:
I use the term ‘genre’ to refer to categorizations assigned on the basis of external criteria. I use the term ‘text type’, on the other hand, to refer to groupings of texts that are similar with respect to their linguistic form, irrespective of genre categories1.
More specifically, Biber (1993) defined appropriate descriptions of text genre and type. To summarise his idea, ‘genre’ is the variety of texts contained within a culture, such as scientific writing, science fiction, letters, press periodicals, conversation etc. On the other hand, text types differ at the linguistic level. Although genre is a more ambiguous concept than text type, it can be said that genre includes text type, or that genre is the superordinate concept of text type. However, numerous other researchers (e.g. Swales 1990) are associated with the term ‘genre’ as well, often having different conceptions of its nature. I believe that ‘genre’ can be based on internal rather than external criteria using a corpus-driven approach, and can be identified and codified using linguistic vocabularies, pattern and style. Thus, it could be accurately said that, as an alternative to conceptions of genre as a priori listings of textual variety, genre can emerge as a topic for quantitative research in linguistics. This paper develops and investigates this idea: analysing linguistic sets of co-occurring linguistic features across genres enables us to ascertain the differences between genres and what the characteristics of specific genres are, particularly from a quantitative viewpoint.
For example, Biber (1988) adopted multidimensional analysis (hereafter MD) in order to make distinctions between genres and to discover their linguistic characteristics. MDs are sets of linguistic features that often co-occur in texts, and are divided into 6 features (strictly 7) as dimensions (or factors) 1-6 (e.g., factor 2 includes ‘past tense’, ‘third person pronoun’ and ‘public verbs’). In addition, the MD approach is based on the idea that if some linguistic features occur frequently in a text, other features will appear less frequently in the same text (Biber 1993). For example, an MD approach revealed that conversational texts are largely interactive and involved than academic texts, largely because the former has less time available for high information production, but the latter has much time to produce a high information content and is highly controlled. Although the MD approach is one of the most well-organised methods of genre analysis, there are also various approaches suggested by many other scholars (see section 3.1).
Compared to the time when genre analysis originated, large genre-specific corpora are now available to enable us to do empirical and extensive genre analyses (e.g. Flowerdew 2002). Using these resources, I examine the characteristics of each genre in a more specific way, by looking at the nature of both word and phrasal behaviour. Before now, various researchers have attempted to conduct genre and text analyses in a range of texts, in particular academic texts. Table 1 summarises the historical methods used by some researchers:
| Research | Method |
|---|---|
| Ure (1971) | Lexical density |
| Leech and Svartvik (1975) | Passive in ‘impersonal writing’ |
| MacDonald et al. (1982) | Readability statistics: sentence length, type: token ratios and FOG analyses |
| Makaya and Bloor (1987) | Hedging |
| Biber (1988) | Multi-dimensional approach |
| Forsyth and Holmes (1996) | Style markers: letters, most frequent words and digrams, two methods of most frequent substring selection approach Stylometry problems: authorship, chronology, subject matter |
| Baayen et al. (1996) | Vocabulary richness and the frequency of the top 50 high frequency words |
| Biber et al. (1998) | Features of academic text |
| Kuo (1999) | Personal Pronouns |
| Hyland (2000), (2002) | Discourse-based features; hedge, boosters, metadiscourse markers, directives |
| Coniam (2004) | Content words, keywords, n-gram, personality, passives, hedges |
| Can and Patton (2004) | Word length, type length and token length |
Another research methodology used for genre and text analysis has been developed by Stamatatos et al. (2001). They adopted a variety of statistical methods in a discriminant approach to evaluate texts for the clarification of authorship; low-level measures, sentence length and punctuation mark count, a set of style markers from natural language processing, percentage of rare or foreign words and a measure indicating the morphological ambiguity were all used. Coniam (2004:288) also suggested other methodologies, yet to be taken up, including:
Following previous research in this area, I adopt a variety of methodologies in order to assess text style within genres:
A general reference corpus includes academic texts, newspaper and literature as significant parts. For example, the Baby-BNC, a 4 million word corpus, is compiled from 4 sections: written academic prose; written fiction; written newspaper; and, spoken demographic; each section is about 1 million running words. Thus, for written language texts, these can be divided into three broad categories: academic, newspaper and literature.
The corpora examined in this research are compiled from 6 pre-existing corpora: MicroConcord Corpora A and B; the Lancaster-Oslo/Bergen Corpus of British English (LOB); the Brown corpus (a standard corpus of present-day edited American English); the Freiburg-LOB Corpus of British English (FLOB); and the Freiburg-Brown Corpus of American English (Frown). The MicroConcord corpus is divided into two categories, A and B. MicroConcord A is a 1 million word corpus consisting of the British newspapers The Independent and The Independent Sunday; while MicroConcord B is a 1 million word corpus of academic articles published by the Oxford University Press. Brown, Frown, LOB and FLOB are well-balanced written American and British English corpora; each is compiled at the same standard, from 500 texts of 2,000 words representative of 15 categories published between the 1960s and 1990s. They all include academic texts at 16% (that is, 160,000 running words), this is known as ‘learned’ text (written texts on science and technology) and is in text category J. Newspaper material (or press texts) always accounts for 17.6% (176,000 running words) in text categories A-C, and literary works, generally categorized as ‘imaginative prose’, 25.2% (252,000 running words) in text categories K-R. I combined the matching parts of these corpora to create separate genre corpora. The size of the resulting genre corpora are as follows: academic corpus (MicroConcord B + text category J of the 4 corpora), 1,662,106 running words; newspaper corpus (MicoroConcord A + text category A, B, C texts of 4 corpora), 1,760,664 running words; literature corpus (text category K-R texts from 4 corpora), 1,019,254 running words. The size of a general reference corpora derived from mixing the 4 corpora (hereafter referred to as the ‘GR’ corpus) was 4,071,830 running words.
Basic statistical data was used to investigate vocabulary variety and difficulty from an empirical viewpoint, in order to investigate general differences between genres. Basic statistical data from the texts were retrieved and calculated using WordSmith (ver.4.0), EXCEL, and by using manual computation. Table 2 provides: 1. the number of tokens; 2. the number of types; 3. standardised type/token ratio (S-TTR); 4. Guiraud value; 5. average word length; and, 6. the ratio of 1-4 letter words. These metrics should help us to understand the relative variety and difficulty of each text from a specific genre.
| Academic | Newspaper | Literature | GR | |
|---|---|---|---|---|
| Tokens (used for WS4 word list) | 1,662,106 | 1,760,664 | 1,019,254 | 4,071,830 |
| Types (distinct words) | 47,481 | 59,222 | 38,181 | 87,727 |
| S-TTR | 40.76 | 48.27 | 45.02 | 44.54 |
| Guiraud value | 36.82 | 44.63 | 37.81 | 43.47 |
| Mean word length | 4.86 | 4.78 | 4.30 | 4.68 |
| Ratio of 1-4 letter words | 55.43% | 55.20% | 62.87% | 57.51% |
The S-TTR indicates the degree of variety of vocabularies in a corpus. In calculating this score, the type/token ratio is calculated for each 1000 words (this is the standard value) in the entire corpus, and a running average is computed. If this value gives a low number, it means that many of the same words are used repeatedly. If the value gives a high number, the texts include a variety of words, and less words are used repeatedly (cf. Help in WordSmith ver. 4.0). The ranked order of the S-TTR value for each genre corpus is, 1. newspaper, 2. literature and 3. academic. In addition, the Guiraud value gives an estimation for the same lexical aspect as S-TTR. The Guiraud value is computed as “the score of types divided by the square root of tokens” (Ishikawa 2005:2). The higher the score of the Guiraud value, then the greater the variety of vocabulary included in a text. Using this value to compare corpora, the ranked order is, 1. newspaper, 2. literature and 3. academic. Therefore, both S-TTR and Guiraud values suggest that newspaper English uses the most varied vocabulary, literary English an intermediate one, and academic English the smallest, if estimators of lexical density are used.
On the other hand, other statistical data such as average word length and the ratio of 1-4 letter words provide us with a measure of the difficulty and style of a text from a different point of view; estimations of the difficulty of words, rather than variety, is taken into account. Can and Patton (2004:62-63) recommend that “word length occurrence frequency information is a good measure to use in stylometric investigation”, and that, “one of the oldest style markers is word length”. On the other hand, some researchers opposed the use of word length; for example in authorship stylistic studies: Holmes (1985) criticises the use of word length frequencies because of the characteristic of Zipf’s first law (Zipf 1932; Can and Patton 2004:63). However, I consider that word-length can be an index useful for investigating text difficulty and stylistics. The higher the value of the average word length, the more difficult the readability of the text. The inclusion of longer words is taken to mean that texts have many difficult words from a solely empirical perspective. When comparing the data in each genre using this method, the ranked order is, 1. academic, 2. newspaper and 3. literature. Thus, academic is the genre that includes more difficult words as opposed to other genres. This conclusion is also supported by looking at the ratio of 1-4 letter words. A low value of the ratio of 1-4 letter words represents a more difficult text. When comparing each genre corpora by this value, the order of difficulty is, 1. newspaper, 2. academic and 3. literature. As a matter of fact, the value of newspaper (55.20%) and academic (55.43 %) shows almost no difference. Thus, the values of mean word length and ratio of 1-4 letter words suggest that academic texts are the most difficult and stylized, whilst that literary texts are the easiest and the least stylised at the vocabulary level. Thus, text genres would be categorized by the basic statistics of lexical density due to such differences described in this section.
Research by Chujo (2004) used 100 word-span frequency word lists based on the British National Corpus (BNC) in order to measure the vocabulary levels of several texts. Short frequency lists are created by dividing the entire BNC frequency list into 100 word-span multiples from the top, e.g. 1-100 most frequent words, 1-200, 1-300 etc. These lists are then used to measure the cover rate of the types and tokens within each text in question to give an estimate of the level of text difficulty.
I use a method similar to that of Chujo (2004). The procedure followed can be divided into two steps: the first is separating the BNC written frequency list, provided by Adam Kilgarriff, into lists of 1,000 word multiples from the top rank (e.g., 1-1,000, 1-2,000, … , 1-10,000), and the second is comparing and measuring the cover rate of tokens between the BNC 1000 short span list and each genre corpus. By using the ‘match list’ function in WordSmith (ver. 4.0), words not matched with BNC short span lists were erased, and only words matched with the BNC short span list were extracted. Finally, the total number of tokens matched with the BNC short span list was calculated, and the cover rate of each corpus with the BNC short span list was computed using EXCEL. Table 3 shows the results of the investigation into cover rate.
| BNC level | 1,000 | 2,000 | 3,000 | 4,000 | 5,000 | 6,000 | 7,000 | 8,000 | 9,000 | 10,000 |
|---|---|---|---|---|---|---|---|---|---|---|
| Academic | 66.73 | 74.54 | 78.77 | 81.69 | 83.8 | 85.43 | 86.67 | 87.68 | 88.52 | 89.24 |
| Newspaper | 64.74 | 72.22 | 76.78 | 79.75 | 81.95 | 83.63 | 84.98 | 86.05 | 86.94 | 87.68 |
| Literature | 71.83 | 78.11 | 81.81 | 84.42 | 86.14 | 87.51 | 88.63 | 89.49 | 90.29 | 90.93 |
| GR | 67.41 | 74.66 | 78.85 | 81.68 | 83.66 | 85.23 | 86.5 | 87.48 | 88.32 | 89 |
The cover rates in Table 3 show the text difficulty from a vocabulary viewpoint. If the cover rate with the BNC list shows a high value in corpus A as opposed to corpus B, corpus A is easier than corpus B with regards to the vocabulary level. As can be seen in Table 3, when comparing three genre texts, the ranked order is, 1. newspaper, 2. academic and 3. literature. This result almost matches with the result of the data given in the previous section (that is to say, that newspaper and academic English is more difficult than literary English). The fact that newspapers emerge as more difficult texts than academic texts could be explained by the fact that academic texts tend to use more set phrases than newspapers, rather than using a variety of vocabularies, because the words used in set phrases consists of the basic words representing as function words.
In addition, the cover rate for each genre corpus shows slow growth after each 2000 word BNC level. This is largely because the 2000 word level is the most suitable limit for high-frequency words (Nation 2001:14). The size of the general vocabulary list, the classic vocabulary list created by West (1953), is also 2000 words. This vocabulary size has been supported by many different researchers. Nation and Hwang (1995) maintained that the size of 2,000 words is still the best selection for English learners to memorize. For example, research conducted by Sutarsyah, Nation and Kennedy (1994) showed that the first 2000 high frequency words attained a text coverage of 82.5 % in a single economic textbook. Coxhead (1998) also shows that the first 2000 words cover 76.1 % in an academic corpus. Thus, it might be possible to divide text genres using the vocabulary level.
In this section, I would like to compare the characteristics of vocabularies occurring in academic, newspaper and literary texts in detail. Following the methodology by Coniam (2004), I extracted only content words. In order to remove function words, I first created a stop list for function words by reference to the website Funcword, and then by automatic computation in WordSmith (ver. 4.0.).
| Academic | Newspaper | Literature | GR | Academic | Newspaper | Literature | GR |
|---|---|---|---|---|---|---|---|
| TIME | SAID | SAID | SAID | YEARS | MAKE | ROOM | GO |
| SAME | MR | LIKE | TIME | MAKE | GROUP | DOOR | RIGHT |
| MADE | NEW | TIME | NEW | STATE | LONG | FACE | USED |
| NEW | PAGE | JUST | LIKE | GENERAL | MARKET | HEAD | TAKE |
| WAY | YEAR | MAN | MADE | FOUND | OWN | KNEW | COME |
| WELL | TIME | KNOW | WELL | JUST | FOREIGN | ASKED | GREAT |
| LAW | YEARS | SEE | YEARS | IMPORTANT | GOOD | TOLD | MEN |
| LIKE | PEOPLE | GO | JUST | NUMBER | MAN | HAND | SAY |
| DIFFERENT | LIKE | WAY | PEOPLE | POINT | JOHN | SAY | MR |
| CASE | CENT | WELL | WAY | LONG | TAKE | LEFT | THOUGHT |
| WORK | GOVERNMENT | THOUGHT | MAN | POSSIBLE | DAY | LOOK | PART |
| FORMULA | HOME | LOOKED | GOOD | GOOD | WEST | DAY | HOUSE |
| USED | MADE | LITTLE | WORK | FACT | LONDON | PEOPLE | CAME |
| FORM | NEWS | THINK | SEE | CASES | PUBLIC | MAKE | USE |
| LIFE | CITY | COME | LONG | POLITICAL | HIGH | BECAUSE | FOUND |
| GIVEN | WORLD | EYES | MAKE | HIGH | PRESIDENT | WANT | THINK |
| PEOPLE | PARTY | MADE | YEAR | PARTICULAR | SPORT | TOOK | HOME |
| USE | JUST | RIGHT | KNOW | LARGE | END | TURNED | STATE |
| WORLD | WELL | GOING | OWN | CHANGE | NATIONAL | HOUSE | HIGH |
| EXAMPLE | WORK | WENT | LITTLE | ORDER | COMPANY | TAKE | END |
| PART | BRITISH | OLD | LIFE | LATER | STATE | SEEMED | GOING |
| SOCIAL | OLD | CAME | OLD | DEVELOPMENT | LITTLE | SAW | PLACE |
| OWN | BUSINESS | GOOD | WORLD | FURTHER | HOUSE | OWN | LEFT |
| SEE | WAY | AWAY | DAY | GROUP | LIFE | NIGHT | SMALL |
| SYSTEM | WEEK | LONG | SAME | WHETHER | RIGHT | FELT | WENT |
As can be seen in Table 4, each genre corpus shows its characteristics as we might expect. For example, the top 50 content words in academic texts have law, different, case, formula, form, example, system, important, possible, development and others. These words can be often used in academic text in an it is construction (e.g. it is important / possible / different to say) and also in multi-word units (e.g. in case of and be different from). The newspaper corpus shows government, news, world, business, market, public, president, national, company and others. These words can be categorised as belonging to business, economics and politics vocabularies. The literature corpus includes know, thought, eyes, room, door, face, head, hand, look, people, house, night, felt and others. They can be categorised as ‘parts of the body’, ‘thoughts had by humans’ and ‘pertaining to the house’. These vocabularies are often used for describing the personals motions, actions and situations.
Next, I would like to examine the keywords occurring in the different genre texts. This is because some researchers have doubts about the MD approach; they comment that keyword analyses can provide the same results as an MD approach: e.g. McEnery and Xiao (2005:63) criticise the MD approach as follows:
MDA is undoubtedly a powerful tool in genre analysis. But associated with this power is complexity. The approach is very demanding both computationally and statistically in that it requires expertise not only in extracting a large number of linguistic features from corpora but also in undertaking sophisticated statistical analysis. … [U]sing the keyword function of WordSmith can achieve approximately the same effect as Biber’s MDA. This approach is less demanding as WordSmith can generate wordlists and extract keywords automatically.
In order to extract more genre specific vocabularies, I use the log-likelihood score. The reason for using this statistical score is, first that the corpora sizes are different, and so raw frequencies cannot be compared directly; second, even if occurrences of a word per 1,000 words are given, comparison of a word between genres is still not useful, largely because it is impossible to say whether any divergence between genres is by chance or a substantive one (cf. Leech, Rayson and Wilson 2001:16). The log-likelihood score shows “how significantly characteristic or distinctive of a given variety of language a word is, when its usage in that variety is compared with its usage in another” (Leech, Rayson and Wilson 2001:16). Table 5 provides the top 50 key content words occurring at keyness value of over 227 in each genre corpus.
| Academic | Newspaper | Literature | ||||
|---|---|---|---|---|---|---|
| Order | Key word | Keyness | Key word | Keyness | Key word | Keyness |
| 1 | CELLS | 1,004.23 | PAGE | 5,612.59 | SAID | 2,175.37 |
| 2 | LAW | 892.39 | MR | 2,759.54 | LOOKED | 900.62 |
| 3 | GENES | 820.39 | YESTERDAY | 2,607.54 | EYES | 777.93 |
| 4 | FORMULA | 807.02 | NEWS | 1,997.35 | LIKE | 750.2 |
| 5 | INHIBITION | 669.1 | SPORT | 1,806.12 | DOOR | 687.55 |
| 6 | LATENT | 576.08 | CENT | 1,689.98 | KNOW | 685.23 |
| 7 | NUTTY | 569.03 | PER | 1,136.05 | KNEW | 547.71 |
| 8 | STIMULUS | 566.81 | LAST | 984.02 | THOUGHT | 530.62 |
| 9 | CASES | 499.83 | YEAR | 937.85 | GET | 501.5 |
| 10 | CELL | 498.77 | CITY | 905.12 | ROOM | 500.72 |
| 11 | LAUTREC | 490.16 | FOREIGN | 859.66 | FACE | 491.43 |
| 12 | MDASH | 482.33 | SHARES | 859.11 | JUST | 476.77 |
| 13 | DIFFERENT | 466.95 | HONG | 713.49 | TURNED | 448.75 |
| 14 | CONTEXT | 423.14 | MARKET | 697.67 | GO | 441.24 |
| 15 | ENGELS | 419.11 | SAID | 659.62 | GOING | 440.72 |
| 16 | THEORY | 414.38 | PARTY | 627.89 | WENT | 440.62 |
| 17 | BEHAVIOUR | 400.35 | KONG | 618 | MAN | 436.31 |
| 18 | TRUST | 396.76 | DOLLARS | 617.23 | THINK | 434.24 |
| 19 | MARX | 396.69 | BRITISH | 598.7 | HEAD | 433.56 |
| 20 | CS | 395.75 | WEEK | 585.07 | AWAY | 413.04 |
| 21 | CHAPTER | 394.02 | GOVERNMENT | 551.65 | OH | 408.24 |
| 22 | SOLUTION | 369.6 | BUSINESS | 534.41 | TELL | 382.26 |
| 23 | NAILS | 359.36 | RUGBY | 495.28 | ASKED | 377.73 |
| 24 | INFECTION | 357.17 | CHAIRMAN | 472.81 | SAW | 369.97 |
| 25 | EXAMPLE | 355.33 | CONFERENCE | 470.84 | SEEMED | 368.19 |
| 26 | SUBJECTS | 336.02 | FOOTBALL | 467.88 | SEE | 360.45 |
| 27 | FORM | 332.83 | PROFITS | 460.83 | STOOD | 337.52 |
| 28 | CASE | 332.58 | ARCHITECTURE | 444.95 | COME | 333.02 |
| 29 | SINGULARITY | 325.46 | BID | 434.45 | VOICE | 332.36 |
| 30 | TRUSTS | 323.27 | WEST | 432.24 | FELT | 328.58 |
| 31 | CONSENT | 322.36 | TEAM | 417.98 | CAME | 319.42 |
| 32 | PATIENT | 320.26 | WEEKEND | 399.53 | LOOK | 313.79 |
| 33 | EVOLUTION | 318.77 | SOVIET | 388.6 | SOMETHING | 312.82 |
| 34 | HITLER | 310.88 | INTERNATIONAL | 383.79 | SAT | 303.5 |
| 35 | STIMULI | 310.2 | MATCH | 368.72 | TOLD | 295.28 |
| 36 | AUTHORITY | 308.21 | HOME | 368.02 | HAIR | 279.5 |
| 37 | THERAPIST | 307.83 | UK | 363.78 | AROUND | 279.36 |
| 38 | SOLUTIONS | 307.61 | LEAGUE | 363.55 | SMILED | 278.76 |
| 39 | WILFRED | 307.35 | PLAYERS | 362.65 | HAND | 276.91 |
| 40 | PARTICULAR | 299.23 | GAME | 357.57 | WANT | 272.18 |
| 41 | PROPERTY | 295.92 | EAST | 351.06 | WALKED | 269.69 |
| 42 | GENE | 290.06 | GROUP | 348.61 | RIGHT | 254.81 |
| 43 | TESTATOR | 284.45 | BRITAIN | 346.42 | MOMENT | 254.02 |
| 44 | EXPOSURE | 283.3 | EUROPEAN | 338.94 | GIRL | 238.64 |
| 45 | CONDITIONING | 281.88 | CORRESPONDENT | 338.78 | MAYBE | 238.06 |
| 46 | ROMAN | 275.5 | WIN | 337.94 | TOOK | 233.23 |
| 47 | SPECIES | 270.6 | STAKE | 337.4 | MORNING | 228.55 |
| 48 | EMBRYO | 270.01 | SHARE | 331.29 | NIGHT | 227.99 |
| 49 | FIG | 267.69 | CUP | 323.64 | CAR | 227.93 |
| 50 | GONORRHOEA | 262.18 | TALKS | 318.13 | LITTLE | 227.53 |
Compared to Table 4, Table 5 provides the more specific and authentic vocabularies occurring in each genre. One thing to note is that our intuition tells us that these words in table 5 would be more specific and authentic vocabularies than those in table 4, but they do not tell us which specific vocabularies occur in each genre. Thus, a keyword corpus-driven approach confirms what we already know and gives us more specific knowledge.
Overall, the trends revealed by keyword analysis shows that nouns were most often keywords in academic (e.g. stimulus, infection, property, species etc.) and newspaper (e.g. profit, architecture, game, European …) texts, whereas literature had more verbs as keywords than any other language component (e.g. said, looked, turned, went). Among the three lists of keywords in Table 5 (150 words altogether), there are no common words across the three genres. In addition, there is only one common word (said in newspaper and literature) across two genres. This suggests that keywords can give us information about the different vocabularies used in each genre text.
I would also like to compare multi-word units between genre corpora, in particular 4-word units occurring in each genre corpus. Coniam (2004) used KfNgram (Fletcher 2002) to compute 4-word units occurring in specific genre texts taken from applied linguistics articles. However, not only KfNgram but also some concordancing programs have n-gram functionality, although some of them give it different names (e.g. ‘cluster’ in WordSmith, ‘wordgrams’ in KfNgram and ‘N-Gram’ in AntConc). I used WordSmith (ver 4.0) developed by Mike Scott at the University of Liverpool, to calculate the most frequent 4-word units for my corpora. The cut-off point for detecting units was set at over 30 times. Also, as raw phrase lists often include some phrases including numbers and error words (e.g. phrases including numbers: # PER CENT AND, A # YEAR OLD, # # AND # etc), are removed manually; these examples will not provide information useful for the characterisation of text genres. Table 6 shows the comparable 4-word unit lists for the three genre copora.
| Academic | Newspaper | Literature | ||||
|---|---|---|---|---|---|---|
| Order | Word | Freq. | Word | Freq. | Word | Freq. |
| 1 | THE END OF THE | 191 | BUSINESS AND CITY PAGE | 507 | THE REST OF THE | 74 |
| 2 | IN THE CASE OF | 184 | PER CENT OF THE | 199 | AT THE SAME TIME | 72 |
| 3 | AT THE SAME TIME | 174 | FOR THE FIRST TIME | 189 | IN FRONT OF THE | 69 |
| 4 | ON THE OTHER HAND | 169 | THE END OF THE | 175 | FOR THE FIRST TIME | 66 |
| 5 | AT THE END OF | 129 | AT THE END OF | 138 | IN THE MIDDLE OF | 58 |
| 6 | ON THE BASIS OF | 121 | THE REST OF THE | 107 | THE END OF THE | 57 |
| 7 | AS A RESULT OF | 120 | AT THE SAME TIME | 103 | THE EDGE OF THE | 56 |
| 8 | IN TERMS OF THE | 106 | SECRETARY OF STATE FOR | 98 | THE MIDDLE OF THE | 52 |
| 9 | THE NATURE OF THE | 87 | IS ONE OF THE | 92 | AT THE END OF | 47 |
| 10 | AS WELL AS THE | 83 | ONE OF THE MOST | 89 | THE SIDE OF THE | 47 |
| 11 | THAT THERE IS A | 77 | AS A RESULT OF | 86 | THE BACK OF THE | 43 |
| 12 | IN THE ABSENCE OF | 76 | A MEMBER OF THE | 80 | ON THE OTHER SIDE | 42 |
| 13 | ONE OF THE MOST | 75 | THE SECRETARY OF STATE | 73 | THE TOP OF THE | 41 |
| 14 | THE FACT THAT THE | 74 | WILL BE ABLE TO | 71 | WAS ONE OF THE | 41 |
| 15 | IS LIKELY TO BE | 73 | IN THE UNITED STATES | 67 | THE OTHER SIDE OF | 39 |
| 16 | PER CENT OF THE | 73 | VIEW FROM CITY ROAD | 64 | FOR A LONG TIME | 37 |
| 17 | IN THE CONTEXT OF | 72 | BY THE END OF | 63 | HE WAS GOING TO | 36 |
| 18 | IN THE FORM OF | 72 | ON THE OTHER HAND | 62 | OTHER SIDE OF THE | 35 |
| 19 | IN THE UNITED STATES | 72 | THE FIRST TIME IN | 60 | AND THERE WAS A | 34 |
| 20 | THE EXTENT TO WHICH | 69 | THE FACT THAT THE | 59 | I DON’T WANT TO | 34 |
| 21 | FOR THE FIRST TIME | 66 | THE LABOUR PARTY CONFERENCE | 59 | IN FRONT OF HIM | 34 |
| 22 | IT IS POSSIBLE TO | 63 | IN THE FIRST HALF | 52 | IT WOULD HAVE BEEN | 32 |
| 23 | ON THE ONE HAND | 63 | IN THE CASE OF | 50 | ON THE OTHER HAND | 32 |
| 24 | THE WAY IN WHICH | 63 | WAS ONE OF THE | 50 | I WANT YOU TO | 30 |
| 25 | AT THE TIME OF | 62 | IS LIKELY TO BE | 47 | WAS GOING TO BE | 30 |
| 26 | IT IS CLEAR THAT | 59 | IN THE FORM OF | 46 | I DON’T KNOW WHAT | 29 |
| 27 | IN THE COURSE OF | 56 | PER CENT IN THE | 46 | IF YOU WANT TO | 29 |
| 28 | THE REST OF THE | 56 | THE UNITED STATES AND | 46 | THE TWO OF THEM | 29 |
| 29 | IT IS IMPORTANT TO | 55 | IN THE MIDDLE OF | 45 | BUT THERE WAS NO | 28 |
| 30 | IT IS DIFFICULT TO | 53 | ON THE BASIS OF | 45 | ON THE EDGE OF | 28 |
| 31 | AS WE HAVE SEEN | 52 | AS WELL AS THE | 43 | THE BACK OF HIS | 28 |
| 32 | AT THE BEGINNING OF | 52 | ONE OF THE FEW | 43 | AT THE TOP OF | 27 |
| 33 | THE DEVELOPMENT OF THE | 52 | AS ONE OF THE | 42 | FROM TIME TO TIME | 26 |
| 34 | THE CASE OF THE | 51 | IN THE WAKE OF | 42 | IN THE LIVING ROOM | 26 |
| 35 | IN THE PRESENCE OF | 50 | THE BANK OF ENGLAND | 42 | ARE YOU GOING TO | 25 |
| 36 | TO THE EXTENT THAT | 49 | IN THE FACE OF | 41 | NOTHING TO DO WITH | 25 |
| 37 | ON THE PART OF | 48 | PER CENT STAKE IN | 41 | THE BOTTOM OF THE | 25 |
| 38 | THE BEGINNING OF THE | 48 | AT THE AGE OF | 39 | TURNED OUT TO BE | 25 |
| 39 | THE EXISTENCE OF A | 47 | IN THE SECOND HALF | 38 | YOU WANT ME TO | 25 |
| 40 | THE TIME OF THE | 46 | WILL HAVE TO BE | 38 | IN THE FIRST PLACE | 24 |
| 41 | TO BE FOUND IN | 46 | OF THE UNITED STATES | 37 | IT WAS AS IF | 24 |
| 42 | A LARGE NUMBER OF | 45 | AT A TIME WHEN | 36 | HE LOOKED AT HER | 23 |
| 43 | IT IS NECESSARY TO | 45 | IN AN ATTEMPT TO | 36 | IT HAD BEEN A | 23 |
| 44 | TO BE ABLE TO | 45 | IN THE FIRST PLACE | 36 | ON THE BACK OF | 23 |
| 45 | OF THE F HRER | 44 | THE BEGINNING OF THE | 36 | SHE WAS GOING TO | 23 |
| 46 | THE BASIS OF THE | 44 | IN THE YEAR TO | 35 | THE FRONT OF THE | 23 |
| 47 | BE FOUND IN THE | 43 | THE HEAD OF THE | 35 | TO BE ABLE TO | 23 |
| 48 | IN THE SENSE THAT | 42 | THE FIRST HALF OF | 34 | AS IF HE WERE | 22 |
| 49 | IN THIS CASE THE | 42 | AT THE BEGINNING OF | 33 | AT THE FAR END | 22 |
| 50 | IS ONE OF THE | 42 | ON BEHALF OF THE | 33 | IN FRONT OF HER | 22 |
N-grams are able to identify the commonest collocations in a discourse far more effectively than a single word analysis. As can be seen in Table 6, there is an overall tendency toward using multi-word fixed units in academic texts as opposed to other genres. One of the specific characteristics that can be seen only in academic texts is the use of the it is construction, such as it is possible to, it is clear that, it is important to and it is difficult to. In addition, academic texts show many prepositional phrases as opposed to the other two genres; for example, when focusing on in * of phrases academic texts contain many examples, e.g. in the case of, in terms of the, in the absence of, in the context of, in the form of, in the course of and in the presence of. On the other hand, the 4-word units in the newspaper corpus show many government associated and economical phrases, such as, secretary of state for, the secretary of state, in the united states, the labour party conference, the united states and, the bank of England and of the united states. Also, the general purpose of newspapers is to function as a medium for relaying facts which are happening in the world. Such a function can be detected in the phrase, the fact that the. On the other hand, the 4-word units most frequently occurring in the literature corpus tend to contain colloquial expressions describing people’s thoughts, ideas, feelings, wishes, actions and motions with the use of pronouns. Phrases such as he was going to, I don’t want to, I want you to, I don’t know what, if you want to, are you going to, you want me to, he looked at her, she was going to, as if he were and in front of her. These differences of vocabularies and phrasal units would also play an important role in categorizing text genre.
One of the great investigators into the use of personality in texts is Kuo (1999). Kuo researched the use of the personal pronoun in academic texts from an empirical viewpoint. The use of the personal pronoun provides an environment creating an interpersonal interaction between the writer and the readers (Kuo 1999:123). For example, Rounds (1987) shows that teachers tend not to use third-person pronouns but rather first-person pronouns in the sense that includes third-person pronouns. Personality as a linguistic aspect can contribute much to the pragmatic analysis of written texts.
In general, academic texts tend not to use personal constructions such as I (and my and me) but instead use we to “reduce personal attribution” (Kuo 1999:125). This is due to the fact that they discuss and argue from an objective, not a subjective, perspective; that is, the data speaks for itself. The function of the use of we can be divided into two categories depending on the context: inclusive and exclusive. The former includes target readers (or hearers) while the latter does not (Kuo 1999:126). In addition, the use of we as opposed to I in academic texts implies an idea that ‘the author’ and ‘the reader’ or ‘other researchers’ agree to follow the process of the argument, and it provides a more ‘objective’ discussion through being inclusive. This aspect provides us an environment in which there is greater contact and greater solidarity between writer and reader; the use of I creates an environment which is more informal, individual and personal (Coniam 2004:283).
Similarly, in academic texts, the passive construction tends to be overused when compared to the active construction, largely because the former is more impersonal. For example, Kuo (1999:122) tells us that scientific articles are usually thought to be impersonal, and tend to use nominalisation and the passive voice to achieve this effect. Goatly (2000:94) proposed that the impersonal construction, such as passives or nominalisation, creates a stance in which there is a “more distant authorial position with the effect that they reduce personality”, that is to say, they add objectivity.
In order to reveal such subjective and objective stylistic characteristics in genres, I investigate the use of personality and passives in each genre corpus. The word I is sometimes used in the lower case as i in i.e., therefore I made a distinction between the pure I for the personal subject and other forms by manual editing. Passive construction was examined by annotating POS tags into raw corpus files, with Brill’s Tagger, and then searching for the combination of ‘be verbs + verb past participle’. Table 7 shows the number of occurrences and the frequency ratios per 1,000 words for I, we and passives in each genre corpus (also see, I and we word list rankings).
| I | We | Passive | |
|---|---|---|---|
| Academic | 3028 (55th) / 1.82 per 1,000 | 4641(35th) / 2.79 per 1,000 | 28,613 / 17.21 per 1,000 |
| Newspaper | 4450 (40th) / 2.52 per 1,000 | 3182 (52nd) / 1.80 per 1,000 | 17,400 / 9.88 per 1,000 |
| Literature | 14,508 (9th) / 14.23 per 1,000 | 2645 (53rd) / 2.59 per 1,000 | 13561 / 13.30 per 1,000 |
| GR | 24447 (20th) / 6.00 per 1,000 | 11095 (40th) / 2.72 per 1,000 | 61,827 / 15.18 per 1,000 |
The frequency ratios in Table 7 show that literature (14.23) tends to overuse, and academic texts (1.82) and newspapers (2.52) tend to under-use I, while academic (2.79) and literature (2.59) use we more than newspapers (1.80). In the use of the passive voice, academic texts (17.21) utilise it much more than literature (13.30) and newspapers (9.88). Thus, there is an overall tendency for academic texts to be more impersonal in nature as opposed to other genres because we and the passive voice are much more often used, whilst I is used much less than in the other two genres. Literature texts can be said to be more personal in nature because they tend to use I more often when compared to other genres. Moreover, newspaper texts tend not to use I, we or the passive voice, largely because the style of newspapers is impersonal but direct, using active voices, in order to give news and events happening in daily life. Therefore, the tagged corpus gives us an enhanced possibility for text analysis that is impossible with simple plain texts.
From intuition, and the data of Table 7, it is well accepted that academic texts are usually more impersonal and formal than popular writing such as newspapers and magazines. However, some linguists maintain that this is not true for a variety of academic writing styles (e.g. Coniam 2004:274). Ivanic and Simpson (1992:167) showed how writers evolve their own academic styles. Swales (1990:128) also proposed that well-organised and high quality academic writing does not always follow expected or accepted linguistic or rhetorical conventions. For example, Ard (1983) pointed out that Chomsky’s writing has changed, using the first person pronoun much more often in his later than in his earlier writing. Therefore, the results of Table 7 can be seen as general aspects of the use of personality in each genre, and may not always be applicable to individual works. These significant differences of the use of personality shows the quite different styles used in each text genre, suggesting that the occurrence of personality is also a key factor for categorising text genres.
Hyland (2000:188-189) provides 108 hedges indicating doubt or certainty. Now, I compare the three corpora using these hedges and investigate the occurrence of these hedges in each genre corpus. The 108 hedges can be divided into two groups based on whether they are single words (93 words) or 2 word units (15 units). Table 8 shows the occurrence of all 93 single-word hedges per 1000 words in each corpus.
| Corpus | Amount of hedging |
|---|---|
| Academic | 25336 (15.24/1,000) |
| Newspaper | 18405 (10.45/1,000) |
| Literature | 13518 (13.26/1,000) |
| General Reference | 51572 (12.66/1,000) |
There are some doubts about the data given by Coniam (2004) on the use of hedges. His research shows that the occurrence of hedges in two academic corpora compiled from applied linguistic articles is 1.85 and 0.76 per 1,000 words respectively. Hyland’s hedges include many high-frequency words such as about, may, doubt, seem, suggest and others. For example, about occurs at 2504 times in academic texts used in this research. This size leads to the fact that it already occurs at 1.50 times per 1,000 words in academic texts. Thus, the number shown in Coniam (2004) may be miscalculated from this aspect.
Next, I show the occurrences of 15 hedges, each consisting of two words, using AntConc (ver. 3.2.0) as follows:
| Academic | Newspaper | Literature | GR | |
|---|---|---|---|---|
| a certain | 172 | 75 | 57 | 288 |
| certain extent | 6 | 0 | 0 | 9 |
| consistent with | 50 | 13 | 0 | 58 |
| general sense | 4 | 0 | 0 | 3 |
| I believe | 37 | 50 | 51 | 168 |
| I claim | 2 | 1 | 0 | 3 |
| in general | 165 | 40 | 9 | 201 |
| in theory | 28 | 18 | 4 | 39 |
| more or less | 69 | 32 | 16 | 95 |
| not always | 82 | 36 | 10 | 108 |
| not necessarily | 89 | 32 | 6 | 103 |
| open to question | 4 | 1 | 0 | 4 |
| our belief | 1 | 2 | 0 | 3 |
| provided that | 34 | 6 | 2 | 37 |
| seen as | 124 | 81 | 4 | 110 |
| Total | 867(0.52/1,000) | 387(0.21/1,000) | 159(0.15/1,000) | 1229(0.30/1,000) |
As Table 9 shows, there is a tendency in academic texts toward also overusing two-word hedges as opposed to the other two genre corpora. The total score of academic texts (0.52 / 1,000) is over twice that of newspaper (0.21 / 1,000), and over three times that of literature texts (0.15 / 1,000). As some researchers, such as Salager-Meyer (1994) and Hyland (1994), suggest, hedging is often used in academic discourse, and the result given by Tables 8 and 9 supports this idea. This fact is largely because “showing modesty by tentative statements and inviting readers to draw inferences by themselves, hedging assists writers to avoid overstating an assertion and to establish a relationship with readers” (Kuo 1999:133).
However, there are exceptions. For example, this cannot be said of I believe because this hedge occurs at 0.022 / 1,000 in academic texts, 0.028 / 1,000 in newspapers, 0.05 / 1,000 in literature and 0.041 / 1,000 in the general reference corpus. The low score given by academic texts is closely connected with the use of I in this hedge. Moreover, it may be that there are flaws in the methodology used to investigate hedging. This is due to the fact that some hedges listed by Hyland (2000), for example might or wrongly, “may well have be used in contexts where they have meanings other that those for hedging purposes” (Coniam 2004:287) Therefore, as far as the semantic annotation is not adopted, it is difficult to gain the correct data of ‘real’ hedges from corpora automatically. However, on the whole, these significant differences of the freqency of hedges can be used as one of a set of indices informing us about the differences amongst text genres and the categorisation of text genres and styles.
In this section, I focus on the distribution of nominalization in each genre corpus. Biber et al. (1998:58) suggest that, “studying a morphological characteristic in a corpus can teach us both about the frequency and distribution of the characteristic and about the differing functions of particular variants”. He examined the use of nominalization in academic prose, fiction and speech, by comparing, 1. the frequency of nominalization per one million words in each genre, and, 2. the proportion of nominalization formed with each suffix in each genre. Following this methodology, I investigate the use of nominalization in the genre corpora. Here, I follow Biber et al. (1998) and assume that nominalization creates forms ending with -tion / -sion, -ness, -ment and -ity, including plural forms. This is largely because, for example, “a search for all words ending in -ion would locate many words that were not nominalizations (e.g. cushion, dandelion). In contrast, searching for -sion provides a much more accurate identification of nominalizations (e.g., decision, division, discussion, expansion, extension, submission), although a few inaccurate items will still be included (e.g. mansion)” (Biber et al. 1998:59). The following table shows the frequency distribution of nominalizations across the three genres.
| Academic | Newspaper | Literature | GR | |
|---|---|---|---|---|
| -tion | 29160 (55%) | 17491 (43%) | 4001 (41%) | 45735 (50%) |
| -sion | 4390 (8%) | 4224 (10%) | 998 (10%) | 8112 (9%) |
| -ness | 2081 (4%) | 2717 (7%) | 1339 (14%) | 5452 (6%) |
| -ment | 8036 (15%) | 8757 (22%) | 1998 (20%) | 16344 (18%) |
| -ity | 9222 (18%) | 7381 (18%) | 1476 (15%) | 16146 (17%) |
| Total | 52889 (31.82 per 1,000) | 40570 ( 23.04 per. 1,000) | 9812 ( 9.62 per. 1,000) | 91789 ( 22.54 per. 1,000) |
Not only overall frequency but also the number of different nominalizations occurring in each genre is also important information when considering linguistic characteristics. As can be seen in Table 10, the proportion of both -tion / -sion occupies more than half of the nominalizations in each register: 63% in academic texts; 53% in newspaper; 51% in literature works; and 59% in general reference texts. However, when looking at other forms of nominalization, the characteristics can be found across the genres. Academic texts tend to use nominalizations ending with (in ranked order) -ity, -ment, but at a much lower frequency, -ness. Newspapers show a similar use of nominalization as academic texts, but the -ment form surpasses the -ity form. For example, the -ment nominalization in newspaper texts include management, government, argument, investment, readjustment, replacement and others. In fact, these words rarely occur in other genres. On the other hand, literature works use these three nominalizations almost equally, and particularly the -ness suffix is more important in this genre than in others. Biber et al. (1998:65) touched on this point as follows:
The -ness ending generally converts adjectives into nouns that often describe personal qualities. Fiction uses a number of these -ness nouns that are rarely found in the other registers: awareness, bitterness, darkness, goodness, happiness, politeness, weakness.
Moreover, the occurrence of nominalization across genres also shows us that academic texts use them more than three times as much as literature does. Newspaper shows almost the same trends as general reference. On the other hand, literature tends not to use them much compared to other genres. The result of a comparison between academic, fiction and speech texts by Biber et al. (1998:60) gave a similar result to this analysis, they found that, “while fiction and speech have similar frequencies, academic prose has a frequency almost four times greater.” Therefore, the amount of nominalization also plays an important role in knowing and representing the different text styles of text genres.
In this section, I categorize the sub-divided text types into three genres automatically from a vocabulary, text analysis, and statistical perspective. Various kinds of text analysis exist depending on the research purposes, e.g. authorship attribution, stylistics, text typology and variation studies such as register variation, regional variation, social variation, authorial variation, chronological variation etc. For example, Burrows (1987) conducted critical research on Jane Austen’s novels by using multivariate analysis of the 12-60 most common words. This examination investigated Austen’s narrative style, character differentiation through idiolects and free indirect discourse. The output showed that multivariate approaches enable us to use computers to assist with literary criticism, literary and linguistic stylistics, for identifying a stylistic ‘fingerprints’, authorship attribution, stylistic imitation and register variation (Tabata 2002). This section focuses on the investigation of genre (or register variation) using multivariate analyses.
The GR corpus is divided into 15 text categories: A (Press: Reportage); B (Press: Editorial) ; C (Press: Reviews); D (Religion); E (Skills, Trades and Hobbies); F (Popular Lore); G (Belles Letters, Biography and Essays); H (Miscellaneous: Government Documents, Industrial Reports, etc.); J (Learned and Scientific Writings); K (General Fiction); L (Mystery and Detective Fiction); M (Science Fiction); N (Adventure and Western Fiction); P (Romance and Love Story); R (Humour). Table 11 gives the tokens for each text category in the GR corpus.
| Text | Words | Text | Words | Text | Words | |||
|---|---|---|---|---|---|---|---|---|
| 1 | A | 676,470 | 6 | F | 702,362 | 11 | L | 193,191 |
| 2 | B | 217,394 | 7 | G | 947,602 | 12 | M | 48,275 |
| 3 | C | 137,788 | 8 | H | 243,692 | 13 | N | 233,556 |
| 4 | D | 137,018 | 9 | J | 1,929,776 | 14 | P | 233,920 |
| 5 | E | 297,667 | 10 | K | 233,856 | 15 | R | 72,533 |
Now I categorize the 15 text categories into three genres: academic texts, newspaper and literature using a corpus-driven multivariate analysis. A general significance reference test like the chi-squared test cannot compute the characteristics based on the complex inter-relationships across large numbers of texts, and so we need to adopt a multivariate approach to perform such computationally intensive studies. McEnery and Wilson (2001:88) summarise the necessity of multivariate analyses as follows:
[It] would not be possible using tests such as the chi-squared test to examine the vocabulary relations between five different genres, except on a word-by-word basis. To perform such holistic comparisons for large numbers of variables we need a different type of statistical technique — a multivariate one.
It is significant that Biber’s MD approach is based on hyper-textual levels, and that dimension scores for each dimension are calculated for each text. Then, the mean of each dimension score for each genre is calculated to enable to ascertain characterization of any given texts or genres. As a result, the output shows genres can be very similar in one dimension while markedly different in others. On the other hand, although Multivariate analysis is similar to the MD approach, strictly speaking, they are different from a theoretical aspect. Multivariate analyses are computed based on a (calculated) value derived from cross-tabulation, and their purposes are to show the statistical similarities and differences across the various sample categories. Thus, a multivariate analysis is an analysis of a large number of linguistic features across many texts and text types using statistical techniques. This analysis is used for various purposes such as linguistic analyses of texts, genres, text types, styles or genres (or registers); the proposed assumption is that different kinds of text differ in their functions at the linguistic level, multivariate analyses make it possible to examine these difference from a quantitative aspect. As noted, Biber’s MD approach is similar to a multivariate analysis, being based on the assumption that multiple parameters of variation will be operative in any discourse domain. However, MD approach, being divided into 6 dimensions, is more complex and awkward to compute. Because of this, this paper utilises a multivariate analysis. Now, there are various multivariate analyses available such as principle components analysis, factor analysis, correspondence analysis and cluster analysis.
In conducting the multivariate analysis, the top 100 high-frequency content words are used in the present research. Content words are used because of the general linguistic tendency for function words to occur at high-rank in any genre, and they do not provide enough significant differences across text genres. Table 12 shows the top 100 content words occurring in the GR 4 million corpus.
| 1 | SAID | 21 | WORLD | 41 | HIGH | 61 | LATER | 81 | WHITE |
|---|---|---|---|---|---|---|---|---|---|
| 2 | TIME | 22 | DAY | 42 | END | 62 | GENERAL | 82 | WOMEN |
| 3 | NEW | 23 | SAME | 43 | GOING | 63 | YOUNG | 83 | FACE |
| 4 | LIKE | 24 | GO | 44 | PLACE | 64 | CALLED | 84 | IMPORTANT |
| 5 | MADE | 25 | RIGHT | 45 | LEFT | 65 | AMERICAN | 85 | SYSTEM |
| 6 | YEARS | 26 | USED | 46 | SMALL | 66 | LOOK | 86 | NIGHT |
| 7 | PEOPLE | 27 | TAKE | 47 | WENT | 67 | NEED | 87 | EYES |
| 8 | WAY | 28 | COME | 48 | COURSE | 68 | POINT | 88 | HALF |
| 9 | MAN | 29 | GREAT | 49 | WAR | 69 | ASKED | 89 | THINGS |
| 10 | GOOD | 30 | MEN | 50 | GOVERNMENT | 70 | CHILDREN | 90 | DIFFERENT |
| 11 | WORK | 31 | SAY | 51 | HAND | 71 | WANT | 91 | LOCAL |
| 12 | SEE | 32 | THOUGHT | 52 | PUT | 72 | ROOM | 92 | BEST |
| 13 | LONG | 33 | PART | 53 | NUMBER | 73 | FIND | 93 | POWER |
| 14 | MAKE | 34 | HOUSE | 54 | TOLD | 74 | HEAD | 94 | DAYS |
| 15 | YEAR | 35 | CAME | 55 | FACT | 75 | SCHOOL | 95 | NATIONAL |
| 16 | KNOW | 36 | USE | 56 | SET | 76 | LARGE | 96 | SIDE |
| 17 | OWN | 37 | FOUND | 57 | PUBLIC | 77 | WATER | 97 | SOCIAL |
| 18 | LITTLE | 38 | THINK | 58 | CASE | 78 | BETTER | 98 | FORM |
| 19 | LIFE | 39 | HOME | 59 | GIVEN | 79 | GIVE | 99 | POSSIBLE |
| 20 | OLD | 40 | STATE | 60 | TOOK | 80 | LOOKED | 100 | EARLY |
Now, I compute the raw frequencies and the ratio of each frequency per 1,000 of the top 100 content words in table 12 in each text category, in order to conduct a multivariate analysis. Correspondence analysis can be computed with raw frequency data because this is based on a pattern matching system. However, principle components analysis and cluster analysis should be based on the ratio of frequency. Tomoji Tabata (personal communication., 12/11/2006), Osaka University, Japan, comments that, as a correspondence analysis maximizes the inter-correlation matrix in its computation, the result gained from raw frequency and even ratio of frequency (e.g., per 1,000 or per 1 million) gives almost the same outcome with respect to high frequency linguistic items. On the other hand, as principle components analysis and cluster analysis are based on correlation coefficients and covariance coefficients, these two analyses cannot compute raw frequencies. This is largely because the different corpus size affects the outcome of the computation, although this does not apply to equal-sized corpora, as in this case. Table 13 shows the raw frequencies of the top 100 content words, and table 14 shows the ratio of frequency of top 100 contents words per 1,000 words.
| 1 | 2 | 3 | 4 | 5 | 6 | 7-98 | 99 | 100 | |
|---|---|---|---|---|---|---|---|---|---|
| said | time | new | like | made | years | … | possible | early | |
| A | 2,018 | 516 | 790 | 313 | 387 | 455 | … | 90 | 125 |
| B | 184 | 344 | 451 | 252 | 195 | 293 | … | 80 | 50 |
| C | 64 | 173 | 267 | 221 | 137 | 153 | … | 22 | 45 |
| D | 116 | 165 | 325 | 100 | 130 | 113 | … | 54 | 74 |
| E | 139 | 500 | 594 | 350 | 342 | 313 | … | 132 | 117 |
| F | 238 | 663 | 521 | 401 | 367 | 473 | … | 114 | 176 |
| G | 486 | 950 | 934 | 779 | 618 | 797 | … | 194 | 276 |
| H | 135 | 350 | 485 | 74 | 403 | 337 | … | 120 | 80 |
| J | 143 | 829 | 561 | 372 | 623 | 416 | … | 408 | 260 |
| K | 1,093 | 449 | 170 | 703 | 261 | 182 | … | 43 | 49 |
| L | 1,069 | 387 | 101 | 459 | 237 | 112 | … | 59 | 35 |
| M | 184 | 107 | 39 | 124 | 46 | 39 | … | 5 | 11 |
| N | 1,254 | 443 | 145 | 562 | 271 | 135 | … | 41 | 40 |
| P | 1,133 | 444 | 155 | 654 | 276 | 195 | … | 44 | 66 |
| R | 304 | 148 | 76 | 181 | 74 | 76 | … | 15 | 11 |
| 1 | 2 | 3 | 4 | 5 | 6 | 7-98 | 99 | 100 | |
|---|---|---|---|---|---|---|---|---|---|
| said | time | new | like | made | years | … | possible | early | |
| A | 2.98 | 0.76 | 1.17 | 0.46 | 0.57 | 0.67 | … | 0.13 | 0.18 |
| B | 0.85 | 1.58 | 2.07 | 1.16 | 0.9 | 1.35 | … | 0.37 | 0.23 |
| C | 0.46 | 1.26 | 1.94 | 1.6 | 0.99 | 1.11 | … | 0.16 | 0.33 |
| D | 0.85 | 1.2 | 2.37 | 0.73 | 0.95 | 0.82 | … | 0.39 | 0.54 |
| E | 0.47 | 1.68 | 2 | 1.18 | 1.15 | 1.05 | … | 0.44 | 0.39 |
| F | 0.34 | 0.94 | 0.74 | 0.57 | 0.52 | 0.67 | … | 0.16 | 0.25 |
| G | 0.51 | 1 | 0.99 | 0.82 | 0.65 | 0.84 | … | 0.2 | 0.29 |
| H | 0.55 | 1.44 | 1.99 | 0.3 | 1.65 | 1.38 | … | 0.49 | 0.33 |
| J | 0.07 | 0.43 | 0.29 | 0.19 | 0.32 | 0.22 | … | 0.21 | 0.13 |
| K | 4.67 | 1.92 | 0.73 | 3.01 | 1.12 | 0.78 | … | 0.18 | 0.21 |
| L | 5.53 | 2 | 0.52 | 2.38 | 1.23 | 0.58 | … | 0.31 | 0.18 |
| M | 3.81 | 2.22 | 0.81 | 2.57 | 0.95 | 0.81 | … | 0.1 | 0.23 |
| N | 5.37 | 1.9 | 0.62 | 2.41 | 1.16 | 0.58 | … | 0.18 | 0.17 |
| P | 4.84 | 1.9 | 0.66 | 2.8 | 1.18 | 0.83 | … | 0.19 | 0.28 |
| R | 4.19 | 2.04 | 1.05 | 2.5 | 1.02 | 1.05 | … | 0.21 | 0.15 |
According to McEnery and Wilson (2001:89-90), correspondence analysis is designed to attempt to “summarise the similarities between larger sets of variables and samples in terms of a smaller number of ‘best fit’ axes”. Tabata (2006) used correspondence analysis to ascertain the similarities and differences between the inaugural addresses of U.S. presidents. His research showed us that, 1. the presidents before 1900 are a forming a group in the left side of the plot, but that after 1900 they swarm together in the right side, 2. I-style is moving into we-style diachronically, 3. the decreasing of embedded relative clauses, and, 4. the increasing of genitive markers. Goto (2006) also attempted to classify sub-corpora in the BNC using correspondence analysis, based on the frequency table of nouns, verbs and adjectives. Figure 1 shows the text type plot, and Figure 2 the word plot, based on Table 13.
Figure 1: Text category plot based on 100 most common content words (verbs, adjectives and nouns)
Figure 2: The top 100 content word plot based on the 15 text types of the GR corpus
As can be seen in Figure 1, text J (Learned and Scientific Writing, that is to say, academic texts) is obviously dissimilar to the other 14 text categories. In addition, data points in Figures 1 and 2 correspond to each other, and so, when looking at the top right quadrant in the content word plot (Figure 2), it can be seen that some academic words such as different, form, system and used are used because text J is in the same quadrant in the text category plot (Figure 1). It is also clear that text A (Press: Reportage) shows distinct features as opposed to other text categories. However, although some text categories such as texts K, L, N and P swarm together, these 15 categories cannot be consistently divided into specific groups by correspondence analysis. Thus, I use principal component analysis and cluster analysis in the next section.
According to Horvath (1985), principal components analysis (PCA) is a statistical method for arranging large arrays of data into interpretable patterning match (Oakes 1998:103). The principal components are computed from the matrix of correlations between the variables, outputting “their eigenvalues (the amount of variance accounted for by each component) and the component loadings (how the variables correlate with the principal component)” (Oakes 1998:103). Then, this analysis attempts to plot and arrange these variables in a two-dimensional space (similar to correspondence analysis) and more closely related items are plotted closer to each other than the less closely related items. For example, Burrows and Hassal (1988) examined disputed authorship attribution using PCA by setting the variables as the top 50 high frequency words. They showed a relationship between the collocation of the top 50 high-frequency words and texts by creating a scatter plot from the PCA data.
Figure 3 is the result of 15-text type plot based on principal components analysis.
Figure 3: Text type plot based on principle components analysis
The result shows that the 15 text types can be divided into three major categories from the viewpoint of closeness: 1. K, L, M, N, P, R; 2. A, F, G, J; and, 3. B, C, D, E, H. Category 1 consists of imaginative prose linked to literature works. Categories 2 and 3 can be categorised as informative prose. Thus, first, we could make a distinction between informative prose and imaginative prose automatically (or between academic and newspaper, and literature). Then, informative prose is divided into categories 2 and 3. Surprisingly, though, in the GR corpus, the texts A, B, C are newspaper press texts (A: Press Reportage; B: Press Editorial; and, C: Press Reviews); text A is grouped in category 3 and texts B & C is in category 2. As category 3 also has an academic text (text J), it is difficult to make a distinction between text A and text J using a corpus-driven approach. Thus, principal components analysis shows that the distinction between academic and newspaper is a grey-zone compared to literature and other text genres. Texts H and J do not swarm closely, and therefore these two texts are independent; they are strictly categorized as, 1. K, L, M, N, P, R; 2. A, F, G; 3. B, C, D, E; 4. H; 5 J.
Cluster analysis is a statistical methodology used to categorise individual objects into groups. According to Oakes (1998:116), cluster analysis starts with the overall cluster including all documents, and then sequentially subdivides them into the individual ones. The result gained from cluster analysis is visualized as a tree-diagram or dendrogram. Cluster analysis links similar texts by “draw[ing] vertical lines upwards from each node or document, then connect[ing] these vertical lines by a horizontal line at the point of similarity at the time the nodes are merged”. This method is known as, “single linkage because clusters are joined at each stage by the single shortest or strongest link between them” (Oakes 1998:118-9; Anderberg 1973). Figure 4 shows the result of cluster analysis of the 15 text types.
Figure 4: Cluster analysis of 15 text types based on 100 common words
The dendrogram in Figure 4 shows that the 15 text types can be divided into three large categories, 1. K, L, M, N, P, R; 2. A, F, G, J; 3. B, C, D, E, H. This result is the same as the result given by principal components analysis.
Overall, the results of the correspondence analysis, principal components analysis and the cluster analysis allow us to categorise each text type into a genre using the corpus-driven method. Taking into account the common result from the three separate multivariate analyses, we can conclude that,
Thus, multivariate analyses such as CA, PCA and cluster analysis can make the elusive and abstract concept of ‘genre’ clearly visible, quantifying similarities and dissimilarities among text types.
Overall, this paper showed that the mixed use of the corpus-driven approach and current statistical methods can reveal the linguistic components forming ‘genres’, and show us that the concept of ‘genre’ is based on internal criteria, and makes the invisible similarities, dissimilarities and styles of text genres visible. More specifically, I have attempted to examine two main aspects of genre analysis and variation. The first is what sort of linguistic features can be revealed from the viewpoint of vocabulary and phrasal approaches in the three genres. This analysis tried to focus on the linguistic differences or features of the genre texts. The second is how these three major written genres can be specified from linguistic differences within 15 sub-divided text categories from a vocabulary viewpoint. Both approaches are conducted using a corpus-driven methodology.
The results of the current research can be divided into 6 summary points: First, the present research using basic statistics, lexical density and measurement of the vocabulary level revealed that, newspaper English uses the most varied vocabulary, literary English an intermediate one, and academic English the smallest, but academic and newspaper English are genres that include more difficult words as opposed to literature English.
Second, high frequency vocabularies, keyword analysis and n-gram analysis show that nouns were most often keywords in academic and newspaper texts, whereas literature had more verbs as keywords than any other language component; also, academic texts use a much more formal written style evidenced by the use of the it is construction and the fixed prepositional phrase, but literature texts use much more fixed, colloquial expressions expressing personal motions, actions, feelings, wishes and situations.
Third, the results from the investigations into the use of personality and passives supports the general ideas stated by Kuo (1999), Coniam (2004) and Goatly (2000). Thus, it would be necessary for us to examine not comprehensively but more specific individual papers if we support the idea maintained by Ivanic and Simpson (1992), Swales (1990) and Ard (1983).
Fourth, the result of investigation on hedging partly supports but also partly contradicts the theories given by Kuo (1999), Hyland (1994), Salager-Meyer (1994) and Coniam (2004). Specifically, the results of the present research present doubts on, 1) the data (of average frequency per 1,000 words) given by Coniam (2004), 2) the fact that the personal two-word hedge I believe shows an exception different from the ideas of Kuo (1999), Hyland (1994) and Salager-Meyer (1994), and 3) the method taken by the current research and by Coniam (2004) due to the fact that it is impossible to divide real hedging and non-hedging uses of hedging words in the list given by Hyland (2000).
Fifth, the result of my investigation on nominalization basically supports Biber et al. (1998), due to the fact that academic texts show nominalization at a higher ratio than other genres, and that the -ness form is predominantly used in the literature genre. However, as newspapers were not examined in Biber’s research, one new aspect of this genre could be revealed: While there is an overall tendency for the newspaper corpus to show a similar ratio and distribution of nominalization as the GR corpus, the -ment form is the highest ratio from the newspaper corpus across genres. The -ment nominalizations in newspaper texts include management, government, argument, investment, readjustment, replacement and others, and these words rarely occur in other genres. In addition, in common with academic texts, the -ity form is also important in newspaper texts.
Finally, the multivariate analysis shows quite a new aspect on obvious similarities and differences across 15 text types:
In summary, I have shown that these 3 genres have different vocabularies, phrases, text levels, varieties, styles and discourses. Therefore, it can be said that such differences form text genres and that a corpus-driven approach is a valid linguistic approach for the analysis of genre.
In later work, Biber changed the term ‘genre’ to ‘register’. ↩