A Corpus-Driven Approach to Genre Analysis: The Reinvestigation of Academic, Newspaper and Literary Texts

Yasunori Nishina, University of Birmingham

Abstract

This paper investigates a number of text corpora belonging to different genres, and applies various statistical methods on features extracted from them. Through this empirical analysis we can identify internal criteria which support the assignment of genres to texts based on external ones.

1. Introduction

New methods in corpus linguistics enable us to reassess and discover the detailed linguistic characteristics and differences existing across text genres which are yet to be described. The use of both corpus-driven methods and statistics allows us to do the computation of automatic text matching from anonymous various texts, based on the frequency and the distribution of content words. The present paper shows that an exhaustive corpus-driven approach, mixed with statistics, is the most effective and sophisticated analytical method for comparing texts across genres; it does this by re-evaluating the achievements of its forerunners and by discovering new facts from an exclusively empirical viewpoint. In order to achieve these aims, I compare texts of different genres, specifically, literary, newspaper and academic texts, from the viewpoint of statistical data, text levels, vocabularies, phrases, personalities, passives, hedges, nominalization and multivariate analysis. These linguistic features, which are extracted by a corpus-driven method, enable us to take a new step in evaluating the characteristics and language patterns of use in a specific discourse and text style. This paper is also intended as an investigation of the methodologies which allow such an evaluation.

2. The Conception of Genre

The emergence of genre as a research topic has its origins during the 1960s and 70s from the work of such researchers as Barber (1962), Herbert (1965), Ewer and Latorre (1969), Ewer (1971), Ewer and Hughes-Davies (1972), Lackstrom et al. (1972) and others. The basic description of ‘genre’ is given by Biber (1988: 70) as follows:

I use the term ‘genre’ to refer to categorizations assigned on the basis of external criteria. I use the term ‘text type’, on the other hand, to refer to groupings of texts that are similar with respect to their linguistic form, irrespective of genre categories1.

More specifically, Biber (1993) defined appropriate descriptions of text genre and type. To summarise his idea, ‘genre’ is the variety of texts contained within a culture, such as scientific writing, science fiction, letters, press periodicals, conversation etc. On the other hand, text types differ at the linguistic level. Although genre is a more ambiguous concept than text type, it can be said that genre includes text type, or that genre is the superordinate concept of text type. However, numerous other researchers (e.g. Swales 1990) are associated with the term ‘genre’ as well, often having different conceptions of its nature. I believe that ‘genre’ can be based on internal rather than external criteria using a corpus-driven approach, and can be identified and codified using linguistic vocabularies, pattern and style. Thus, it could be accurately said that, as an alternative to conceptions of genre as a priori listings of textual variety, genre can emerge as a topic for quantitative research in linguistics. This paper develops and investigates this idea: analysing linguistic sets of co-occurring linguistic features across genres enables us to ascertain the differences between genres and what the characteristics of specific genres are, particularly from a quantitative viewpoint.

For example, Biber (1988) adopted multidimensional analysis (hereafter MD) in order to make distinctions between genres and to discover their linguistic characteristics. MDs are sets of linguistic features that often co-occur in texts, and are divided into 6 features (strictly 7) as dimensions (or factors) 1-6 (e.g., factor 2 includes ‘past tense’, ‘third person pronoun’ and ‘public verbs’). In addition, the MD approach is based on the idea that if some linguistic features occur frequently in a text, other features will appear less frequently in the same text (Biber 1993). For example, an MD approach revealed that conversational texts are largely interactive and involved than academic texts, largely because the former has less time available for high information production, but the latter has much time to produce a high information content and is highly controlled. Although the MD approach is one of the most well-organised methods of genre analysis, there are also various approaches suggested by many other scholars (see section 3.1).

3. Research Procedures

3.1. Historical Methodology

Compared to the time when genre analysis originated, large genre-specific corpora are now available to enable us to do empirical and extensive genre analyses (e.g. Flowerdew 2002). Using these resources, I examine the characteristics of each genre in a more specific way, by looking at the nature of both word and phrasal behaviour. Before now, various researchers have attempted to conduct genre and text analyses in a range of texts, in particular academic texts. Table 1 summarises the historical methods used by some researchers:

Table 1: Historical methods of text analysis
ResearchMethod
Ure (1971)Lexical density
Leech and Svartvik (1975)Passive in ‘impersonal writing’
MacDonald et al. (1982)Readability statistics: sentence length, type: token ratios and FOG analyses
Makaya and Bloor (1987)Hedging
Biber (1988)Multi-dimensional approach
Forsyth and Holmes (1996)Style markers: letters, most frequent words and digrams, two methods of most frequent substring selection approach
Stylometry problems: authorship, chronology, subject matter
Baayen et al. (1996)Vocabulary richness and the frequency of the top 50 high frequency words
Biber et al. (1998)Features of academic text
Kuo (1999)Personal Pronouns
Hyland (2000), (2002)Discourse-based features; hedge, boosters, metadiscourse markers, directives
Coniam (2004)Content words, keywords, n-gram, personality, passives, hedges
Can and Patton (2004)Word length, type length and token length

Another research methodology used for genre and text analysis has been developed by Stamatatos et al. (2001). They adopted a variety of statistical methods in a discriminant approach to evaluate texts for the clarification of authorship; low-level measures, sentence length and punctuation mark count, a set of style markers from natural language processing, percentage of rare or foreign words and a measure indicating the morphological ambiguity were all used. Coniam (2004:288) also suggested other methodologies, yet to be taken up, including:

  1. word level: word counts, word frequency analysis, type token ratios and phrasal verb use;
  2. phrase level: verb tense, nominalization, modal group length and adverbial intensifying;
  3. sentence and discourse level: fronted subordinate clauses, use of passive, hedging, directives, transition features, approximates and author personality.

Following previous research in this area, I adopt a variety of methodologies in order to assess text style within genres:

  1. basic statistical data of lexical density and vocabulary level;
  2. high frequency vocabularies, keyword and n-gram;
  3. personality and passives;
  4. hedging;
  5. nominalization; and
  6. multivariate analysis.

3.2. Corpus Compilation

A general reference corpus includes academic texts, newspaper and literature as significant parts. For example, the Baby-BNC, a 4 million word corpus, is compiled from 4 sections: written academic prose; written fiction; written newspaper; and, spoken demographic; each section is about 1 million running words. Thus, for written language texts, these can be divided into three broad categories: academic, newspaper and literature.

The corpora examined in this research are compiled from 6 pre-existing corpora: MicroConcord Corpora A and B; the Lancaster-Oslo/Bergen Corpus of British English (LOB); the Brown corpus (a standard corpus of present-day edited American English); the Freiburg-LOB Corpus of British English (FLOB); and the Freiburg-Brown Corpus of American English (Frown). The MicroConcord corpus is divided into two categories, A and B. MicroConcord A is a 1 million word corpus consisting of the British newspapers The Independent and The Independent Sunday; while MicroConcord B is a 1 million word corpus of academic articles published by the Oxford University Press. Brown, Frown, LOB and FLOB are well-balanced written American and British English corpora; each is compiled at the same standard, from 500 texts of 2,000 words representative of 15 categories published between the 1960s and 1990s. They all include academic texts at 16% (that is, 160,000 running words), this is known as ‘learned’ text (written texts on science and technology) and is in text category J. Newspaper material (or press texts) always accounts for 17.6% (176,000 running words) in text categories A-C, and literary works, generally categorized as ‘imaginative prose’, 25.2% (252,000 running words) in text categories K-R. I combined the matching parts of these corpora to create separate genre corpora. The size of the resulting genre corpora are as follows: academic corpus (MicroConcord B + text category J of the 4 corpora), 1,662,106 running words; newspaper corpus (MicoroConcord A + text category A, B, C texts of 4 corpora), 1,760,664 running words; literature corpus (text category K-R texts from 4 corpora), 1,019,254 running words. The size of a general reference corpora derived from mixing the 4 corpora (hereafter referred to as the ‘GR’ corpus) was 4,071,830 running words.

4. Basic Statistical Data

Basic statistical data was used to investigate vocabulary variety and difficulty from an empirical viewpoint, in order to investigate general differences between genres. Basic statistical data from the texts were retrieved and calculated using WordSmith (ver.4.0), EXCEL, and by using manual computation. Table 2 provides: 1. the number of tokens; 2. the number of types; 3. standardised type/token ratio (S-TTR); 4. Guiraud value; 5. average word length; and, 6. the ratio of 1-4 letter words. These metrics should help us to understand the relative variety and difficulty of each text from a specific genre.

Table 2: Basic statistical data from each genre corpus
 AcademicNewspaperLiteratureGR
Tokens (used for WS4 word list)1,662,1061,760,6641,019,2544,071,830
Types (distinct words)47,48159,22238,18187,727
S-TTR40.7648.2745.0244.54
Guiraud value36.8244.6337.8143.47
Mean word length4.864.784.304.68
Ratio of 1-4 letter words55.43%55.20%62.87%57.51%

The S-TTR indicates the degree of variety of vocabularies in a corpus. In calculating this score, the type/token ratio is calculated for each 1000 words (this is the standard value) in the entire corpus, and a running average is computed. If this value gives a low number, it means that many of the same words are used repeatedly. If the value gives a high number, the texts include a variety of words, and less words are used repeatedly (cf. Help in WordSmith ver. 4.0). The ranked order of the S-TTR value for each genre corpus is, 1. newspaper, 2. literature and 3. academic. In addition, the Guiraud value gives an estimation for the same lexical aspect as S-TTR. The Guiraud value is computed as “the score of types divided by the square root of tokens” (Ishikawa 2005:2). The higher the score of the Guiraud value, then the greater the variety of vocabulary included in a text. Using this value to compare corpora, the ranked order is, 1. newspaper, 2. literature and 3. academic. Therefore, both S-TTR and Guiraud values suggest that newspaper English uses the most varied vocabulary, literary English an intermediate one, and academic English the smallest, if estimators of lexical density are used.

On the other hand, other statistical data such as average word length and the ratio of 1-4 letter words provide us with a measure of the difficulty and style of a text from a different point of view; estimations of the difficulty of words, rather than variety, is taken into account. Can and Patton (2004:62-63) recommend that “word length occurrence frequency information is a good measure to use in stylometric investigation”, and that, “one of the oldest style markers is word length”. On the other hand, some researchers opposed the use of word length; for example in authorship stylistic studies: Holmes (1985) criticises the use of word length frequencies because of the characteristic of Zipf’s first law (Zipf 1932; Can and Patton 2004:63). However, I consider that word-length can be an index useful for investigating text difficulty and stylistics. The higher the value of the average word length, the more difficult the readability of the text. The inclusion of longer words is taken to mean that texts have many difficult words from a solely empirical perspective. When comparing the data in each genre using this method, the ranked order is, 1. academic, 2. newspaper and 3. literature. Thus, academic is the genre that includes more difficult words as opposed to other genres. This conclusion is also supported by looking at the ratio of 1-4 letter words. A low value of the ratio of 1-4 letter words represents a more difficult text. When comparing each genre corpora by this value, the order of difficulty is, 1. newspaper, 2. academic and 3. literature. As a matter of fact, the value of newspaper (55.20%) and academic (55.43 %) shows almost no difference. Thus, the values of mean word length and ratio of 1-4 letter words suggest that academic texts are the most difficult and stylized, whilst that literary texts are the easiest and the least stylised at the vocabulary level. Thus, text genres would be categorized by the basic statistics of lexical density due to such differences described in this section.

5. Measuring the Vocabulary Level

Research by Chujo (2004) used 100 word-span frequency word lists based on the British National Corpus (BNC) in order to measure the vocabulary levels of several texts. Short frequency lists are created by dividing the entire BNC frequency list into 100 word-span multiples from the top, e.g. 1-100 most frequent words, 1-200, 1-300 etc. These lists are then used to measure the cover rate of the types and tokens within each text in question to give an estimate of the level of text difficulty.

I use a method similar to that of Chujo (2004). The procedure followed can be divided into two steps: the first is separating the BNC written frequency list, provided by Adam Kilgarriff, into lists of 1,000 word multiples from the top rank (e.g., 1-1,000, 1-2,000, … , 1-10,000), and the second is comparing and measuring the cover rate of tokens between the BNC 1000 short span list and each genre corpus. By using the ‘match list’ function in WordSmith (ver. 4.0), words not matched with BNC short span lists were erased, and only words matched with the BNC short span list were extracted. Finally, the total number of tokens matched with the BNC short span list was calculated, and the cover rate of each corpus with the BNC short span list was computed using EXCEL. Table 3 shows the results of the investigation into cover rate.

Table 3: The cover rate of BNC frequency cut list in each genre corpus
BNC level1,0002,0003,0004,0005,0006,0007,0008,0009,00010,000
Academic66.7374.5478.7781.6983.885.4386.6787.6888.5289.24
Newspaper64.7472.2276.7879.7581.9583.6384.9886.0586.9487.68
Literature71.8378.1181.8184.4286.1487.5188.6389.4990.2990.93
GR67.4174.6678.8581.6883.6685.2386.587.4888.3289

The cover rates in Table 3 show the text difficulty from a vocabulary viewpoint. If the cover rate with the BNC list shows a high value in corpus A as opposed to corpus B, corpus A is easier than corpus B with regards to the vocabulary level. As can be seen in Table 3, when comparing three genre texts, the ranked order is, 1. newspaper, 2. academic and 3. literature. This result almost matches with the result of the data given in the previous section (that is to say, that newspaper and academic English is more difficult than literary English). The fact that newspapers emerge as more difficult texts than academic texts could be explained by the fact that academic texts tend to use more set phrases than newspapers, rather than using a variety of vocabularies, because the words used in set phrases consists of the basic words representing as function words.

In addition, the cover rate for each genre corpus shows slow growth after each 2000 word BNC level. This is largely because the 2000 word level is the most suitable limit for high-frequency words (Nation 2001:14). The size of the general vocabulary list, the classic vocabulary list created by West (1953), is also 2000 words. This vocabulary size has been supported by many different researchers. Nation and Hwang (1995) maintained that the size of 2,000 words is still the best selection for English learners to memorize. For example, research conducted by Sutarsyah, Nation and Kennedy (1994) showed that the first 2000 high frequency words attained a text coverage of 82.5 % in a single economic textbook. Coxhead (1998) also shows that the first 2000 words cover 76.1 % in an academic corpus. Thus, it might be possible to divide text genres using the vocabulary level.

6. Empirical Lexical Studies

6.1. High Frequency Vocabularies

In this section, I would like to compare the characteristics of vocabularies occurring in academic, newspaper and literary texts in detail. Following the methodology by Coniam (2004), I extracted only content words. In order to remove function words, I first created a stop list for function words by reference to the website Funcword, and then by automatic computation in WordSmith (ver. 4.0.).

Table 4: Top 50 frequent content words from academic, newspaper and literature texts
AcademicNewspaperLiteratureGRAcademicNewspaperLiteratureGR
TIME SAID SAID SAID YEARS MAKE ROOM GO
SAME MR LIKE TIME MAKE GROUP DOOR RIGHT
MADE NEW TIME NEW STATE LONG FACE USED
NEW PAGE JUST LIKE GENERAL MARKET HEAD TAKE
WAY YEAR MAN MADE FOUND OWN KNEW COME
WELL TIME KNOW WELL JUST FOREIGN ASKED GREAT
LAW YEARS SEE YEARS IMPORTANT GOOD TOLD MEN
LIKE PEOPLE GO JUST NUMBER MAN HAND SAY
DIFFERENT LIKE WAY PEOPLE POINT JOHN SAY MR
CASE CENT WELL WAY LONG TAKE LEFT THOUGHT
WORK GOVERNMENT THOUGHT MAN POSSIBLE DAY LOOK PART
FORMULA HOME LOOKED GOOD GOOD WEST DAY HOUSE
USED MADE LITTLE WORK FACT LONDON PEOPLE CAME
FORM NEWS THINK SEE CASES PUBLIC MAKE USE
LIFE CITY COME LONG POLITICAL HIGH BECAUSE FOUND
GIVEN WORLD EYES MAKE HIGH PRESIDENT WANT THINK
PEOPLE PARTY MADE YEAR PARTICULAR SPORT TOOK HOME
USE JUST RIGHT KNOW LARGE END TURNED STATE
WORLD WELL GOING OWN CHANGE NATIONAL HOUSE HIGH
EXAMPLE WORK WENT LITTLE ORDER COMPANY TAKE END
PART BRITISH OLD LIFE LATER STATE SEEMED GOING
SOCIAL OLD CAME OLD DEVELOPMENT LITTLE SAW PLACE
OWN BUSINESS GOOD WORLD FURTHER HOUSE OWN LEFT
SEE WAY AWAY DAY GROUP LIFE NIGHT SMALL
SYSTEM WEEK LONG SAME WHETHER RIGHT FELT WENT

As can be seen in Table 4, each genre corpus shows its characteristics as we might expect. For example, the top 50 content words in academic texts have law, different, case, formula, form, example, system, important, possible, development and others. These words can be often used in academic text in an it is construction (e.g. it is important / possible / different to say) and also in multi-word units (e.g. in case of and be different from). The newspaper corpus shows government, news, world, business, market, public, president, national, company and others. These words can be categorised as belonging to business, economics and politics vocabularies. The literature corpus includes know, thought, eyes, room, door, face, head, hand, look, people, house, night, felt and others. They can be categorised as ‘parts of the body’, ‘thoughts had by humans’ and ‘pertaining to the house’. These vocabularies are often used for describing the personals motions, actions and situations.

6.2. Keyword Analysis

Next, I would like to examine the keywords occurring in the different genre texts. This is because some researchers have doubts about the MD approach; they comment that keyword analyses can provide the same results as an MD approach: e.g. McEnery and Xiao (2005:63) criticise the MD approach as follows:

MDA is undoubtedly a powerful tool in genre analysis. But associated with this power is complexity. The approach is very demanding both computationally and statistically in that it requires expertise not only in extracting a large number of linguistic features from corpora but also in undertaking sophisticated statistical analysis. … [U]sing the keyword function of WordSmith can achieve approximately the same effect as Biber’s MDA. This approach is less demanding as WordSmith can generate wordlists and extract keywords automatically.

In order to extract more genre specific vocabularies, I use the log-likelihood score. The reason for using this statistical score is, first that the corpora sizes are different, and so raw frequencies cannot be compared directly; second, even if occurrences of a word per 1,000 words are given, comparison of a word between genres is still not useful, largely because it is impossible to say whether any divergence between genres is by chance or a substantive one (cf. Leech, Rayson and Wilson 2001:16). The log-likelihood score shows “how significantly characteristic or distinctive of a given variety of language a word is, when its usage in that variety is compared with its usage in another” (Leech, Rayson and Wilson 2001:16). Table 5 provides the top 50 key content words occurring at keyness value of over 227 in each genre corpus.

Table 5: The top 50 key content-words in academic, newspaper and literature texts (P < 0.05)
Academic Newspaper Literature
Order Key word Keyness Key word Keyness Key word Keyness
CELLS 1,004.23  PAGE 5,612.59  SAID 2,175.37 
LAW 892.39  MR 2,759.54  LOOKED 900.62 
GENES 820.39  YESTERDAY 2,607.54  EYES 777.93 
FORMULA 807.02  NEWS 1,997.35  LIKE 750.2 
INHIBITION 669.1  SPORT 1,806.12  DOOR 687.55 
LATENT 576.08  CENT 1,689.98  KNOW 685.23 
NUTTY 569.03  PER 1,136.05  KNEW 547.71 
STIMULUS 566.81  LAST 984.02  THOUGHT 530.62 
CASES 499.83  YEAR 937.85  GET 501.5 
10  CELL 498.77  CITY 905.12  ROOM 500.72 
11  LAUTREC 490.16  FOREIGN 859.66  FACE 491.43 
12  MDASH 482.33  SHARES 859.11  JUST 476.77 
13  DIFFERENT 466.95  HONG 713.49  TURNED 448.75 
14  CONTEXT 423.14  MARKET 697.67  GO 441.24 
15  ENGELS 419.11  SAID 659.62  GOING 440.72 
16  THEORY 414.38  PARTY 627.89  WENT 440.62 
17  BEHAVIOUR 400.35  KONG 618  MAN 436.31 
18  TRUST 396.76  DOLLARS 617.23  THINK 434.24 
19  MARX 396.69  BRITISH 598.7  HEAD 433.56 
20  CS 395.75  WEEK 585.07  AWAY 413.04 
21  CHAPTER 394.02  GOVERNMENT 551.65  OH 408.24 
22  SOLUTION 369.6  BUSINESS 534.41  TELL 382.26 
23  NAILS 359.36  RUGBY 495.28  ASKED 377.73 
24  INFECTION 357.17  CHAIRMAN 472.81  SAW 369.97 
25  EXAMPLE 355.33  CONFERENCE 470.84  SEEMED 368.19 
26  SUBJECTS 336.02  FOOTBALL 467.88  SEE 360.45 
27  FORM 332.83  PROFITS 460.83  STOOD 337.52 
28  CASE 332.58  ARCHITECTURE 444.95  COME 333.02 
29  SINGULARITY 325.46  BID 434.45  VOICE 332.36 
30  TRUSTS 323.27  WEST 432.24  FELT 328.58 
31  CONSENT 322.36  TEAM 417.98  CAME 319.42 
32  PATIENT 320.26  WEEKEND 399.53  LOOK 313.79 
33  EVOLUTION 318.77  SOVIET 388.6  SOMETHING 312.82 
34  HITLER 310.88  INTERNATIONAL 383.79  SAT 303.5 
35  STIMULI 310.2  MATCH 368.72  TOLD 295.28 
36  AUTHORITY 308.21  HOME 368.02  HAIR 279.5 
37  THERAPIST 307.83  UK 363.78  AROUND 279.36 
38  SOLUTIONS 307.61  LEAGUE 363.55  SMILED 278.76 
39  WILFRED 307.35  PLAYERS 362.65  HAND 276.91 
40  PARTICULAR 299.23  GAME 357.57  WANT 272.18 
41  PROPERTY 295.92  EAST 351.06  WALKED 269.69 
42  GENE 290.06  GROUP 348.61  RIGHT 254.81 
43  TESTATOR 284.45  BRITAIN 346.42  MOMENT 254.02 
44  EXPOSURE 283.3  EUROPEAN 338.94  GIRL 238.64 
45  CONDITIONING 281.88  CORRESPONDENT 338.78  MAYBE 238.06 
46  ROMAN 275.5  WIN 337.94  TOOK 233.23 
47  SPECIES 270.6  STAKE 337.4  MORNING 228.55 
48  EMBRYO 270.01  SHARE 331.29  NIGHT 227.99 
49  FIG 267.69  CUP 323.64  CAR 227.93 
50  GONORRHOEA 262.18  TALKS 318.13  LITTLE 227.53 

Compared to Table 4, Table 5 provides the more specific and authentic vocabularies occurring in each genre. One thing to note is that our intuition tells us that these words in table 5 would be more specific and authentic vocabularies than those in table 4, but they do not tell us which specific vocabularies occur in each genre. Thus, a keyword corpus-driven approach confirms what we already know and gives us more specific knowledge.

Overall, the trends revealed by keyword analysis shows that nouns were most often keywords in academic (e.g. stimulus, infection, property, species etc.) and newspaper (e.g. profit, architecture, game, European …) texts, whereas literature had more verbs as keywords than any other language component (e.g. said, looked, turned, went). Among the three lists of keywords in Table 5 (150 words altogether), there are no common words across the three genres. In addition, there is only one common word (said in newspaper and literature) across two genres. This suggests that keywords can give us information about the different vocabularies used in each genre text.

6.3. N-Gram Analysis

I would also like to compare multi-word units between genre corpora, in particular 4-word units occurring in each genre corpus. Coniam (2004) used KfNgram (Fletcher 2002) to compute 4-word units occurring in specific genre texts taken from applied linguistics articles. However, not only KfNgram but also some concordancing programs have n-gram functionality, although some of them give it different names (e.g. ‘cluster’ in WordSmith, ‘wordgrams’ in KfNgram and ‘N-Gram’ in AntConc). I used WordSmith (ver 4.0) developed by Mike Scott at the University of Liverpool, to calculate the most frequent 4-word units for my corpora. The cut-off point for detecting units was set at over 30 times. Also, as raw phrase lists often include some phrases including numbers and error words (e.g. phrases including numbers: # PER CENT AND, A # YEAR OLD, # # AND # etc), are removed manually; these examples will not provide information useful for the characterisation of text genres. Table 6 shows the comparable 4-word unit lists for the three genre copora.

Table 6: The top 50 four-word units in each genre corpus
 AcademicNewspaperLiterature
Order Word Freq. Word Freq. Word Freq.
THE END OF THE 191  BUSINESS AND CITY PAGE 507  THE REST OF THE 74 
IN THE CASE OF 184  PER CENT OF THE 199  AT THE SAME TIME 72 
AT THE SAME TIME 174  FOR THE FIRST TIME 189  IN FRONT OF THE 69 
ON THE OTHER HAND 169  THE END OF THE 175  FOR THE FIRST TIME 66 
AT THE END OF 129  AT THE END OF 138  IN THE MIDDLE OF 58 
ON THE BASIS OF 121  THE REST OF THE 107  THE END OF THE 57 
AS A RESULT OF 120  AT THE SAME TIME 103  THE EDGE OF THE 56 
IN TERMS OF THE 106  SECRETARY OF STATE FOR 98  THE MIDDLE OF THE 52 
THE NATURE OF THE 87  IS ONE OF THE 92  AT THE END OF 47 
10  AS WELL AS THE 83  ONE OF THE MOST 89  THE SIDE OF THE 47 
11  THAT THERE IS A 77  AS A RESULT OF 86  THE BACK OF THE 43 
12  IN THE ABSENCE OF 76  A MEMBER OF THE 80  ON THE OTHER SIDE 42 
13  ONE OF THE MOST 75  THE SECRETARY OF STATE 73  THE TOP OF THE 41 
14  THE FACT THAT THE 74  WILL BE ABLE TO 71  WAS ONE OF THE 41 
15  IS LIKELY TO BE 73  IN THE UNITED STATES 67  THE OTHER SIDE OF 39 
16  PER CENT OF THE 73  VIEW FROM CITY ROAD 64  FOR A LONG TIME 37 
17  IN THE CONTEXT OF 72  BY THE END OF 63  HE WAS GOING TO 36 
18  IN THE FORM OF 72  ON THE OTHER HAND 62  OTHER SIDE OF THE 35 
19  IN THE UNITED STATES 72  THE FIRST TIME IN 60  AND THERE WAS A 34 
20  THE EXTENT TO WHICH 69  THE FACT THAT THE 59  I DON’T WANT TO 34 
21  FOR THE FIRST TIME 66  THE LABOUR PARTY CONFERENCE 59  IN FRONT OF HIM 34 
22  IT IS POSSIBLE TO 63  IN THE FIRST HALF 52  IT WOULD HAVE BEEN 32 
23  ON THE ONE HAND 63  IN THE CASE OF 50  ON THE OTHER HAND 32 
24  THE WAY IN WHICH 63  WAS ONE OF THE 50  I WANT YOU TO 30 
25  AT THE TIME OF 62  IS LIKELY TO BE 47  WAS GOING TO BE 30 
26  IT IS CLEAR THAT 59  IN THE FORM OF 46  I DON’T KNOW WHAT 29 
27  IN THE COURSE OF 56  PER CENT IN THE 46  IF YOU WANT TO 29 
28  THE REST OF THE 56  THE UNITED STATES AND 46  THE TWO OF THEM 29 
29  IT IS IMPORTANT TO 55  IN THE MIDDLE OF 45  BUT THERE WAS NO 28 
30  IT IS DIFFICULT TO 53  ON THE BASIS OF 45  ON THE EDGE OF 28 
31  AS WE HAVE SEEN 52  AS WELL AS THE 43  THE BACK OF HIS 28 
32  AT THE BEGINNING OF 52  ONE OF THE FEW 43  AT THE TOP OF 27 
33  THE DEVELOPMENT OF THE 52  AS ONE OF THE 42  FROM TIME TO TIME 26 
34  THE CASE OF THE 51  IN THE WAKE OF 42  IN THE LIVING ROOM 26 
35  IN THE PRESENCE OF 50  THE BANK OF ENGLAND 42  ARE YOU GOING TO 25 
36  TO THE EXTENT THAT 49  IN THE FACE OF 41  NOTHING TO DO WITH 25 
37  ON THE PART OF 48  PER CENT STAKE IN 41  THE BOTTOM OF THE 25 
38  THE BEGINNING OF THE 48  AT THE AGE OF 39  TURNED OUT TO BE 25 
39  THE EXISTENCE OF A 47  IN THE SECOND HALF 38  YOU WANT ME TO 25 
40  THE TIME OF THE 46  WILL HAVE TO BE 38  IN THE FIRST PLACE 24 
41  TO BE FOUND IN 46  OF THE UNITED STATES 37  IT WAS AS IF 24 
42  A LARGE NUMBER OF 45  AT A TIME WHEN 36  HE LOOKED AT HER 23 
43  IT IS NECESSARY TO 45  IN AN ATTEMPT TO 36  IT HAD BEEN A 23 
44  TO BE ABLE TO 45  IN THE FIRST PLACE 36  ON THE BACK OF 23 
45  OF THE F HRER 44  THE BEGINNING OF THE 36  SHE WAS GOING TO 23 
46  THE BASIS OF THE 44  IN THE YEAR TO 35  THE FRONT OF THE 23 
47  BE FOUND IN THE 43  THE HEAD OF THE 35  TO BE ABLE TO 23 
48  IN THE SENSE THAT 42  THE FIRST HALF OF 34  AS IF HE WERE 22 
49  IN THIS CASE THE 42  AT THE BEGINNING OF 33  AT THE FAR END 22 
50  IS ONE OF THE 42  ON BEHALF OF THE 33  IN FRONT OF HER 22 

N-grams are able to identify the commonest collocations in a discourse far more effectively than a single word analysis. As can be seen in Table 6, there is an overall tendency toward using multi-word fixed units in academic texts as opposed to other genres. One of the specific characteristics that can be seen only in academic texts is the use of the it is construction, such as it is possible to, it is clear that, it is important to and it is difficult to. In addition, academic texts show many prepositional phrases as opposed to the other two genres; for example, when focusing on in * of phrases academic texts contain many examples, e.g. in the case of, in terms of the, in the absence of, in the context of, in the form of, in the course of and in the presence of. On the other hand, the 4-word units in the newspaper corpus show many government associated and economical phrases, such as, secretary of state for, the secretary of state, in the united states, the labour party conference, the united states and, the bank of England and of the united states. Also, the general purpose of newspapers is to function as a medium for relaying facts which are happening in the world. Such a function can be detected in the phrase, the fact that the. On the other hand, the 4-word units most frequently occurring in the literature corpus tend to contain colloquial expressions describing people’s thoughts, ideas, feelings, wishes, actions and motions with the use of pronouns. Phrases such as he was going to, I don’t want to, I want you to, I don’t know what, if you want to, are you going to, you want me to, he looked at her, she was going to, as if he were and in front of her. These differences of vocabularies and phrasal units would also play an important role in categorizing text genre.

7. The Examination of Personality: I and We and Passives

One of the great investigators into the use of personality in texts is Kuo (1999). Kuo researched the use of the personal pronoun in academic texts from an empirical viewpoint. The use of the personal pronoun provides an environment creating an interpersonal interaction between the writer and the readers (Kuo 1999:123). For example, Rounds (1987) shows that teachers tend not to use third-person pronouns but rather first-person pronouns in the sense that includes third-person pronouns. Personality as a linguistic aspect can contribute much to the pragmatic analysis of written texts.

In general, academic texts tend not to use personal constructions such as I (and my and me) but instead use we to “reduce personal attribution” (Kuo 1999:125). This is due to the fact that they discuss and argue from an objective, not a subjective, perspective; that is, the data speaks for itself. The function of the use of we can be divided into two categories depending on the context: inclusive and exclusive. The former includes target readers (or hearers) while the latter does not (Kuo 1999:126). In addition, the use of we as opposed to I in academic texts implies an idea that ‘the author’ and ‘the reader’ or ‘other researchers’ agree to follow the process of the argument, and it provides a more ‘objective’ discussion through being inclusive. This aspect provides us an environment in which there is greater contact and greater solidarity between writer and reader; the use of I creates an environment which is more informal, individual and personal (Coniam 2004:283).

Similarly, in academic texts, the passive construction tends to be overused when compared to the active construction, largely because the former is more impersonal. For example, Kuo (1999:122) tells us that scientific articles are usually thought to be impersonal, and tend to use nominalisation and the passive voice to achieve this effect. Goatly (2000:94) proposed that the impersonal construction, such as passives or nominalisation, creates a stance in which there is a “more distant authorial position with the effect that they reduce personality”, that is to say, they add objectivity.

In order to reveal such subjective and objective stylistic characteristics in genres, I investigate the use of personality and passives in each genre corpus. The word I is sometimes used in the lower case as i in i.e., therefore I made a distinction between the pure I for the personal subject and other forms by manual editing. Passive construction was examined by annotating POS tags into raw corpus files, with Brill’s Tagger, and then searching for the combination of ‘be verbs + verb past participle’. Table 7 shows the number of occurrences and the frequency ratios per 1,000 words for I, we and passives in each genre corpus (also see, I and we word list rankings).

Table 7: The difference of the use of personality and passives
  I We Passive
Academic 3028 (55th) / 1.82 per 1,000  4641(35th) / 2.79 per 1,000  28,613 / 17.21 per 1,000 
Newspaper 4450 (40th) / 2.52 per 1,000  3182 (52nd) / 1.80 per 1,000  17,400 / 9.88 per 1,000 
Literature 14,508 (9th) / 14.23 per 1,000  2645 (53rd) / 2.59 per 1,000  13561 / 13.30 per 1,000 
GR 24447 (20th) / 6.00 per 1,000  11095 (40th) / 2.72 per 1,000  61,827 / 15.18 per 1,000 

The frequency ratios in Table 7 show that literature (14.23) tends to overuse, and academic texts (1.82) and newspapers (2.52) tend to under-use I, while academic (2.79) and literature (2.59) use we more than newspapers (1.80). In the use of the passive voice, academic texts (17.21) utilise it much more than literature (13.30) and newspapers (9.88). Thus, there is an overall tendency for academic texts to be more impersonal in nature as opposed to other genres because we and the passive voice are much more often used, whilst I is used much less than in the other two genres. Literature texts can be said to be more personal in nature because they tend to use I more often when compared to other genres. Moreover, newspaper texts tend not to use I, we or the passive voice, largely because the style of newspapers is impersonal but direct, using active voices, in order to give news and events happening in daily life. Therefore, the tagged corpus gives us an enhanced possibility for text analysis that is impossible with simple plain texts.

From intuition, and the data of Table 7, it is well accepted that academic texts are usually more impersonal and formal than popular writing such as newspapers and magazines. However, some linguists maintain that this is not true for a variety of academic writing styles (e.g. Coniam 2004:274). Ivanic and Simpson (1992:167) showed how writers evolve their own academic styles. Swales (1990:128) also proposed that well-organised and high quality academic writing does not always follow expected or accepted linguistic or rhetorical conventions. For example, Ard (1983) pointed out that Chomsky’s writing has changed, using the first person pronoun much more often in his later than in his earlier writing. Therefore, the results of Table 7 can be seen as general aspects of the use of personality in each genre, and may not always be applicable to individual works. These significant differences of the use of personality shows the quite different styles used in each text genre, suggesting that the occurrence of personality is also a key factor for categorising text genres.

8. Hedging

Hyland (2000:188-189) provides 108 hedges indicating doubt or certainty. Now, I compare the three corpora using these hedges and investigate the occurrence of these hedges in each genre corpus. The 108 hedges can be divided into two groups based on whether they are single words (93 words) or 2 word units (15 units). Table 8 shows the occurrence of all 93 single-word hedges per 1000 words in each corpus.

Table 8: The occurrence of single word hedges in each corpus
Corpus Amount of hedging
Academic 25336 (15.24/1,000)
Newspaper 18405 (10.45/1,000)
Literature 13518 (13.26/1,000)
General Reference 51572 (12.66/1,000)

There are some doubts about the data given by Coniam (2004) on the use of hedges. His research shows that the occurrence of hedges in two academic corpora compiled from applied linguistic articles is 1.85 and 0.76 per 1,000 words respectively. Hyland’s hedges include many high-frequency words such as about, may, doubt, seem, suggest and others. For example, about occurs at 2504 times in academic texts used in this research. This size leads to the fact that it already occurs at 1.50 times per 1,000 words in academic texts. Thus, the number shown in Coniam (2004) may be miscalculated from this aspect.

Next, I show the occurrences of 15 hedges, each consisting of two words, using AntConc (ver. 3.2.0) as follows:

Table 9: The occurrence of 15 two-word hedges in each genre corpus
  Academic Newspaper Literature GR
a certain 172  75  57  288 
certain extent
consistent with 50  13  58 
general sense
I believe 37  50  51  168 
I claim
in general 165  40  201 
in theory 28  18  39 
more or less 69  32  16  95 
not always 82  36  10  108 
not necessarily 89  32  103 
open to question
our belief
provided that 34  37 
seen as 124  81  110 
Total 867(0.52/1,000) 387(0.21/1,000) 159(0.15/1,000) 1229(0.30/1,000)

As Table 9 shows, there is a tendency in academic texts toward also overusing two-word hedges as opposed to the other two genre corpora. The total score of academic texts (0.52 / 1,000) is over twice that of newspaper (0.21 / 1,000), and over three times that of literature texts (0.15 / 1,000). As some researchers, such as Salager-Meyer (1994) and Hyland (1994), suggest, hedging is often used in academic discourse, and the result given by Tables 8 and 9 supports this idea. This fact is largely because “showing modesty by tentative statements and inviting readers to draw inferences by themselves, hedging assists writers to avoid overstating an assertion and to establish a relationship with readers” (Kuo 1999:133).

However, there are exceptions. For example, this cannot be said of I believe because this hedge occurs at 0.022 / 1,000 in academic texts, 0.028 / 1,000 in newspapers, 0.05 / 1,000 in literature and 0.041 / 1,000 in the general reference corpus. The low score given by academic texts is closely connected with the use of I in this hedge. Moreover, it may be that there are flaws in the methodology used to investigate hedging. This is due to the fact that some hedges listed by Hyland (2000), for example might or wrongly, “may well have be used in contexts where they have meanings other that those for hedging purposes” (Coniam 2004:287) Therefore, as far as the semantic annotation is not adopted, it is difficult to gain the correct data of ‘real’ hedges from corpora automatically. However, on the whole, these significant differences of the freqency of hedges can be used as one of a set of indices informing us about the differences amongst text genres and the categorisation of text genres and styles.

9. Nominalization

In this section, I focus on the distribution of nominalization in each genre corpus. Biber et al. (1998:58) suggest that, “studying a morphological characteristic in a corpus can teach us both about the frequency and distribution of the characteristic and about the differing functions of particular variants”. He examined the use of nominalization in academic prose, fiction and speech, by comparing, 1. the frequency of nominalization per one million words in each genre, and, 2. the proportion of nominalization formed with each suffix in each genre. Following this methodology, I investigate the use of nominalization in the genre corpora. Here, I follow Biber et al. (1998) and assume that nominalization creates forms ending with -tion / -sion, -ness, -ment and -ity, including plural forms. This is largely because, for example, “a search for all words ending in -ion would locate many words that were not nominalizations (e.g. cushion, dandelion). In contrast, searching for -sion provides a much more accurate identification of nominalizations (e.g., decision, division, discussion, expansion, extension, submission), although a few inaccurate items will still be included (e.g. mansion)” (Biber et al. 1998:59). The following table shows the frequency distribution of nominalizations across the three genres.

Table 10: The comparison of nominalization across registers on occurrence and proportion
  Academic Newspaper Literature GR
-tion 29160 (55%) 17491 (43%) 4001 (41%) 45735 (50%)
-sion 4390 (8%) 4224 (10%) 998 (10%) 8112 (9%)
-ness 2081 (4%) 2717 (7%) 1339 (14%) 5452 (6%)
-ment 8036 (15%) 8757 (22%) 1998 (20%) 16344 (18%)
-ity 9222 (18%) 7381 (18%) 1476 (15%) 16146 (17%)
Total 52889 (31.82 per 1,000) 40570 ( 23.04 per. 1,000) 9812 ( 9.62 per. 1,000) 91789 ( 22.54 per. 1,000)

Not only overall frequency but also the number of different nominalizations occurring in each genre is also important information when considering linguistic characteristics. As can be seen in Table 10, the proportion of both -tion / -sion occupies more than half of the nominalizations in each register: 63% in academic texts; 53% in newspaper; 51% in literature works; and 59% in general reference texts. However, when looking at other forms of nominalization, the characteristics can be found across the genres. Academic texts tend to use nominalizations ending with (in ranked order) -ity, -ment, but at a much lower frequency, -ness. Newspapers show a similar use of nominalization as academic texts, but the -ment form surpasses the -ity form. For example, the -ment nominalization in newspaper texts include management, government, argument, investment, readjustment, replacement and others. In fact, these words rarely occur in other genres. On the other hand, literature works use these three nominalizations almost equally, and particularly the -ness suffix is more important in this genre than in others. Biber et al. (1998:65) touched on this point as follows:

The -ness ending generally converts adjectives into nouns that often describe personal qualities. Fiction uses a number of these -ness nouns that are rarely found in the other registers: awareness, bitterness, darkness, goodness, happiness, politeness, weakness.

Moreover, the occurrence of nominalization across genres also shows us that academic texts use them more than three times as much as literature does. Newspaper shows almost the same trends as general reference. On the other hand, literature tends not to use them much compared to other genres. The result of a comparison between academic, fiction and speech texts by Biber et al. (1998:60) gave a similar result to this analysis, they found that, “while fiction and speech have similar frequencies, academic prose has a frequency almost four times greater.” Therefore, the amount of nominalization also plays an important role in knowing and representing the different text styles of text genres.

10. Multivariate Analyses

In this section, I categorize the sub-divided text types into three genres automatically from a vocabulary, text analysis, and statistical perspective. Various kinds of text analysis exist depending on the research purposes, e.g. authorship attribution, stylistics, text typology and variation studies such as register variation, regional variation, social variation, authorial variation, chronological variation etc. For example, Burrows (1987) conducted critical research on Jane Austen’s novels by using multivariate analysis of the 12-60 most common words. This examination investigated Austen’s narrative style, character differentiation through idiolects and free indirect discourse. The output showed that multivariate approaches enable us to use computers to assist with literary criticism, literary and linguistic stylistics, for identifying a stylistic ‘fingerprints’, authorship attribution, stylistic imitation and register variation (Tabata 2002). This section focuses on the investigation of genre (or register variation) using multivariate analyses.

10.1. Multivariate Analysis

The GR corpus is divided into 15 text categories: A (Press: Reportage); B (Press: Editorial) ; C (Press: Reviews); D (Religion); E (Skills, Trades and Hobbies); F (Popular Lore); G (Belles Letters, Biography and Essays); H (Miscellaneous: Government Documents, Industrial Reports, etc.); J (Learned and Scientific Writings); K (General Fiction); L (Mystery and Detective Fiction); M (Science Fiction); N (Adventure and Western Fiction); P (Romance and Love Story); R (Humour). Table 11 gives the tokens for each text category in the GR corpus.

Table 11: The tokens for each 15 text categories in GR corpus
  Text Words   Text Words   Text Words
A 676,470  F 702,362  11  L 193,191 
B 217,394  G 947,602  12  M 48,275 
C 137,788  H 243,692  13  N 233,556 
D 137,018  J 1,929,776  14  P 233,920 
E 297,667  10  K 233,856  15  R 72,533 

Now I categorize the 15 text categories into three genres: academic texts, newspaper and literature using a corpus-driven multivariate analysis. A general significance reference test like the chi-squared test cannot compute the characteristics based on the complex inter-relationships across large numbers of texts, and so we need to adopt a multivariate approach to perform such computationally intensive studies. McEnery and Wilson (2001:88) summarise the necessity of multivariate analyses as follows:

[It] would not be possible using tests such as the chi-squared test to examine the vocabulary relations between five different genres, except on a word-by-word basis. To perform such holistic comparisons for large numbers of variables we need a different type of statistical technique — a multivariate one.

It is significant that Biber’s MD approach is based on hyper-textual levels, and that dimension scores for each dimension are calculated for each text. Then, the mean of each dimension score for each genre is calculated to enable to ascertain characterization of any given texts or genres. As a result, the output shows genres can be very similar in one dimension while markedly different in others. On the other hand, although Multivariate analysis is similar to the MD approach, strictly speaking, they are different from a theoretical aspect. Multivariate analyses are computed based on a (calculated) value derived from cross-tabulation, and their purposes are to show the statistical similarities and differences across the various sample categories. Thus, a multivariate analysis is an analysis of a large number of linguistic features across many texts and text types using statistical techniques. This analysis is used for various purposes such as linguistic analyses of texts, genres, text types, styles or genres (or registers); the proposed assumption is that different kinds of text differ in their functions at the linguistic level, multivariate analyses make it possible to examine these difference from a quantitative aspect. As noted, Biber’s MD approach is similar to a multivariate analysis, being based on the assumption that multiple parameters of variation will be operative in any discourse domain. However, MD approach, being divided into 6 dimensions, is more complex and awkward to compute. Because of this, this paper utilises a multivariate analysis. Now, there are various multivariate analyses available such as principle components analysis, factor analysis, correspondence analysis and cluster analysis.

10.2. The Basic Data

In conducting the multivariate analysis, the top 100 high-frequency content words are used in the present research. Content words are used because of the general linguistic tendency for function words to occur at high-rank in any genre, and they do not provide enough significant differences across text genres. Table 12 shows the top 100 content words occurring in the GR 4 million corpus.

Table 12: The top content 100 words (verbs, nouns and adjectives) GR corpus
SAID 21  WORLD 41  HIGH 61  LATER 81  WHITE
TIME 22  DAY 42  END 62  GENERAL 82  WOMEN
NEW 23  SAME 43  GOING 63  YOUNG 83  FACE
LIKE 24  GO 44  PLACE 64  CALLED 84  IMPORTANT
MADE 25  RIGHT 45  LEFT 65  AMERICAN 85  SYSTEM
YEARS 26  USED 46  SMALL 66  LOOK 86  NIGHT
PEOPLE 27  TAKE 47  WENT 67  NEED 87  EYES
WAY 28  COME 48  COURSE 68  POINT 88  HALF
MAN 29  GREAT 49  WAR 69  ASKED 89  THINGS
10  GOOD 30  MEN 50  GOVERNMENT 70  CHILDREN 90  DIFFERENT
11  WORK 31  SAY 51  HAND 71  WANT 91  LOCAL
12  SEE 32  THOUGHT 52  PUT 72  ROOM 92  BEST
13  LONG 33  PART 53  NUMBER 73  FIND 93  POWER
14  MAKE 34  HOUSE 54  TOLD 74  HEAD 94  DAYS
15  YEAR 35  CAME 55  FACT 75  SCHOOL 95  NATIONAL
16  KNOW 36  USE 56  SET 76  LARGE 96  SIDE
17  OWN 37  FOUND 57  PUBLIC 77  WATER 97  SOCIAL
18  LITTLE 38  THINK 58  CASE 78  BETTER 98  FORM
19  LIFE 39  HOME 59  GIVEN 79  GIVE 99  POSSIBLE
20  OLD 40  STATE 60  TOOK 80  LOOKED 100  EARLY

Now, I compute the raw frequencies and the ratio of each frequency per 1,000 of the top 100 content words in table 12 in each text category, in order to conduct a multivariate analysis. Correspondence analysis can be computed with raw frequency data because this is based on a pattern matching system. However, principle components analysis and cluster analysis should be based on the ratio of frequency. Tomoji Tabata (personal communication., 12/11/2006), Osaka University, Japan, comments that, as a correspondence analysis maximizes the inter-correlation matrix in its computation, the result gained from raw frequency and even ratio of frequency (e.g., per 1,000 or per 1 million) gives almost the same outcome with respect to high frequency linguistic items. On the other hand, as principle components analysis and cluster analysis are based on correlation coefficients and covariance coefficients, these two analyses cannot compute raw frequencies. This is largely because the different corpus size affects the outcome of the computation, although this does not apply to equal-sized corpora, as in this case. Table 13 shows the raw frequencies of the top 100 content words, and table 14 shows the ratio of frequency of top 100 contents words per 1,000 words.

Table 13: The cross-tabulation of the raw frequency of the top 100 content words
  7-98  99  100 
  said time new like made years possible early
A 2,018  516  790  313  387  455  90  125 
B 184  344  451  252  195  293  80  50 
C 64  173  267  221  137  153  22  45 
D 116  165  325  100  130  113  54  74 
E 139  500  594  350  342  313  132  117 
F 238  663  521  401  367  473  114  176 
G 486  950  934  779  618  797  194  276 
H 135  350  485  74  403  337  120  80 
J 143  829  561  372  623  416  408  260 
K 1,093  449  170  703  261  182  43  49 
L 1,069  387  101  459  237  112  59  35 
M 184  107  39  124  46  39  11 
N 1,254  443  145  562  271  135  41  40 
P 1,133  444  155  654  276  195  44  66 
R 304  148  76  181  74  76  15  11 

Table 14: The cross-tabulation of the ratio of frequency of the top 100 content words per 1,000 words
  7-98  99  100 
  said time new like made years possible early
A 2.98  0.76  1.17  0.46  0.57  0.67  0.13  0.18 
B 0.85  1.58  2.07  1.16  0.9  1.35  0.37  0.23 
C 0.46  1.26  1.94  1.6  0.99  1.11  0.16  0.33 
D 0.85  1.2  2.37  0.73  0.95  0.82  0.39  0.54 
E 0.47  1.68  1.18  1.15  1.05  0.44  0.39 
F 0.34  0.94  0.74  0.57  0.52  0.67  0.16  0.25 
G 0.51  0.99  0.82  0.65  0.84  0.2  0.29 
H 0.55  1.44  1.99  0.3  1.65  1.38  0.49  0.33 
J 0.07  0.43  0.29  0.19  0.32  0.22  0.21  0.13 
K 4.67  1.92  0.73  3.01  1.12  0.78  0.18  0.21 
L 5.53  0.52  2.38  1.23  0.58  0.31  0.18 
M 3.81  2.22  0.81  2.57  0.95  0.81  0.1  0.23 
N 5.37  1.9  0.62  2.41  1.16  0.58  0.18  0.17 
P 4.84  1.9  0.66  2.8  1.18  0.83  0.19  0.28 
R 4.19  2.04  1.05  2.5  1.02  1.05  0.21  0.15 

10.3. Correspondence Analysis

According to McEnery and Wilson (2001:89-90), correspondence analysis is designed to attempt to “summarise the similarities between larger sets of variables and samples in terms of a smaller number of ‘best fit’ axes”. Tabata (2006) used correspondence analysis to ascertain the similarities and differences between the inaugural addresses of U.S. presidents. His research showed us that, 1. the presidents before 1900 are a forming a group in the left side of the plot, but that after 1900 they swarm together in the right side, 2. I-style is moving into we-style diachronically, 3. the decreasing of embedded relative clauses, and, 4. the increasing of genitive markers. Goto (2006) also attempted to classify sub-corpora in the BNC using correspondence analysis, based on the frequency table of nouns, verbs and adjectives. Figure 1 shows the text type plot, and Figure 2 the word plot, based on Table 13.

Figure 1: Text category plot based on 100 most common content words (verbs, adjectives and nouns)

Figure 1: Text category plot based on 100 most common content words (verbs, adjectives and nouns)

Figure 2: The top 100 content word plot based on the 15 text types of the GR corpus

Figure 2: The top 100 content word plot based on the 15 text types of the GR corpus

As can be seen in Figure 1, text J (Learned and Scientific Writing, that is to say, academic texts) is obviously dissimilar to the other 14 text categories. In addition, data points in Figures 1 and 2 correspond to each other, and so, when looking at the top right quadrant in the content word plot (Figure 2), it can be seen that some academic words such as different, form, system and used are used because text J is in the same quadrant in the text category plot (Figure 1). It is also clear that text A (Press: Reportage) shows distinct features as opposed to other text categories. However, although some text categories such as texts K, L, N and P swarm together, these 15 categories cannot be consistently divided into specific groups by correspondence analysis. Thus, I use principal component analysis and cluster analysis in the next section.

10.4. Principal Component Analysis

According to Horvath (1985), principal components analysis (PCA) is a statistical method for arranging large arrays of data into interpretable patterning match (Oakes 1998:103). The principal components are computed from the matrix of correlations between the variables, outputting “their eigenvalues (the amount of variance accounted for by each component) and the component loadings (how the variables correlate with the principal component)” (Oakes 1998:103). Then, this analysis attempts to plot and arrange these variables in a two-dimensional space (similar to correspondence analysis) and more closely related items are plotted closer to each other than the less closely related items. For example, Burrows and Hassal (1988) examined disputed authorship attribution using PCA by setting the variables as the top 50 high frequency words. They showed a relationship between the collocation of the top 50 high-frequency words and texts by creating a scatter plot from the PCA data.

Figure 3 is the result of 15-text type plot based on principal components analysis.

Figure 3: Text type plot based on principle components analysis

Figure 3: Text type plot based on principle components analysis

The result shows that the 15 text types can be divided into three major categories from the viewpoint of closeness: 1. K, L, M, N, P, R; 2. A, F, G, J; and, 3. B, C, D, E, H. Category 1 consists of imaginative prose linked to literature works. Categories 2 and 3 can be categorised as informative prose. Thus, first, we could make a distinction between informative prose and imaginative prose automatically (or between academic and newspaper, and literature). Then, informative prose is divided into categories 2 and 3. Surprisingly, though, in the GR corpus, the texts A, B, C are newspaper press texts (A: Press Reportage; B: Press Editorial; and, C: Press Reviews); text A is grouped in category 3 and texts B & C is in category 2. As category 3 also has an academic text (text J), it is difficult to make a distinction between text A and text J using a corpus-driven approach. Thus, principal components analysis shows that the distinction between academic and newspaper is a grey-zone compared to literature and other text genres. Texts H and J do not swarm closely, and therefore these two texts are independent; they are strictly categorized as, 1. K, L, M, N, P, R; 2. A, F, G; 3. B, C, D, E; 4. H; 5 J.

10.5. Cluster Analysis

Cluster analysis is a statistical methodology used to categorise individual objects into groups. According to Oakes (1998:116), cluster analysis starts with the overall cluster including all documents, and then sequentially subdivides them into the individual ones. The result gained from cluster analysis is visualized as a tree-diagram or dendrogram. Cluster analysis links similar texts by “draw[ing] vertical lines upwards from each node or document, then connect[ing] these vertical lines by a horizontal line at the point of similarity at the time the nodes are merged”. This method is known as, “single linkage because clusters are joined at each stage by the single shortest or strongest link between them” (Oakes 1998:118-9; Anderberg 1973). Figure 4 shows the result of cluster analysis of the 15 text types.

Figure 4: Cluster analysis of 15 text types based on 100 common words

Figure 4: Cluster analysis of 15 text types based on 100 common words

The dendrogram in Figure 4 shows that the 15 text types can be divided into three large categories, 1. K, L, M, N, P, R; 2. A, F, G, J; 3. B, C, D, E, H. This result is the same as the result given by principal components analysis.

Overall, the results of the correspondence analysis, principal components analysis and the cluster analysis allow us to categorise each text type into a genre using the corpus-driven method. Taking into account the common result from the three separate multivariate analyses, we can conclude that,

  1. there is an obvious difference between literary and non-literary texts;
  2. academic texts have by far, more distinct linguistic features as opposed to other genres;
  3. the style of press (or newspaper) text is totally different between reportage and non-reportage (editorial and reviews).

Thus, multivariate analyses such as CA, PCA and cluster analysis can make the elusive and abstract concept of ‘genre’ clearly visible, quantifying similarities and dissimilarities among text types.

11. Conclusion

Overall, this paper showed that the mixed use of the corpus-driven approach and current statistical methods can reveal the linguistic components forming ‘genres’, and show us that the concept of ‘genre’ is based on internal criteria, and makes the invisible similarities, dissimilarities and styles of text genres visible. More specifically, I have attempted to examine two main aspects of genre analysis and variation. The first is what sort of linguistic features can be revealed from the viewpoint of vocabulary and phrasal approaches in the three genres. This analysis tried to focus on the linguistic differences or features of the genre texts. The second is how these three major written genres can be specified from linguistic differences within 15 sub-divided text categories from a vocabulary viewpoint. Both approaches are conducted using a corpus-driven methodology.

The results of the current research can be divided into 6 summary points: First, the present research using basic statistics, lexical density and measurement of the vocabulary level revealed that, newspaper English uses the most varied vocabulary, literary English an intermediate one, and academic English the smallest, but academic and newspaper English are genres that include more difficult words as opposed to literature English.

Second, high frequency vocabularies, keyword analysis and n-gram analysis show that nouns were most often keywords in academic and newspaper texts, whereas literature had more verbs as keywords than any other language component; also, academic texts use a much more formal written style evidenced by the use of the it is construction and the fixed prepositional phrase, but literature texts use much more fixed, colloquial expressions expressing personal motions, actions, feelings, wishes and situations.

Third, the results from the investigations into the use of personality and passives supports the general ideas stated by Kuo (1999), Coniam (2004) and Goatly (2000). Thus, it would be necessary for us to examine not comprehensively but more specific individual papers if we support the idea maintained by Ivanic and Simpson (1992), Swales (1990) and Ard (1983).

Fourth, the result of investigation on hedging partly supports but also partly contradicts the theories given by Kuo (1999), Hyland (1994), Salager-Meyer (1994) and Coniam (2004). Specifically, the results of the present research present doubts on, 1) the data (of average frequency per 1,000 words) given by Coniam (2004), 2) the fact that the personal two-word hedge I believe shows an exception different from the ideas of Kuo (1999), Hyland (1994) and Salager-Meyer (1994), and 3) the method taken by the current research and by Coniam (2004) due to the fact that it is impossible to divide real hedging and non-hedging uses of hedging words in the list given by Hyland (2000).

Fifth, the result of my investigation on nominalization basically supports Biber et al. (1998), due to the fact that academic texts show nominalization at a higher ratio than other genres, and that the -ness form is predominantly used in the literature genre. However, as newspapers were not examined in Biber’s research, one new aspect of this genre could be revealed: While there is an overall tendency for the newspaper corpus to show a similar ratio and distribution of nominalization as the GR corpus, the -ment form is the highest ratio from the newspaper corpus across genres. The -ment nominalizations in newspaper texts include management, government, argument, investment, readjustment, replacement and others, and these words rarely occur in other genres. In addition, in common with academic texts, the -ity form is also important in newspaper texts.

Finally, the multivariate analysis shows quite a new aspect on obvious similarities and differences across 15 text types:

  1. each text feature of academic texts (text J) and press reportage texts (text A) is totally different from the other 13 text types based on CA;
  2. However, it is possible to categorise 15 text types into three groups based on PCA and cluster analysis;
  3. the most important fact is that, while all texts of imaginative prose can be categorised in the same group automatically, informative prose, particularly press texts (or newspaper texts) are clearly divided into two categories representing different text features.

In summary, I have shown that these 3 genres have different vocabularies, phrases, text levels, varieties, styles and discourses. Therefore, it can be said that such differences form text genres and that a corpus-driven approach is a valid linguistic approach for the analysis of genre.


  1. In later work, Biber changed the term ‘genre’ to ‘register’. 

References




Yasunori Nishina (2007) “A Corpus-Driven Approach to Genre Analysis: The Reinvestigation of Academic, Newspaper and Literary Texts”, ELR Journal, 1 (2).