Exploring the composition of the searchable web: a corpus-based taxonomy of web registers. Issue 1 (April 2015)
- Record Type:
- Journal Article
- Title:
- Exploring the composition of the searchable web: a corpus-based taxonomy of web registers. Issue 1 (April 2015)
- Main Title:
- Exploring the composition of the searchable web: a corpus-based taxonomy of web registers
- Authors:
- Biber, Douglas
Egbert, Jesse
Davies, Mark - Abstract:
- Abstract : One major challenge for Web-As-Corpus research is that a typical Web search provides little information about the register of the documents that are searched. Previous research has attempted to address this problem (e.g., through the Automatic Genre Identification initiative), but with only limited success. As a result, we currently know surprisingly little about the distribution of registers on the web. In this study, we tackle this problem through a bottom-up user-based investigation of a large, representative corpus of web documents. We base our investigation on a much larger corpus than those used in previous research (48, 571 web documents), and obtained through random sampling from across the full range of documents that are publically available on the searchable web. Instead of relying on individual expert coders, we recruit typical end-users of the Web for register coding, with each document in the corpus coded by four different raters. End-users identify basic situational characteristics of each web document, coded in a hierarchical manner. Those situational characteristics lead to general register categories, which eventually lead to lists of specific sub-registers. By working through a hierarchical decision tree, users are able to identify the register category of most Internet texts with a high degree of reliability. After summarising our methodological approach, this paper documents the register composition of the searchable web. Narrative registersAbstract : One major challenge for Web-As-Corpus research is that a typical Web search provides little information about the register of the documents that are searched. Previous research has attempted to address this problem (e.g., through the Automatic Genre Identification initiative), but with only limited success. As a result, we currently know surprisingly little about the distribution of registers on the web. In this study, we tackle this problem through a bottom-up user-based investigation of a large, representative corpus of web documents. We base our investigation on a much larger corpus than those used in previous research (48, 571 web documents), and obtained through random sampling from across the full range of documents that are publically available on the searchable web. Instead of relying on individual expert coders, we recruit typical end-users of the Web for register coding, with each document in the corpus coded by four different raters. End-users identify basic situational characteristics of each web document, coded in a hierarchical manner. Those situational characteristics lead to general register categories, which eventually lead to lists of specific sub-registers. By working through a hierarchical decision tree, users are able to identify the register category of most Internet texts with a high degree of reliability. After summarising our methodological approach, this paper documents the register composition of the searchable web. Narrative registers are found to be the most prevalent, while Opinion and Informational Description/Explanation registers are also found to be extremely common. One of the major innovations of the approach adopted here is that it permits an empirical identification of 'hybrid' documents, which integrate characteristics from multiple general register categories (e.g., opinionated-narrative). These patterns are described and illustrated through sample Internet documents. … (more)
- Is Part Of:
- Corpora. Volume 10:Issue 1(2015)
- Journal:
- Corpora
- Issue:
- Volume 10:Issue 1(2015)
- Issue Display:
- Volume 10, Issue 1 (2015)
- Year:
- 2015
- Volume:
- 10
- Issue:
- 1
- Issue Sort Value:
- 2015-0010-0001-0000
- Page Start:
- 11
- Page End:
- 45
- Publication Date:
- 2015-04
- Subjects:
- hybrid registers -- informational registers -- Internet language -- Mechanical Turk -- narrative -- opinion -- Web-As-Corpus -- web registers
Corpora (Linguistics) -- Periodicals
410.188 - Journal URLs:
- http://www.euppublishing.com/journal/cor ↗
http://www.euppublishing.com/journals ↗ - DOI:
- 10.3366/cor.2015.0065 ↗
- Languages:
- English
- ISSNs:
- 1749-5032
- Deposit Type:
- Legaldeposit
- View Content:
- Available online (eLD content is only available in our Reading Rooms) ↗
- Physical Locations:
- British Library DSC - BLDSS-3PM
British Library HMNTS - ELD Digital store - Ingest File:
- 5036.xml