This webpage provides a select list of corpora of computer-mediated communication which supplement the following article:
Computer-Mediated Communication (CMC) is the research field that explores the social, communicative and linguistic impact of communication technologies, which have continually evolved in connection with the use of computer networks. The main focus of CMC research is on Internet-based technologies and their genres: e-mail, mailinglists, discussion groups (forums and bulletin boards), Internet Relay Chat (IRC) and webchats, Instant Messaging (ICQ, AIM & Co.), MUDs, Voice-over-IP applications (Skype etc.), Web-based videoconferencing, weblogs and hypertext (incl. wikis).
We differentiate the following types of CMC corpora:
The CoSy:50 Corpus (Simeon Yates) |
50 submissions from 152 computer conferences (see Yates, Simeon J. 1996, Oral and written linguistic aspects of computer conferencing , in: Herring, Susan C. (ed), Computer-Mediated Communication. Linguistic, Social and Cross-Cultural Perspectives. Amsterdam/Philadelphia, pp. 29-46) |
The Swiss German webchat corpora (Beat Siebenhaar) |
(see e.g. Siebenhaar, Beat, Die dialektale Verankerung regionaler Chats in der deutschsprachigen Schweiz, in: Eggers, Eckhard/Stellmacher, Dieter/Schmidt, Jürgen Erich (eds): Tagungsband IGDD-Kongress Marburg. Stuttgart) |
The contrastive German-Swedish IRC-Corpus (Christiane Pankow) |
(see e.g. Pankow, Christiane 2003,Zur Darstellung nonverbalen Verhaltens in deutschen und schwedischen IRC-Chats. Eine Korpusuntersuchung, in: Linguistik online 15) |
The Netscan Usenet Database | |
The Enron Email Dataset | (> 0.5 million business-related e-mail messages) |
The SpamAssassin Public Corpus | (approx. 6,000 e-mail messages from the Apache SpamAssassin Project) |
Korpus deutschsprachiger Newsgroups | (see Feldweg, Helmut/Kibiger, Ralf/Thielen, Christine 1995, Zum Sprachgebrauch in deutschen Newsgruppen , in: Schmitz, Ulrich (ed), Neue Medien (Osnabrücker Beiträge zur Sprachtheorie 50), 143-154) |
WWE-2006 weblog dataset | this corpus of weblog posts was temporarily available to the participants of the 3rd Annual Workshop on the Weblogging Ecosystem (see |
E-Mail corpus from the COSMA project | 160 e-mail messages with appointment arrangements (see Declerck, Thierry/Klein, Judith 1997, Ein Email-Korpus zur Entwicklung und Evaluierung der Analysekomponente eines Terminvereinbarungssystems) |
Website corpus from the Hypnotic project | (see Rehm, Georg, 2001, Hypertextsorten: Definition, Struktur, Klassifikation) |
The Düsseldorf CMC Corpus | Corpus resource at Düsseldorf University (Dieter Stein); no online access |
The Dortmund Chat Corpus | (> 500 annotated chatlogs and a retrieval tool available online) |
List compiled by Michael Beißwenger and Angelika Storrer, Technical University of Dortmund.
Last revised: 2008-02-09