PSLW is a corpus of first year written texts from Purdue’s English 106i course. The class consists of five written assignments each semester and three drafts of each assignment. The processed corpus contains over 3 million words (from Fall 2014 and Spring 2015) and is available as plain text for Purdue students and researchers. The corpus continues to grow each semester.

How To Access PSLW
SLS User Agreement

  • The Corpus of Contemporary American English (COCA) is the largest freely-available corpus of English online. Created by Mark Davies, Professor of Linguistics at Brigham Young University, it is the only large, balanced corpus of contemporary American language and it is also one of the most frequently used, with more than 40,000 unique visitors each month. It contains more than 450 million words of text, which are equally divided among spoken, fiction, popular magazines, newspapers and academic texts. It includes 20 million words each year from the years 1990-2012 and is updated regularly (the most recent being in the summer of 2012). With COCA, one can search for exact words and/or phrases, wildcards, lemmas, parts of speech, or any combination of these. One can also collocate within a 10-word window, limit searches by frequency and compare the frequency, and conduct semantically-based queries. Because of these features, COCA is perhaps the only corpus of English that is suitable for looking at current, ongoing changes in the language.

  • The Corpus of Historical American English (COHA) is the largest structured corpus of historical English. Created by Mark Davies, Professor of Linguistics at Brigham Young University, with funding from the US National Endowment for the Humanities, it allows one to search more than 400 million words of text of American English from the years 1810-2009. With COHA, one can see how words, phrases, and grammatical constructions have changed in frequency over time; how word meanings have changed over time; and how stylistic changes have occurred over time.

  • The Corpus do Portugues allows one to quickly and easily search more than 45 million words in almost 57,000 Portuguese texts from the 1300s to the 1900s. One can search for exact words or phrases, wildcards, lemmas, parts of speech, or any combination of these, as well as collocates within a 10-word window. One can also compare the frequency and distribution of two related words, phrases and grammatical constructions across texts by register, dialect and historical period. Semantically-based queries can also be conducted with this corpus.

  • The Corpus of Global Web-Based English (GloWbE) is comprised of 1.9 billion words from 1.8 million webpages in 20 different English-speaking countries. Released in 2013, it was created by Mark Davies, Professor of Linguistics at Brigham Young University (BYU), and is related to COCA and COHA, two other corpora through BYU. In combination with COCA and COHA, one can use GloWbE to examine variation in English by dialect, genre and over time. GloWbE also allows one to search through a corpus that is more than four times the size of COCA and to see the frequency of any word, phrase, or grammatical construction in each of the 20 different countries. One can compare any features of two sets of dialects or limit a search to one or two countries. Ultimately, GloWbE allows researchers to study an extremely wide range of phenomena.

  • The Corpus del Español allows one to quickly and easily search more than 100 million words in more than 20,000 Spanish texts from the 1200s to the 1900s. One can search for exact words or phrases, wildcards, lemmas, parts of speech, or any combination of these, as well as collocates within a 10-word window. One can also compare the frequency and distribution of two related words, phrases and grammatical constructions across texts by register and historical period. Semantically-based queries can also be conducted with this corpus.

  • The TIME Magazine Corpus contains more than 100 million words of text of American English from 1923 to the present, as found in TIME magazine. Created by Mark Davies, Professor of Linguistics at Brigham Young University, it allows one to see how words, phrases and grammatical constructions have changed over time, in frequency and in meaning.

  • The Corpus of American Soap Operas contains 100 million words in more than 22,000 transcripts from ten American soap operas between the years 2001 and 2012. Released in 2012, it was created by Mark Davies, Professor of Linguistics at Brigham Young University. Although soap operas are scripted, the data gleaned from them provides a useful and insightful look into informal, colloquial American speech.

  • The British National Corpus (BNC) was created by the Oxford University Press in the early 1980s-early 1990s and exists in various forms online. The version housed by Brigham Young University (BYU) was updated in 2012 and uses the CLAWS 7 tagset. The BNC contains 100 million words from the 1970s-1993. In the BYU-BNC, one can search for words and phrases by exact word or phrase, wildcard or part of speech, or combinations of these, as well as collocates within a 10-word window. One can also search among different registers (e.g. spoken, poetry, etc.) for frequency of words and phrases in any combination, and can also compare data between two different registers. Semantically-based queries are also possible in this corpus.

  • The Michigan Corpus of Academic Spoken English (MICASE) is a collection of nearly 1.8 million words of transcribed speech (almost 200 hours of recordings) from the University of Michigan-Ann Arbor (UM). Compiled during the years 1997-2002 and released in 2002 (with a new interface released in 2007), it was created by researchers and students at the UM English Language Institute (ELI). MICASE contains data from a wide range of speech events (including lectures, classroom discussions, lab sections, seminars and advising sections) and locations across the university.

  • The Michigan Corpus of Upper-Level Student Papers (MICUSP) is a collection of around 830 A-grade papers (roughly 2.6 million words) from a range of disciplines across four academic divisions (Humanities and Arts, Physical Sciences, Social Sciences, Biological and Health Sciences) of the University of Michigan-Ann Arbor (U-M). MICUSP was created by a team of researchers and students at the U-M English Language Institute (ELI). Comprised of papers written between the years 2002-2009, MICUSP was created during the years 2004-2009 and was released in 2009.

  • The International Corpus of English (ICE), which was first created in 1990, aims to collect material for comparative studies of English worldwide. Twenty-six research teams from around the world work to collect data in order to create electronic corpora of their own national or regional variety of English. There are currently fourteen individual corpora housed within ICE, and each contains one million words of spoken and written English produced after 1989: Canada, East Africa, Great Britain, Hong Kong, India, Ireland & SPICE, Ireland, Jamaica, New Zealand, Nigeria (written), The Philippines, Singapore, Sri Lanka (written) and USA (written).

  • The British Academic Written English (BAWE) Corpus was created between the years of 2004-2007. It contains 2761 pieces of proficient assessed student writing, ranging in length from 500 words to about 5000 words. Holdings are distributed across four disciplinary areas (Arts and Humanities, Social Sciences, Life Sciences and Physical Sciences) and across four levels of study (undergraduate and graduate); thirty-five disciplines are represented. Each file is headed by factual information about the author, such as gender and year of birth, as well as other research findings, such as genre family.

  • The British Academic Spoken English (BASE) Corpus was developed by Hilary Nesi, with Paul Thompson, between the years 2000-2005. The BASE Corpus consists of 160 lectures and 40 seminars recorded in a variety of departments. It contains 1,644,942 tokens in total (lectures and seminars). Holdings are distributed across four broad disciplinary groups, each represented by 40 lectures and 10 seminars.

  • The Vienna-Oxford International Corpus of English (VOICE) consists of transcripts of naturally occurring, non-scripted face-to-face interactions in English as a lingua franca (ELF). VOICE currently comprises 1 million words of spoken ELF interactions, equaling approximately 120 hours of transcribed speech; there are also 23 recordings of transcribed speech events that can also be listened to. The speakers recorded in VOICE are experienced ELF speakers from a wide range of first language backgrounds. VOICE currently includes 1250 ELF speakers with around 50 different first languages, most of whom are European ELF speakers. The ELF interactions recorded cover a range of different speech events in terms of domain (professional, educational, leisure), function (exchanging information, enacting social relationships) and participant roles and relationships (acquainted vs. unacquainted, symmetrical vs. asymmetrical). They are classified into ten various speech event types, such as interviews, seminar discussions and question-answer sessions.

  • The Louvain International Database of Spoken English Interlanguage (LINDSEI) was launched in 1995 by the Centre for English Corpus Linguistics of the Université catholique de Louvain. It contains oral data produced by advanced learners of English from several different first language backgrounds. The first component of LINDSEI contained 50 transcripts from native French speakers who were learners of English, for a total of about 100,000 words of learner language; there are currently eleven components available via CD-ROM, with more being developed. All of the components feature around 50 interviews made up of three tasks: set topic, free discussion and picture description. The interviews are transcribed and marked-up according to the same conventions, and each of them is linked to a profile which contains information about the learner, the interviewer and the interview itself. This information makes it possible to study the possible influence of certain factors of learner language. Lexis, discourse, pragmatics, syntax and phraseology, among other aspects, of learner English can be investigated via LINDSEI.

  • The Hong Kong Corpus of Spoken English (HKCSE-Prosodic) is a large collection of texts representing spoken English in Hong Kong. Created by the Research Centre for Professional Communication in English of the Hong Kong Polytechnic University, it is available via book and CD through the John Benjamins Publishing Company. The corpus is the first to apply David Brazil’s Discourse Intonation systems (prominence, key, tone and termination) to the study of naturally-occurring spoken discourses. The HKCSE (prosodic) is made up of around one million words consisting of four sub-corpora of equal size (academic, conversation, business and public). The participants are all adults and usually speak Cantonese or English as a first language. The CD-ROM contains the prosodically transcribed corpus together with iConc, the software designed and written specifically to interrogate the HKCSE (prosodic).