University of Arizona Corpus Lab

Researchers: Shelley Staples, Bradley Dilger, Louis Wyatt, Sherri Craig, Michelle McMullin, Ola Swatek, Lindsey Macdonald, Hadi Banat, Wendy Gao, Terrence Wang

The Crow project is a collaboration between faculty and graduate students in Second Language Studies and Rhetoric and Composition. This innovative project brings together the Purdue Second Language Writing Corpus (PSLW), a collection of first year composition texts, and revives the Collaborative Online Instructor Network (COIN), a database of resources targeting professional development for writing instruction. Our project is creating a user-friendly website that connects data from the corpus with pedagogical resources (e.g., syllabi, lesson plans) to encourage creative research in Applied Linguistics as well as Rhetoric and Composition.

Presentations: Purdue Languages and Cultures Conference 2016, Computers & Writing 2016, TALC 2016

PSLW is a corpus of first year written texts from Purdue’s English 106i course. The class consists of five written assignments each semester and three drafts of each assignment. The processed corpus contains over 3 million words (from Fall 2014 and Spring 2015) and is available as plain text for Purdue students and researchers. The corpus continues to grow each semester, and a number of research projects have been conducted using the PSLW.

The Corpus Linguistics Research Lab is currently working on two projects related to the PSLW:

Novice L2 English academic writers’ use of reporting verbs: A learner corpus study

Researchers: R. Scott Partridge, Heejung Kwon, and Shelley Staples

Presentations: Corpus Linguistics 2015, Second Language Research Forum 2015

Publications: in progress

This study explores the use of reporting verbs in a literature review assignment from first-year composition classes for second language writers. Using the newly created Purdue Second Language Writing Corpus (PSLW-Corpus), researchers investigated the linguistic features of L2 novice writers’ texts in the early stages of learning academic writing.

Examining the effectiveness of data-driven instruction of reporting verbs in L2 writing: a corpus-based study

Researchers: R. Scott Partridge, Shelley Staples and the Corpus Lab and Research Group

Presentations: TALC 2016

This study examines the effectiveness of data-driven learning in a FYW class for L2 writers. The intervention employed previous corpus-based findings that highlighted L2 writers’ reporting verb use in a literature review assignment. The students’ essays pre and post intervention will be compared to our corpus as a control group.

Identifying linguistic features of medical interactions: A register analysis

Although medical discourse has been explored from both qualitative and quantitative perspectives since the 1980s, there have been few corpus-based linguistic analyses of doctor-patient and nurse-patient interactions. Such analyses are useful both to identify the linguistic features characteristic of medical interaction and their functions within the health care context, and also to provide insight into language needed for teaching and training medical providers. This chapter will explore the use of linguistic features associated with medical discourse in two contexts: nurse-patient interaction and doctor-patient interaction. In order to identify distinctive features of spoken medical discourse, the nurse-patient and doctor-patient interactions will be compared to casual face-to-face conversation. Lexico-grammatical features that have been identified as important to effective medical encounters, such as stance devices (e.g., modals), narrative structures (e.g., past tense), personal pronouns, and questions will be investigated across three corpora. The functions of the linguistic features will be interpreted in relation to the situational characteristics of each register, including the speaker’s role in the interaction. The results show similar patterns across the two medical contexts in comparison with conversation. This lends support to previous research that has compared nurse-patient interactions with conversation (Staples & Biber, 2014). However, there are also important distinctions between the two medical contexts, which are related to the speaker role (doctor vs. nurse) and the setting of the interaction (primary care clinic vs. hospital). The findings have implications for the sociolinguistic study of health care communication as well as the teaching and training of both native speaking and non-native speaking English medical providers. In particular, many of the linguistic features are associated with patient-centered care and the building of patient rapport.

A multi-dimensional comparison of oral proficiency interviews to conversation, academic and professional spoken registers (with Geoffrey LaFlair, University of Kentucky and Jesse Egbert, Brigham Young University)

Oral Proficiency Interviews (OPIs) are widely used to measure speaking ability in a second or foreign language. The Michigan English Language Assessment Battery (MELAB) is an OPI used for academic and professional purposes around the world. According to Kane (2013), an argument-based approach to assessment validity can be bolstered by analytic evidence of a relationship between language use in test conditions to language use in the target domain (the extrapolation inference). In this study we use Multi-Dimensional (MD) analysis, investigating a large number of linguistic features to determine the extent to which the language of the MELAB speaking performances are similar to conversation and academic and professional registers of spoken English. The results show that while the MELAB constitutes a distinct register, it has similarities with conversation in its use of stance, and is closely aligned with academic and professional registers in the use of language for informational exchange. These findings provide support for the extrapolation inference of the MELAB OPI. However, the use of narrative features and discussion of future possibilities and suggestions, important aspects of both conversation and academic and professional registers, may be harder to evaluate through the MELAB and other similar OPIs.

Investigating lexico-grammatical complexity as construct validity evidence for the ECPE writing tasks: A multidimensional analysis (with Xun Yan, University of Illinois Urbana-Champaign)

The complexity of lexico-grammatical features is widely recognized as an integral part of writing proficiency in second language (L2) writing assessment. However, a remaining concern for the construct validation of writing tasks lies in the scalability of representative linguistic features in writing performances. Previous research suggests that distinctions across different levels of writing proficiency are not necessarily associated with individual lexico-grammatical features, but rather with the co-occurrence of multiple features (Biber, Gray & Staples, 2014; Friginal, Li & Weigle, 2014; Jarvis, Grant, Bikowski & Ferris, 2003).

As an effort to investigate the scalability of lexico-grammatical complexity, this study used a multidimensional (MD) analysis to examine saliency and patterns of co-occurrence for 31 lexico-grammatical features in 595 writing performances on a large-scale, advanced-level English language proficiency examination, the Examination for the Certificate of Proficiency in English (ECPE). The linguistic features were classified into four categories: fluency, lexical sophistication, semantic categories for word classes, and general grammatical features, all of which have been found to characterize written discourse and advanced L2 writing proficiency (e.g., Biber, Gray & Poonpon, 2011).

Results of the MD analysis indicate five underlying factors, representing five functional dimensions of lexico-grammatical complexity in ECPE writing performances: literate vs. oral discourse, topic-related content, prompt dependence vs. lexical diversity, overt suggestions, and stance vs. referential discourse. Together, the five dimensions accounted for 35% of the holistic score variance. While factor scores on the prompt-difference dimension did not yield significant correlation with the holistic ECPE writing scores awarded by human raters, correlations for the other four dimensions were linear and statistically significant. Findings of this study present supportive evidence for different shades/layers of construct validity of ECPE writing tasks and suggest the scalability of the ECPE writing scale with respect to lexico-grammatical complexity.

Lexical bundles in L1 Persian and L1 English argumentative essays (with Hesamoddin Shahriari, Ferdowsi University, Mashhad, Iran)

One of the necessary preconditions for the effective teaching of second/foreign language writing is to have a comprehensive understanding of the strengths and weaknesses of one’s target group of learners. Such an understanding would allow instructors to more thoroughly focus on their learners’ mistakes and inaccuracies, while spending less time on features with which learners have relatively fewer problems. Frequently recurring lexical sequences within a text (i.e., lexical bundles) have been used to analyze the discourse of a register from a structural and functional point of view. This study aims to explore the ways in which analyzing lexical bundles in a learner corpus could be used to improve our understanding of learner writing and subsequently inform the process of instruction. To this end, the lexical bundles found in the Iranian sub-corpus of the International Corpus of Learner English (ICLE) were analyzed in terms of their fixedness, form and function; in each case, comparisons were drawn between the target set of bundles and those found in a comparable corpus of native-speaker essays (a sample of the LOCNESS). A detailed discussion of the findings along with relevant pedagogic implications will be presented.