A journey through learner language: tracking development using POS tag sequences in large-scale learner data
Abstract
This PhD study comes at a cross-roads of SLA studies and corpus linguistics methodology, using a bottom-up data-first approach to throw light on second language development. Taking POS tag n-gram sequences as a starting point, searching the data from the outermost syntactic layer available in corpus tools, it is an investigation of grammatical development in learner language across the six proficiency levels in the 52-million-word CEFR-benchmarked quasi-longitudinal Cambridge Learner Corpus. It takes a mixed methods approach, first examining the frequency and distribution of POS tag sequences by level, identifying convergence and divergence, and secondly looking qualitatively at form-meaning mappings of sequences at differing levels. It seeks to observe if there are sequences which characterise levels and which might index the transition between levels. It investigates sequence use at a lexical and functional level and explores whether this can contribute to our understanding of how a generic repertoire of learner language develops. It aims to contribute to the theoretical debate by looking critically at how current theories of language development and description might account for learner language development. It responds to the call to look at largescale learner data, and benefits from privileged access to such longitudinal data, acknowledging the limitations of any corpus data and the need to triangulate across different datasets. It seeks to illustrate how L2 language use converges and diverges across proficiency levels and to investigate convergence and divergence between L1 and L2 usage.
Keywords
Learner corpus researchSecond language development
POS tag sequences
Usage-based