Cambridge learner corpus download




















Skip to main content. By using our site, you agree to our collection of information through the use of cookies. To learn more, view our Privacy Policy. Log In Sign Up. Download Free PDF. Diane Nicholls. Download PDF. A short summary of this paper. It comprises English examination scripts, transcribed retaining all errors, written by learners of English with 86 different mother tongues.

The scripts range across 8 EFL examinations and cover both general and business English. A 6 million-word component of the corpus has been error coded to date, using an error-coding system devised at CUP specifically for the Cambridge Learner Corpus.

The majority of codes are based on a two-letter system in which the first letter represents the general type of error e. There are 88 possible codes in all. This paper will describe the coding system and the corpus tools used for analysis of the coded corpus, and will demonstrate the benefits which this coding and analysis provides for both lexicographers and writers of other ELT books at CUP.

Students' examination essays are carefully transcribed, reproducing all errors, checked for inputter- generated errors, and stored in the corpus, along with candidate details and examination scores.

The corpus is growing all the time. At present, the complete corpus contains more than 16 million words. The error-coded component of the corpus currently contains 6 million words.

A profile of each candidate is given for each examination script. This includes information on the first language, age, sex, education history and years of English study of each student. This information can be used to specify the parameters for the creation of subcorpora.

For example, it is possible to isolate for analysis the English of very young learners or a particular examination level, mother tongue or language group. A combination of any of these details can be used to create a subcorpus.

The coded corpus 6 million words of the Cambridge Learner Corpus have been error coded to date. While error coding is certainly a laborious and time-consuming task, its benefits for our purposes, which I discuss below, have proved to far outweigh the difficulties. The system of error codes and software have been designed in such a way as to overcome, as far as possible, problems with the indeterminacy of some error types. The corpus has also been manually coded by just two coders, with one coder overseeing the work of the second, thus keeping to a minimum any problems with consistency of tagging.

Our aim in coding the corpus is not to create a systematic taxonomy of learner errors but, where possible, to capture under one heading all errors both of omission and commission of a particular type so that they can easily be extracted and analyzed and the information gained passed on to examining bodies, teachers, lexicographers, researchers and ELT authors for use in developing tools for learners of English.

The codes are not an end in themselves, but rather, act as bookmarks to the contexts in which an error repeatedly occurs. These bookmarks can then be referred to for further analysis. Care is taken not to 'interpret' or paraphrase errors and only to add a corrected version where there is relative certainty and only one clear replacement possible.

This measure vastly increases the search potential of the corpus, allowing access to data which would otherwise be present but only indirectly accessible. Correct uses which are automatically tagged NE no error can be deselected and the remaining cites sorted according to error type for easy look-up and analysis.

This also allows the possibility of comparing what learners get right an often neglected area in ELT with what they get wrong. After searching through a concordanced search on 'at', for example, in an uncoded corpus, it is possible to locate errors such as the unnecessary use of the preposition e. Organised like a traditional dictionary from A to Z, clear descriptions and examples explain and illustrate everything from Astronaut to Zoo.

It is perfect for building your child's understanding of language and improving their spelling and grammar. Packed with special full-page features on major subjects and developed in close consultation with experts in children's language teaching, this is the ultimate reference tool for your child. With new entries and sensitive edits, this fifth edition places J.

Written in a clear and highly readable style Comprehensive historical coverage extending from ancient times to the present day Broad intellectual and cultural range Expands on the previous edition to incorporate the most recent literary terminology New material is particularly focused in areas such as gender studies and queer theory, post-colonial theory, post-structuralism, post-modernism, narrative theory, and cultural studies.

To browse Academia. Skip to main content. You're using an out-of-date version of Internet Explorer. By using our site, you agree to our collection of information through the use of cookies. To learn more, view our Privacy Policy. Log In Sign Up. Account Options Sign in. Top charts. New releases. The documents are harvested from all the tasks in the past reading papers for each of the exams. The Cambridge English Exams are designed for L2 learners specifically and the A2—C2 levels assigned to each reading paper can be treated as the level of reading difficulty of the documents for the L2 learners.

You may download the Cambridge English Readability Dataset if you agree to the licence above. Toggle SlidingBar Area. Licence The Dataset is released for non-commercial research and educational purposes under the following licence agreement: By downloading this dataset and licence, this licence agreement is entered into, effective this date, between you, the Licensee, and the University of Cambridge, the Licensor. Copyright of the entire licensed dataset is held by the Licensor. No ownership or interest in the dataset is transferred to the Licensee.

The Licensor hereby grants the Licensee a non-exclusive non-transferable right to use the licensed dataset for non-commercial research and educational purposes. Non-commercial purposes exclude without limitation any use of the licensed dataset or information derived from the dataset for or as part of a product or service which is sold, offered for sale, licensed, leased or rented.

The Licensee may publish excerpts of less than words from the licensed dataset pursuant to clause 3. Licensor does not make, and expressly disclaims, any express or implied warranties, representations or endorsements of any kind whatsoever. This Agreement shall be governed by and construed in accordance with the laws of England and the English courts shall have exclusive jurisdiction.



0コメント

  • 1000 / 1000