Corpus Linguistics

 

Bibliographic reference section that focusses on Corpus Linguistics and the use of corpora in language teaching. 

  • Texts: Corpora, Newspapers and News Sites
  • Chinese
  • Czech
  • Danish
  • Dutch
  • English
  • English-Miscellaneous
  • Estonian
  • Ethiopic
  • French
  • Gaelic
  • German
  • Hebrew
  • Italian
  • Malay
  • Norwegian
  • Polish
  • Portuguese
  • Russian
  • Scandinavian
  • Spanish
  • Swedish
  • Turkish
  • Learner corpora
  • Corpus searches
  • Word lists and Stop lists
  • Software
  • Text analysis
  • Taggers
  • Online papers, theses, etc. related to CL.
  • Courses in Corpus Linguistics
  • Useful Sites and Home Pages
  •  

    Texts: Corpora, Newspapers and News Sites

    Chinese

    Mandarin corpus Big 5 encoding

    Czech

    Institute of the Czech National Corpus

    Danish

    News in Danish. Address has changed.

     

    Dutch

    The Institute for Dutch Lexicology have several large corpora, which can be accessed for academic research purposes.

    English

    American National Corpus

    Oxford Text Archive WEB site OTA FTP site. (Mirror ftp site for North America -- OTA.) Good starting point. Includes British novels, Dickens, Trollope, etc. The Susanne Corpus is in this archive in the directory pub/ota/public/susanne. For background info, see Susanne. OED Online

    Project Gutenberg: (in English)
    Some literary works such as "Moby Dick" and "Through the Looking Glass" are available electronically from Project Gutenberg.

    Corpus of Spoken, Professional American-English The corpus is available commercially from Athelstan. There is a 50,000 word sample available online.

    British National Corpus. A large (100 million words) corpus of modern English (1990's). BNC World Edition is now available. See also BNC Indexer

    COBUILD offers access to a large corpus for a fee. Also has a free demo.

    Wellington Corpus of Spoken New Zealand English. CD-ROM. Written New Zealand English is also included. Corpus-Manager@vuw.ac.nz

    Penn-Helsinki Corpus of Middle English

    ICAME, Bergen. This is the ftp site. ICAME also produces an excellent CD-ROM containing Brown, LOB, London-Lund, and Helsinki corpora among others. Also the home of Corpora news-list. Also a web-site.

    The Bergen Corpus of London Teenage Language

    The TRAINS Spoken Dialogue Corpus

    CCAT Archive Gopher site at U. Penn. A good site for classical, historical, and religious texts.

    Voice of America News (Gopher)

    CBC Canadian broadcasting. Includes sound files.

    Marx & Engels Online Library

    English-Miscellaneous

    Ftp site for Red Dwarf scripts
    O.J. Simpson Trial Transcripts Another transcript source. O.J. trial transcripts And Another good source.
    Neologisms
    Proper names Ftp site.
    Russian novel Gopher

    Estonian

     

    Estonian Law (in Estonian!)

    Ethiopic

    Thesaurus Linguae Aethiopicae

    French

    Louisiana French MOVED ??

    French novels

    Dictionnaire de l'Académie française

    Radio French Internationale

    Gaelic

     

    Manx

    German

    Mannheimer Corpora A very large, growing, online German corpus archive (778 million words in August 2000). A copyright-free portion of the archive (379 million words in August 2000) is freely searchable. Invited guests have access to the whole archive. Partially tagged.

    German newspapers -- tagged corpus with syntactic structure annotated.

    Hebrew

    Spoken Israel Hebrew Description of the project.

    Indo-European

    Comparative Indo-European

    Italian

     

    Italian literature (LiberLiber)

    Malay

    Malay Classical literature. Searchable online.

    Norwegian

    Oslo Corpus of Tagged Norwegian Texts

    Polish

    Polish Newspaper

    Portuguese (Brazilian)

     

    News from Brazil

    Russian

    Russian literature

    Russian foreign affairs articles I have not had much luck with this.

    Russian word list gopher.

    Scandinavian

    Language Bank of Swedish Texts

    Project Runeberg (Scandinavian classics)

    Swedish

    Spanish

    South American oral and written texts available via ftp from lola.lllf.uam.es.

    Spanish Syntax Research Group University of Santiago de Compostela. Information about ARTHUS (1.5 million words in modern Spanish) and syntactic database (BDS, 160.000 analysed clauses of ARTHUS). In progress: a medieval and classic Spanish corpus ("ARTHUS Medieval y Clasico). 

    "Maria" corpus Acquisition of Spanish.

    Mexican Newspapers: El Nacional, La Jornada, etc.

    Swedish

    Bank of Swedish

    Turkish

    Turkish

    Learner corpora

    Learner corpora Extensive information from Yukio Tono

    Hungarian EFL Student Writing

    ICLE - Brazilian Portuguese Sub-Corpus

    Corpus searches

    COSMAS search Institut für Deutsche Sprache, Mannheim, Germany.

    IMS Stuttgart (Penn Treebank) search -- OLD LINK??

    Cobuild Corpus Sampler

    University of Michigan Middle English Collection

    Michigan Early Modern English Materials

    Web-based analysis of Gutenberg texts by Ron Reck. See also Corpus Access at the University of Essex.

    VISL Project Denmark. English and German corpora can be searched.

    Concordance of Great Books

    British National Corpus Simple search

    Word lists and Stop lists

    French Stop list from Jean.

    Zipped file of n-grams from the Brown Corpus

    Software

    Text analysis

    COSMAS - A corpus analysis toolbox, online accessible since 1995, see COSMAS. 778 million words online, virtual corpus composition, complex query language, concordancing, collocation analysis etc.

    MonoConc Pro. Commercial Windows concordance program (produced by me). See the Athelstan site.

    MonoConc, a Mac/Windows concordance program that allows sorts (2R,1R,2L,1L) and provides simple frequency information. 

    ParaConc, a Mac/Windows concordance program for parallel texts. A version is available for free for research purposes (under license). 

    Conc, a Mac concordance program, is available via ftp from SIL. Also available by anonymous-ftp from clr.nmsu.edu (/clr.nmsu.edu:/CLR/tools/concordances).
    Indiana University LETRS Conc QuickGuide.

    Free Text, a Mac concordance program, should be available from the U. of Michigan site. Also available from ftp://nora.hd.uib.no/pub/mac/

    HUM, developed by William Tuthill, is available by anonymous-ftp from clr.nmsu.edu (/clr.nmsu.edu:/CLR/tools/concordances).

    Perl Dan Melamed's perl tools

    Tact. Available via ftp from University of Toronto (epas.utoronto.ca).
    Indiana University LETRS TACT QuickGuide
    World Wide Web implementation of TACT -- TACTWeb. "TACTweb connects TACT to the World Wide Web-making a TACT TDB database accessible to the entire WWW community." See also Elisabeth Burr's site.

    LEXA Corpus processing Software version 6 (for DOS) is available via ftp. This is a suite of programs for tagging, lemmatization, word frequency counts, etc.

    TextAnalyst Commercial software that produces a semantic network on the basis of text input. The company, Megaputer also produces a data mining tool PolyAnalyst.

    ShoeBox Fieldwork oriented program. Information available from SIL.

    VisualText A suite of commercial text analysis tools.

    Word Cruncher Info available from WPT

    Paai's text utilities: A set of utilities consisting of unix-scripts and c-programs for frequency-counts and lexical cohesion.

    Taggers

    Eric Brill's program Ftp site.

    TOSCA/LOB tagger for DOS. Downloadable.

    AMALGAM Email tagging, conversion of tagsets, ...

    TreeTagger Language-independent HMM tagger. Parameter files for English, French, German.

    CRATER report. Discussion of a modified version of the Xerox Tagger.

    The Corpus Linguistics Group at the University of Birmingham has an Experimental email tagger-QTAG Texts can be sent via email to tagger@clg.bham.ac.uk

    CoreLex -- a tagset and database for semantic tagging based on WordNet

    Online Papers, Theses, etc. Related to CL

    Theses

    Torbjörn Lager Thesis-A Logical Approach to Computational Corpus Linguistics

     

    Books

     

    Pattern Grammar A corpus-driven approach to the lexical grammar of English. Susan Hunston and Gill Francis Studies in Corpus Linguistics 4

    Patterns and Meanings Using corpora for English language research and teaching. Alan Partington. Studies in Corpus Linguistics 2

    Terms in Context Jennifer Pearson. Studies in Corpus Linguistics 1

    Text and Technology In honour of John Sinclair. Mona Baker, Gill Francis and Elena Tognini-Bonelli (eds.) John Benjamins.

    Courses in Corpus Linguistics

     

    Eugene Charniak: Statistical course

    Elisabeth Burr: Korpuslinguistik course

    Tony Berber Sardinha: Corpus Linguistics courses: 1998-1999; 2000

    Mark Davies: History of the Spanish Language; Assignments and projects

    Chris Brew: Statistical NLP

    Javier Perez-Guerra: English linguistics (written in Galician)

    Useful Sites and Home Pages

    Centres and Departments

    Corpus Linguistics at Birmingham University, England.

    Center for Electronic Texts in the Humanities.

    Centre for English Corpus Linguistics, Louvain

    CTI Centre for Modern Languages Based in Hull, England. Newsletter, language software guide, info on language teaching.

    Oxford Text Archive

    Tuscan Word Centre

    Other Useful Sites

     

    Alex gopher site
    Alex allows users to find and retrieve the full-text of documents on the Internet.

    American National Corpus

    Annotation page at Upenn. Describes some 40 tools and formats for creating and managing linguistic annotations.

    Athena Large e-text site

    CHILDES Parent-child interactions.

    Tim Johns Classroom Concordancing Page.

    Collocations page

    Concordancing page

    Corpus Encoding Standards Coordinated by Nancy Ide

    ECI/MCI Multilingual corpus information

    Electronic Text Archive

    English Language Corpora

    the etext pages

    Human Languages Page at Willamette.

    Hong Liang Qiao's web-page

    Index of electronic text projects

    Literature in various languages. University of Virginia ETC. 

    MATE Project Annotation of spoken corpora

    SPIRE Text visualisation analysis

    Survey of English Usage An interesting page.

    TalkBank

     

    by Michael Barlow

    Publicado el 25/10/2011

    Páginas de interés