Many domains of linguistics use collections of language data (written or recorded) to look for patterns about languages or dialects. A corpus can be huge or tiny, and can be created as part of research or already exist as a public or private source. In this resource post, I provide a ton of links to learn more about various types of linguistics that use corpuses/corpora (pick your favourite plural): sociolinguistics, historical linguistics, typology, and language documentation/revitalization/fieldwork.*
Sociolinguistics is about how people who speak the same language speak differently depending on various social and demographic factors (age, gender, social class, ethnicity, etc). Dialectology tends to focus more on how dialects are different in different locations.
- Sites: basic sociolinguistics examples, terminology, resource page for a documentary on dialects of American English, class page with several links
- The classic “fourth floor” study: overview, original paper
- Concepts: the notion of prestige, politeness theory
- Pdfs: full introductory sociolinguistics textbook, a different full textbook on socioling, first part of a dialectology textbook, chapter on methods in dialectology
- Slides: dialects in the UK, short handout on socioling
- Internet linguistics, list of papers about Facebook, “tu” on Twitter, quotative “like”, “moar” and list of online corpora, Google ngrams
- Video: American vs British politeness
Historical linguistics is about how language changes over time, for example from Beowulf to Modern English, or attempting to reconstruct the original source language that several related languages are derived from. Etymology and philology are subfields of historical linguistics.
- Sites: intro/exercises, diagrams of language families
- Pdfs: overview of historical linguistics, examples of language change in English and Indo-European, examples of change in Chinese, history of historical linguistics,
- Comparative reconstruction: description, more examples, exercises, more exercises
- Texbooks (limited previews): older historical linguistics book, slightly newer book, fairly recent book, very recent book
- Etymonline and see notes on this post for other etymological dictionaries.
- Fun: Why don’t we say “orangehead” instead of “redhead”?, Savage Chickens comic
Typology is about comparing various related and non-related languages in attempt to figure out what characteristics are common to all languages, what characteristics are different, and how they vary. With historical linguistics, typology is part of a larger group of comparative linguistics.
- Pdfs: introductory chapter, some differences between languages,
- Sites: practice exercises, more exercises
- Textbooks (limited preview): Introduction to Typology, Introduction to Linguistic Typology,
- The World Atlas of Language Structures (WALS), a huge online searchable database of properties of the world’s languages
- Ethnologue, information about the languages of the world (related languages/families, countries spoken, number of speakers, etc)
Language documentation is about recording and describing languages, generally those that have not historically been studied in as much detail and which may be endangered. Sometimes documentation may overlap with language revitalization if the people of that language are interested in speaking it more but it’s no longer commonly spoken in homes.
In an academic setting, documentation is generally introduced through field methods courses, where a class works with a speaker of a language that none of them has any previous knowledge of, and learns how to ask them questions to figure out the structure of their language.
- Sites: endangeredlanguages.com, documentation course outline and notes, field methods course resources, language revitalization articles
- Pdfs: Defining Documentary Linguistics, ethical issues in fieldwork, more in-depth discussion of ethics
- Movies: The Linguists (documentation), We Still Live Here (revitalization)
- Books (limited preview): Describing Morphosyntax, Language Documentation: Practice and Values, review of Linguistic Fieldwork: A Practical Guide, How to Keep Your Language Alive
- CoLang, a summer institute on collaborative language research (scroll down for lots of documentation resources)
Documentation/fieldwork is normally learned by doing, so it’s not really possible to learn everything about it just by reading things online. You could also try reading a descriptive grammar (list of grammars) of an interesting-looking language or asking questions of a friend who speaks a language you don’t know, if they’re interested.
Content note: I’m not trying to assert any theoretical differences by putting something in the “corpus” or “experimental” posts, just trying to split up areas so one post doesn’t get too long. Corpus methods and experimental methods often overlap in each sub-field.
*Notes: Some of the links overlap in content, especially chapters and slides. This is deliberate, so if you don’t like how something is explained in one place, try somewhere else. Content is taken from a variety of sources, which may use slightly different theories or simplifications: don’t panic. Introductory linguistics courses vary in how much they cover corpus-related topics: some may not talk about them at all, while others may go into considerable detail in one or more areas. Reading everything would be closer to a full undergrad course in each of these sub-disciplines, so don’t feel like you have to work your way through absolutely everything. If you have questions about what you’re reading, you will probably get a faster response posting in #linguistics where multiple people can see you and reply than messaging me directly.
This post is part of a series on resources for teaching yourself linguistics. Previously: semantics, syntax, morphology, phonetics/phonology, why “protolinguist”, and my original protolinguist post. Next: experimental, descriptive grammar, philosophy of language/linguistic anthropology. Any comments/feedback very much appreciated, especially if you are trying to learn more about linguistics or if you have more (fun or serious) corpus links to add. Posts will be tagged #linguistics and #protolinguist, and I’ll be checking both.