Urban Digital Linguistics

Jung Hwan Kim
4 min readJun 23, 2022

Linguistics is the scientific study of language. Today, with advances in technology, large sets of texts are constructed for research in various fields, such as data science and NLP. Text data are strictly qualitative and unstructured because words have different meanings and are open to interpretation according to syntax and semantics.

Language is volatile. New acronyms form, artificial words are made up in niches, and social media take part in shaping the way people use words. According to this website, 14.7 words are created every day (Number of Words in English Archives, 2021).

From the Linguistics Society of America, language change because “the needs of its speakers change. New technologies, new products, and new experiences require new words to refer to them clearly and efficiently” (Birner 2021). So, if there were to be a study on the modern use of internet language, which kind of source should researchers rely on? If there is a new medium for new vocabulary to form, there should exist a new medium to find its inteded meaning. Many studies point to crowdsourcing.

Crowdsourcing involves contributions from large groups of dispersed participants. Crowdsourced dictionaries and encyclopedias have been long present in the online environment, allowing the public to share its knowledge in specific fields, jargons, or even slangs. For example, Urban Dictionary is a crowdsourced dictionary for vocabulary used on the internet.

Though the dictionary is known for providing definitions of vulgar languages and slangs, it portrays how internet languages are formed. Lauren Squires, a linguistics researcher, pointed out in her paper ‘Enregistering internet language’ that urban dictionary is an accumulation of internet language and has the potential to be used as a source to find discrepancies in proper and improper language (2010).

When constructing a corpus, the semantics of the words is important. When finding semantic meanings in a corpus filled with internet language, leaning on crowdsourced dictionaries and encyclopedias may be viable.

Another platform that makes use of crowdsourcing is Wikipedia. Wikipedia is a free online encyclopedia, made by writers and editors all around the world. Users can write new entries about technologies, phenomena, and items that they believe should be on the online encyclopedia. Hence, the encyclopedia is full of articles that are up to date. Because of its editing system, contents are improved often, and users debate over content integrity. Wikipedia is rich in content and volume.

By 2015, the Wikipedia articles written in English alone had more than six million articles (Erik, 2019). For its volume and authenticity, Wikipedia text dump is often used as a corpus to train Artificial Intelligence models. In Natural Language Processing and Artificial Intelligence, research like ‘BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding’ have used the Wikipedia corpus. The paper states that it has used “the Books Corpus (800M words) and English Wikipedia (2,500 words)” (Devlin, 2018). Corpus made with crowdsourcing has rich content and numerous articles, and most importantly, it embodies how language is used by the public.

Language is always changing. New measures have to be taken to respond to the change. Today, crowdsourced platforms like Urban Dictionary and Wikipedia are great sources to study the semantics and usage of language, more specifically internet language.

Urban Dictionary provides definitions of jargons, slangs, and acronyms that are informal and improper. The platform serves as a key to decipher the ever-growing internet lexicon. Wikipedia offers a large size of text data that has been written and edited by enthusiasts. Crowdsourcing has become an important source to construct a corpus for urban digital linguistics.

Reference:

[1] Ciarán Ó Duibhín. (2014). Using the IMS Corpus Workbench. IMS. https://www3.smo.uhi.ac.uk/oduibhin/oideasra/interfaces/wincwb.htm

[2] Number of Words in English Archives. (2021, August 15). The Global Language Monitor. https://languagemonitor.com/number-of-words-in-english/

[3] Birner Betty. (2021). Is English Changing? | Linguistic Society of America. Linguistic Society of America. Retrieved June 12, 2022, from https://www.linguisticsociety.org/content/english-changing

[4] Squires, Lauren. “Enregistering Internet Language.” Language in Society, vol. 39, no. 4, 2010, pp. 457–92. JSTOR, http://www.jstor.org/stable/40925792. Accessed 12 Jun. 2022.

[5] Erik Zachte. (2019). Wikipedia Statistics — Tables — All languages. Wikipedia Statistics. https://stats.wikimedia.org/EN/TablesWikipediaZZ.htm

[6] Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2018). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding (Version 2). arXiv. https://doi.org/10.48550/ARXIV.1810.04805

--

--

Jung Hwan Kim

I’m a student studying computer science and cognitive science at Yonsei University. My interest is in NLP, data science, and finance.