Towards Augmenting Lexical Resources for Slang and African American English

Abstract

Researchers in natural language processing have developed large, robust resources for understanding formal Standard American English (SAE), but we lack similar resources for variations of English, such as slang and African American English (AAE). In this work, we use word embeddings and clustering algorithms to group semantically similar words in three datasets, two of which contain high incidence of slang and AAE. Since high-quality clusters would contain related words, we could also infer the meaning of an unfamiliar word based on the meanings of words clustered with it. After clustering, we compute precision and recall scores using WordNet and ConceptNet as gold standards and show that these scores are unimportant when the given resources do not fully represent slang and AAE. Amazon Mechanical Turk and expert evaluations show that clusters with low precision can still be considered high quality, and we propose the new Cluster Split Score as a metric for machine-generated clusters. These contributions emphasize the gap in natural language processing research for variations of English and motivate further work to close it.

Publication
In Proceedings of the 7th Workshop on NLP for Similar Languages, Varieties, and Dialects
Alyssa Hwang
Alyssa Hwang
PhD Student

I am a first-year PhD student in the Department of Computer and Information Science at the University of Pennsylvania. I am particularly interested in the intersections of Natural Language Processing, Linguistics, and Psychology, especially expanding NLU resources for nonstandard English. I am supported by the NSF Graduate Research Fellowship Program. I earned my BS in Computer Science at Columbia University, where I conducted research and wrote an undergraduate thesis with Prof. Kathleen McKeown.

Related