2020
Aguilar, Gustavo; Kar, Sudipta; Solorio, Thamar
LinCE: A Centralized Linguistic Code-Switching Evaluation Benchmark Conference
Proceedings of the Twelfth International Conference on Language Resources and Evaluation, LREC, 2020.
Abstract | Links | BibTeX | Tags: benchmark, Code-Switching
@conference{aguilar20_lince,
title = {LinCE: A Centralized Linguistic Code-Switching Evaluation Benchmark},
author = {Gustavo Aguilar and Sudipta Kar and Thamar Solorio},
editor = {LREC},
url = {https://www.aclweb.org/anthology/2020.lrec-1.223.pdf},
year = {2020},
date = {2020-05-11},
booktitle = {Proceedings of the Twelfth International Conference on Language Resources and Evaluation},
publisher = {LREC},
abstract = {Recent trends in NLP research have raised an interest in linguistic code-switching (CS); modern approaches have been proposed to solve a wide range of NLP tasks on multiple language pairs. Unfortunately, these proposed methods are hardly generalizable to different code-switched languages. In addition, it is unclear whether a model architecture is applicable for a different task while still being compatible with the code-switching setting. This is mainly because of the lack of a centralized benchmark and the sparse corpora that researchers employ based on their specific needs and interests. To facilitate research in this direction, we propose a centralized benchmark for textbf{Lin}guistic textbf{C}ode-switching textbf{E}valuation (textbf{LinCE}) that combines ten corpora covering four different code-switched language pairs (i.e., Spanish-English, Nepali-English, Hindi-English, and Modern Standard Arabic-Egyptian Arabic) and four tasks (i.e., language identification, named entity recognition, part-of-speech tagging, and sentiment analysis). As part of the benchmark centralization effort, we provide an online platform at texttt{ritual.uh.edu/lince}, where researchers can submit their results while comparing with others in real-time. In addition, we provide the scores of different popular models, including LSTM, ELMo, and multilingual BERT so that the NLP community can compare against state-of-the-art systems. LinCE is a continuous effort, and we will expand it with more low-resource languages and tasks.},
keywords = {benchmark, Code-Switching},
pubstate = {published},
tppubtype = {conference}
}
Recent trends in NLP research have raised an interest in linguistic code-switching (CS); modern approaches have been proposed to solve a wide range of NLP tasks on multiple language pairs. Unfortunately, these proposed methods are hardly generalizable to different code-switched languages. In addition, it is unclear whether a model architecture is applicable for a different task while still being compatible with the code-switching setting. This is mainly because of the lack of a centralized benchmark and the sparse corpora that researchers employ based on their specific needs and interests. To facilitate research in this direction, we propose a centralized benchmark for textbf{Lin}guistic textbf{C}ode-switching textbf{E}valuation (textbf{LinCE}) that combines ten corpora covering four different code-switched language pairs (i.e., Spanish-English, Nepali-English, Hindi-English, and Modern Standard Arabic-Egyptian Arabic) and four tasks (i.e., language identification, named entity recognition, part-of-speech tagging, and sentiment analysis). As part of the benchmark centralization effort, we provide an online platform at texttt{ritual.uh.edu/lince}, where researchers can submit their results while comparing with others in real-time. In addition, we provide the scores of different popular models, including LSTM, ELMo, and multilingual BERT so that the NLP community can compare against state-of-the-art systems. LinCE is a continuous effort, and we will expand it with more low-resource languages and tasks.