LinCE Benchmark

Acknowledgements

This benchmark has been possible thanks to the contribution of many researchers around the globe. To recognize their work, we have included the list of papers along with their BibTex. Please cite them if you use any of the related datasets.

Please log in to download the datasets.

NOTE: The data in LinCE has been stratified differently from the original datasets. Please check our paper for more details.

CALCS Shared Tasks: Machine Translation (MT)

English - Hinglish (ENG - HINGLISH)

[Paper] [Data] [Bibtex]

[License Agreement]: Twitter data is distributed for non-commercial use and for research purposes only, following Twitter's own Developer Agreement and Policy.

English - Spanish (ENG - SPA)

[Paper] [Data] [Bibtex]

[License Agreement]: Twitter data is distributed for non-commercial use and for research purposes only, following Twitter's own Developer Agreement and Policy.

English - Spanglish (ENG - SPANGLISH)

[Paper] [Data] [Bibtex]

[License Agreement]: Twitter data is distributed for non-commercial use and for research purposes only, following Twitter's own Developer Agreement and Policy.

Spanglish - English (SPANGLISH - ENG)

[Paper] [Data] [Bibtex]

[License Agreement]: Twitter data is distributed for non-commercial use and for research purposes only, following Twitter's own Developer Agreement and Policy.

Spanglish - Spanish (SPANGLISH - SPA)

[Paper] [Data] [Bibtex]

[License Agreement]: Twitter data is distributed for non-commercial use and for research purposes only, following Twitter's own Developer Agreement and Policy.

Modern Standard Arabic-Egyptian Arabic - English (MSAEA - ENG)

[Paper] [Data] [Bibtex]

[License Agreement]: Twitter data is distributed for non-commercial use and for research purposes only, following Twitter's own Developer Agreement and Policy.

Modern Standard Arabic-Egyptian Arabic -> Spanish (MSAEA - SPA)

[Paper] [Data] [Bibtex]

[License Agreement]: Twitter data is distributed for non-commercial use and for research purposes only, following Twitter's own Developer Agreement and Policy.

English -> Modern Standard Arabic-Egyptian Arabic (ENG - MSAEA)

[Paper] [Data] [Bibtex]

[License Agreement]: Twitter data is distributed for non-commercial use and for research purposes only, following Twitter's own Developer Agreement and Policy.

Language Identification (LID)

Spanish - English (SPA - ENG)

Overview for the Second Shared Task on Language Identification in Code-Switched Data (CALCS 2016)

Giovanni Molina, Nicolas Rey-Villamizar, Thamar Solorio, Fahad AlGhamdi, Mahmoud Ghoneim, Abdelati Hawwari, Mona Diab

[Paper] [Data] [Bibtex]

Hindi - English (HIN - ENG)

Language Identification and Analysis of Code-Switched Social Media Text

Deepthi Mave, Suraj Maharjan and Thamar Solorio

[Paper] [Data] [Bibtex]

Nepali - English (NEP - ENG)

Overview for the First Shared Task on Language Identification in Code-Switched Data (CALCS 2014)

Thamar Solorio, Elizabeth Blair, Suraj Maharjan, Steven Bethard, Mona Diab, Mahmoud Gohneim, Abdelati Hawwari, Fahad AlGhamdi, ulia Hirschberg, Alison Chang, Pascale Fung

[Paper] [Data] [Bibtex]

Modern Standard Arabic - Egyptian Arabic (MSA - EA)

Overview for the Second Shared Task on Language Identification in Code-Switched Data (CALCS 2016)

Giovanni Molina, Nicolas Rey-Villamizar, Thamar Solorio, Fahad AlGhamdi, Mahmoud Ghoneim, Abdelati Hawwari, Mona Diab

[Paper] [Data] [Bibtex]

Parts of Speech Tagging (POS)

Spanish - English (SPA - ENG)

Part of Speech Tagging for Code Switched Data

Fahad AlGhamdi, Giovanni Molina, Mona Diab, Thamar Solorio, Abdelati Hawwari, Victor Soto, Julia Hirschberg

[Paper] [Data] [Bibtex]

Hindi - English (HIN - ENG)

A Twitter Corpus for Hindi-English Code Mixed POS Tagging

Kushagra Singh, Indira Sen, Ponnurangam Kumaraguru

[Paper] [Data] [Bibtex]

Named Entity Recognition (NER)

Spanish - English (SPA - ENG)

Named Entity Recognition on Code-Switched Data: Overview of the CALCS 2018 Shared Task

Gustavo Aguilar, Fahad AlGhamdi, Victor Soto, Mona Diab, Julia Hirschberg, Thamar Solorio

[Paper] [Data] [Bibtex]

Modern Standard Arabic - Egyptian Arabic (MSA - EA)

Named Entity Recognition on Code-Switched Data: Overview of the CALCS 2018 Shared Task

Gustavo Aguilar, Fahad AlGhamdi, Victor Soto, Mona Diab, Julia Hirschberg, Thamar Solorio

[Paper] [Data] [Bibtex]

Hindi - English (HIN - ENG)

Language Identification and Named Entity Recognition in Hinglish Code Mixed Tweets

Kushagra Singh, Indira Sen, Ponnurangam Kumaraguru

[Paper] [Data] [Bibtex]

Sentiment Analysis (SA)

Spanish - English (SPA - ENG)

SemEval-2020 Task 9: Overview of Sentiment Analysis of Code-Mixed Tweets

Parth Patwa, Gustavo Aguilar, Sudipta Kar, Suraj Pandey, Srinivas PYKL, Björn Gambäck, Tanmoy Chakraborty, Thamar Solorio, Amitava Das

[Paper] [Data] [Bibtex]