Survey of part-of-speech tagger for mixed-code Indian and foreign language used in social media

Received Apr 29, 2019 Revised Aug 28, 2019 Accepted Oct 6, 2019 A Part-Of-Speech Tagger (POS Tagger) is a tool that scans the text in specific language and allocates chunks of speech to individual word (and another token), such as verb, adjective, nown etc., as more fine-grained POS tags are used in computational applications like 'noun-plural'. Basically, the goal of a POS tagger is to allocate linguistic (mostly grammatical) information to sub-sentential units, called tokens as well as to words and symbols (e.g. punctuation). This paper presents a survey of POS Tagger used for code-Mixed Indian and Foreign languages. Various methods, procedures, and features required to device POS Tagger for code-mixed foreign languages especially for Indian are studied and observations related to it are reported.


INTRODUCTION
Community language of communication in social media is often combined in nature, where individuals counterfeit their regional dialectal with English and this technique is found to be extremely popular. Natural language processing (NLP) work towards to gather the data from these texts somewhere Part-of-Speech (POS) tagging performs a key title role in receiving the prosody of the inscribed text. One purpose of POS labeling is to disambiguate homonyms. Several kinds of information including dictionaries, lexicons, rules etc. use by taggers. Word may be a member of more than one category. Lexicons have type or types of a specific word. For example, a word address is both verb and noun. Taggers utilizes the probabilistic evidence to solve this indistinctness of actual word. As a preprocessor in text processing POS tagger can be used. Text retrieval and indexing requires POS information. Language processing needs POS tags to choose the pronunciation. For making tagged corpora POS tagger is also used.
Dialectal processing methods to code switched text was first accomplished in the early 1980s [1], whereas in social media text code-switching begun to be considered in the late 1990s [2]. Still, conventional texts code change was rare as to encourage ample curiosity by the computational dialectal research people, and it was first lately that, it emerges a study topic in its own right, with a code-switching workshop at EMNLP 2014 [3]. Solorio with Liu [4], projected a simple but well-designed solution of labeling mixed-code English-Spanish transcript twice -on one occasion for each language, a tagger -and then joining the outcome of the language-explicit taggers to get the optimal word-level tags [5]. For English-Hindi Mixed-Code Social Media Content, a POS Labeling System has been presented in [5]. Efforts has been performed on English-Bengal and English-Hindi data. Nelakuditi [6], performed, two different kinds of experiments, First, POS taggers based on machine learning and second is uniting POS taggers of individual languages [7]. ISSN: 2252-8814  Survey of part-of-speech tagger for mixed-code Indian and foreign language used in … (Bhushan Nikam)

265
POS tagger tool has been designed for various languages, but for code-mixed Indian and foreign Languages, very little work yet is performed with undesirable accuracy. This paper presents review of such work which is prepared into next four Sections. Section 2 and 5 specifies techniques used and approaches involved in the implementation of POS tagger for code-mixed Indian and foreign dialects. Section 3 summarizes efforts made to implement CM POS tagger for Indian Languages. Challenges to implement code-mixed POS tagger is presented in section 4.

VARIOUS APPROACHES AND TECHNIQUES USED TO IMPLEMENT CODE-MIXED POS TAGGER FOR INDIAN AND FOREIGN DIALECTS
India is homegrown to number of dialects. Language changes and variety in dialect prompt frequent mixing of code in India. Hence, Indians are polyglot by habituation with necessity, and frequently change mix tongues in social media circumstances, that possess additional problems for automatic Indian social media text processing. Requirement for any kind of NLP applications especially in this context Code-Mixed Part-of-speech (CM-POS) labelling is essential. Relating to it, I present a report on various POS tagger approaches and techniques used to implement code-mixed POS tagger for Indian and foreign Languages. Jamatia and Das [5] experimented by using classification algorithms based on four machine learning technique to the undertaking exercise: Conditional Random Fields (CRF), with Sequential Minimal Optimization (SMO), Naïve Bayes (NB), and Random Forests (RF). For the Conditional Random Fields they tried the MIRALIUM 1 application, whereas the other three were the applications in WEKA 2 and reported effectuation on the complete dataset (2,583 utterances), after 5-fold cross-validation of all the ML methods using both fine-grained (FG) and coarse-grained (CG) tag sets and noticed that all the ML methods have further problems with HI-EN alternation.
In the Machine learning based POS taggers experiment Nelakuditi et. al [6] used three types of Machine Learning techniques for designing the POS tagger viz, Support Vector Machines (SVM), Bayes classification (Bay) and Conditional Random Fields (CRF), with different groupings and distinctions. In second experiment of joining POS taggers of individual languages, CMU's Twitter POS tagger for English with POS tagger developed at LTRC, that is a part of the shallow parser tool 3 for Telugu were used and then finally reported accuracies.
Kamal Sarkar [7], developed HMM-based POS tagging system which is founded on Trigram Hidden Markov Model that uses data from the vocabulary, and some other word level attributes to improve the comment possibilities of the known along with unknown tokens. He gives in to scores for Hindi-English, Bengali-English and Tamil-English Language duos. His scheme has been skilled and tried on the datasets provided for ICON 2015 shared task. In the constrained mode, his technique gains average overall accuracy (averaged over all three language pairs) of 75.60% which is very close to other participating two systems (76.79% for IIITH and 75.79% for AMRITA_CEN) which ordered larger than his system. In the unrestricted mode, his system gets typical overall accuracy of 70.65% which is also nearby to the system (72.85% for AMRITA_CEN) that obtained average overall accuracy highest.
Vyas et. al [8] conducted three different experiments: In the first experiment, by assuming the language identities and normalized/transliterated forms of the words, POS tagging is performed. It gives an idea of the accuracy of POS tagging task, if normalization, transliteration and language identification could be done perfectly. Experiments have been conducted with two different POS taggers for English: the Stanford POS tagger and the Twitter POS tagger. In the next experiment, by assuming that only the language identity of the words are known for Hindi their own model is applied to generate the back transliterations. For English, Twitter POS tagger is applied directly to handle social media text. In the third experiment by assuming nothing is known, language identifier process is first applied, and based on the language detected, Hi transliteration module, and Hi POS tagger, or the English tagger is applied and also stated that though the matrix information is not used in any of their experiments, it could be potentially useful for POS tagging which could be explored in future.
For constrained and unconstrained training and result submission, Pimpale and Patel [9], used Stanford POS tagger and machine learning algorithm viz., Decision Tree J48, Decision Tree Random Forest, Naive Bayes and Multilayer Perceptron resp. By concluding, the method used is reporting well for constrained submission, but deficiency of the superiority working information doesn't allow doing ample with it, if they, use the distributed vector illustration of words in feature engineering, that allow them to use non-labeled data for working out.
As stated by Sequiera et. al [10], explored machine learning approaches for Hindi (Hi)-English (En) CM typescript from social media POS tagging starting with repetition of the trials specified in [8] along with [4], and reconfirming results on dataset. Extending the attributes set applied by Solorio and Liu [4] and doing numerous feature selection experiments, they proposed and conducted a POS-tagging and joint Kamal Sarkar [11], also proposed a POS tagging system for social media texts. It is developed based on Conditional Random Fields (CRF) trained using a rich feature set that includes contextual features, orthographic features, punctuation features and word length features. He concluded that his system performs well across all three languages Bengali-English-Hindi pairs. He hoped that the proper choice of features along with the suitable grouping of machine learning algorithms would improve the performance of his system.
According to Sharma and Motlani [12], experimented code-mixed POS tagging of Indian social media text using machine learning techniques. Building a POS tagger using constrained system, give them an accuracy of 75.04%, after being estimated on the new test dataset. While by using other resources, namely an unconstrained system, POS tagger did better than the constrained system and gives 80.68% of accuracy. For training and testing of both type of systems they used ten-fold cross-validation method and computed the best model attribute values by undertaking a grid search over all the parameters of the attributes. Finally, for the other two pairs, namely BN-EN (Bengali-English) and TA-EN (Tamil-English), accuracy measured was 79.84% and 75.48% respectively using developed and submitted constrained systems. Pipeline approach, for language identification, Back-transliteration and POS tagging Sisodiya [13] respectively used, logistic based classifier and CRF, Google API, and CRF++ based Hindi POS tagger developed by IIT Kharagpur.
Singh and Kanskar [14] employed, controlled word-level classification with and without contextual signs, and sequence labeling using Conditional Random Fields, for implementation of a simple unconfirmed dictionary-based method. A modest dialectal discovery-based investigative used in which first, the text can be separated into portions of tokens belonging to a language, and then each portion be categorized according to its language and further labeled by the POS tagger for that dialectal. Linguistic finding and transliteration text is labeled through an English monolingual tagger and then selecting one out of two labels for a conversation based on some heuristics that was detected by several language detection techniques.
As stated by Ghosh et. al [15], they listed various steps involved in POS labeling task using CRF++ toolkit and Stanford POS Tagger, including chunking, lexicons for dominant languages. They also concluded that Bengali-English and Hindi-English results are more than that of Tamil-English because of difference in labels used in Tamil-English gold standard files.
Barman [16], divided the experiment into four parts viz., implementing, baselines for POS tagging, pipeline systems, their stacking systems and joint model. By performing with the data, five-fold crossvalidation and reported normal cross-validation exactness with investigating the use of hand-crafted features and attributes that can be gained from monolingual POS taggers (stacking), performed researches with different groupings of these attribute sets. They described a trilingual code-mixed corpus with POS comment. Using state-of-the-art methods performing POS tagging and investigating the usage of factorial CRF (FCRF)based joint model found that the best stacking method (S2) that practices the joint features, achieves better than the combine version (FCRF) and the systems with pipeline. They observed that combined modeling outperforms the systems with pipeline in their experimentations. FCRF fall late the best POS labeling system S2. Possibly, to achieve better performance than S2 more training data would help FCRF.
According to Gupta et. al [17], they proposed a system that practices a comprehensive set of features for POS labeling. The feature set was used to design a POS model. Conditional random field (CRF) is applied as the underlying classifier. CRF++, an employment of CRF is used to accomplish the experiment. As CRF++ uses a stated feature template, therefore to discover the optimal feature template a series of experiments were made on the training data set in a cross-validated way. However, they tune the feature pattern on English-Hindi data set only and used the optimal model for all these CM languages (English-Hindi, English-Bengali, and English-Telugu) pairs. Bhargava et. al [18,19], experimented similar kinds of approaches to implement POS tagger for English-Telugu, English-Hindi, English-Bengali language pairs with a slight variation to achieve accuracies.

VARIOUS APPROACHES AND TECHNIQUES USED TO IMPLEMENT CODE-MIXED POS TAGGER FOR FOREIGN LANGUAGES
Efforts are not much more still be seen to implement code-mixed POS tagger for foreign languages. Solorio and Liu [4] just predicted potential code alternation points, in the growth of extra accurate systems for processing code-mixed English-Spanish language. Such mixing of languages is rarely found all over the world, other than in India.

CHALLENGES TO IMPLEMENT CODE-MIXED POS TAGGER
Building Code-Mixed POS (CM-Part of Speech) taggers for Indian dialects is a particularly interesting problem in computational linguistics due to a lack of accurately glossed training corpora. More cultured language processing techniques are required for POS tagging that is proficient of drawing interpretations from more delicate dialectal information. From a dialectal outlook, meaning arises from the distinctness between dialectal units, including words, phrases, and so on. These distinctness are of two types: paradigmatic (concerning substitution) and syntagmatic (concerning positioning). To implement Code-Mixed POS tagger all these differences are also needed to be considered.

CONCLUSION
The survey shows that in general, various Machine Learning techniques along with POS tagger are used by researchers to implement CM POS taggers for Indian and foreign languages. Much more work is started to perform for code-mixed Indian languages. But an actual tool for code-mixed POS tagging is not yet available on the internet.