Design and Development of Unsupervised Stemmer for Sindhi Language
Abstract
Stemmer is a fundamental NLP tool which performs the task of normalization (i.e. to remove suffixes) of inflected word. This paper presents a stemmer, design and developed for Sindhi Language, using unsupervised approach. Suffixes are extracted using “Linguistica 5 “[22] a tool for unsupervised learning of morphology. The raw corpus of 10000 sentences of Sindhi Language is used for extraction of suffixes. Unsupervised stemmer is evaluated using Direct approach. Results are compared with existing rule-based, stemmer [32] and Lemmatizer[33], 1000 words are extracted from Sindhi Dictionary for evaluation.