To build a morphological analyser for under-resourced language, a creation of morphological resource is required. With a limitation of morphological resource in digital format, a digitisation process, which is time-consuming and a tedious task, is used to create the resources. An objective
of this work is to develop new steps in creating the morphological resources from social media. The steps comprise of crawling of the blogs and tweets. A limited list of words of the under-resourced language was used to reduce the number of crawled web pages. Then, the crawled pages and tweets
were normalised. This step cleaned and transformed the crawled data with informal and noisy nature into a cleaned wordlist for the next process, which is dictionary lookup validation. Lastly, the validation of wordlist was carried out due to languages mixing that caused uncertainty of spelling
standard. At this stage, edit distance algorithms, namely, Jaro-Winkler is applied to determine an accuracy of the spelling standard by comparing with the dictionary. The findings suggest that the availability of huge amount of dictionary word entries could improve the accuracy of the poor
results. It is recommended that the developed steps can assist other researchers to create validated morphological resources or even language resources for the under-resourced languages.
No Reference information available - sign in for access.
No Citation information available - sign in for access.
No Supplementary Data.
No Article Media
Document Type: Research Article
Faculty of Computing and Informatics, Multimedia University, Persiaran Multimedia, 63100 Cyberjaya, Selangor, Malaysia
Department of Information System, Faculty of Computer Science and Information Technology, Universiti Malaysia Sarawak, 94300 Kota Samarahan, Sarawak, Malaysia
Publication date: November 1, 2017
More about this publication?
ADVANCED SCIENCE LETTERS is an international peer-reviewed journal with a very wide-ranging coverage, consolidates research activities in all areas of (1) Physical Sciences, (2) Biological Sciences, (3) Mathematical Sciences, (4) Engineering, (5) Computer and Information Sciences, and (6) Geosciences to publish original short communications, full research papers and timely brief (mini) reviews with authors photo and biography encompassing the basic and applied research and current developments in educational aspects of these scientific areas.
- Editorial Board
- Information for Authors
- Subscribe to this Title
- Ingenta Connect is not responsible for the content or availability of external websites