Research

Published Papers

Exploring the Impact of Temperature on Large Language Models: A Case Study for Classification Task based on Word Sense Disambiguation

T Sumanathilaka, Nicholas Micallef and Julian Hough

2025 7th International Conference on Natural Language Processing (ICNLP 2025) | Mar 2025

With the advent of Large Language Models (LLMs), Natural Language (NL) related tasks have been evaluated and explored. While the impact of temperature on text generation in LLMs has been explored, its influence on classification tasks remains unexamined despite temperature being a key parameter for controlling response randomness and creativity. In this study, we investigated the effect of the model’s temperature on sense classification tasks for Word Sense Disambiguation (WSD). A carefully crafted Few-shot Chain of Thought (COT) prompt was used to conduct the study, and FEWS lexical knowledge was shared for the gloss identification task. GPT-3.5 and 4, LlaMa 3-70B and 3.1-70B, and Mixtral 8x22B have been used as the base models for the study, while evaluations are conducted with 0.2 intervals between the 0 to 1 range. The results demonstrate that temperature significantly affects the performance of LLMs in classification tasks, emphasizing the importance of conducting a preliminary study to select the optimal temperature for a task. The results show that GPT-3.5-Turbo and Llama-3.1-70B models have a clear performance shift, the Mixtral 8x22B model with minor deviations, while GPT-4-Turbo and LlaMa-3-70B models produce consistent results at different temperatures.

GlossGPT: GPT for Word Sense Disambiguation using Few-shot Chain-of-Thought Prompting

T Sumanathilaka, Nicholas Micallef and Julian Hough

2025 8th International Conference on Emerging Data and Industry 4.0 (EDI40)| Procedia Computer Science | Apr 25

Lexical ambiguity is a major challenge in computational linguistic tasks, as limitations in proper sense identification led to inefficient translation and question answering. General-purpose Large Language Models (LLMs) are commonly utilized for NLP tasks. However, utilizing general-purpose LLMs for specific tasks has been challenging, and fine-tuning has become a critical requirement for task specification. In this work, we craft advanced prompts with different contextual parameters to guide the model's inference towards accurate sense prediction to handle Word Sense Disambiguation (WSD). We present a few-shot Chain of Thought (COT) prompt-based technique using GPT-4-Turbo with knowledgebase as a retriever that does not require fine-tuning the model for WSD tasks and sense definitions are supported by synonyms to broaden the lexical meaning. Our approach achieves comparable performance on the SemEval and Senseval datasets. More importantly, we set a new state-of-the-art performance with the few-shot FEWS dataset, breaking through the 90% F1 score barrier.

SSL400 - A Comprehensive Word Level Dataset for Sinhala Sign Language Recognition

Yohan Abhishek, T Sumanathilaka

2025 5th International Conference on Advanced Research in Computing (ICARC) - IEEE | Feb 25

Systematic Review of Fine-tuning Approaches for Large Language Models in the Context of Sinhala

Sachin Hansaka, T Sumanathilaka

2025 5th International Conference on Advanced Research in Computing (ICARC) - IEEE | Feb 25

Recent Trends and Challenges in Assistive Applications for Sinhala-Speaking Adults with Dyslexia: A Decade in Review

Peshala Perera, T Sumanathilaka

2025 5th International Conference on Advanced Research in Computing (ICARC) - IEEE | Feb 25

A Hybrid Computational Framework Using NLP and ML for Emotion Analysis in Sinhala Songs

Malinda Peiris, T Sumanathilaka

2025 5th International Conference on Advanced Research in Computing (ICARC) - IEEE | Feb 25

Romanized Sinhala to Sinhala Reverse Transliteration Using BERT

Sameeraa Perera, Lahiru Prabhath,T Sumanathilaka, Isuri Anuradha

2025 The First Workshop on Natural Language Processing for Indo-Aryan and Dravidian Languages | Jan 25

The Romanized text has become popular with the growth of digital communication platforms, largely due to the familiarity with English keyboards. In Sri Lanka, Romanized Sinhala, commonly referred to as “Singlish” is widely used in digital communications. This paper introduces a novel context-aware back-transliteration system designed to address the ad-hoc typing patterns and lexical ambiguity inherent in Singlish. The proposed system combines dictionary-based mapping for Singlish words, a rule-based transliteration for out-of-vocabulary words and a BERT-based language model for addressing lexical ambiguities. Evaluation results demonstrate the robustness of the proposed approach, achieving high BLEU scores along with low Word Error Rate (WER) and Character Error Rate (CER) across test datasets. This study provides an effective solution for Romanized Sinhala back-transliteration and establishes the foundation for improving NLP tools for similar low-resourced languages.

Machine Translation and Transliteration for Indo-Aryan Languages: A Systematic Review

Sameeraa Perera, T Sumanathilaka

2025 The First Workshop on Natural Language Processing for Indo-Aryan and Dravidian Languages | Jan 25

With the advent of Web 2.0, digital platforms have become increasingly multilingual. Non-English speakers are rapidly adopting their native languages on social media, highlighting the need for robust translation and transliteration models to facilitate effective communication. This systematic review paper provides an overview of recent machine translation and transliteration developments for Indo-Aryan languages spoken by a large South Asian population. The paper examines advancements in translation and transliteration systems for a few language pairs that have appeared in recently published papers in the last half a decade. The review summarizes the current state of these technologies, providing a worthwhile resource for anyone who is doing research in these fields to understand and find existing systems and techniques for translation and transliteration. The current challenges and limitations in the current systems are identified, and possible directions are suggested.

Assessing GPT’s Potential for Word Sense Disambiguation: A Quantitative Evaluation on Prompt Engineering Techniques

T Sumanathilaka, Nicholas Micallef and Julian Hough

2024 IEEE 15th Control and System Graduate Research Colloquium (ICSGRC) | IEEE | Aug 24

Modern digital communications (including social media content) often contain ambiguous words due to their potential for multiple related interpretations (polysemy). This ambiguity poses challenges for traditional Word Sense Disambiguation (WSD) methods, which struggle with limited data and lack of contextual understanding. These limitations hinder efficient translation, information retrieval, and question-answering systems, thereby restricting the benefits of computational linguistics techniques when applied to digital communication technologies. Our research investigates the use of Large Language Models (LLMs) to improve WSD using various prompt engineering techniques. We propose and evaluate a novel method that combines a knowledge graph, together with Part-of-Speech (POS) tagging and few-shot prompting to guide LLMs. By utilizing prompt augmentation with human-in-loop on few-shot prompt approaches, this work demonstrates a substantial improvement in WSD. This research advances accurate word interpretation in digital communications, leading to important implications for improved translation systems, better search results, and more intelligent question-answering technology.

Can LLMs assist with Ambiguity? A Quantitative Evaluation of various Large Language Models on Word Sense Disambiguation

T Sumanathilaka, Nicholas Micallef and Julian Hough

2024 First International Conference on Natural Language Processing and Artificial Intelligence for Cyber Security (NLPAICS 2024) | Jul 24

Ambiguous words are often found in modern digital communications, posing challenges to traditional Word Sense Disambiguation (WSD) methods due to limited data. Consequently, the efficiency of translation, information retrieval, and question-answering systems is hindered by these limitations. This study investigates the use of Large Language Models (LLMs) to improve WSD through a novel approach that combines a systematic prompt augmentation mechanism with a knowledge base (KB) of different sense interpretations. The proposed method incorporates a human-in-loop approach for prompt augmentation, supported by Part-of-Speech (POS) tagging, synonyms of ambiguous words, aspect-based sense filtering, and few-shot prompting to guide the LLM. Utilizing a few-shot Chain of Thought (COT) prompting-based approach, this work demonstrates a substantial improvement in performance. The evaluation, conducted using FEWS test data and sense tags, advances accurate word interpretation in social media and digital communication..

TAMZi!: Shorthand Romanized Tamil to Tamil Reverse Transliteration Using Novel Hybrid Approach

Anuja Herath Mudiyanselage, T Sumanathilaka

2024 The International Journal on Advances in ICT for Emerging Regions (ICTer)

Transliteration from Tamil to the Roman script holds a crucial place in the realms of effective communication, educational accessibility, and the seamless integration of digital technology. However, this process encounters a significant challenge due to the disparity in the number of vowels between the Tamil script, which encompasses a rich set of 12 vowels, and the Roman script, which is limited to just 5. This incongruity poses a substantial impediment when attempting ad-hoc transliteration of Tamil into the Roman script, especially when vowels are omitted. This paper aims to make a significant academic contribution by conducting an extensive literature review of recent developments in Romanized Tamil to Tamil transliteration, with a particular focus on addressing the absence of vowels. The review involves a meticulous examination of a wide array of methodologies proposed in recent years, ranging from rule-based systems to context-based strategies and machine learning-based approaches. In response to the challenges inherent in Tamil to Roman transliteration, this research work introduces a novel and innovative solution. This solution incorporates a Reverse Transliteration module, which leverages N-gram analysis and a rule-based model. The utilization of a trained trie structure is a key component of this approach, enabling word suggestions that effectively resolve ambiguities during the transliteration process. Remarkably, the proposed solution outperforms existing character-level transliteration methods, achieving an impressive character-level accuracy rate of 0.93. The practical implications of this research are substantial, particularly concerning the fulfilment of the linguistic and transliteration needs of native Tamil speakers within the digital platform, where such accuracy is of utmost importance.

Swa-Bhasha Dataset: Romanized Sinhala to Sinhala Adhoc Transliteration Corpus

T Sumanathilaka, Ruvan Weerasinghe, Nicholas Micallef

2024 4th International Conference on Advanced Research in Computing (ICARC) | IEEE | Feb 24

In the context of a changing society and rapid technological advancements, the prevalence of social media platforms and instant messaging services has significantly strengthened the usage of native languages. In Sri Lanka, Sinhala and Romanized Sinhala have emerged as popular typing languages, owing to the widespread use of informal shorthand-based typing and internet acronyms for quicker communication. However, due to the limited availability of resources, linguistic support for these languages is limited, making them low-resource languages. To address this resource deficit, this study proposes the development of a rule-based transliteration tool that can annotate Sinhala words into Romanized Sinhala, accommodating the diverse ad hoc typing patterns used by the community. The research approach involved a comprehensive survey employing a stratified sampling method, considering variables such as age, gender, and language proficiency. 215 participants were presented with an online survey comprising 12 Sinhala sentences to capture various transliteration patterns related to Sinhala characters which are necessary for the annotation process. Analysis of the survey responses led to the formulation of 92 general rules and 26 special rules, encapsulating ad-hoc Romanized Sinhala typing patterns. Using these rules, Sinhala dictionaries were annotated, building a large corpus of data which consists of Sinhala and its Romanized Sinhala patterns. The annotated dataset was validated using a back transliteration tool, achieving an 84%-word accuracy rate. This innovative transliteration annotator can be used to mitigate the resource constraints associated with Sinhala to Romanized Sinhala transliteration.

Swa-Bhasha : Romanized Sinhala to Sinhala Reverse Transliteration using a Hybrid Approach

T Sumanathilaka, Ruvan Weerasinghe, Prasan Yapa

2023 International Conference on Advanced Research in Computing 2023 (ICARC 2023) | Feb 23

With the social and technological revolution, the usage of social media platforms and instant message services strengthens native language compatibility in the digital arena. The Sinhala and the Romanized Sinhala became the prominent typing languages among the general Sri Lankan community. Informal short-hand-based typing and short net acronyms were used for easier Sinhala typing. But Typing Romanized Sinhala using ad-hoc transliterations and getting the expected output in native Sinhala is less accurate and time-consuming. Therefore, this study aims to introduce a novel reverse transliterator which can back transliterate and suggest Romanized Sinhala to Sinhala words. The Transliterator has been modelled using the Statistical approach with Trigram and Rule-based model for back transliteration purposes and knowledge-based with a Trie data structure for suggesting purposes. The proposed solution is capable of transliterating both formal and informal shorthand Romanized Sinhala. This hybrid model used in the study is capable of efficient transliteration with a word-level accuracy of 0.84. This proposed model can be used in digital platforms to enhance the usability of native Sinhala communication in a much more efficient way.

Swa Bhasha: Message-Based Singlish to Sinhala Transliteration.

Maneesha Athukorala, T Sumanathilaka

2022 International Conference on Innovations in Info-business & Technology (ICIIT2022)

Machine Transliteration provides the ability to transliterate a basic language into different languages in a computational way. Transliteration is an important technical process that has caught the attention most recently. The Sinhala transliteration has many constraints because of the insufficiency of resources in the Sinhala language. Due to these limitations, Sinhala Transliteration is highly complex and time-consuming. Therefore, the majority of the Sri Lankans uses non-formal texting language named 'Singlish' to make that process simple. This study has focused on the transliteration of the Singlish language at the word level by reducing the complication in the transliteration. A new approach of coding system has invented with the rule-based approach that can map the matching Sinhala words even without the vowels. Various typing patterns were collected by different communities for this. The collected data have analyzed with every Sinhala character and unique Singlish patterns related to them were generated. The system has introduced a newly initiated numeric coding system to use with the Singlish letters by matching with the recognized typing patterns. For the mapping process, fuzzy logic-based implementation has used. A codified dictionary has also implemented including unique numeric values. In this system, Each Romanized English letter was assigned with a unique numeric code that can construct a unique pattern for each word. The system can identify the most relevant Sinhala word that matches with the pattern of the Singlish word or it gives the most related word suggestions. For example, the word 'kiyanna,kianna, kynna, kynn, kiynna' have mapped with the accurate Sinhala word "kiyanna". These results revealed that the 'Swa Bhasha' transliteration system has the ability to enhance the Sinhala users' experience while conducting the texting in Singlish to Sinhala.

Swa Bhasha 2.0: Addressing Ambiguities in Romanized Sinhala to Native Sinhala Transliteration Using Neural Machine Translation.

Sachithya Dharmasiri, T Sumanathilaka

2024 4th International Conference on Advanced Research in Computing (ICARC) | IEEE | Feb 24

With the growing popularity of social media and instantaneous messaging, it is more important than ever to interact online in your native language. In Sinhala, both Romanized and native Sinhala are widely used. Due to the informal textual abbreviation known as “Singlish” however, attempts to translate Romanized Sinhala into native Sinhala via machine transliteration may result in errors. Rule-based transliteration systems may not be compatible with the ad hoc transliterations used in Singlish. To translate Romanized Sinhala back precisely and consistently into Native Sinhala, a novel NMT approach has been proposed. To address the complexities of casual Romanized Sinhala, a hybrid strategy combining rule-based and neural machine translation has been proposed. This strategy aims to eliminate word selection ambiguity by selecting the best word suggestions from a pool of predicted words using a suggestion algorithm. Combining the advantages of Suggestion algorithms and neural machine translation, the proposed transliterator has the potential to considerably enhance reverse transliteration and improve communication in native Sinhala by combining the strengths of both approaches. After completing the GRU model, the performance of the machine translation models on the BLEU test improved to 0.8, indicating high word-level translation accuracy. Significant potential exists for the proposed transliterator to enhance reverse transliteration and improve communication in Sinhala. While preliminary test results are promising, additional testing and refinement are required to improve the overall efficacy of machine translation models.

Emotion Detection Using Bi-directional LSTM with an Effective Text Pre-processing Method.

T Sumanathilaka,V Selvarai, U Raj, VP Raiu, J Prakash

2021 12th International Conference On Computing, Communication And Networking Technologies (ICCCNT 2021)

In a real-life scenario, extracting emotion from unstructured text is an active and challenging area of research. It has diverse applications in various aspects of our daily life To overcome various challenges involved in detecting emotion from text, researchers from diverse fields applied various machine learning algorithms. However, deep learning methods such as long short-term memory is effective to detect emotion by maintaining the sequence structure of the text. In this work, we use Bi-directional long short-term memory with attention layer for emotion detection for better accuracy for prediction. In addition, we employ a text preprocessing method to improve further results. We perform the experiments on three data sets and the models are evaluated based on the classification accuracy.

A Survey on Image Captioning Using Object Detection and NLP

Vathila De Silva, T Sumanathilaka

2024 4th International Conference on Advanced Research in Computing (ICARC) | IEEE | Feb 24

Recent years have seen the emergence of image captioning as a revolutionary technical development that seamlessly blends computer vision and natural language processing. The integration of various domains facilitates the production of captivating captions for pictures, promoting a more profound comprehension of visual material. This study offers a thorough analysis of the rapidly developing field of image captioning, examining its uses in various settings such as social media, online platforms, assistive technology, and content indexing. It emphasizes the critical importance of advanced techniques like You Only Look Once (YOLO) for accurate item detection and Natural Language Processing (NLP) for creating subtle captions. While NLP gives the resulting text a layer of contextual depth, YOLO guarantees precise object detection, which helps with caption accuracy. Using these cutting-edge methods boosts the overall efficacy of image captioning systems and represents a noteworthy trend in the literature. Examining the evolution of image captioning, the review paper highlights how important it is for multimodal comprehension. Image captioning is a transformative tool that enhances the readability of visual content and influences contemporary digital experiences in various contexts. The development of image captioning is summarized in this abstract, which also highlights the tendency to use cutting-edge techniques for more accurate and nuanced caption production.

Exploring Computational Linguistics Techniques for Enhanced Outing Planning: A Comprehensive Review

Faizan Muthaliff, T Sumanathilaka

2024 4th International Conference on Advanced Research in Computing (ICARC) | IEEE | Feb 24

In this day and age where convenience is king a new competitor in the form of chat-bots has appeared which requires users to only have prior knowledge on the language the chat-bot converses in. When it comes to planning outings there are a multitude of criteria to be wary of to determine the ideal solution. This review paper will address the common issues with extracting meaningful information from the user utilizing modern NLP techniques, the required criteria in addition to the user input such as, real time weather information and communicable diseases pertaining to the outing location, to determine an ideal suggestion and finally the method in which said ideal suggestion is determined after prioritizing criteria. The conclusion underscores that while NLP applications might not entirely replace existing menu-based systems for outing planning, they excel in certain scenarios. Notably, NLP's innate capacity to comprehend user preferences and context allows it to offer tailored outing suggestions. Furthermore, the application's unique feature of dynamically adapting to user requests without necessitating a return to the home page adds a distinctive advantage to its usability.

Hybrid Approaches to Emotion Recognition: A Comprehensive Survey of Audio-Textual Methods and Their Application

Sahan Wewalwala, T Sumanathilaka

2024 4th International Conference on Advanced Research in Computing (ICARC) | IEEE | Feb 24

This survey study provides a comprehensive analysis of emotion recognition, documenting its progression from conventional techniques to innovative hybrid algorithms that effectively combine textual and audio information. This work distinguishes itself with its thorough examination of the innovative combination of deep learning-based textual sentiment analysis and sophisticated aural signal processing approaches. This unique combination greatly improves the precision of emotion recognition from intricate sources like social media and consumer feedback. The study addresses the difficulties of processing real-time data and reducing bias in varied datasets. It provides new perspectives on the powers and limitations of neural networks and machine learning in this specific scenario. Our research represents a notable advancement in the utilization of these technologies for human-computer interaction, facilitating the creation of digital interfaces that are more empathic and attuned to human emotions. The final statements underscore the pressing requirement for sophisticated multimodal methodologies and the incorporation of developing technology, underscoring our distinctive contribution to the progress of emotion identification systems. This report offers a thorough analysis of the present condition of the subject and outlines a direction for future advancements, highlighting the crucial significance of emotion identification in improving human-computer interaction.

Deep Learning Based Framework for Reliable Sri Lankan Currency Authentication and Counterfeit Prevention

Yohan Abhishek, N Nadeem, S Delankawala, S Jayakody, T Sumanathilaka

2023 13th International Conference on System Engineering and Technology (ICSET 2023) held at Shah Alam, Malaysia. | IEEE | Oct 23

Digital image processing plays a crucial role in enabling efficient and precise analysis, manipulation, and enhancement of images. In this study, researchers address challenges faced by individuals with visual impairments in recognizing currency denominations and identifying counterfeit banknotes. The researchers propose "Blind Trust," an IoT device that utilizes an Arduino Uno and a camera module to capture images of banknotes. To achieve these objectives, researchers utilize pre-processing techniques using the OpenCV and TensorFlow libraries to extract the notes/coins' characteristics. Custom datasets are developed for training Convolutional Neural Network (CNN) models, which are then used to identify currency denominations and detect counterfeit currency. To enhance the model's performance, various pre-processing techniques are employed, resulting in high accuracy rates for both tasks. The currency notes identification model achieves an impressive 99% accuracy when tested on 25% of the data, while the currency coins identification model achieves 93% accuracy using InceptionV3. Additionally, the counterfeit currency detection model, created using VGG16, achieves an accuracy rate of 97% on a dataset comprising genuine and counterfeit currency images. Moreover, the note placement detection model attains 93% accuracy. "Blind Trust" holds great potential for enhancing financial security and accessibility for individuals with visual impairments. Its accuracy, speed, and ease of use contribute significantly to the development of new technologies aimed at improving their quality of life. Keywords—Arduino, CNN, Currency identification, Fake currency detection.

Deep Learning in Sinhala Sign Language Recognition: A Systematic Decade Review

Yohan Abhishek, T Sumanathilaka

2024 3rd International Conference on Image Processing, Computer Vision and Machine Learning (IEEE), China.

The paper reviews the progress in Sinhala Sign Language recognition (SSL) over the last decade, tracing the techniques from early image processing to advanced deep learning models, indicating methodologies involved, datasets used, and challenges. The paper shows that the evolution has progressed from basic classifiers to sophisticated machine learning techniques that integrate CNNs, RNNs, and transformer architectures. It also presents the implementation of multimodal systems using hand gestures, facial expressions, and body postures. The review identifies the limitations of current recognition in SSL, such as scarce complete datasets, technical difficulties, and a lack of robust continuous recognition. Finally, it points out future research directions, stressing that further work must focus on growing datasets, making lightweight models for edge devices, and collaboration for further growth in SSL recognition technology.

TripTractix: Optimizing Outing Planning with Advanced Computational Techniques

Faizan Muthaliff, T Sumanathilaka

2024 9th International Conference on Information Technology Research | IEEE | Dec 24

Event planning and travel applications often lack critical medical and travel-related information, despite the need to consider factors like weather and infectious disease risks, as highlighted during the COVID-19 pandemic. In Sri Lanka,there is also a scarcity of travel applications that fully utilizeNatural Language Processing (NLP) to handle dynamic user requests. To bridge these gaps, this work proposes an outingapplication that leverages NLP to generate optimal travel recommendations. The system uses Named Entity Recognition (NER) and semantic analysis to extract keywords from user inputs, mapping activity preferences with the help of Large Language Models (LLMs) and zero-shot classification techniques. Recommendations are ranked using the TOPSIS model, while location data is integrated from the Google Place API. This approach aims to deliver personalized, data-driven outing suggestions by combining NLP, advanced recommendation algorithms, and LLMs.

PrePrints

IndoNLP 2025: Shared Task on Real-Time Reverse Transliteration for Romanized Indo-Aryan languages

T Sumanathilaka, Isuri Anuradha, R Weerasinghe, N Micallef, J Hough | 2025

The paper overviews the shared task on Real-Time Reverse Transliteration for Romanized Indo-Aryan languages. It focuses on the reverse transliteration of low-resourced languages in the Indo-Aryan family to their native scripts. Typing Romanized Indo-Aryan languages using ad-hoc transliterals and achieving accurate native scripts are complex and often inaccurate processes with the current keyboard systems. This task aims to introduce and evaluate a real-time reverse transliterator that converts Romanized Indo-Aryan languages to their native scripts, improving the typing experience for users. Out of 11 registered teams, four teams participated in the final evaluation phase with transliteration models for Sinhala, Hindi and Malayalam. These proposed solutions not only solve the issue of ad-hoc transliteration but also empower low-resource language usability in the digital arena.

EmoScan: Automatic Screening of Depression Symptoms in Romanized Sinhala Tweets

J Hewapathirana, T Sumanathilaka | 2024

This work explores the utilization of Romanized Sinhala social media data to identify individuals at risk of depression. A machine learning-based framework is presented for the automatic screening of depression symptoms by analyzing language patterns, sentiment, and behavioural cues within a comprehensive dataset of social media posts. The research has been carried out to compare the suitability of Neural Networks over the classical machine learning techniques. The proposed Neural Network with an attention layer which is capable of handling long sequence data, attains a remarkable accuracy of 93.25% in detecting depression symptoms, surpassing current state-of-the-art methods. These findings underscore the efficacy of this approach in pinpointing individuals in need of proactive interventions and support. Mental health professionals, policymakers, and social media companies can gain valuable insights through the proposed model. Leveraging natural language processing techniques and machine learning algorithms, this work offers a promising pathway for mental health screening in the digital era. By harnessing the potential of social media data, the framework introduces a proactive method for recognizing and assisting individuals at risk of depression. In conclusion, this research contributes to the advancement of proactive interventions and support systems for mental health, thereby influencing both research and practical applications in the field.

A Survey on Lexical Ambiguity Detection and Word Sense Disambiguation

M Abeysiriwardana, T Sumanathilaka | 2024

This paper explores techniques that focus on understanding and resolving ambiguity in language within the field of natural language processing (NLP), highlighting the complexity of linguistic phenomena such as polysemy and homonymy and their implications for computational models. Focusing extensively on Word Sense Disambiguation (WSD), it outlines diverse approaches ranging from deep learning techniques to leveraging lexical resources and knowledge graphs like WordNet. The paper introduces cutting-edge methodologies like word sense extension (WSE) and neuromyotonic approaches, enhancing disambiguation accuracy by predicting new word senses. It examines specific applications in biomedical disambiguation and language specific optimisation and discusses the significance of cognitive metaphors in discourse analysis. The research identifies persistent challenges in the field, such as the scarcity of sense annotated corpora and the complexity of informal clinical texts. It concludes by suggesting future directions, including using large language models, visual WSD, and multilingual WSD systems, emphasising the ongoing evolution in addressing lexical complexities in NLP. This thinking perspective highlights the advancement in this field to enable computers to understand language more accurately.

Academic Dissertations

Romanized Sinhala to Sinhala Reverse Transliteration using a Hybrid Approach

Thesis for Master in Computer Science Advisor: Dr Ruvan Weerasinghe, Mr Prasan Yapa 2022

With the revolution of social technology, the introduction of social media platforms and instant messages strengthen the native language used for communication in electronic media. With the commencement of multi-language compatibility in the digital arena both native Sinhala and Romanized Sinhala became prominent among the general community. Machine transliteration provides the ability to transliterate the alphabet of one language to another using computational approaches. The informal shorthand language that uses in texting also known as “Singlish” makes texting easier as the words in Sinhala can be interpreted using English letters with different typing patterns. But typing “Romanized Sinhala” using ad hoc transliterations and short net acronyms and getting the expected output in native Sinhala is less accurate. The current transliterators with a rule-based approach use a letter-level transliteration with a defined rule for the transliteration schema. But Romanized Sinhala via shortened hand based typing is not compatible with the current system. The proposed ad-hoc schema uses multiple computational approaches Aka Hybrid Approach to accomplish the requirement of ambiguity-free transliteration. The statistical approach used in the first phase uses an N-gram tagger where the tokens are fed to Trigram, Bigram, and Unigram taggers respectively. The unknown token from the initial phase is fed to the second phase with a Rule-based Algorithm which will predict respective words. The third phase which is the finalizing phase uses a suggestion-level model implemented using a Trie and Knowledge base to find the most optimal word suggestions from the predicted words pool. This phase will solve the ambiguity of a word selection. The Transliterator has been tested with the testing data and word level accuracy achieved was 84%. Therefore, the proposed novel transliterator which can back transliterate Romanized Sinhala to Sinhala using the Hybrid approach can use to enhance the reverse transliteration schema which will escalate the usage of Native Sinhala for communication

Natural and Emotional Linguistic Text to Speech Synthesis

Thesis for Bachelor in Technology Computer Science and Engineering Advisor: Dr Jay Prakash 2019

A real time platform to synthesize emotional and natural voice from unstructured text was built. The whole system has two parts. The first part deals with extracting emotion out of text using text mining and LSTMs which is the cutting-edge technology in the current scenario. The model thus built was exible enough to obtain high accuracy over multiple datasets. The second part identi es the key features of speech and modi es those attributes in a neutral speech to give it an emotional and human-like sound. To identify these exclusive features, an SER(Speech Emotion Recognition) model was also built.

Data Resources

Swa-Bhasha Dataset

This dataset contains Romanized Sinhala to Sinhala data records in ad hoc transliteration. It is currently the largest available dataset for Romanized Sinhala ad hoc translation.

Transliteration Evaluation Dataset

This Test dataset has been created and augmented specifically for the INdoNLP Shared Task 2025. Please note that some data records are a combination of existing datasets that are publicly available for the respective languages. The augmentation process involved generating new data samples based on these existing resources while ensuring data diversity and relevance to the task.