Hi, I am currently pursuing a Master’s degree in Data Science and Artificial Intelligence at Saarland University, Germany. I hold a degree in computer engineering from IOE, Pulchowk Campus, and have gained valuable experience as a research assistant at NAAMII, a leading research institute. My research interests lie in machine learning, focusing on semi-supervised learning, multimodal learning, and natural language processing.
At NAAMII, I developed and implemented advanced machine-learning models for real-world applications, including natural language processing and medical imaging. My strong foundation in algorithms, data structures, and mathematics equips me to tackle complex challenges, and I am passionate about creating impactful solutions through technology.
I am eager to expand my machine learning and AI expertise through academic and research opportunities, with the long-term goal of pursuing a Ph.D. My research experience and passion for applying machine learning techniques to societal challenges enable me to contribute effectively to interdisciplinary research teams.
MS in Data Science and AI, 2024
Saarland University, Saarland Informatics Campus
Bachelors in Computer Engineering, 2017
Tribhuvan University, Institute of Engineering, Pulchowk Campus
High School in Physical Sciences, 2015
SOS Hermann Gmeiner School Bharatpur, Bharatpur, Nepal
Supervisor: Bishesh Khanal, Ph.D.
Supervisor: Binod Bhattarai, Ph.D.
Foundation Vision-Language Models (VLMs) trained using large-scale open-domain images and text pairs have recently been adapted to develop Vision-Language Segmentation Models (VLSMs) that allow providing text prompts during inference to guide image segmentation. If robust and powerful VLSMs can be built for medical images, it could aid medical professionals in many clinical tasks where they must spend substantial time delineating the target structure of interest. VLSMs for medical images resort to fine-tuning base VLM or VLSM pretrained on open-domain natural image datasets due to fewer annotated medical image datasets; this fine-tuning is resource-consuming and expensive as it usually requires updating all or a significant fraction of the pretrained parameters. Recently, lightweight blocks called adapters have been proposed in VLMs that keep the pretrained model frozen and only train adapters during fine-tuning, substantially reducing the computing resources required. We introduce a novel adapter, VLSM-Adapter, that can fine-tune pretrained vision-language segmentation models using transformer encoders. Our experiments in widely used CLIP-based segmentation models show that with only 3 million trainable parameters, the VLSM-Adapter outperforms state-of-the-art and is comparable to the upper bound end-to-end fine-tuning. The source code is available at https://github.com/naamiinepal/vlsm-adapter.
Medical image segmentation allows quantifying target structure size and shape, aiding in disease diagnosis, prognosis, surgery planning, and comprehension. Building upon recent advancements in foundation Vision-Language Models (VLMs) from natural image-text pairs, several studies have proposed adapting them to Vision-Language Segmentation Models (VLSMs) that allow using language text as an additional input to segmentation models. Introducing auxiliary information via text with human-in-the-loop prompting during inference opens up unique opportunities, such as open vocabulary segmentation and potentially more robust segmentation models against out-of-distribution data. Although transfer learning from natural to medical images has been explored for image-only segmentation models, the joint representation of vision-language in segmentation problems remains underexplored. This study introduces the first systematic study on transferring VLSMs to 2D medical images, using carefully curated 11 datasets encompassing diverse modalities and insightful language prompts and experiments. Our findings demonstrate that although VLSMs show competitive performance compared to image-only models for segmentation after finetuning in limited medical image datasets, not all VLSMs utilize the additional information from language prompts, with image features playing a dominant role. While VLSMs exhibit enhanced performance in handling pooled datasets with diverse modalities and show potential robustness to domain shifts compared to conventional segmentation models, our results suggest that novel approaches are required to enable VLSMs to leverage the various auxiliary information available through language prompts. The code and datasets are available at https://github.com/naamiinepal/medvlsm.