Rabin Adhikari

Saarbrücken, Germany

Hi! I’m Rabin, a Master’s student in Data Science and Artificial Intelligence at Saarland University. Part time, I’m also working as a Research Assistant at the Max Planck Institute for Software Systems (MPI-SWS), where I’m part of the Machine Teaching group led by Prof. Adish Singla.

My main focus these days is on trustworthy AI—I’m interested in building systems that are more robust, reliable, and aligned with how people actually use them. Before joining MPI-SWS, I worked as a Data Scientist (Working Student) at QuantPi, where I evaluated the behavior of large, production-grade models from NVIDIA. I looked into their biases, robustness, and overall performance in real-world deployment settings.

Before coming to Germany, I was a Research Assistant at NAAMII in Nepal, where I worked on a mix of machine learning domains, including NLP, multimodal learning, and medical imaging. I’ve also had hands-on experience as a full-stack developer, which keeps me grounded in the practical side of things.

Outside of work and studies, I enjoy hacking on personal projects, diving into research papers, or just chatting about the quirks and challenges of modern AI systems.

If you’d like to know more about my background, my CV is available here.

News

Sep 30, 2024	Moved to Germany to join Saarland University as a Master’s student under Computer Science department.
Sep 25, 2024	TuneVLSeg has been accepted for oral presentation at ACCV, 2024 in Hanoi, Vietnam.
Jun 17, 2024	VLSM-Adapter has been accepted for the main conference of MICCAI, 2024 in Marrakesh, Morocco.
Jun 06, 2024	Exploring Transfer Learning in Medical Image Segmentation using Vision-Language Models has been accepted for the oral presentation at MIDL, 2024 in Paris, France.

Latest Posts

May 07, 2021	Training of Word Embeddings
Oct 30, 2020	Importing OpenVPN Configuration

Selected Publications

ACCV
TuneVLSeg: Prompt Tuning Benchmark for Vision-Language Segmentation Models

Rabin Adhikari, Safal Thapaliya, Manish Dhakal, and Bishesh Khanal

In Proceedings of the Asian Conference on Computer Vision (ACCV), Dec 2024

Orally presented at the conference

Abs DOI arXiv Bib PDF Code Slides

Vision-Language Models (VLMs) have shown impressive performance in vision tasks, but adapting them to new domains often requires expensive fine-tuning. Prompt tuning techniques, including textual, visual, and multimodal prompting, offer efficient alternatives by leveraging learnable prompts. However, their application to Vision-Language Segmentation Models (VLSMs) and evaluation under significant domain shifts remain unexplored. This work presents an open-source benchmarking framework, TuneVLSeg, to integrate various unimodal and multimodal prompt tuning techniques into VLSMs, making prompt tuning usable for downstream segmentation datasets with any number of classes. We test various prompt tuning on 8 diverse medical datasets, including 3 radiology datasets (breast tumor, echocardiograph, chest X-ray pathologies) and 5 non-radiology datasets (polyp, ulcer, skin cancer), and two natural domain segmentation datasets. Our study found that textual prompt tuning struggles under significant domain shifts, from natural-domain images to medical data. Furthermore, visual prompt tuning, with fewer hyperparameters than multimodal prompt tuning, often achieves performance competitive to multimodal approaches, making it a valuable first attempt. Our work advances the understanding and applicability of different prompt-tuning techniques for robust domain-specific segmentation.
@inproceedings{adhikari2024tunevlseg, title = {TuneVLSeg: Prompt Tuning Benchmark for Vision-Language Segmentation Models}, author = {Adhikari, Rabin and Thapaliya, Safal and Dhakal, Manish and Khanal, Bishesh}, booktitle = {Proceedings of the Asian Conference on Computer Vision (ACCV)}, volume = {15474}, pages = {126--144}, month = dec, year = {2024}, organization = {Springer Nature Singapore}, doi = {10.1007/978-981-96-0908-6_3}, note = {Orally presented at the conference} }
MICCAI
VLSM-Adapter: Finetuning Vision-Language Segmentation Efficiently with Lightweight Blocks

Manish Dhakal, Rabin Adhikari, Safal Thapaliya, and Bishesh Khanal

In International Conference on Medical Image Computing and Computer-Assisted Intervention (MICCAI), Oct 2024

Abs DOI arXiv Bib PDF Code

Foundation Vision-Language Models (VLMs) trained using large-scale open-domain images and text pairs have recently been adapted to develop Vision-Language Segmentation Models (VLSMs) that allow providing text prompts during inference to guide image segmentation. If robust and powerful VLSMs can be built for medical images, it could aid medical professionals in many clinical tasks where they must spend substantial time delineating the target structure of interest. VLSMs for medical images resort to fine-tuning base VLM or VLSM pretrained on open-domain natural image datasets due to fewer annotated medical image datasets; this fine-tuning is resource-consuming and expensive as it usually requires updating all or a significant fraction of the pretrained parameters. Recently, lightweight blocks called adapters have been proposed in VLMs that keep the pretrained model frozen and only train adapters during fine-tuning, substantially reducing the computing resources required. We introduce a novel adapter, VLSM-Adapter, that can fine-tune pretrained vision-language segmentation models using transformer encoders. Our experiments in widely used CLIP-based segmentation models show that with only 3 million trainable parameters, the VLSM-Adapter outperforms state-of-the-art and is comparable to the upper bound end-to-end fine-tuning.
@inproceedings{dhakal2024vlsm, title = {VLSM-Adapter: Finetuning Vision-Language Segmentation Efficiently with Lightweight Blocks}, author = {Dhakal, Manish and Adhikari, Rabin and Thapaliya, Safal and Khanal, Bishesh}, booktitle = {International Conference on Medical Image Computing and Computer-Assisted Intervention (MICCAI)}, pages = {712--722}, month = oct, year = {2024}, organization = {Springer Nature Switzerland}, doi = {10.1007/978-3-031-72114-4_68}, }
MIDL
Exploring Transfer Learning in Medical Image Segmentation using Vision-Language Models

Kanchan Poudel^*, Manish Dhakal^*, Prasiddha Bhandari^*, Rabin Adhikari^*, Safal Thapaliya^*, and 1 more author

In Medical Imaging with Deep Learning, Jun 2024

Orally presented at the conference

Abs arXiv Bib PDF Code Slides

Medical image segmentation allows quantifying target structure size and shape, aiding in disease diagnosis, prognosis, surgery planning, and comprehension. Building upon recent advancements in foundation Vision-Language Models (VLMs) from natural image-text pairs, several studies have proposed adapting them to Vision-Language Segmentation Models (VLSMs) that allow using language text as an additional input to segmentation models. Introducing auxiliary information via text with human-in-the-loop prompting during inference opens up unique opportunities, such as open vocabulary segmentation and potentially more robust segmentation models against out-of-distribution data. Although transfer learning from natural to medical images has been explored for image-only segmentation models, the joint representation of vision-language in segmentation problems remains underexplored. This study introduces the first systematic study on transferring VLSMs to 2D medical images, using carefully curated 11 datasets encompassing diverse modalities and insightful language prompts and experiments. Our findings demonstrate that although VLSMs show competitive performance compared to image-only models for segmentation after finetuning in limited medical image datasets, not all VLSMs utilize the additional information from language prompts, with image features playing a dominant role. While VLSMs exhibit enhanced performance in handling pooled datasets with diverse modalities and show potential robustness to domain shifts compared to conventional segmentation models, our results suggest that novel approaches are required to enable VLSMs to leverage the various auxiliary information available through language prompts.
@inproceedings{poudel2024exploring, title = {Exploring Transfer Learning in Medical Image Segmentation using Vision-Language Models}, author = {Poudel, Kanchan and Dhakal, Manish and Bhandari, Prasiddha and Adhikari, Rabin and Thapaliya, Safal and Khanal, Bishesh}, booktitle = {Medical Imaging with Deep Learning}, month = jun, year = {2024}, archiveprefix = {arXiv}, url = {https://openreview.net/forum?id=sN3sDKkGeN}, note = {Orally presented at the conference} }