skip to main content

Transformer-Based Encoder-Decoder Model for Medical Image Captioning with Concept Embedding

School of Electrical Engineering and Informatics, Institut Teknologi Bandung, Indonesia

Received: 25 Jun 2025; Revised: 15 Oct 2025; Accepted: 21 Oct 2025; Published: 8 Jan 2026.
Open Access Copyright (c) 2026 The authors. Published by Department of Informatics Universitas, Diponegoro
Creative Commons License This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

Citation Format:
Abstract
This research presents a Transformer-based encoder-decoder model for medical image captioning that incorporates semantic medical knowledge through Concept Unique Identifiers (CUIs) from the Unified Medical Language System (UMLS). The proposed architecture employs a Swin Transformer as the visual encoder and GPT-2 as the language decoder, with CUI integration applied during both caption preprocessing and decoding. Experiments were conducted on the ROCOv2 dataset under two scenarios: baseline (raw captions) and enhanced (CUI-enriched captions). Quantitative evaluation using BLEU, ROUGE, CIDEr, and BERT-based metrics demonstrates that the CUI-integrated model outperforms several baselines, including CNN-LSTM, ViT-BioMedLM, and DeepSeek-VL, achieving a BLEU-1 score of 0.371, ROUGE-L of 0.305, CIDEr of 0.275, and PubMedBERTScore-F1 of 0.893. These results represent a 20.1% improvement in BLEU-1 and a 39.9% increase in ROUGE-L compared to the best-performing model before caption preprocessing (ViT-GPT2 with BLEU-1 = 0.309, ROUGE-L = 0.218). Qualitative assessment by expert radiologists further confirms enhanced diagnostic accuracy, descriptive completeness, and clinical relevance. This study introduces a novel integration of medical semantic knowledge into captioning models, offering a scalable solution for clinical decision support in resource-limited settings such as Indonesia.
Fulltext View|Download
Keywords: Transformer, Medical Image Captioning, Concept Unique Identifier, Unified Medical Language System
Funding: Lembaga Pengelola Dana Pendidikan (LPDP)

Article Metrics:

  1. H. Liu et al., “Artificial Intelligence and Radiologist Burnout,” JAMA Netw. open, vol. 7, no. 11, p. e2448714, 2024, doi: 10.1001/jamanetworkopen.2024.48714
  2. M. Chen et al., “Impact of human and artificial intelligence collaboration on workload reduction in medical image interpretation,” npj Digit. Med., vol. 7, no. 1, pp. 1–10, 2024, doi: 10.1038/s41746-024-01328-w
  3. I. Adamchic, “Enhancing Intracranial Aneurysm Detection with Artificial Intelligence in Radiology,” vol. 9, pp. 5–10, 2025, [Online]. Available: 10.29245/2572.942X/2025/1.1310
  4. A. B. Jing, N. Garg, J. Zhang, and J. J. Brown, “AI solutions to the radiology workforce shortage,” pp. 23–28, 2025, doi: 10.1038/s44401-025-00023-6
  5. O. Vinyals, A. Toshev, S. Bengio, and D. Erhan, “Show and tell: A neural image caption generator,” Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit., vol. 07-12-June, pp. 3156–3164, 2015, doi: 10.1109/CVPR.2015.7298935
  6. J. Aneja, A. Deshpande, and A. G. Schwing, “Convolutional Image Captioning,” Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit., pp. 5561–5570, 2018, doi: 10.1109/CVPR.2018.00583
  7. A. Vaswani et al., “Attention is all you need,” Adv. Neural Inf. Process. Syst., vol. 2017-Decem, pp. 5999–6009, 2017
  8. Z. Liu et al., “Swin Transformer: Hierarchical Vision Transformer using Shifted Windows,” Proc. IEEE Int. Conf. Comput. Vis., pp. 9992–10002, 2021, doi: 10.1109/ICCV48922.2021.00986
  9. A. Dosovitskiy et al., “an Image Is Worth 16X16 Words: Transformers for Image Recognition At Scale,” ICLR 2021 - 9th Int. Conf. Learn. Represent., 2021
  10. A. Radford et al., “Language models are unsupervised multitask learners,” OpenAI blog, vol. 1, no. 8, p. 9, 2019
  11. O. Bodenreider, “The Unified Medical Language System (UMLS): Integrating biomedical terminology,” Nucleic Acids Res., vol. 32, no. DATABASE ISS., pp. 267–270, 2004, doi: 10.1093/nar/gkh061
  12. V. T. Phan and K. T. Nguyen, “EasyChair Preprint MedBLIP : Multimodal Medical Image Captioning Using BLIP MedBLIP : Multimodal medical image captioning using,” 2024
  13. S. Wu, B. Yang, Z. Ye, H. Wang, H. Zheng, and T. Zhang, “MAKEN: Improving Medical Report Generation with Adapter Tuning and Knowledge Enhancement in Vision-Language Foundation Models,” Proc. - Int. Symp. Biomed. Imaging, pp. 1–5, 2024, doi: 10.1109/ISBI56570.2024.10635421
  14. A. Nicolson, J. Dowling, and B. Koopman, “A Concise Model for Medical Image Captioning,” CEUR Workshop Proc., vol. 3497, pp. 1611–1619, 2023
  15. D. R. Beddiar, M. Oussalah, T. Seppänen, and R. Jennane, “ACapMed: Automatic Captioning for Medical Imaging,” Appl. Sci., vol. 12, no. 21, pp. 1–24, 2022, doi: 10.3390/app122111092
  16. F. A. Zahra and R. J. Kate, “Obtaining clinical term embeddings from SNOMED CT ontology,” J. Biomed. Inform., vol. 149, no. November 2023, 2024, doi: 10.1016/j.jbi.2023.104560
  17. Z. Huang, X. Zhang, and S. Zhang, “KiUT: Knowledge-injected U-Transformer for Radiology Report Generation,” Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit., vol. 2023-June, pp. 19809–19818, 2023, doi: 10.1109/CVPR52729.2023.01897
  18. J. Rückert et al., “ROCOv2: Radiology Objects in COntext Version 2, an Updated Multimodal Image Dataset,” Sci. Data, vol. 11, no. 1, pp. 1–15, 2024, doi: 10.1038/s41597-024-03496-6
  19. K. Xu et al., “Show, attend and tell: Neural image caption generation with visual attention,” in International conference on machine learning, 2015, pp. 2048–2057
  20. X. Mei, L. Yang, D. Gao, X. Cai, J. Han, and T. Liu, “PhraseAug: An Augmented Medical Report Generation Model with Phrasebook,” IEEE Trans. Med. Imaging, vol. PP, no. Xx, p. 1, 2024, doi: 10.1109/TMI.2024.3416190
  21. P. Singh and S. Singh, “ChestX-Transcribe: a multimodal transformer for automated radiology report generation from chest x-rays,” Front. Digit. Heal., vol. 7, no. January, pp. 1–11, 2025, doi: 10.3389/fdgth.2025.1535168
  22. E. Bolton et al., “BioMedLM: A 2.7B Parameter Language Model Trained On Biomedical Text,” vol. 2015, pp. 1–23, 2024, [Online]. Available: http://arxiv.org/abs/2403.18421
  23. F. Liu, E. Shareghi, Z. Meng, M. Basaldella, and N. Collier, “Self-Alignment Pretraining for Biomedical Entity Representations,” NAACL-HLT 2021 - 2021 Conf. North Am. Chapter Assoc. Comput. Linguist. Hum. Lang. Technol. Proc. Conf., pp. 4228–4238, 2021, doi: 10.18653/v1/2021.naacl-main.334
  24. G. Michalopoulos, Y. Wang, H. Kaka, H. Chen, and A. Wong, “UmlsBERT: Clinical Domain Knowledge Augmentation of Contextual Embeddings Using the Unified Medical Language System Metathesaurus,” NAACL-HLT 2021 - 2021 Conf. North Am. Chapter Assoc. Comput. Linguist. Hum. Lang. Technol. Proc. Conf., pp. 1744–1753, 2021, doi: 10.18653/v1/2021.naacl-main.139
  25. X. Zhang, C. Wu, Y. Zhang, W. Xie, and Y. Wang, “Knowledge-enhanced visual-language pre-training on chest radiology images,” Nat. Commun., vol. 14, no. 1, pp. 1–12, 2023, doi: 10.1038/s41467-023-40260-7
  26. A. L. Beam et al., “Clinical concept embeddings learned from massive sources of multimodal medical data,” Pacific Symp. Biocomput., vol. 25, no. 2020, pp. 295–306, 2020, doi: 10.1142/9789811215636_0027
  27. Z. Kraljevic et al., “Multi-domain clinical natural language processing with MedCAT: The Medical Concept Annotation Toolkit,” Artif. Intell. Med., vol. 117, no. May, 2021, doi: 10.1016/j.artmed.2021.102083
  28. A. Pal and M. Sankarasubbu, “Gemini Goes to Med School: Exploring the Capabilities of Multimodal Large Language Models on Medical Challenge Problems & Hallucinations,” Clin. 2024 - 6th Work. Clin. Nat. Lang. Process. Proc. Work., pp. 21–46, 2024, doi: 10.18653/v1/2024.clinicalnlp-1.3
  29. Gemini Team et al., “Gemini: A Family of Highly Capable Multimodal Models,” pp. 1–90, 2025, [Online]. Available: http://arxiv.org/abs/2312.11805
  30. Z. Yuan, Y. Liu, C. Tan, S. Huang, and F. Huang, “Improving Biomedical Pretrained Language Models with Knowledge,” Proc. 20th Work. Biomed. Lang. Process. BioNLP 2021, pp. 180–190, 2021, doi: 10.18653/v1/2021.bionlp-1.20
  31. K. Papineni, S. Roukos, T. Ward, and W. J. Zhu, “BLEU: A method for automatic evaluation of machine translation,” Proc. Annu. Meet. Assoc. Comput. Linguist., vol. 2002-July, no. July, pp. 311–318, 2002, doi: 10.3115/1073083.1073135
  32. C. Y. Lin, “Rouge: A package for automatic evaluation of summaries,” Proc. Work. text Summ. branches out (WAS 2004), no. 1, pp. 25–26, 2004, [Online]. Available: https://aclanthology.org/W04-1013/
  33. R. Vedantam, C. L. Zitnick, and D. Parikh, “CIDEr: Consensus-based image description evaluation,” Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit., vol. 07-12-June, pp. 4566–4575, 2015, doi: 10.1109/CVPR.2015.7299087
  34. T. Zhang, V. Kishore, F. Wu, K. Q. Weinberger, and Y. Artzi, “Bertscore: Evaluating Text Generation With Bert,” 8th Int. Conf. Learn. Represent. ICLR 2020, pp. 1–43, 2020
  35. A. Ben Abacha, W. W. Yim, G. Michalopoulos, and T. Lin, “An Investigation of Evaluation Metrics for Automated Medical Note Generation,” Proc. Annu. Meet. Assoc. Comput. Linguist., pp. 2575–2588, 2023, doi: 10.18653/v1/2023.findings-acl.161
  36. P. Wang et al., “Qwen2-VL: Enhancing Vision-Language Model’s Perception of the World at Any Resolution,” pp. 1–52, 2024, [Online]. Available: http://arxiv.org/abs/2409.12191
  37. J. Li, D. Li, C. Xiong, and S. Hoi, “BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation,” Proc. Mach. Learn. Res., vol. 162, pp. 12888–12900, 2022
  38. H. Lu et al., “DeepSeek-VL: Towards Real-World Vision-Language Understanding,” pp. 1–33, 2024, [Online]. Available: http://arxiv.org/abs/2403.05525

Last update:

No citation recorded.

Last update: 2026-01-11 23:28:08

No citation recorded.