skip to main content

Systematic Literature Review on Medical Image Captioning Using CNN-LSTM and Transformer-Based Models

School of Electrical Engineering and Informatics , Indonesia

Received: 10 May 2025; Revised: 20 May 2025; Accepted: 21 May 2025; Available online: 27 May 2025; Published: 31 May 2025.
Editor(s): Prajanto Adi
Open Access Copyright (c) 2025 The authors. Published by Department of Informatics Universitas, Diponegoro
Creative Commons License This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

Citation Format:
Abstract
Creating descriptive text from medical images to aid in diagnosis and treatment planning is known as medical image captioning, and it is a crucial duty in the healthcare industry. In this study, medical image captioning techniques are systematically reviewed in the literature with an emphasis on Transformer-based models and Convolutional Neural Network-Long Short Term Memory (CNN-LSTM). Aspects like as model designs, datasets, evaluation measures, and difficulties encountered in practical implementations are all examined in this paper. According to the results, Transformer-based models, specifically Swin Transformer and Vision Transformer (ViT), perform better than CNN-LSTM-based models in terms of BLEU, ROUGE, CIDEr, and METEOR scores, resulting in more accurate clinically relevant caption generation. However, there are still a number of issues, including interpretability, computing requirements, and data restrictions. Potential future routes to increase the efficacy and practical application of medical image captioning systems are covered in this paper, including hybrid model approaches, data augmentation techniques, and enhanced explainability methodologies.
Fulltext View|Download
Keywords: Medical image captioning, convolutional neural network, transformer, healthcare ai, automatic report generation
Funding: Lembaga Pengelola Dana Pendidikan (LPDP)

Article Metrics:

  1. G. Reale-Nosei, E. Amador-Domínguez, and E. Serrano, “From vision to text: A comprehensive review of natural image captioning in medical diagnosis and radiology report generation,” Med. Image Anal., vol. 97, no. August 2023, 2024, doi: 10.1016/j.media.2024.103264
  2. J. Pavlopoulos, V. Kougia, I. Androutsopoulos, and D. Papamichail, “Diagnostic captioning: a survey,” Knowl. Inf. Syst., vol. 64, no. 7, pp. 1691–1722, 2022, doi: 10.1007/s10115-022-01684-7
  3. A. Selivanov, O. Y. Rogov, D. Chesakov, A. Shelmanov, I. Fedulova, and D. V Dylov, “Medical image captioning via generative pretrained transformers,” Sci. Rep., pp. 1–12, 2023, doi: 10.1038/s41598-023-31223-5
  4. H. Sharma and D. Padha, “Domain-specific image captioning : a comprehensive review,” Int. J. Multimed. Inf. Retr., vol. 13, no. 2, pp. 1–27, 2024, doi: 10.1007/s13735-024-00328-6
  5. D. R. Beddiar, M. Oussalah, T. Seppänen, and R. Jennane, “ACapMed: Automatic Captioning for Medical Imaging,” Appl. Sci., vol. 12, no. 21, pp. 1–24, 2022, doi: 10.3390/app122111092
  6. R. Thirunavukarasu and E. Kotei, “A comprehensive review on transformer network for natural and medical image analysis,” Comput. Sci. Rev., vol. 53, no. April, 2024, doi: 10.1016/j.cosrev.2024.100648
  7. A. Vaswani et al., “Attention is all you need,” Adv. Neural Inf. Process. Syst., vol. 2017-Decem, pp. 5999–6009, 2017
  8. N. N. Y. Nguyen, H. L. Tu, P. D. Nguyen, T. N. Do, T. M. Thai, and T. B. Nguyen-tat, “DS @ BioMed at ImageCLEFmedical Caption 2024 : Enhanced Attention Mechanisms in Medical Caption Generation through Concept Detection Integration ⋆,” 2024
  9. F. A. Zahra and R. J. Kate, “Obtaining clinical term embeddings from SNOMED CT ontology,” J. Biomed. Inform., vol. 149, no. November 2023, 2024, doi: 10.1016/j.jbi.2023.104560
  10. H. Zhou et al., “Complementary and Integrative Health Information in the literature: Its lexicon and named entity recognition,” J. Am. Med. Informatics Assoc., vol. 31, no. 2, pp. 426–434, 2024, doi: 10.1093/jamia/ocad216
  11. C. Wohlin and R. Prikladniki, “Systematic literature reviews in software engineering,” Inf. Softw. Technol., vol. 55, no. 6, pp. 919–920, 2013, doi: 10.1016/j.infsof.2013.02.002
  12. D. Moher, A. Liberati, J. Tetzlaff, and D. G. Altman, “Preferred reporting items for systematic reviews and meta-analyses: The PRISMA statement,” Int. J. Surg., vol. 8, no. 5, pp. 336–341, 2010, doi: 10.1016/j.ijsu.2010.02.007
  13. P. Ravinder and S. Srinivasan, “Automated Medical Image Captioning with Soft Attention-Based LSTM Model Utilizing YOLOv4 Algorithm,” J. Comput. Sci., vol. 20, no. 1, pp. 52–68, 2024, doi: 10.3844/jcssp.2024.52.68
  14. S. Yan, W. K. Cheung, K. Chiu, T. M. Tong, K. C. Cheung, and S. See, “Attributed Abnormality Graph Embedding for Clinically Accurate X-Ray Report Generation,” IEEE Trans. Med. Imaging, vol. 42, no. 8, pp. 2211–2222, 2023, doi: 10.1109/TMI.2023.3245608
  15. F. Liu, X. Wu, S. Ge, W. Fan, and Y. Zou, “Exploring and distilling posterior and prior knowledge for radiology report generation,” in Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2021, pp. 13748–13757. doi: 10.1109/CVPR46437.2021.01354
  16. M. Li, R. Liu, F. Wang, X. Chang, and X. Liang, “Auxiliary signal-guided knowledge encoder-decoder for medical report generation,” World Wide Web, vol. 26, no. 1, pp. 253–270, 2023, doi: 10.1007/s11280-022-01013-6
  17. V. T. Phan and K. T. Nguyen, “EasyChair Preprint MedBLIP : Multimodal Medical Image Captioning Using BLIP MedBLIP : Multimodal medical image captioning using,” 2024
  18. B. Yang, Z. Ye, H. Wang, H. Zheng, and S. International, “MAKEN : IMPROVING MEDICAL REPORT GENERATION WITH ADAPTER TUNING AND KNOWLEDGE ENHANCEMENT IN VISION-LANGUAGE FOUNDATION MODELS ADSPLAB , School of Electronic and Computer Engineering , Peking University , Shenzhen , China Shenzhen Institute of Advanced Te,” 2024 IEEE Int. Symp. Biomed. Imaging, pp. 1–5, 2024, doi: 10.1109/ISBI56570.2024.10635421
  19. T. Tanida, P. Müller, G. Kaissis, and D. Rueckert, “Interactive and Explainable Region-guided Radiology Report Generation,” Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit., vol. 2023-June, pp. 7433–7442, 2023, doi: 10.1109/CVPR52729.2023.00718
  20. F. Dalla Serra, C. Wang, F. Deligianni, J. Dalton, and A. Q. O’Neil, “Controllable Chest X-Ray Report Generation from Longitudinal Representations,” Find. Assoc. Comput. Linguist. EMNLP 2023, pp. 4891–4904, 2023, doi: 10.18653/v1/2023.findings-emnlp.325
  21. B. Yan et al., “Style-Aware Radiology Report Generation with RadGraph and Few-Shot Prompting,” Find. Assoc. Comput. Linguist. EMNLP 2023, pp. 14676–14688, 2023, doi: 10.18653/v1/2023.findings-emnlp.977
  22. Z. Zhang et al., “Sam-Guided Enhanced Fine-Grained Encoding with Mixed Semantic Learning for Medical Image Captioning,” ICASSP 2024 - 2024 IEEE Int. Conf. Acoust. Speech Signal Process., pp. 1731–1735, 2024, doi: 10.1109/icassp48485.2024.10446878
  23. S. Raminedi, S. Shridevi, and D. Won, “Multi-modal transformer architecture for medical image analysis and automated report generation,” Sci. Rep., vol. 14, no. 1, pp. 1–18, 2024, doi: 10.1038/s41598-024-69981-5
  24. W. Zhou et al., “Transferring Pre-Trained Large Language-Image Model for Medical Image Captioning,” CEUR Workshop Proc., vol. 3497, pp. 1776–1784, 2023
  25. H.-Y. Zhou, X. Chen, Y. Zhang, R. Luo, L. Wang, and Y. Yu, “Generalized radiograph representation learning via cross-supervision between images and free-text radiology reports,” Nat. Mach. Intell., vol. 4, no. 1, pp. 32–40, 2022
  26. F. Yu, M. Endo, R. Krishnan, and C. P. Langlotz, “Article Evaluating progress in automatic chest X-ray radiology report generation Evaluating progress in automatic chest X-ray radiology report generation,” Patterns, vol. 4, no. 9, p. 100802, 2023, doi: 10.1016/j.patter.2023.100802
  27. J. Rückert et al., “ROCOv2 : Radiology Objects in COntext Version 2 , an Updated Multimodal Image Dataset,” pp. 1–15, 2024, doi: 10.1038/s41597-024-03496-6
  28. S. Zhang, Q. Han, J. Li, Y. Sun, and Y. Qin, “A medical report generation method integrating teacher–student model and encoder–decoder network,” Biomed. Signal Process. Control, vol. 94, no. March, 2024, doi: 10.1016/j.bspc.2024.106251
  29. S. Yang, X. Wu, S. Ge, S. K. Zhou, and L. Xiao, “Knowledge matters: Chest radiology report generation with general and specific knowledge,” Med. Image Anal., vol. 80, 2022, doi: 10.1016/j.media.2022.102510
  30. S. Sun, Z. Mei, X. Li, T. Tang, Z. Su, and Y. Wu, “A label information fused medical image report generation framework,” Artif. Intell. Med., vol. 150, no. October 2022, 2024, doi: 10.1016/j.artmed.2024.102823
  31. R. Ran et al., “MeFD-Net: multi-expert fusion diagnostic network for generating radiology image reports,” Appl. Intell., pp. 11484–11495, 2024, doi: 10.1007/s10489-024-05680-y
  32. P. Messina, R. Vidal, D. Parra, A. Soto, and V. Araujo, “Extracting and Encoding: Leveraging Large Language Models and Medical Knowledge to Enhance Radiological Text Representation,” Find. Assoc. Comput. Linguist. ACL 2024, no. Section 4, pp. 3955–3986, 2024, [Online]. doi: 10.18653/v1/2024.findings-acl.236
  33. K. Kar, “Medical Image Captioning using CvT and,” 2024 Second Int. Conf. Adv. Inf. Technol., vol. 1, pp. 1–6, 2024, doi: 10.1109/ICAIT61638.2024.10690339
  34. W. Hou, K. Xu, Y. Cheng, W. Li, and J. Liu, “ORGAN: Observation-Guided Radiology Report Generation via Tree Reasoning,” Proc. Annu. Meet. Assoc. Comput. Linguist., vol. 1, pp. 8108–8122, 2023, doi: 10.18653/v1/2023.acl-long.451
  35. P. Cheng, L. Lin, J. Lyu, Y. Huang, W. Luo, and X. Tang, “PRIOR: Prototype Representation Joint Learning from Medical Images and Reports,” Proc. IEEE Int. Conf. Comput. Vis., pp. 21304–21314, 2023, doi: 10.1109/ICCV51070.2023.01953
  36. X. Mei, L. Yang, D. Gao, X. Cai, J. H. Fellow, and T. Liu, “PhraseAug : An Augmented Medical Report Generation Model with Phrasebook,” IEEE Trans. Med. Imaging, vol. PP, no. Xx, p. 1, 2024, doi: 10.1109/TMI.2024.3416190
  37. M. M. Mohsan, M. U. Akram, G. Rasool, N. S. Alghamdi, M. A. A. Baqai, and M. Abbas, “Vision Transformer and Language Model Based Radiology Report Generation,” IEEE Access, vol. 11, no. December 2022, pp. 1814–1824, 2023, doi: 10.1109/ACCESS.2022.3232719
  38. G. Veras Magalhães, R. L. de S. Santos, L. H. S. Vogado, A. Cardoso de Paiva, and P. de Alcântara dos Santos Neto, “XRaySwinGen: Automatic medical reporting for X-ray exams with multimodal model,” Heliyon, vol. 10, no. 7, p. e27516, 2024, doi: 10.1016/j.heliyon.2024.e27516
  39. A. Moncloa-Muro, G. Ramirez-Alonso, and F. Martinez-Reyes, “Automatic Medical Concept Detection on Images: Dividing the Task into Smaller Ones,” CEUR Workshop Proc., vol. 3740, pp. 1668–1680, 2024
  40. H. Kauschke, K. Bogomasov, and S. Conrad, “Predicting Captions and Detecting Concepts for Medical Images: Contributions of the DBS-HHU Team to ImageCLEFmedical Caption 2024,” CEUR Workshop Proc., vol. 3740, pp. 1645–1655, 2024
  41. Z. Chen et al., “Multi-modal Masked Autoencoders for Medical Vision-and-Language Pre-training,” Lect. Notes Comput. Sci. (including Subser. Lect. Notes Artif. Intell. Lect. Notes Bioinformatics), vol. 13435 LNCS, pp. 679–689, 2022, doi: 10.1007/978-3-031-16443-9_65
  42. M. Aono, T. Asakawa, K. Shimizu, and K. Nomura, “Medical Image Captioning using CUI-based Classification and Feature Similarity,” CEUR Workshop Proc., vol. 3740, pp. 1490–1499, 2024
  43. A. G. Barreto, J. M. de Oliveira, F. N. B. Gois, P. C. Cortez, and V. H. C. de Albuquerque, “A New Generative Model for Textual Descriptions of Medical Images Using Transformers Enhanced with Convolutional Neural Networks,” Bioengineering, vol. 10, no. 9, 2023, doi: 10.3390/bioengineering10091098
  44. Q. Van Nguyen, H. Q. Pham, D. Q. Tran, T. K. B. Nguyen, N. H. Nguyen-Dang, and T. B. Nguyen-Tat, “UIT-DarkCow team at ImageCLEFmedical Caption 2024: Diagnostic Captioning for Radiology Images Efficiency with Transformer Models,” CEUR Workshop Proc., vol. 3740, pp. 1695–1710, 2024
  45. M. Aono, H. Shinoda, T. Asakawa, K. Shimizu, T. Togawa, and T. Komoda, “Multi-stage Medical Image Captioning using Classification and CLIP,” CEUR Workshop Proc., vol. 3497, pp. 1387–1395, 2023
  46. S. Ram, S. Vinoth, R. N. Gopalakrishnan, A. A. Balakumar, L. Kalinathan, and T. A. J. Velankanni, “Leveraging Diverse CNN Architectures for Medical Image Captioning: DenseNet-121, MobileNetV2, and ResNet-50 in ImageCLEF 2024,” CEUR Workshop Proc., vol. 3740, pp. 1720–1728, 2024
  47. G. Y. Kim, B. D. Oh, C. Kim, and Y. S. Kim, “Convolutional Neural Network and Language Model-Based Sequential CT Image Captioning for Intracerebral Hemorrhage,” Appl. Sci., vol. 13, no. 17, 2023, doi: 10.3390/app13179665
  48. J. W. Kong, B. D. Oh, C. Kim, and Y. S. Kim, “Sequential Brain CT Image Captioning Based on the Pre-Trained Classifiers and a Language Model,” Appl. Sci., vol. 14, no. 3, 2024, doi: 10.3390/app14031193
  49. S. Bannur et al., “Learning to Exploit Temporal Structure for Biomedical Vision-Language Processing,” in Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2023, pp. 15016–15027. doi: 10.1109/CVPR52729.2023.01442
  50. S. Elbedwehy, T. M. Taher, and H. Mohammed, “Enhanced descriptive captioning model for histopathological patches,” Multimed. Tools Appl., vol. 83, no. 12, pp. 36645–36664, 2024, doi: 10.1007/s11042-023-15884-y
  51. K. Papineni, S. Roukos, T. Ward, and W. J. Zhu, “BLEU: A method for automatic evaluation of machine translation,” Proc. Annu. Meet. Assoc. Comput. Linguist., vol. 2002-July, no. July, pp. 311–318, 2002
  52. R. Vedantam, C. L. Zitnick, and D. Parikh, “CIDEr: Consensus-based image description evaluation,” Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit., vol. 07-12-June, pp. 4566–4575, 2015, doi: 10.1109/CVPR.2015.7299087
  53. C. Y. Lin, “Rouge: A package for automatic evaluation of summaries,” Proc. Work. text Summ. branches out (WAS 2004), no. 1, pp. 25–26, 2004, [Online]. Available: papers2://publication/uuid/5DDA0BB8-E59F-44C1-88E6-2AD316DAEF85
  54. A. Lavie and A. Agarwal, “METEOR: An automatic metric for MT evaluation with high levels of correlation with human judgments,” Proc. Annu. Meet. Assoc. Comput. Linguist., no. June, pp. 228–231, 2007
  55. T. Zhang, V. Kishore, F. Wu, K. Q. Weinberger, and Y. Artzi, “Bertscore: Evaluating Text Generation With Bert,” 8th Int. Conf. Learn. Represent. ICLR 2020, pp. 1–43, 2020
  56. Z. Gong, A. P. French, G. Qiu, and X. Chen, “CTranS: A Multi-Resolution Convolution-Transformer Network for Medical Image Segmentation,” Proc. - Int. Symp. Biomed. Imaging, pp. 1–5, 2024, doi: 10.1109/ISBI56570.2024.10635192
  57. T. T. Tran, D. T. Vu, T. H. Nguyen, and V. T. Pham, “A CNN-Transformer-based Approach for Medical Image Segmentation,” Proc. 2023 Int. Conf. Syst. Sci. Eng. ICSSE 2023, pp. 22–27, 2023, doi: 10.1109/ICSSE58758.2023.10227162
  58. X. Chen et al., “Transformers Improve Breast Cancer Diagnosis from Unregistered Multi-View Mammograms,” Diagnostics, vol. 12, no. 7, p. 1549, 2022, doi: 10.3390/diagnostics12071549
  59. D. Abdal Hafeth and S. Kollias, “Insights into Object Semantics: Leveraging Transformer Networks for Advanced Image Captioning,” Sensors, vol. 24, no. 6, p. 1796, 2024, doi: 10.3390/s24061796
  60. H. Chen et al., “3D-CT-GPT: Generating 3D Radiology Reports through Integration of Large Vision-Language Models,” 2024, [Online]. Available: http://arxiv.org/abs/2409.19330
  61. Z. Chen, Y. Song, T. H. Chang, and X. Wan, “Generating radiology reports via memory-driven transformer,” EMNLP 2020 - 2020 Conf. Empir. Methods Nat. Lang. Process. Proc. Conf., pp. 1439–1449, 2020, doi: 10.18653/v1/2020.emnlp-main.112
  62. A. Kumar and A. Agrawal, “Towards Precision Healthcare: Leveraging Pre-trained Models and Attention Mechanisms in Medical Visual Question Answering for Clinical Decision Support,” 2nd IEEE Int. Conf. Networks, Multimed. Inf. Technol. NMITCON 2024, 2024, doi: 10.1109/NMITCON62075.2024.10699307
  63. M. Kakkar, D. Shanbhag, C. Aladahalli, and M. Gurunath Reddy, “Language Augmentation in CLIP for Improved Anatomy Detection on Multi-modal Medical Images,” Proc. Annu. Int. Conf. IEEE Eng. Med. Biol. Soc. EMBS, pp. 3–6, 2024, doi: 10.1109/EMBC53108.2024.10781689
  64. L. V. De Moura, R. Ravazio, C. Mattjie, L. S. Kupssinsku, C. M. Dal Sasso Freitas, and R. C. Barros, “Unlocking The Potential Of Vision-Language Models For Mammography Analysis,” Proc. - Int. Symp. Biomed. Imaging, pp. 5–8, 2024, doi: 10.1109/ISBI56570.2024.10635683
  65. R. AlSaad et al., “Multimodal Large Language Models in Healthcare: Applications, Challenges, and Future Outlook (Preprint),” J. Med. Internet Res., vol. 26, p. e59505, 2024, doi: 10.2196/59505
  66. A. Gopu, P. Nishchal, V. Mittal, and K. Srinidhi, “Image Captioning using Deep Learning Techniques,” Proc. IEEE InC4 2023 - 2023 IEEE Int. Conf. Contemp. Comput. Commun., vol. 1, pp. 1–5, 2023, doi: 10.1109/InC457730.2023.10263093

Last update:

No citation recorded.

Last update: 2025-05-30 05:26:34

No citation recorded.