Systematic Literature Review on Medical Image Captioning Using CNN-LSTM and Transformer-Based Models

Husni Fadhilah; Nugraha Priya Utama

doi:10.14710/jmasif.16.1.73127

DOI: https://doi.org/10.14710/jmasif.16.1.73127

Systematic Literature Review on Medical Image Captioning Using CNN-LSTM and Transformer-Based Models

Husni Fadhilah , Nugraha Priya Utama

School of Electrical Engineering and Informatics, Institut Teknologi Bandung , Indonesia

Received: 10 May 2025; Revised: 20 May 2025; Accepted: 21 May 2025; Available online: 27 May 2025; Published: 31 May 2025.

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

Citation Format:

Abstract

Creating descriptive text from medical images to aid in diagnosis and treatment planning is known as medical image captioning, and it is a crucial duty in the healthcare industry. In this study, medical image captioning techniques are systematically reviewed in the literature with an emphasis on Transformer-based models and Convolutional Neural Network-Long Short Term Memory (CNN-LSTM). Aspects like as model designs, datasets, evaluation measures, and difficulties encountered in practical implementations are all examined in this paper. According to the results, Transformer-based models, specifically Swin Transformer and Vision Transformer (ViT), perform better than CNN-LSTM-based models in terms of BLEU, ROUGE, CIDEr, and METEOR scores, resulting in more accurate clinically relevant caption generation. However, there are still a number of issues, including interpretability, computing requirements, and data restrictions. Potential future routes to increase the efficacy and practical application of medical image captioning systems are covered in this paper, including hybrid model approaches, data augmentation techniques, and enhanced explainability methodologies.

Fulltext View|Download

Keywords: Medical image captioning, convolutional neural network, transformer, healthcare ai, automatic report generation

Funding: Lembaga Pengelola Dana Pendidikan (LPDP)

Article Metrics:

Article Info

Section: Review Article

Language : EN

In Vol 16, No 1 (2025): May 2025

Recent articles

Applying the Scrum Method in Software Development for Undergraduate Thesis Project Implementation An Ensemble-Based Approach for Detecting Clickbait in Indonesian Online Media A Combination of SHA-256 and DES for Visual Data Protection Systematic Literature Review on Medical Image Captioning Using CNN-LSTM and Transformer-Based Models More recent articles

Most cited articles

OPTICAL CHARACTER RECOGNITION MENGGUNAKAN ALGORITMA TEMPLATE MATCHING CORRELATION Pengenalan Jenis Golongan Darah Menggunakan Jaringan Syaraf Tiruan Perceptron ANALISA PERFORMA METODE COSINE DAN JACARD PADA PENGUJIAN KESAMAAN DOKUMEN KLASIFIKASI UCAPAN KATA DENGAN SUPPORT VECTOR MACHINE PENYELESAIAN MASALAH JOB SHOP MENGGUNAKAN ALGORITMA GENETIKA More cited articles

G. Reale-Nosei, E. Amador-Domínguez, and E. Serrano, “From vision to text: A comprehensive review of natural image captioning in medical diagnosis and radiology report generation,” Med. Image Anal., vol. 97, no. August 2023, 2024, doi: 10.1016/j.media.2024.103264
J. Pavlopoulos, V. Kougia, I. Androutsopoulos, and D. Papamichail, “Diagnostic captioning: a survey,” Knowl. Inf. Syst., vol. 64, no. 7, pp. 1691–1722, 2022, doi: 10.1007/s10115-022-01684-7
A. Selivanov, O. Y. Rogov, D. Chesakov, A. Shelmanov, I. Fedulova, and D. V Dylov, “Medical image captioning via generative pretrained transformers,” Sci. Rep., pp. 1–12, 2023, doi: 10.1038/s41598-023-31223-5
H. Sharma and D. Padha, “Domain-specific image captioning : a comprehensive review,” Int. J. Multimed. Inf. Retr., vol. 13, no. 2, pp. 1–27, 2024, doi: 10.1007/s13735-024-00328-6
D. R. Beddiar, M. Oussalah, T. Seppänen, and R. Jennane, “ACapMed: Automatic Captioning for Medical Imaging,” Appl. Sci., vol. 12, no. 21, pp. 1–24, 2022, doi: 10.3390/app122111092
R. Thirunavukarasu and E. Kotei, “A comprehensive review on transformer network for natural and medical image analysis,” Comput. Sci. Rev., vol. 53, no. April, 2024, doi: 10.1016/j.cosrev.2024.100648
A. Vaswani et al., “Attention is all you need,” Adv. Neural Inf. Process. Syst., vol. 2017-Decem, pp. 5999–6009, 2017
N. N. Y. Nguyen, H. L. Tu, P. D. Nguyen, T. N. Do, T. M. Thai, and T. B. Nguyen-tat, “DS @ BioMed at ImageCLEFmedical Caption 2024 : Enhanced Attention Mechanisms in Medical Caption Generation through Concept Detection Integration ⋆,” 2024
F. A. Zahra and R. J. Kate, “Obtaining clinical term embeddings from SNOMED CT ontology,” J. Biomed. Inform., vol. 149, no. November 2023, 2024, doi: 10.1016/j.jbi.2023.104560
H. Zhou et al., “Complementary and Integrative Health Information in the literature: Its lexicon and named entity recognition,” J. Am. Med. Informatics Assoc., vol. 31, no. 2, pp. 426–434, 2024, doi: 10.1093/jamia/ocad216
C. Wohlin and R. Prikladniki, “Systematic literature reviews in software engineering,” Inf. Softw. Technol., vol. 55, no. 6, pp. 919–920, 2013, doi: 10.1016/j.infsof.2013.02.002
D. Moher, A. Liberati, J. Tetzlaff, and D. G. Altman, “Preferred reporting items for systematic reviews and meta-analyses: The PRISMA statement,” Int. J. Surg., vol. 8, no. 5, pp. 336–341, 2010, doi: 10.1016/j.ijsu.2010.02.007
P. Ravinder and S. Srinivasan, “Automated Medical Image Captioning with Soft Attention-Based LSTM Model Utilizing YOLOv4 Algorithm,” J. Comput. Sci., vol. 20, no. 1, pp. 52–68, 2024, doi: 10.3844/jcssp.2024.52.68
S. Yan, W. K. Cheung, K. Chiu, T. M. Tong, K. C. Cheung, and S. See, “Attributed Abnormality Graph Embedding for Clinically Accurate X-Ray Report Generation,” IEEE Trans. Med. Imaging, vol. 42, no. 8, pp. 2211–2222, 2023, doi: 10.1109/TMI.2023.3245608
F. Liu, X. Wu, S. Ge, W. Fan, and Y. Zou, “Exploring and distilling posterior and prior knowledge for radiology report generation,” in Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2021, pp. 13748–13757. doi: 10.1109/CVPR46437.2021.01354
M. Li, R. Liu, F. Wang, X. Chang, and X. Liang, “Auxiliary signal-guided knowledge encoder-decoder for medical report generation,” World Wide Web, vol. 26, no. 1, pp. 253–270, 2023, doi: 10.1007/s11280-022-01013-6
V. T. Phan and K. T. Nguyen, “EasyChair Preprint MedBLIP : Multimodal Medical Image Captioning Using BLIP MedBLIP : Multimodal medical image captioning using,” 2024
B. Yang, Z. Ye, H. Wang, H. Zheng, and S. International, “MAKEN : IMPROVING MEDICAL REPORT GENERATION WITH ADAPTER TUNING AND KNOWLEDGE ENHANCEMENT IN VISION-LANGUAGE FOUNDATION MODELS ADSPLAB , School of Electronic and Computer Engineering , Peking University , Shenzhen , China Shenzhen Institute of Advanced Te,” 2024 IEEE Int. Symp. Biomed. Imaging, pp. 1–5, 2024, doi: 10.1109/ISBI56570.2024.10635421
T. Tanida, P. Müller, G. Kaissis, and D. Rueckert, “Interactive and Explainable Region-guided Radiology Report Generation,” Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit., vol. 2023-June, pp. 7433–7442, 2023, doi: 10.1109/CVPR52729.2023.00718
F. Dalla Serra, C. Wang, F. Deligianni, J. Dalton, and A. Q. O’Neil, “Controllable Chest X-Ray Report Generation from Longitudinal Representations,” Find. Assoc. Comput. Linguist. EMNLP 2023, pp. 4891–4904, 2023, doi: 10.18653/v1/2023.findings-emnlp.325
B. Yan et al., “Style-Aware Radiology Report Generation with RadGraph and Few-Shot Prompting,” Find. Assoc. Comput. Linguist. EMNLP 2023, pp. 14676–14688, 2023, doi: 10.18653/v1/2023.findings-emnlp.977
Z. Zhang et al., “Sam-Guided Enhanced Fine-Grained Encoding with Mixed Semantic Learning for Medical Image Captioning,” ICASSP 2024 - 2024 IEEE Int. Conf. Acoust. Speech Signal Process., pp. 1731–1735, 2024, doi: 10.1109/icassp48485.2024.10446878
S. Raminedi, S. Shridevi, and D. Won, “Multi-modal transformer architecture for medical image analysis and automated report generation,” Sci. Rep., vol. 14, no. 1, pp. 1–18, 2024, doi: 10.1038/s41598-024-69981-5
W. Zhou et al., “Transferring Pre-Trained Large Language-Image Model for Medical Image Captioning,” CEUR Workshop Proc., vol. 3497, pp. 1776–1784, 2023
H.-Y. Zhou, X. Chen, Y. Zhang, R. Luo, L. Wang, and Y. Yu, “Generalized radiograph representation learning via cross-supervision between images and free-text radiology reports,” Nat. Mach. Intell., vol. 4, no. 1, pp. 32–40, 2022
F. Yu, M. Endo, R. Krishnan, and C. P. Langlotz, “Article Evaluating progress in automatic chest X-ray radiology report generation Evaluating progress in automatic chest X-ray radiology report generation,” Patterns, vol. 4, no. 9, p. 100802, 2023, doi: 10.1016/j.patter.2023.100802
J. Rückert et al., “ROCOv2 : Radiology Objects in COntext Version 2 , an Updated Multimodal Image Dataset,” pp. 1–15, 2024, doi: 10.1038/s41597-024-03496-6
S. Zhang, Q. Han, J. Li, Y. Sun, and Y. Qin, “A medical report generation method integrating teacher–student model and encoder–decoder network,” Biomed. Signal Process. Control, vol. 94, no. March, 2024, doi: 10.1016/j.bspc.2024.106251
S. Yang, X. Wu, S. Ge, S. K. Zhou, and L. Xiao, “Knowledge matters: Chest radiology report generation with general and specific knowledge,” Med. Image Anal., vol. 80, 2022, doi: 10.1016/j.media.2022.102510
S. Sun, Z. Mei, X. Li, T. Tang, Z. Su, and Y. Wu, “A label information fused medical image report generation framework,” Artif. Intell. Med., vol. 150, no. October 2022, 2024, doi: 10.1016/j.artmed.2024.102823
R. Ran et al., “MeFD-Net: multi-expert fusion diagnostic network for generating radiology image reports,” Appl. Intell., pp. 11484–11495, 2024, doi: 10.1007/s10489-024-05680-y
P. Messina, R. Vidal, D. Parra, A. Soto, and V. Araujo, “Extracting and Encoding: Leveraging Large Language Models and Medical Knowledge to Enhance Radiological Text Representation,” Find. Assoc. Comput. Linguist. ACL 2024, no. Section 4, pp. 3955–3986, 2024, [Online]. doi: 10.18653/v1/2024.findings-acl.236
K. Kar, “Medical Image Captioning using CvT and,” 2024 Second Int. Conf. Adv. Inf. Technol., vol. 1, pp. 1–6, 2024, doi: 10.1109/ICAIT61638.2024.10690339
W. Hou, K. Xu, Y. Cheng, W. Li, and J. Liu, “ORGAN: Observation-Guided Radiology Report Generation via Tree Reasoning,” Proc. Annu. Meet. Assoc. Comput. Linguist., vol. 1, pp. 8108–8122, 2023, doi: 10.18653/v1/2023.acl-long.451
P. Cheng, L. Lin, J. Lyu, Y. Huang, W. Luo, and X. Tang, “PRIOR: Prototype Representation Joint Learning from Medical Images and Reports,” Proc. IEEE Int. Conf. Comput. Vis., pp. 21304–21314, 2023, doi: 10.1109/ICCV51070.2023.01953
X. Mei, L. Yang, D. Gao, X. Cai, J. H. Fellow, and T. Liu, “PhraseAug : An Augmented Medical Report Generation Model with Phrasebook,” IEEE Trans. Med. Imaging, vol. PP, no. Xx, p. 1, 2024, doi: 10.1109/TMI.2024.3416190
M. M. Mohsan, M. U. Akram, G. Rasool, N. S. Alghamdi, M. A. A. Baqai, and M. Abbas, “Vision Transformer and Language Model Based Radiology Report Generation,” IEEE Access, vol. 11, no. December 2022, pp. 1814–1824, 2023, doi: 10.1109/ACCESS.2022.3232719
G. Veras Magalhães, R. L. de S. Santos, L. H. S. Vogado, A. Cardoso de Paiva, and P. de Alcântara dos Santos Neto, “XRaySwinGen: Automatic medical reporting for X-ray exams with multimodal model,” Heliyon, vol. 10, no. 7, p. e27516, 2024, doi: 10.1016/j.heliyon.2024.e27516
A. Moncloa-Muro, G. Ramirez-Alonso, and F. Martinez-Reyes, “Automatic Medical Concept Detection on Images: Dividing the Task into Smaller Ones,” CEUR Workshop Proc., vol. 3740, pp. 1668–1680, 2024
H. Kauschke, K. Bogomasov, and S. Conrad, “Predicting Captions and Detecting Concepts for Medical Images: Contributions of the DBS-HHU Team to ImageCLEFmedical Caption 2024,” CEUR Workshop Proc., vol. 3740, pp. 1645–1655, 2024
Z. Chen et al., “Multi-modal Masked Autoencoders for Medical Vision-and-Language Pre-training,” Lect. Notes Comput. Sci. (including Subser. Lect. Notes Artif. Intell. Lect. Notes Bioinformatics), vol. 13435 LNCS, pp. 679–689, 2022, doi: 10.1007/978-3-031-16443-9_65
M. Aono, T. Asakawa, K. Shimizu, and K. Nomura, “Medical Image Captioning using CUI-based Classification and Feature Similarity,” CEUR Workshop Proc., vol. 3740, pp. 1490–1499, 2024
A. G. Barreto, J. M. de Oliveira, F. N. B. Gois, P. C. Cortez, and V. H. C. de Albuquerque, “A New Generative Model for Textual Descriptions of Medical Images Using Transformers Enhanced with Convolutional Neural Networks,” Bioengineering, vol. 10, no. 9, 2023, doi: 10.3390/bioengineering10091098
Q. Van Nguyen, H. Q. Pham, D. Q. Tran, T. K. B. Nguyen, N. H. Nguyen-Dang, and T. B. Nguyen-Tat, “UIT-DarkCow team at ImageCLEFmedical Caption 2024: Diagnostic Captioning for Radiology Images Efficiency with Transformer Models,” CEUR Workshop Proc., vol. 3740, pp. 1695–1710, 2024
M. Aono, H. Shinoda, T. Asakawa, K. Shimizu, T. Togawa, and T. Komoda, “Multi-stage Medical Image Captioning using Classification and CLIP,” CEUR Workshop Proc., vol. 3497, pp. 1387–1395, 2023
S. Ram, S. Vinoth, R. N. Gopalakrishnan, A. A. Balakumar, L. Kalinathan, and T. A. J. Velankanni, “Leveraging Diverse CNN Architectures for Medical Image Captioning: DenseNet-121, MobileNetV2, and ResNet-50 in ImageCLEF 2024,” CEUR Workshop Proc., vol. 3740, pp. 1720–1728, 2024
G. Y. Kim, B. D. Oh, C. Kim, and Y. S. Kim, “Convolutional Neural Network and Language Model-Based Sequential CT Image Captioning for Intracerebral Hemorrhage,” Appl. Sci., vol. 13, no. 17, 2023, doi: 10.3390/app13179665
J. W. Kong, B. D. Oh, C. Kim, and Y. S. Kim, “Sequential Brain CT Image Captioning Based on the Pre-Trained Classifiers and a Language Model,” Appl. Sci., vol. 14, no. 3, 2024, doi: 10.3390/app14031193
S. Bannur et al., “Learning to Exploit Temporal Structure for Biomedical Vision-Language Processing,” in Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2023, pp. 15016–15027. doi: 10.1109/CVPR52729.2023.01442
S. Elbedwehy, T. M. Taher, and H. Mohammed, “Enhanced descriptive captioning model for histopathological patches,” Multimed. Tools Appl., vol. 83, no. 12, pp. 36645–36664, 2024, doi: 10.1007/s11042-023-15884-y
K. Papineni, S. Roukos, T. Ward, and W. J. Zhu, “BLEU: A method for automatic evaluation of machine translation,” Proc. Annu. Meet. Assoc. Comput. Linguist., vol. 2002-July, no. July, pp. 311–318, 2002
R. Vedantam, C. L. Zitnick, and D. Parikh, “CIDEr: Consensus-based image description evaluation,” Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit., vol. 07-12-June, pp. 4566–4575, 2015, doi: 10.1109/CVPR.2015.7299087
C. Y. Lin, “Rouge: A package for automatic evaluation of summaries,” Proc. Work. text Summ. branches out (WAS 2004), no. 1, pp. 25–26, 2004, [Online]. Available: papers2://publication/uuid/5DDA0BB8-E59F-44C1-88E6-2AD316DAEF85
A. Lavie and A. Agarwal, “METEOR: An automatic metric for MT evaluation with high levels of correlation with human judgments,” Proc. Annu. Meet. Assoc. Comput. Linguist., no. June, pp. 228–231, 2007
T. Zhang, V. Kishore, F. Wu, K. Q. Weinberger, and Y. Artzi, “Bertscore: Evaluating Text Generation With Bert,” 8th Int. Conf. Learn. Represent. ICLR 2020, pp. 1–43, 2020
Z. Gong, A. P. French, G. Qiu, and X. Chen, “CTranS: A Multi-Resolution Convolution-Transformer Network for Medical Image Segmentation,” Proc. - Int. Symp. Biomed. Imaging, pp. 1–5, 2024, doi: 10.1109/ISBI56570.2024.10635192
T. T. Tran, D. T. Vu, T. H. Nguyen, and V. T. Pham, “A CNN-Transformer-based Approach for Medical Image Segmentation,” Proc. 2023 Int. Conf. Syst. Sci. Eng. ICSSE 2023, pp. 22–27, 2023, doi: 10.1109/ICSSE58758.2023.10227162
X. Chen et al., “Transformers Improve Breast Cancer Diagnosis from Unregistered Multi-View Mammograms,” Diagnostics, vol. 12, no. 7, p. 1549, 2022, doi: 10.3390/diagnostics12071549
D. Abdal Hafeth and S. Kollias, “Insights into Object Semantics: Leveraging Transformer Networks for Advanced Image Captioning,” Sensors, vol. 24, no. 6, p. 1796, 2024, doi: 10.3390/s24061796
H. Chen et al., “3D-CT-GPT: Generating 3D Radiology Reports through Integration of Large Vision-Language Models,” 2024, [Online]. Available: http://arxiv.org/abs/2409.19330
Z. Chen, Y. Song, T. H. Chang, and X. Wan, “Generating radiology reports via memory-driven transformer,” EMNLP 2020 - 2020 Conf. Empir. Methods Nat. Lang. Process. Proc. Conf., pp. 1439–1449, 2020, doi: 10.18653/v1/2020.emnlp-main.112
A. Kumar and A. Agrawal, “Towards Precision Healthcare: Leveraging Pre-trained Models and Attention Mechanisms in Medical Visual Question Answering for Clinical Decision Support,” 2nd IEEE Int. Conf. Networks, Multimed. Inf. Technol. NMITCON 2024, 2024, doi: 10.1109/NMITCON62075.2024.10699307
M. Kakkar, D. Shanbhag, C. Aladahalli, and M. Gurunath Reddy, “Language Augmentation in CLIP for Improved Anatomy Detection on Multi-modal Medical Images,” Proc. Annu. Int. Conf. IEEE Eng. Med. Biol. Soc. EMBS, pp. 3–6, 2024, doi: 10.1109/EMBC53108.2024.10781689
L. V. De Moura, R. Ravazio, C. Mattjie, L. S. Kupssinsku, C. M. Dal Sasso Freitas, and R. C. Barros, “Unlocking The Potential Of Vision-Language Models For Mammography Analysis,” Proc. - Int. Symp. Biomed. Imaging, pp. 5–8, 2024, doi: 10.1109/ISBI56570.2024.10635683
R. AlSaad et al., “Multimodal Large Language Models in Healthcare: Applications, Challenges, and Future Outlook (Preprint),” J. Med. Internet Res., vol. 26, p. e59505, 2024, doi: 10.2196/59505
A. Gopu, P. Nishchal, V. Mittal, and K. Srinidhi, “Image Captioning using Deep Learning Techniques,” Proc. IEEE InC4 2023 - 2023 IEEE Int. Conf. Contemp. Comput. Commun., vol. 1, pp. 1–5, 2023, doi: 10.1109/InC457730.2023.10263093

Last update:

No citation recorded.

Last update: 2025-08-11 09:47:34

No citation recorded.

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

The authors who submit the manuscript must understand that the article's copyright belongs to the author(s) if accepted for publication. However, the author(s) grant the journal right of first publication with the work simultaneously licensed under a Creative Commons Attribution-ShareAlike 4.0 International License. Authors should also understand that their article (and any additional files, including data sets, and analysis/computation data) will become publicly available once published under that license. By submitting the manuscript to Jmasif, the author(s) agree with this policy. No special document approval is required.

The author(s) guarantee that:

their article is original, written by the mentioned author(s),
has never been published before,
does not contain statements that violate the law, and
does not violate the rights of others, is subject to copyright held exclusively by the author(s), is free from the rights of third parties, and the necessary written permission to quote from other sources has been obtained by the author(s).

The author(s) retain all rights to the published work, such as (but not limited to) the following rights:

Copyright and other proprietary rights related to the article, such as patents,
The right to use the substance of the article in its own future works, including lectures and books,
The right to reproduce the article for its own purposes,
The right to archive all versions of the article in any repository, and
The right to enter into separate additional contractual arrangements for the non-exclusive distribution of published versions of the article (for example, posting them to institutional repositories or publishing them in a book), acknowledging its initial publication in this journal (Jurnal Masyarakat Informatika).

Suppose the article was prepared jointly by more than one author. Each author submitting the manuscript warrants that all co-authors have given their permission to agree to copyright and license notices (agreements) on their behalf and notify co-authors of the terms of this policy. Jmasif will not be held responsible for anything arising because of the writer's internal dispute. Jmasif will only communicate with correspondence authors.

Authors should also understand that their articles (and any additional files, including data sets and analysis/computation data) will become publicly available once published. The license of published articles (and additional data) will be governed by a Creative Commons Attribution-ShareAlike 4.0 International License. Jmasif allows users to copy, distribute, display and perform work under license. Users need to attribute the author(s) and Jmasif to distribute works in journals and other publication media. Unless otherwise stated, the author(s) is a public entity as soon as the article is published.