⚠ Official Notice: www.ijisrt.com is the official website of the International Journal of Innovative Science and Research Technology (IJISRT) Journal for research paper submission and publication. Please beware of fake or duplicate websites using the IJISRT name.



Auto Project Narrator – Text to Video Converter


Authors : Venkata Lakshmi G.; Susmitha N.; Deva Sai Praneetha V.; Anusha P.; Aswitha S.

Volume/Issue : Volume 11 - 2026, Issue 4 - April


Google Scholar : https://tinyurl.com/4jadr8nx

Scribd : https://tinyurl.com/yppvekvz

DOI : https://doi.org/10.38124/ijisrt/26apr1128

Note : A published paper may take 4-5 working days from the publication date to appear in PlumX Metrics, Semantic Scholar, and ResearchGate.


Abstract : The growing demand for a better way to describe how a technical project will be completed has created the need for intelligent systems that automatically create multimedia (e.g., video) explanations of a project from text only. Preparing a presentation using traditional methods typically requires manually putting together slides, creating images, narrating the presentation, and doing a video demo; all of which can take time and effort to accomplish. This project proposes an artificial intelligence-based system to create a video description of a project automatically using NLP, RAG, Stable Diffusion and neural TTS technologies. The system takes a user-provided description of a project and builds structured documentation using a combination of retrieve systems and Large Language Models (LLMs). The documentation that is generated is converted into a multi-scene storyboard that uses the content from the documentation to create each scene of a video. Each visual prompt created from the storyboard is used as input to the Stable Diffusion model in a high-performance GPU cloud computing environment to generate images for each scene. Finally, the speech output from the neural TTS synthesis process is used to produce clear and natural audio to accompany each video scene. The generation methodology's final output, an explanatory video with a project report summary, will consist of images and audio generated and synchronized together by the new TDMs and then assembled using a video-processing method. Empirical evaluation of the new system indicates that it will automate the creation of a substantial portion of the presentation of a multimedia project, thus reducing the number of hours required for manual production of completed media and increasing the clarity of presentation as well as the attractiveness of the presentation. The new framework is designed to allow for easy scalability to enable the automated generation of educational media and provide intelligent documentation of multimedia.

Keywords : Text to Video Generation, NLP, Retrieval Augmented Generation (RAG), Stable Diffusion, Generative AI, Storyboard Generation, Multimedia Automation, Neural TTS, Automated Video Generation.

References :

  1. J. Xing, M. Xia, Y. Liu, Y. Zhang, Y. Zhang, Y. He, H. Liu, H. Chen, X. Cun, X. Wang, Y. Shan, and T. T. Wong, “Make-Your-Video: Customized Video Generation Using Textual and Structural Guidance,” IEEE Transactions on Visualization and Computer Graphics, vol. 31, no. 2, pp. 1526–1541, Feb. 2025.
  2. A. Kesharwani, N. Bagdwal, D. Patel, N. Adigoppula, and M. E. Hossain, “Text-to-Digital Person Video Generator: DigitalAvatarGen,” Kennesaw State University, Capstone Project Report, Dec. 2024.
  3. Y. Zhang, Y. Wei, D. Jiang, X. Zhang, W. Zuo, and Q. Tian, “ControlVideo: Training-free controllable text-to-video generation,” arXiv preprint arXiv:2305.13077, 2023.
  4. L. Khachatryan, A. Movsisyan, V. Tadevosyan, R. Henschel, Z. Wang, S. Navasardyan, and H. Shi, “Text2Video-Zero: Text-to-image diffusion models are zero-shot video generators,” in Proc. IEEE/CVF Int. Conf. Computer Vision (ICCV), 2023.
  5. W. Hong, M. Ding, W. Zheng, X. Liu, and J. Tang, “CogVideo: Large-scale pretraining for text-to-video generation via transformers,” in Proc. Int. Conf. Learning Representations (ICLR), 2023.
  6. U. Singer, A. Polyak, T. Hayes, J. Yin, D. An, S. Li, and T. Baldridge, “Make-A-Video: Text-to-video generation without text-video data,” in Proc. Int. Conf. Learning Representations (ICLR), 2023.
  7. J. Ho, W. Chan, C. Saharia, W. Chan, D. Fleet, and M. Norouzi, “Imagen Video: High-definition video generation with diffusion models,” arXiv preprint arXiv:2210.02303, 2022.
  8. R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer, “High-resolution image synthesis with latent diffusion models,” in Proc. IEEE/CVF Conf. Computer Vision and Pattern Recognition (CVPR), 2022, pp. 10674–10685.
  9. A. Radford et al., “Learning transferable visual models from natural language supervision,” in Proc. Int. Conf. Machine Learning (ICML), 2021.
  10. A. Ramesh et al., “Hierarchical text-conditional image generation with CLIP latents,” arXiv preprint arXiv:2204.06125, 2022.
  11. C. Saharia et al., “Photorealistic text-to-image diffusion models with deep language understanding,” in Proc. Conf. Neural Information Processing Systems (NeurIPS), 2022.
  12. P. Esser, J. Chiu, P. Atighehchian, J. Granskog, and A. Germanidis, “Structure and content-guided video synthesis with diffusion models,” arXiv preprint arXiv:2302.03011, 20923.
  13. J. Z. Wu et al., “Tune-A-Video: One-shot tuning of image diffusion models for text-to-video generation,” arXiv preprint arXiv:2212.11565, 2022.
  14. J. Ho, A. Jain, and P. Abbeel, “Denoising diffusion probabilistic models,” in Proc. Conf. Neural Information Processing Systems (NeurIPS), 2020.
  15. M. Bain, A. Nagrani, G. Varol, and A. Zisserman, “Frozen in time: A joint video and image encoder for end-to-end retrieval,” in Proc. IEEE/CVF Int. Conf. Computer Vision (ICCV), 2021.
  16. A. Blattmann et al., “Align your latents: High-resolution video synthesis with latent diffusion models,” arXiv preprint arXiv:2304.08818, 2023.
  17. A. Baevski et al., “wav2vec 2.0: A framework for self-supervised learning of speech representations,” in Proc. Conf. Neural Information Processing Systems (NeurIPS), 2020.
  18. K. Xu et al., “Show, attend and tell: Neural image caption generation with visual attention,” in Proc. Int. Conf. Machine Learning (ICML), 2015.
  19. X. Gu, C. Wen, J. Song, and Y. Gao, “SEER: Language instructed video prediction with latent diffusion models,” arXiv preprint arXiv:2303.14897, 2023.
  20. J. Shen et al., “Natural TTS synthesis by conditioning WaveNet on mel spectrogram predictions,” in Proc. IEEE Int. Conf. Acoustics, Speech and Signal Processing (ICASSP), 2018.

The growing demand for a better way to describe how a technical project will be completed has created the need for intelligent systems that automatically create multimedia (e.g., video) explanations of a project from text only. Preparing a presentation using traditional methods typically requires manually putting together slides, creating images, narrating the presentation, and doing a video demo; all of which can take time and effort to accomplish. This project proposes an artificial intelligence-based system to create a video description of a project automatically using NLP, RAG, Stable Diffusion and neural TTS technologies. The system takes a user-provided description of a project and builds structured documentation using a combination of retrieve systems and Large Language Models (LLMs). The documentation that is generated is converted into a multi-scene storyboard that uses the content from the documentation to create each scene of a video. Each visual prompt created from the storyboard is used as input to the Stable Diffusion model in a high-performance GPU cloud computing environment to generate images for each scene. Finally, the speech output from the neural TTS synthesis process is used to produce clear and natural audio to accompany each video scene. The generation methodology's final output, an explanatory video with a project report summary, will consist of images and audio generated and synchronized together by the new TDMs and then assembled using a video-processing method. Empirical evaluation of the new system indicates that it will automate the creation of a substantial portion of the presentation of a multimedia project, thus reducing the number of hours required for manual production of completed media and increasing the clarity of presentation as well as the attractiveness of the presentation. The new framework is designed to allow for easy scalability to enable the automated generation of educational media and provide intelligent documentation of multimedia.

Keywords : Text to Video Generation, NLP, Retrieval Augmented Generation (RAG), Stable Diffusion, Generative AI, Storyboard Generation, Multimedia Automation, Neural TTS, Automated Video Generation.

Paper Submission Last Date
31 - May - 2026

SUBMIT YOUR PAPER CALL FOR PAPERS
Video Explanation for Published paper

Never miss an update from Papermashup

Get notified about the latest tutorials and downloads.

Subscribe by Email

Get alerts directly into your inbox after each post and stay updated.
Subscribe
OR

Subscribe by RSS

Add our RSS to your feedreader to get regular updates from us.
Subscribe