Authors :
Shalini Bhaskar Bajaj
Volume/Issue :
Volume 10 - 2025, Issue 9 - September
Google Scholar :
https://tinyurl.com/336pmz9d
Scribd :
https://tinyurl.com/yvax2r4s
DOI :
https://doi.org/10.38124/ijisrt/25sep899
Note : A published paper may take 4-5 working days from the publication date to appear in PlumX Metrics, Semantic Scholar, and ResearchGate.
Note : Google Scholar may take 30 to 40 days to display the article.
Abstract :
The paper investigates and compares the performance of two language models phi4 and qwen by using a
comprehensive evaluation framework. It is designed to assess them on multiple metrics such as generation of text-length,
token-count, response time and readability. To make sure the evaluation is robust, we utilize an array of statistical techniques
which are ANOVA, Welch’s t-Tests, Levene's test, as well as non-parametric tests, Mann-Whitney U and Kruskal-Wallis
tests. This multi-layered approach allows for a detailed and better comparison of the models, highlighting small differences
in their output behaviors and performance profiles. The analysis reveals that phi4 generates detailed and varied responses
as evidenced by high text lengths and token counts, indicating its strength in applications that require comprehensive and
in-depth information. Whereas qwen consistently demonstrates significantly lower latency and exhibits higher readability,
which makes it perfect for real-time conversations where speed and clarity are paramount. These distinct characteristics
highlight the difference between variation and efficiency, suggesting that the optimal model choice is dependent on the
specific needs of the tasks. For instance, phi4 might be advantageous for generating reports or explaining content, qwen is
more appropriate for virtual assistant applications where quick response and communication are required.
Keywords :
Large Language Model, Statistic, Comparison, Virtual Voice Assistant, Hardware, ANOVA, Welch’s t-Test, Levene’s Test, Kruskal-Wallis, Mann-Whitney, Natural Language Processing (NLP).
References :
- H. Naveed, A. U. Khan, S. Qiu, M. Saqib, S. Anwar, M. Usman, N. Akhtar, N. Barnes, and A. Mian, "A comprehensive overview of large language models," arXiv preprint, vol. 2307, no. 06435, Jul. 2023
- Y. Chang, X. Wang, J. Wang, Y. Wu, L. Yang, K. Zhu, H. Chen, et al., "A survey on evaluation of large language models," ACM Trans. Intell. Syst. Technol., vol. 15, no. 3, pp. 1-45, Mar. 2024.
- K. Ewers, D. Baier, and N. Höhn, "Siri, do I like you? Digital voice assistants and their acceptance by consumers," SMR-J. Serv. Manag. Res., vol. 4, no. 1, pp. 52–68, 2020.
- X. Lei, G. -H. Tu, A. X. Liu, C. -Y. Li and T. Xie, "The Insecurity of Home Digital Voice Assistants - Vulnerabilities, Attacks and Countermeasures," 2018 IEEE Conference on Communications and Network Security (CNS), Beijing, China, 2018, pp. 1-9, doi: 10.1109/CNS.2018.8433167.
- C. Bălan, "Chatbots and voice assistants: digital transformers of the company–customer interface—a systematic review of the business research literature," J. Theor. Appl. Electron. Commer. Res., vol. 18, no. 2, pp. 995–1019, 2023.Maedche, C. Legner, A. Benlian, B. Berger, H. Gimpel, T. Hess, O. Hinz, S. Morana, and M. Söllner, "AI-based digital assistants: Opportunities, threats, and research perspectives," Bus. Inf. Syst. Eng., vol. 61, pp. 535–544, 2019.
- L. H. Acosta and D. Reinhardt, "A survey on privacy issues and solutions for Voice-controlled Digital Assistants," Pervasive Mob. Comput., vol. 80, Art. no. 101523, 2022.
- J. Kirmayr, L. Stappen, P. Schneider, F. Matthes, and E. André, "CarMem: Enhancing Long-Term Memory in LLM Voice Assistants through Category-Bounding," arXiv preprint, arXiv:2501.09645, 2025.
- L. Liu, H. An, P. Chen, and L. Ye, "A Contemporary Overview: Trends and Applications of Large Language Models on Mobile Devices," arXiv preprint, arXiv:2412.03772, 2024.
- E. C. Ling, I. Tussyadiah, A. Tuomi, J. Stienmetz, and A. Ioannou, "Factors influencing users’ adoption and use of conversational agents: A systematic review," Psychol. Mark., vol. 38, no. 8, pp. 1031–1051, 2021.
- S. I. Lei, H. Shen, and S. Ye, "A comparison between chatbot and human service: Customer perception and reuse intention," Int. J. Contemp. Hosp. Manag., vol. 33, no. 12, pp. 3977–3995, 2021.Radford, K. Narasimhan, T. Salimans, and I. Sutskever, "Improving language understanding by generative pre-training," OpenAI, 2018.
- J. S. Samaan, Y. H. Yeo, N. Rajeev, L. Hawley, S. Abel, W. H. Ng, N. Srinivasan, J. Park, M. Burch, R. Watson, et al., "Assessing the accuracy of responses by the language model ChatGPT to questions regarding bariatric surgery," Obes. Surg., vol. 33, pp. 1790–1796, 2023.
- N. Xie, M. L. Francisco, and P. P. Y. Wong, "AI NPCs in an Educational Metaverse: Evaluating the Effectiveness of Prompt Templates for Contextual Interactions," Innovating Education with AI, vol. 53, pp. 53–74, 2025.
- E. Mieczkowski, R. Mon-Williams, N. Bramley, C. G. Lucas, N. Velez, and T. L. Griffiths, "Predicting Multi-Agent Specialization via Task Parallelizability," arXiv preprint, arXiv:2503.15703, 2025.
- J. Dhamala, T. Sun, V. Kumar, S. Krishna, Y. Pruksachatkun, K.-W. Chang, and R. Gupta, "BOLD: Dataset and Metrics for Measuring Biases in Open-Ended Language Generation," in Proc. ACM Conf. Fairness, Accountability, and Transparency (FAccT '21), Virtual Event, Canada, 2021, pp. 862–872.
The paper investigates and compares the performance of two language models phi4 and qwen by using a
comprehensive evaluation framework. It is designed to assess them on multiple metrics such as generation of text-length,
token-count, response time and readability. To make sure the evaluation is robust, we utilize an array of statistical techniques
which are ANOVA, Welch’s t-Tests, Levene's test, as well as non-parametric tests, Mann-Whitney U and Kruskal-Wallis
tests. This multi-layered approach allows for a detailed and better comparison of the models, highlighting small differences
in their output behaviors and performance profiles. The analysis reveals that phi4 generates detailed and varied responses
as evidenced by high text lengths and token counts, indicating its strength in applications that require comprehensive and
in-depth information. Whereas qwen consistently demonstrates significantly lower latency and exhibits higher readability,
which makes it perfect for real-time conversations where speed and clarity are paramount. These distinct characteristics
highlight the difference between variation and efficiency, suggesting that the optimal model choice is dependent on the
specific needs of the tasks. For instance, phi4 might be advantageous for generating reports or explaining content, qwen is
more appropriate for virtual assistant applications where quick response and communication are required.
Keywords :
Large Language Model, Statistic, Comparison, Virtual Voice Assistant, Hardware, ANOVA, Welch’s t-Test, Levene’s Test, Kruskal-Wallis, Mann-Whitney, Natural Language Processing (NLP).