A multitask benchmark for prompt engineering in large language models performance efficiency and reliability tradeoffs| International Journal of Innovative Science and Research Technology

A Multi-Task Benchmark for Prompt Engineering in Large Language Models: Performance, Efficiency and Reliability Trade-offs

Authors : Shifa Shah; Kumkum Mishra; Devesh Kumar Gola

Volume/Issue : Volume 11 - 2026, Issue 4 - April

Google Scholar : https://tinyurl.com/ye9mk7wu

Scribd : https://tinyurl.com/mr44pzvz

DOI : https://doi.org/10.38124/ijisrt/26apr1597

PlumX Metrics

Semantic Scholar

ResearchGate

Note : A published paper may take 4-5 working days from the publication date to appear in PlumX Metrics, Semantic Scholar, and ResearchGate.

Abstract : A major method of steering large language models (LLMs) through various natural language processing (NLP) tasks without fine-tuning is prompt engineering. Nevertheless, the current research frequently considers the techniques of prompting separately, which restricts the knowledge of their generalizability. The paper is a multi-task benchmark that measures ten prompt engineering strategies on three NLP tasks: requirement classification, sentiment analysis, and topic classification. The tested approaches are the zero-shot, few-shot, chain- of-thought (CoT), role-based, and structured prompting. Accuracy and F1 scores are used to measure the performance, and latency and invalid output rate are used to measure the efficiency and reliability. Findings indicate that prompt design significantly impacts model performance. The structured and CoT prompting invariably outperform the zeroshot ones, but few-shot prompting does not necessarily. Task-specific analysis determines that simple tasks are less susceptible to prompt variations, and complex tasks rely more on prompt design. Additionally, a clear trade-off between performance and efficiency is observed. These results provide insights for the development of effective, robust, and scalable prompt engineering solutions to real-world LLM applications.

Keywords : Prompt Engineering, Chain-of-Thought Prompting, Few-Shot Learning, Zero-Shot Learning, NLP Classification, Macro-F1 Score, Latency Analysis, Green AI, Structured Prompting.

References :

T. Brown et al., “Language Models are Few-Shot Learners,” in Advances in Neural Information Processing Systems (NeurIPS), 2020.
J. Wei et al., “Chain-of-Thought Prompting Elicits Reasoning in Large Language Models,” arXiv preprint arXiv:2201.11903, 2022.
T. Kojima et al., “Large Language Models are Zero- Shot Reasoners,” arXiv preprint arXiv:2205.11916, 2022.
X. Wang et al., “Self-Consistency Improves Chain-of- Thought Reasoning in Language Models,” arXiv preprint arXiv:2203.11171, 2023.
S. Min et al., “Rethinking the Role of Demonstrations: What Makes In-Context Learning Work?” in EMNLP, 2022.
P. Liu et al., “Pre-train, Prompt, and Predict: A Systematic Survey of Prompting Methods in NLP,” ACM Computing Surveys, 2023.
L. Reynolds and K. McDonell, “Prompt Programming for Large Language Models: Beyond the Few-Shot Paradigm,” arXiv preprint arXiv:2102.07350, 2021.
Y. Schick and H. Schütze, “Exploiting Cloze Questions for Few-Shot Text Classification,” in NAACL, 2021.
A. Holtzman et al., “The Curious Case of Neural Text Degeneration,” in ICLR, 2020.
R. Bommasani et al., “On the Opportunities and Risks of Foundation Models,” arXiv preprint arXiv:2108.07258, 2021.
OpenAI, “GPT-4 Technical Report,” arXiv preprint arXiv:2303.08774, 2023.
M. Bubeck et al., “Sparks of Artificial General Intelligence: Early Experiments with GPT-4,” arXiv preprint arXiv:2303.12712, 2023.
J. Zhou et al., “A Survey of Large Language Models,”arXiv preprint arXiv:2303.18223, 2023.
D. Hendrycks et al., “Measuring Massive Multitask Language Understanding,” in ICLR, 2021.
A. Raffel et al., “Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer,” JMLR, 2020.
E. Strubell et al., “Energy and Policy Considerations for Deep Learning in NLP,” in ACL, 2019.
R. Schwartz et al., “Green AI,” Communications of the ACM, 2020.
S. Sanh et al., “Multitask Prompted Training Enables Zero-Shot Task Generalization,” in ICLR, 2022.
X. Li and P. Liang, “Prefix-Tuning: Optimizing Continuous Prompts for Generation,” in ACL, 2021.
B. Lester et al., “The Power of Scale for Parameter- Efficient Prompt Tuning,” in EMNLP, 2021.

A major method of steering large language models (LLMs) through various natural language processing (NLP) tasks without fine-tuning is prompt engineering. Nevertheless, the current research frequently considers the techniques of prompting separately, which restricts the knowledge of their generalizability. The paper is a multi-task benchmark that measures ten prompt engineering strategies on three NLP tasks: requirement classification, sentiment analysis, and topic classification. The tested approaches are the zero-shot, few-shot, chain- of-thought (CoT), role-based, and structured prompting. Accuracy and F1 scores are used to measure the performance, and latency and invalid output rate are used to measure the efficiency and reliability. Findings indicate that prompt design significantly impacts model performance. The structured and CoT prompting invariably outperform the zeroshot ones, but few-shot prompting does not necessarily. Task-specific analysis determines that simple tasks are less susceptible to prompt variations, and complex tasks rely more on prompt design. Additionally, a clear trade-off between performance and efficiency is observed. These results provide insights for the development of effective, robust, and scalable prompt engineering solutions to real-world LLM applications.

Keywords : Prompt Engineering, Chain-of-Thought Prompting, Few-Shot Learning, Zero-Shot Learning, NLP Classification, Macro-F1 Score, Latency Analysis, Green AI, Structured Prompting.

Paper Submission Last Date
30 - June - 2026

SUBMIT YOUR PAPER CALL FOR PAPERS

Video Explanation for Published paper

Never miss an update from Papermashup

Get notified about the latest tutorials and downloads.

Subscribe by Email

Get alerts directly into your inbox after each post and stay updated.

Subscribe by RSS

Add our RSS to your feedreader to get regular updates from us.