Authors :
Shifa Shah; Kumkum Mishra; Devesh Kumar Gola
Volume/Issue :
Volume 11 - 2026, Issue 4 - April
Google Scholar :
https://tinyurl.com/ye9mk7wu
Scribd :
https://tinyurl.com/mr44pzvz
DOI :
https://doi.org/10.38124/ijisrt/26apr1597
Note : A published paper may take 4-5 working days from the publication date to appear in PlumX Metrics, Semantic Scholar, and ResearchGate.
Abstract :
A major method of steering large language models (LLMs) through various natural language processing
(NLP) tasks without fine-tuning is prompt engineering. Nevertheless, the current research frequently considers the
techniques of prompting separately, which restricts the knowledge of their generalizability. The paper is a multi-task
benchmark that measures ten prompt engineering strategies on three NLP tasks: requirement classification,
sentiment analysis, and topic classification. The tested approaches are the zero-shot, few-shot, chain- of-thought
(CoT), role-based, and structured prompting. Accuracy and F1 scores are used to measure the performance, and
latency and invalid output rate are used to measure the efficiency and reliability. Findings indicate that prompt
design significantly impacts model performance. The structured and CoT prompting invariably outperform the zeroshot ones, but few-shot prompting does not necessarily. Task-specific analysis determines that simple tasks are less
susceptible to prompt variations, and complex tasks rely more on prompt design. Additionally, a clear trade-off
between performance and efficiency is observed. These results provide insights for the development of effective,
robust, and scalable prompt engineering solutions to real-world LLM applications.
Keywords :
Prompt Engineering, Chain-of-Thought Prompting, Few-Shot Learning, Zero-Shot Learning, NLP Classification, Macro-F1 Score, Latency Analysis, Green AI, Structured Prompting.
References :
- T. Brown et al., “Language Models are Few-Shot Learners,” in Advances in Neural Information Processing Systems (NeurIPS), 2020.
- J. Wei et al., “Chain-of-Thought Prompting Elicits Reasoning in Large Language Models,” arXiv preprint arXiv:2201.11903, 2022.
- T. Kojima et al., “Large Language Models are Zero- Shot Reasoners,” arXiv preprint arXiv:2205.11916, 2022.
- X. Wang et al., “Self-Consistency Improves Chain-of- Thought Reasoning in Language Models,” arXiv preprint arXiv:2203.11171, 2023.
- S. Min et al., “Rethinking the Role of Demonstrations: What Makes In-Context Learning Work?” in EMNLP, 2022.
- P. Liu et al., “Pre-train, Prompt, and Predict: A Systematic Survey of Prompting Methods in NLP,” ACM Computing Surveys, 2023.
- L. Reynolds and K. McDonell, “Prompt Programming for Large Language Models: Beyond the Few-Shot Paradigm,” arXiv preprint arXiv:2102.07350, 2021.
- Y. Schick and H. Schütze, “Exploiting Cloze Questions for Few-Shot Text Classification,” in NAACL, 2021.
- A. Holtzman et al., “The Curious Case of Neural Text Degeneration,” in ICLR, 2020.
- R. Bommasani et al., “On the Opportunities and Risks of Foundation Models,” arXiv preprint arXiv:2108.07258, 2021.
- OpenAI, “GPT-4 Technical Report,” arXiv preprint arXiv:2303.08774, 2023.
- M. Bubeck et al., “Sparks of Artificial General Intelligence: Early Experiments with GPT-4,” arXiv preprint arXiv:2303.12712, 2023.
- J. Zhou et al., “A Survey of Large Language Models,”arXiv preprint arXiv:2303.18223, 2023.
- D. Hendrycks et al., “Measuring Massive Multitask Language Understanding,” in ICLR, 2021.
- A. Raffel et al., “Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer,” JMLR, 2020.
- E. Strubell et al., “Energy and Policy Considerations for Deep Learning in NLP,” in ACL, 2019.
- R. Schwartz et al., “Green AI,” Communications of the ACM, 2020.
- S. Sanh et al., “Multitask Prompted Training Enables Zero-Shot Task Generalization,” in ICLR, 2022.
- X. Li and P. Liang, “Prefix-Tuning: Optimizing Continuous Prompts for Generation,” in ACL, 2021.
- B. Lester et al., “The Power of Scale for Parameter- Efficient Prompt Tuning,” in EMNLP, 2021.
A major method of steering large language models (LLMs) through various natural language processing
(NLP) tasks without fine-tuning is prompt engineering. Nevertheless, the current research frequently considers the
techniques of prompting separately, which restricts the knowledge of their generalizability. The paper is a multi-task
benchmark that measures ten prompt engineering strategies on three NLP tasks: requirement classification,
sentiment analysis, and topic classification. The tested approaches are the zero-shot, few-shot, chain- of-thought
(CoT), role-based, and structured prompting. Accuracy and F1 scores are used to measure the performance, and
latency and invalid output rate are used to measure the efficiency and reliability. Findings indicate that prompt
design significantly impacts model performance. The structured and CoT prompting invariably outperform the zeroshot ones, but few-shot prompting does not necessarily. Task-specific analysis determines that simple tasks are less
susceptible to prompt variations, and complex tasks rely more on prompt design. Additionally, a clear trade-off
between performance and efficiency is observed. These results provide insights for the development of effective,
robust, and scalable prompt engineering solutions to real-world LLM applications.
Keywords :
Prompt Engineering, Chain-of-Thought Prompting, Few-Shot Learning, Zero-Shot Learning, NLP Classification, Macro-F1 Score, Latency Analysis, Green AI, Structured Prompting.