Authors :
Kato Samuel Namuene; Egbe Enow Andrew
Volume/Issue :
Volume 11 - 2026, Issue 4 - April
Google Scholar :
https://tinyurl.com/9jw59tdf
Scribd :
https://tinyurl.com/2hh5kr2y
DOI :
https://doi.org/10.38124/ijisrt/26apr1365
Note : A published paper may take 4-5 working days from the publication date to appear in PlumX Metrics, Semantic Scholar, and ResearchGate.
Abstract :
Artificial Intelligence (AI) can be used to speed up data analysis with R statistics by generating R code which is
executed in R (vibe data analysis), reducing the time a manual data analyst takes to develop R code. A reproducible, AIassisted framework for bivariate and multivariate statistical analysis of forestry count data was developed and validated in
this study, integrating vibe data analysis with conventional manual methods using four disturbance observations
(snapping, windthrow, branch fall, and dead standing) across 73 species drawn from 183 treefall gaps in Korup National
Park, Cameroon. Using Claude.ai to generate R statistical code through structured prompt engineering, we systematically
applied classical parametric approaches alongside non-parametric alternatives across five analytical stages: exploratory
data analysis, bivariate correlation and regression, multivariate correlation matrix analysis, dimensionality reduction and
clustering, and multiple linear regression. All disturbance count variables exhibited extreme positive skewness (1.776-
8.367) and severe excess kurtosis (5.554-71.014), fundamentally violating parametric assumptions and designating nonparametric methods as co-primary analytical tools. The bivariate analysis revealed a strong positive association between
snapping and gap size (Pearson r = 0.865, p < 0.001; R² = 0.7483), corroborated by non-parametric methods (Spearman ρ
= 0.455, p < 0.001; Kendall τ = 0.366, p < 0.001), indicating that species associated with larger canopy openings tend to
record higher snapping frequencies.
Keywords :
Bivariate Analysis, Multivariate Analysis, Correlation, Regression, PCA, Cluster Analysis, K-means, Vibe Data Analysis, R Statistics, Artificial Intelligence, Ecological Disturbance Data.
References :
- Abdi, H., & Williams, L. J. (2010). Principal component analysis. Wiley Interdisciplinary Reviews: Computational Statistics, 2(4), 433–459. https://doi.org/10.1002/wics.101
- Ahuja, K., Diddee, H., Hada, R., Ochieng, M., Ramesh, K., Jain, P., Nambi, A., Ganu, T., Segal, S., Axmed, M., Bali, K., & Sitaram, S. (2023). MEGA: Multilingual evaluation of generative AI (Version 4). arXiv. https://doi.org/10.48550/arXiv.2303.12528
- Alkaissi, H., & McFarlane, S. I. (2023). Artificial hallucinations in ChatGPT: implications in scientific writing. Cureus, 15(2), e35179. https://doi.org/10.7759/cureus.35179
- Anderson, M. J. (2008). A new method for non-parametric multivariate analysis of variance. Austral Ecology, 26(1), 32–46. https://doi.org/10.1111/j.1442-9993.2001.01070.pp.x
- Anthropic. (2026). Claude (3.5 Sonnet version) [Large language model]. Available at: https://claude.ai/
- Barke, S., James, M. B., & Polikarpova, N. (2023). Grounded Copilot: How Programmers Interact with Code-Generating Models. Proceedings of the ACM on Programming Languages, 7, 85-111.
https://doi.org/10.1145/3586030
- Baumer, B., Cetinkaya-Rundel, M., Bray, A., Loi, L., & Horton, N. J. (2014). R Markdown: Integrating a reproducible analysis tool into introductory statistics. Technology Innovations in Statistics Education, 8(1), 1-29. https://doi.org/10.5070/T581020118
- Bonnini, S., Assegie, G. M., & Trzcinska, K. (2024). Review about the permutation approach in hypothesis testing. Mathematics, 12(17), 2617. https://doi.org/10.3390/math12172617
- Borcard, D., Gillet, F., & Legendre, P. (2018). Numerical ecology with R (2nd ed.). Springer. 435pp. https://doi.org/10.1007/978-3-319-71404-2
- Brokaw, N.V.L. (1985). Gap-phase regeneration in a tropical forest. Ecology, 66(3), 682-687.
- Everitt, B. S., Landau, S., Leese, M., & Stahl, D. (2011). Cluster analysis (5th ed.). Wiley. https://doi.org/10.1002/9780470977811
- Federiakin, D., Molerov, D., Zlatkin-Troitschanskaia, O., & Maur, A. (2024). Prompt engineering as a new 21st century skill. Frontiers in Education, 9, 1366434. https://doi.org/10.3389/feduc.2024.1366434
- Floridi, L., Cowls, J., King, T. C., & Taddeo, M. (2020). How to design AI for social good: Seven essential factors. Springer Nature, 26(3), 1771–1796. https://doi.org/10.1007/s11948-020-00213-5
- Forrester, D. I., & Tang, X. (2016). Analysing the spatial and temporal dynamics of species interactions in mixed-species forests and the effects of stand density using the 3-PGmix model. Ecological Modelling, 319, 233–254. https://doi.org/10.1016/j.ecolmodel.2015.07.010
- Fox, J., & Weisberg, S. (2019). An R companion to applied regression (3rd ed.). Sage Publishing. 802pp. https://www.scribd.com/document/434845005/Companion-Applied-Regression-R
- Friendly, M. (2002). Corrgrams: Exploratory displays for correlation matrices. The American Statistician, 56(4), 316–324.
- Gelman, A., & Loken, E. (2014). The statistical crisis in science. American Scientist, 102(6), 460–465. https://doi.org/10.1511/2014.111.460
- Guo, B., Zhang, X., Wang, Z., Jiang, M., Nie, J., Ding, Y., Yue, J., & Wu, Y. (2023). How close is ChatGPT to human experts? Comparison corpus, evaluation, and detection. arXiv:2301.07597. https://doi.org/10.48550/arXiv.2301.07597
- Hampton, S. E., Jones, M. B., Wasser, L. A., Schildhauer, M. P., Supp, S. R., Brun, J., Hernandez, R. R., Boettiger, C., Collins, S. L., Gross, L. J., Fernández, D. S., Budden, A., White, E. P., Teal, T. K., Labou, S. G., & Aukema, J. E. (2017). Skills and knowledge for data-intensive environmental research. BioScience, 67(6), 546–557. https://doi.org/10.1093/biosci/bix025
- Hellas, A., Leinonen, J., Sarsa, S., Koutcheme, C., Kujanpää, L., & Sorva, J. (2023). Exploring the responses of large language models to beginner programmers’ help requests. In Proceedings of the 2023 ACM Conference on International Computing Education Research (pp. 93–105). Association for Computing Machinery. https://doi.org/10.1145/3568813.3600139
- Hesselbarth, M. H. K., Sciaini, M., With, K. A., Wiegand, K., & Nowosad, J. (2019). landscapemetrics: An open-source R tool to calculate landscape metrics. Ecography, 42, 1648–1657. https://doi.org/10.1111/ecog.04617
- Huang, J., & Chang, K. C.-C. (2023). Towards reasoning in large language models: A survey. Findings of the Association for Computational Linguistics: ACL 2023, 1049–1065. https://doi.org/10.18653/v1/2023.findings-acl.67
- Ives, A. R. (2015). For testing the significance of regression coefficients, go ahead and log-transform count data. Methods in Ecology and Evolution, 6(7), 828–835. https://doi.org/10.1111/2041-210X.12386
- Ji, Z., Lee, N., Frieske, R., Yu, T., Su, D., Xu, Y., & others. (2023). Survey of hallucination in natural language generation. ACM Computing Surveys, 55(12), https://doi.org/10.1145/3571730
- Jolliffe, I.T. and Cadima, J. (2016) Principal Component Analysis: A Review and Recent Developments. Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences, 374, Article 20150202.
https://doi.org/10.1098/rsta.2015.0202.
- Kaiser, H. F. (1960). The application of electronic computers to factor analysis. Educational and Psychological Measurement, 20, 141-151. http://dx.doi.org/10.1177/001316446002000116
- Kassambara, A. (2017). Practical guide to cluster analysis in R: Unsupervised machine learning. STHDA. https://www.datanovia.com/en/product/practical-guide-to-cluster-analysis-in-r/
- Kassambara, A., Mundt, F., & Erdey, L. (2026). factoextra: Extract and visualize the results of multivariate data analyses(Version 2.0.0) [R package]. Comprehensive R Archive Network (CRAN). https://doi.org/10.32614/CRAN.package.factoextra
- Kumar, S. S., Lones, M. A., Maarek, M., & Zantout, H. (2024). Investigating the proficiency of large language models in formative feedback generation for student programmers. In Proceedings of the 1st International Workshop on Large Language Models for Code (LLM4Code ’24) (pp. 88–93). Association for Computing Machinery. https://doi.org/10.1145/3643795.3648380
- Lai, J., Lortie, C. J., Muenchen, R. A., Yang, J., & Ma, K. (2019). Evaluating the popularity of R in ecology. Ecosphere, 10(1), e02567. https://doi.org/10.1002/ecs2.2567
- Lê, S., Josse, J., & Husson, F. (2008). FactoMineR: An R package for multivariate analysis. Journal of Statistical Software, 25(1), 1–18. https://doi.org/10.18637/jss.v025.i01
- Legendre, P. (2019). Numerical ecology. In B. Fath (Ed.), Encyclopedia of Ecology (2nd ed., Vol. 3, pp. 487–493). Elsevier. https://doi.org/10.1016/B978-0-12-409548-9.10595-0
- Liu, P., Yuan, W., Fu, J., Jiang, Z., Hayashi, H., & Neubig, G. (2023). Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing. ACM Computing Surveys, 55(9), 1–35. https://doi.org/10.1145/3560815
- Marvin, G., Hellen, N., Jjingo, D., & Nakatumba-Nabende, J. (2024). Prompt engineering in large language models. In International Conference on Data Intelligence and Cognitive Informatics (pp. 387–402). Springer. https://doi.org/10.1007/978-981-99-7962-2_30
- McCune, B. and Grace, J.B. (2002) Analysis of Ecological Communities. MjM Software Design, Gleneden Beach. 304pp
- Michener, W. K. (2015). Ten simple rules for creating a good data management plan. PLOS Computational Biology, 11(10), e1004525. https://doi.org/10.1371/journal.pcbi.1004525
- Mielke, P. W., & Berry, K. J. (2001). Permutation methods: A distance function approach. Springer. https://doi.org/10.1007/978-1-4757-3449-2
- Miller, G., & Spiegel, E. (2025). Guidelines for research data integrity (GRDI). Science Data, 12(1), 95. https://doi.org/10.1038/s41597-024-04312-x
- Negron-Juarez, R., Feng, Y., Sheil, D., Keller, M., Ordway, E. M., Marra, D. M., & Urquiza-Muñoz, J. D. (2026). Widespread forest disturbance from windthrow in central African rainforests. npj Natural Hazards, 3, Article 21. https://doi.org/10.1038/s44304-026-00188-6
- Nori, H., Lee, Y. T., Zhang, S., Carignan, D., Edgar, R., Fusi, N., King, N., Larson, J., Li, Y., Liu, W., Luo, R., McKinney, S. M., Ness, R. O., Poon, H, Qin, T., Usuyama, N, White, C. and Horvitz, E. (2023). Can generalist foundation models outcompete special-purpose tuning? Case study in medicine. arXiv preprint arXiv:2311.16452. https://doi.org/10.48550/arXiv.2311.16452
- O’Brien, R. M. (2007). A caution regarding rules of thumb for variance inflation factors. Springer, 41, 673–690. https://doi.org/10.1007/s11135-006-9018-6
- OECD (2021), AI and the Future of Skills, Volume 1: Capabilities and Assessments, Educational Research and Innovation, OECD Publishing, Paris, https://doi.org/10.1787/5ee71f34-en.
- Oksanen, J., Simpson, G. L., Blanchet, F. G., Kindt, R., Pierre Legendre, P., Minchin, P. R., O’Hara, R. B., Solymos, P., Stevens, M. H. H., Szoecs, E., Wagner, H., Barbour, M., Bedward, M., Bolker, B., Borcard, D., Borman, T., Carvalho, G., Chirico, M., De Caceres, M., … Weedon, J. (2026). vegan: Community ecology package (Version 2.7-3) [R package]. Comprehensive R Archive Network (CRAN). https://doi.org/10.32614/CRAN.package.vegan
- Pearson, K. (1901) On Lines and Planes of Closest Fit to Systems of Points in Space. The London, Edinburgh, and Dublin Philosophical Magazine and Journal of Science, 2, 559-572.
https://doi.org/10.1080/14786440109462720
- Peng, R. D. (2011). Reproducible research in computational science. Science, 334(6060), 1226–1227. https://doi.org/10.1126/science.1213847
- Powers, S. M., & Hampton, S. E. (2019). Open science, reproducibility, and transparency in ecology. Ecological Applications, 29(1), e01822. https://doi.org/10.1002/eap.1822
- R Core Team. (2026). R: A language and environment for statistical computing. R Foundation for Statistical Computing. https://www.R-project.org/
- Raffin, A., Hill, A., Gleave, A., Kanervisto, A., Ernestus, M., & Dormann, N. (2021). Stable-Baselines3: Reliable reinforcement learning implementations. Journal of Machine Learning Research, 22(268), 1–8.. https://jmlr.org/papers/v22/20-1364.html
- Schulhoff, S., Ilie, M., Balepur, N., Kahadze, K., Liu, A., Si, C., Li, Y., Gupta, A., Han, H., Schulhoff, S., Dulepet, P. S., Vidyadhara, S., Ki, D., Agrawal, S., Pham, C., Kroiz, G., Li, F., Tao, H., Srivastava, A., Da Costa, H., Gupta, S., Rogers, M. L., Goncearenco, I., Sarli, G., Galynker, I., Peskoff, D., Carpuat, M., White, J., Anadkat, S., Hoyle, A., & Resnik, P. (2025). The prompt report: A systematic survey of prompt engineering techniques (Version 6). arXiv. https://doi.org/10.48550/arXiv.2406.06608
- Simmons, J. P., Nelson, L. D., & Simonsohn, U. (2011). False-positive psychology: Undisclosed flexibility in data collection and analysis allows presenting anything as significant. Psychological Science, 22(11), 1359–1366. https://doi.org/10.1177/0956797611417632
- Singhal, K., Azizi, S., Tu, T., Mahdavi, S. S., Wei, J., Chung, H. W., Scales, N., Tanwani, A., Cole-Lewis, H., Pfohl, S., Payne, P., Seneviratne, M., Gamble, P., Kelly, C., Babiker, A., Schärli, N., Chowdhery, A., Mansfield, P., Demner-Fushman, D., Agüera y Arcas, B., Webster, D., Corrado, G. S., Matias, Y., Chou, K., Gottweis, J., Tomasev, N., Liu, Y., Rajkomar, A., Barral, J. K., Semturs, C., Karthikesalingam, A., & Natarajan, V. (2023). Large language models encode clinical knowledge. Nature, 620(7972), 1–9. https://doi.org/10.1038/s41586-023-06291-2
- Spearman, C. (1904) The Proof and Measurement of Association between Two Things. The American Journal of Psychology, 15, 72-101. https://doi.org/10.2307/1412159
- ter Braak, C. J. F., & Šmilauer, P. (2002). CANOCO reference manual and CanoDraw for Windows user's guide: Software for canonical community ordination (Version 4.5). Microcomputer Power. https://edepot.wur.nl/405659
- Thirunavukarasu, A. J., Ting, D. S. J., Elangovan, K., Gutierrez, L., Tan, T. F., & Ting, D. S. W. (2023). Large language models in medicine. Nature Medicine, 29, 1930–1940. https://doi.org/10.1038/s41591-023-02448-8
- Thomas, D. W., Kenfack, D., Chuyong, G. B., Moses, S. N., Losos, E. C., Condit, R. S., & Songwe, N. C. (2003). Tree species of the South Western Cameroon: Tree Distribution Maps, Diameter Tables, and Species documentation of the 50-hectare Korup Forest Dynamics plot. Center for Tropical Forest Science, Washington DC. 247pp.
- Touchon, J. C., & McCoy, M. W. (2016). The mismatch between current statistical practice and doctoral training in ecology. Ecosphere, 7(8), e01394. https://doi.org/10.1002/ecs2.1394
- Tukey, J. W. (1977). Exploratory data analysis (Vol. 2). Addison-Wesley. 688pp
- Tukey, J. W. (1980). We need both exploratory and confirmatory. The American Statistician, 34(1), 23–25. https://doi.org/10.2307/2682991
- Unwin, A. (2015). Graphical Data Analysis with R (1st ed.). Chapman and Hall/CRC. 310pp. https://doi.org/10.1201/9781315370088
- Vaithilingam, P., Zhang, T., & Glassman, E. L. (2022). Expectation vs. experience: Evaluating the usability of code generation tools powered by large language models. In CHI Conference on Human Factors in Computing Systems Extended Abstracts (pp. 1-7). https://doi.org/10.1145/3491101.3519665
- Ver Hoef, J. M., & Boveng, P. L. (2007). Quasi-Poisson vs. negative binomial regression: How should we model overdispersed count data? Ecology, 88(11), 2766–2772. https://doi.org/10.1890/07-0043.1
- Wamba, S. F., & Queiroz, M. M. (2023). Responsible Artificial Intelligence as a secret ingredient for digital health: Bibliometric analysis, insights, and research directions. Springer Nature, 25, 2123–2138. https://doi.org/10.1007/s10796-021-10142-8
- Warton, D. I., & Hui, F. K. (2011). The arcsine is asinine: The analysis of proportions in ecology. Ecology, 92(1), 3-10. https://doi.org/10.1890/10-0340.1
- Weng, L., Liu, J., & Le, Q. V. (2023). Large language models as tool makers. arXiv preprint arXiv:2305.17126. https://doi.org/10.48550/arXiv.2305.17126
- White, J., Fu, Q., Hays, S., Sandborn, M., Olea, C., Gilbert, H., Elnashar, A., Spencer-Smith, J., & Schmidt, D. C. (2023). A prompt pattern catalog to enhance prompt engineering with ChatGPT. arXiv:2302.11382. https://doi.org/10.48550/arXiv.2302.11382
- Wickham, H., Averick, M., Bryan, J., Chang, W., D’Agostino McGowan, L., François, R., Grolemund, G., Hayes, A., Henry, L., Hester, J., Kuhn, M., Pedersen, T. L., Miller, E., Bache, S. M., Müller, K., Ooms, J., Robinson, D., Seidel, D. P., Spinu, V., Takahashi, K., Vaughan, D., Wilke, C., Woo, K., & Yutani, H. (2019). Welcome to the tidyverse. Journal of Open Source Software, 4(43), 1686. https://doi.org/10.21105/joss.01686
- Wickham, H. (2016). ggplot2: Elegant graphics for data analysis (2nd ed.). Springer. 260pp. https://doi.org/10.1007/978-3-319-24277-4
- Wilson, G., Bryan, J., Cranston, K., Kitzes, J., Nederbragt, L., & Teal, T. K. (2017). Good enough practices in scientific computing. PLOS Computational Biology, 13(6), e1005510. https://doi.org/10.1371/journal.pcbi.1005510
- Zamfirescu-Pereira, J. D., Wong, R. Y., Hartmann, B., & Yang, Q. (2023). Why Johnny can’t prompt: How non-AI experts try (and fail) to design LLM prompts. In Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems (CHI ’23) (Article 437, pp. 1–21). Association for Computing Machinery. https://doi.org/10.1145/3544548.3581388
- Zeileis, A., Kleiber, C., & Jackman, S. (2008). Regression models for count data in R. Journal of Statistical Software, 27(8), 1–25. https://doi.org/10.18637/jss.v027.i08
- Zuur, A. F., Ieno, E. N., & Smith, G. M. (2007). Analysing ecological data. Springer. https://doi.org/10.1007/978-0-387-45972-1
Artificial Intelligence (AI) can be used to speed up data analysis with R statistics by generating R code which is
executed in R (vibe data analysis), reducing the time a manual data analyst takes to develop R code. A reproducible, AIassisted framework for bivariate and multivariate statistical analysis of forestry count data was developed and validated in
this study, integrating vibe data analysis with conventional manual methods using four disturbance observations
(snapping, windthrow, branch fall, and dead standing) across 73 species drawn from 183 treefall gaps in Korup National
Park, Cameroon. Using Claude.ai to generate R statistical code through structured prompt engineering, we systematically
applied classical parametric approaches alongside non-parametric alternatives across five analytical stages: exploratory
data analysis, bivariate correlation and regression, multivariate correlation matrix analysis, dimensionality reduction and
clustering, and multiple linear regression. All disturbance count variables exhibited extreme positive skewness (1.776-
8.367) and severe excess kurtosis (5.554-71.014), fundamentally violating parametric assumptions and designating nonparametric methods as co-primary analytical tools. The bivariate analysis revealed a strong positive association between
snapping and gap size (Pearson r = 0.865, p < 0.001; R² = 0.7483), corroborated by non-parametric methods (Spearman ρ
= 0.455, p < 0.001; Kendall τ = 0.366, p < 0.001), indicating that species associated with larger canopy openings tend to
record higher snapping frequencies.
Keywords :
Bivariate Analysis, Multivariate Analysis, Correlation, Regression, PCA, Cluster Analysis, K-means, Vibe Data Analysis, R Statistics, Artificial Intelligence, Ecological Disturbance Data.