Bivariate and multivariate vibe analysis of forestry data with ai and r statistics| International Journal of Innovative Science and Research Technology

Bivariate and Multivariate Vibe Analysis of Forestry Data with AI and R Statistics

Authors : Kato Samuel Namuene; Egbe Enow Andrew

Volume/Issue : Volume 11 - 2026, Issue 4 - April

Google Scholar : https://tinyurl.com/9jw59tdf

Scribd : https://tinyurl.com/2hh5kr2y

DOI : https://doi.org/10.38124/ijisrt/26apr1365

PlumX Metrics

Semantic Scholar

ResearchGate

Note : A published paper may take 4-5 working days from the publication date to appear in PlumX Metrics, Semantic Scholar, and ResearchGate.

Abstract : Artificial Intelligence (AI) can be used to speed up data analysis with R statistics by generating R code which is executed in R (vibe data analysis), reducing the time a manual data analyst takes to develop R code. A reproducible, AIassisted framework for bivariate and multivariate statistical analysis of forestry count data was developed and validated in this study, integrating vibe data analysis with conventional manual methods using four disturbance observations (snapping, windthrow, branch fall, and dead standing) across 73 species drawn from 183 treefall gaps in Korup National Park, Cameroon. Using Claude.ai to generate R statistical code through structured prompt engineering, we systematically applied classical parametric approaches alongside non-parametric alternatives across five analytical stages: exploratory data analysis, bivariate correlation and regression, multivariate correlation matrix analysis, dimensionality reduction and clustering, and multiple linear regression. All disturbance count variables exhibited extreme positive skewness (1.776- 8.367) and severe excess kurtosis (5.554-71.014), fundamentally violating parametric assumptions and designating nonparametric methods as co-primary analytical tools. The bivariate analysis revealed a strong positive association between snapping and gap size (Pearson r = 0.865, p < 0.001; R² = 0.7483), corroborated by non-parametric methods (Spearman ρ = 0.455, p < 0.001; Kendall τ = 0.366, p < 0.001), indicating that species associated with larger canopy openings tend to record higher snapping frequencies.

Keywords : Bivariate Analysis, Multivariate Analysis, Correlation, Regression, PCA, Cluster Analysis, K-means, Vibe Data Analysis, R Statistics, Artificial Intelligence, Ecological Disturbance Data.

References :

Abdi, H., & Williams, L. J. (2010). Principal component analysis. Wiley Interdisciplinary Reviews: Computational Statistics, 2(4), 433–459. https://doi.org/10.1002/wics.101
Ahuja, K., Diddee, H., Hada, R., Ochieng, M., Ramesh, K., Jain, P., Nambi, A., Ganu, T., Segal, S., Axmed, M., Bali, K., & Sitaram, S. (2023). MEGA: Multilingual evaluation of generative AI (Version 4). arXiv. https://doi.org/10.48550/arXiv.2303.12528
Alkaissi, H., & McFarlane, S. I. (2023). Artificial hallucinations in ChatGPT: implications in scientific writing. Cureus, 15(2), e35179. https://doi.org/10.7759/cureus.35179
Anderson, M. J. (2008). A new method for non-parametric multivariate analysis of variance. Austral Ecology, 26(1), 32–46. https://doi.org/10.1111/j.1442-9993.2001.01070.pp.x
Anthropic. (2026). Claude (3.5 Sonnet version) [Large language model]. Available at: https://claude.ai/
Barke, S., James, M. B., & Polikarpova, N. (2023). Grounded Copilot: How Programmers Interact with Code-Generating Models. Proceedings of the ACM on Programming Languages, 7, 85-111.
https://doi.org/10.1145/3586030
Baumer, B., Cetinkaya-Rundel, M., Bray, A., Loi, L., & Horton, N. J. (2014). R Markdown: Integrating a reproducible analysis tool into introductory statistics. Technology Innovations in Statistics Education, 8(1), 1-29. https://doi.org/10.5070/T581020118
Bonnini, S., Assegie, G. M., & Trzcinska, K. (2024). Review about the permutation approach in hypothesis testing. Mathematics, 12(17), 2617. https://doi.org/10.3390/math12172617
Borcard, D., Gillet, F., & Legendre, P. (2018). Numerical ecology with R (2nd ed.). Springer. 435pp. https://doi.org/10.1007/978-3-319-71404-2
Brokaw, N.V.L. (1985). Gap-phase regeneration in a tropical forest. Ecology, 66(3), 682-687.
Everitt, B. S., Landau, S., Leese, M., & Stahl, D. (2011). Cluster analysis (5th ed.). Wiley. https://doi.org/10.1002/9780470977811
Federiakin, D., Molerov, D., Zlatkin-Troitschanskaia, O., & Maur, A. (2024). Prompt engineering as a new 21st century skill. Frontiers in Education, 9, 1366434. https://doi.org/10.3389/feduc.2024.1366434
Floridi, L., Cowls, J., King, T. C., & Taddeo, M. (2020). How to design AI for social good: Seven essential factors. Springer Nature, 26(3), 1771–1796. https://doi.org/10.1007/s11948-020-00213-5
Forrester, D. I., & Tang, X. (2016). Analysing the spatial and temporal dynamics of species interactions in mixed-species forests and the effects of stand density using the 3-PGmix model. Ecological Modelling, 319, 233–254. https://doi.org/10.1016/j.ecolmodel.2015.07.010
Fox, J., & Weisberg, S. (2019). An R companion to applied regression (3rd ed.). Sage Publishing. 802pp. https://www.scribd.com/document/434845005/Companion-Applied-Regression-R
Friendly, M. (2002). Corrgrams: Exploratory displays for correlation matrices. The American Statistician, 56(4), 316–324.
Gelman, A., & Loken, E. (2014). The statistical crisis in science. American Scientist, 102(6), 460–465. https://doi.org/10.1511/2014.111.460
Guo, B., Zhang, X., Wang, Z., Jiang, M., Nie, J., Ding, Y., Yue, J., & Wu, Y. (2023). How close is ChatGPT to human experts? Comparison corpus, evaluation, and detection. arXiv:2301.07597. https://doi.org/10.48550/arXiv.2301.07597
Hampton, S. E., Jones, M. B., Wasser, L. A., Schildhauer, M. P., Supp, S. R., Brun, J., Hernandez, R. R., Boettiger, C., Collins, S. L., Gross, L. J., Fernández, D. S., Budden, A., White, E. P., Teal, T. K., Labou, S. G., & Aukema, J. E. (2017). Skills and knowledge for data-intensive environmental research. BioScience, 67(6), 546–557. https://doi.org/10.1093/biosci/bix025
Hellas, A., Leinonen, J., Sarsa, S., Koutcheme, C., Kujanpää, L., & Sorva, J. (2023). Exploring the responses of large language models to beginner programmers’ help requests. In Proceedings of the 2023 ACM Conference on International Computing Education Research (pp. 93–105). Association for Computing Machinery. https://doi.org/10.1145/3568813.3600139
Hesselbarth, M. H. K., Sciaini, M., With, K. A., Wiegand, K., & Nowosad, J. (2019). landscapemetrics: An open-source R tool to calculate landscape metrics. Ecography, 42, 1648–1657. https://doi.org/10.1111/ecog.04617
Huang, J., & Chang, K. C.-C. (2023). Towards reasoning in large language models: A survey. Findings of the Association for Computational Linguistics: ACL 2023, 1049–1065. https://doi.org/10.18653/v1/2023.findings-acl.67
Ives, A. R. (2015). For testing the significance of regression coefficients, go ahead and log-transform count data. Methods in Ecology and Evolution, 6(7), 828–835. https://doi.org/10.1111/2041-210X.12386
Ji, Z., Lee, N., Frieske, R., Yu, T., Su, D., Xu, Y., & others. (2023). Survey of hallucination in natural language generation. ACM Computing Surveys, 55(12), https://doi.org/10.1145/3571730
Jolliffe, I.T. and Cadima, J. (2016) Principal Component Analysis: A Review and Recent Developments. Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences, 374, Article 20150202.
https://doi.org/10.1098/rsta.2015.0202.
Kaiser, H. F. (1960). The application of electronic computers to factor analysis. Educational and Psychological Measurement, 20, 141-151. http://dx.doi.org/10.1177/001316446002000116
Kassambara, A. (2017). Practical guide to cluster analysis in R: Unsupervised machine learning. STHDA. https://www.datanovia.com/en/product/practical-guide-to-cluster-analysis-in-r/
Kassambara, A., Mundt, F., & Erdey, L. (2026). factoextra: Extract and visualize the results of multivariate data analyses(Version 2.0.0) [R package]. Comprehensive R Archive Network (CRAN). https://doi.org/10.32614/CRAN.package.factoextra
Kumar, S. S., Lones, M. A., Maarek, M., & Zantout, H. (2024). Investigating the proficiency of large language models in formative feedback generation for student programmers. In Proceedings of the 1st International Workshop on Large Language Models for Code (LLM4Code ’24) (pp. 88–93). Association for Computing Machinery. https://doi.org/10.1145/3643795.3648380
Lai, J., Lortie, C. J., Muenchen, R. A., Yang, J., & Ma, K. (2019). Evaluating the popularity of R in ecology. Ecosphere, 10(1), e02567. https://doi.org/10.1002/ecs2.2567
Lê, S., Josse, J., & Husson, F. (2008). FactoMineR: An R package for multivariate analysis. Journal of Statistical Software, 25(1), 1–18. https://doi.org/10.18637/jss.v025.i01
Legendre, P. (2019). Numerical ecology. In B. Fath (Ed.), Encyclopedia of Ecology (2nd ed., Vol. 3, pp. 487–493). Elsevier. https://doi.org/10.1016/B978-0-12-409548-9.10595-0
Liu, P., Yuan, W., Fu, J., Jiang, Z., Hayashi, H., & Neubig, G. (2023). Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing. ACM Computing Surveys, 55(9), 1–35. https://doi.org/10.1145/3560815
Marvin, G., Hellen, N., Jjingo, D., & Nakatumba-Nabende, J. (2024). Prompt engineering in large language models. In International Conference on Data Intelligence and Cognitive Informatics (pp. 387–402). Springer. https://doi.org/10.1007/978-981-99-7962-2_30
McCune, B. and Grace, J.B. (2002) Analysis of Ecological Communities. MjM Software Design, Gleneden Beach. 304pp
Michener, W. K. (2015). Ten simple rules for creating a good data management plan. PLOS Computational Biology, 11(10), e1004525. https://doi.org/10.1371/journal.pcbi.1004525
Mielke, P. W., & Berry, K. J. (2001). Permutation methods: A distance function approach. Springer. https://doi.org/10.1007/978-1-4757-3449-2
Miller, G., & Spiegel, E. (2025). Guidelines for research data integrity (GRDI). Science Data, 12(1), 95. https://doi.org/10.1038/s41597-024-04312-x
Negron-Juarez, R., Feng, Y., Sheil, D., Keller, M., Ordway, E. M., Marra, D. M., & Urquiza-Muñoz, J. D. (2026). Widespread forest disturbance from windthrow in central African rainforests. npj Natural Hazards, 3, Article 21. https://doi.org/10.1038/s44304-026-00188-6
Nori, H., Lee, Y. T., Zhang, S., Carignan, D., Edgar, R., Fusi, N., King, N., Larson, J., Li, Y., Liu, W., Luo, R., McKinney, S. M., Ness, R. O., Poon, H, Qin, T., Usuyama, N, White, C. and Horvitz, E. (2023). Can generalist foundation models outcompete special-purpose tuning? Case study in medicine. arXiv preprint arXiv:2311.16452. https://doi.org/10.48550/arXiv.2311.16452
O’Brien, R. M. (2007). A caution regarding rules of thumb for variance inflation factors. Springer, 41, 673–690. https://doi.org/10.1007/s11135-006-9018-6
OECD (2021), AI and the Future of Skills, Volume 1: Capabilities and Assessments, Educational Research and Innovation, OECD Publishing, Paris, https://doi.org/10.1787/5ee71f34-en.
Oksanen, J., Simpson, G. L., Blanchet, F. G., Kindt, R., Pierre Legendre, P., Minchin, P. R., O’Hara, R. B., Solymos, P., Stevens, M. H. H., Szoecs, E., Wagner, H., Barbour, M., Bedward, M., Bolker, B., Borcard, D., Borman, T., Carvalho, G., Chirico, M., De Caceres, M., … Weedon, J. (2026). vegan: Community ecology package (Version 2.7-3) [R package]. Comprehensive R Archive Network (CRAN). https://doi.org/10.32614/CRAN.package.vegan
Pearson, K. (1901) On Lines and Planes of Closest Fit to Systems of Points in Space. The London, Edinburgh, and Dublin Philosophical Magazine and Journal of Science, 2, 559-572.
https://doi.org/10.1080/14786440109462720
Peng, R. D. (2011). Reproducible research in computational science. Science, 334(6060), 1226–1227. https://doi.org/10.1126/science.1213847
Powers, S. M., & Hampton, S. E. (2019). Open science, reproducibility, and transparency in ecology. Ecological Applications, 29(1), e01822. https://doi.org/10.1002/eap.1822
R Core Team. (2026). R: A language and environment for statistical computing. R Foundation for Statistical Computing. https://www.R-project.org/
Raffin, A., Hill, A., Gleave, A., Kanervisto, A., Ernestus, M., & Dormann, N. (2021). Stable-Baselines3: Reliable reinforcement learning implementations. Journal of Machine Learning Research, 22(268), 1–8.. https://jmlr.org/papers/v22/20-1364.html
Schulhoff, S., Ilie, M., Balepur, N., Kahadze, K., Liu, A., Si, C., Li, Y., Gupta, A., Han, H., Schulhoff, S., Dulepet, P. S., Vidyadhara, S., Ki, D., Agrawal, S., Pham, C., Kroiz, G., Li, F., Tao, H., Srivastava, A., Da Costa, H., Gupta, S., Rogers, M. L., Goncearenco, I., Sarli, G., Galynker, I., Peskoff, D., Carpuat, M., White, J., Anadkat, S., Hoyle, A., & Resnik, P. (2025). The prompt report: A systematic survey of prompt engineering techniques (Version 6). arXiv. https://doi.org/10.48550/arXiv.2406.06608
Simmons, J. P., Nelson, L. D., & Simonsohn, U. (2011). False-positive psychology: Undisclosed flexibility in data collection and analysis allows presenting anything as significant. Psychological Science, 22(11), 1359–1366. https://doi.org/10.1177/0956797611417632
Singhal, K., Azizi, S., Tu, T., Mahdavi, S. S., Wei, J., Chung, H. W., Scales, N., Tanwani, A., Cole-Lewis, H., Pfohl, S., Payne, P., Seneviratne, M., Gamble, P., Kelly, C., Babiker, A., Schärli, N., Chowdhery, A., Mansfield, P., Demner-Fushman, D., Agüera y Arcas, B., Webster, D., Corrado, G. S., Matias, Y., Chou, K., Gottweis, J., Tomasev, N., Liu, Y., Rajkomar, A., Barral, J. K., Semturs, C., Karthikesalingam, A., & Natarajan, V. (2023). Large language models encode clinical knowledge. Nature, 620(7972), 1–9. https://doi.org/10.1038/s41586-023-06291-2
Spearman, C. (1904) The Proof and Measurement of Association between Two Things. The American Journal of Psychology, 15, 72-101. https://doi.org/10.2307/1412159
ter Braak, C. J. F., & Šmilauer, P. (2002). CANOCO reference manual and CanoDraw for Windows user's guide: Software for canonical community ordination (Version 4.5). Microcomputer Power. https://edepot.wur.nl/405659
Thirunavukarasu, A. J., Ting, D. S. J., Elangovan, K., Gutierrez, L., Tan, T. F., & Ting, D. S. W. (2023). Large language models in medicine. Nature Medicine, 29, 1930–1940. https://doi.org/10.1038/s41591-023-02448-8
Thomas, D. W., Kenfack, D., Chuyong, G. B., Moses, S. N., Losos, E. C., Condit, R. S., & Songwe, N. C. (2003). Tree species of the South Western Cameroon: Tree Distribution Maps, Diameter Tables, and Species documentation of the 50-hectare Korup Forest Dynamics plot. Center for Tropical Forest Science, Washington DC. 247pp.
Touchon, J. C., & McCoy, M. W. (2016). The mismatch between current statistical practice and doctoral training in ecology. Ecosphere, 7(8), e01394. https://doi.org/10.1002/ecs2.1394
Tukey, J. W. (1977). Exploratory data analysis (Vol. 2). Addison-Wesley. 688pp
Tukey, J. W. (1980). We need both exploratory and confirmatory. The American Statistician, 34(1), 23–25. https://doi.org/10.2307/2682991
Unwin, A. (2015). Graphical Data Analysis with R (1st ed.). Chapman and Hall/CRC. 310pp. https://doi.org/10.1201/9781315370088
Vaithilingam, P., Zhang, T., & Glassman, E. L. (2022). Expectation vs. experience: Evaluating the usability of code generation tools powered by large language models. In CHI Conference on Human Factors in Computing Systems Extended Abstracts (pp. 1-7). https://doi.org/10.1145/3491101.3519665
Ver Hoef, J. M., & Boveng, P. L. (2007). Quasi-Poisson vs. negative binomial regression: How should we model overdispersed count data? Ecology, 88(11), 2766–2772. https://doi.org/10.1890/07-0043.1
Wamba, S. F., & Queiroz, M. M. (2023). Responsible Artificial Intelligence as a secret ingredient for digital health: Bibliometric analysis, insights, and research directions. Springer Nature, 25, 2123–2138. https://doi.org/10.1007/s10796-021-10142-8
Warton, D. I., & Hui, F. K. (2011). The arcsine is asinine: The analysis of proportions in ecology. Ecology, 92(1), 3-10. https://doi.org/10.1890/10-0340.1
Weng, L., Liu, J., & Le, Q. V. (2023). Large language models as tool makers. arXiv preprint arXiv:2305.17126. https://doi.org/10.48550/arXiv.2305.17126
White, J., Fu, Q., Hays, S., Sandborn, M., Olea, C., Gilbert, H., Elnashar, A., Spencer-Smith, J., & Schmidt, D. C. (2023). A prompt pattern catalog to enhance prompt engineering with ChatGPT. arXiv:2302.11382. https://doi.org/10.48550/arXiv.2302.11382
Wickham, H., Averick, M., Bryan, J., Chang, W., D’Agostino McGowan, L., François, R., Grolemund, G., Hayes, A., Henry, L., Hester, J., Kuhn, M., Pedersen, T. L., Miller, E., Bache, S. M., Müller, K., Ooms, J., Robinson, D., Seidel, D. P., Spinu, V., Takahashi, K., Vaughan, D., Wilke, C., Woo, K., & Yutani, H. (2019). Welcome to the tidyverse. Journal of Open Source Software, 4(43), 1686. https://doi.org/10.21105/joss.01686
Wickham, H. (2016). ggplot2: Elegant graphics for data analysis (2nd ed.). Springer. 260pp. https://doi.org/10.1007/978-3-319-24277-4
Wilson, G., Bryan, J., Cranston, K., Kitzes, J., Nederbragt, L., & Teal, T. K. (2017). Good enough practices in scientific computing. PLOS Computational Biology, 13(6), e1005510. https://doi.org/10.1371/journal.pcbi.1005510
Zamfirescu-Pereira, J. D., Wong, R. Y., Hartmann, B., & Yang, Q. (2023). Why Johnny can’t prompt: How non-AI experts try (and fail) to design LLM prompts. In Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems (CHI ’23) (Article 437, pp. 1–21). Association for Computing Machinery. https://doi.org/10.1145/3544548.3581388
Zeileis, A., Kleiber, C., & Jackman, S. (2008). Regression models for count data in R. Journal of Statistical Software, 27(8), 1–25. https://doi.org/10.18637/jss.v027.i08
Zuur, A. F., Ieno, E. N., & Smith, G. M. (2007). Analysing ecological data. Springer. https://doi.org/10.1007/978-0-387-45972-1

Artificial Intelligence (AI) can be used to speed up data analysis with R statistics by generating R code which is executed in R (vibe data analysis), reducing the time a manual data analyst takes to develop R code. A reproducible, AIassisted framework for bivariate and multivariate statistical analysis of forestry count data was developed and validated in this study, integrating vibe data analysis with conventional manual methods using four disturbance observations (snapping, windthrow, branch fall, and dead standing) across 73 species drawn from 183 treefall gaps in Korup National Park, Cameroon. Using Claude.ai to generate R statistical code through structured prompt engineering, we systematically applied classical parametric approaches alongside non-parametric alternatives across five analytical stages: exploratory data analysis, bivariate correlation and regression, multivariate correlation matrix analysis, dimensionality reduction and clustering, and multiple linear regression. All disturbance count variables exhibited extreme positive skewness (1.776- 8.367) and severe excess kurtosis (5.554-71.014), fundamentally violating parametric assumptions and designating nonparametric methods as co-primary analytical tools. The bivariate analysis revealed a strong positive association between snapping and gap size (Pearson r = 0.865, p < 0.001; R² = 0.7483), corroborated by non-parametric methods (Spearman ρ = 0.455, p < 0.001; Kendall τ = 0.366, p < 0.001), indicating that species associated with larger canopy openings tend to record higher snapping frequencies.

Paper Submission Last Date
30 - June - 2026

SUBMIT YOUR PAPER CALL FOR PAPERS

Video Explanation for Published paper

Never miss an update from Papermashup

Get notified about the latest tutorials and downloads.

Subscribe by Email

Get alerts directly into your inbox after each post and stay updated.

Subscribe by RSS

Add our RSS to your feedreader to get regular updates from us.