Engineering and Technology
| Open Access | Integrated Framework for Reliable Work Zone Crash Classification: Combining Data Validation, Machine Learning Ensembles, and Natural Language Methods
Dr. Mateo Alvarez , Global Institute for Transport Safety, University of LisbonAbstract
This paper presents a comprehensive, publication-ready investigation into the problem of reliable work zone crash classification and risk prediction using an integrated pipeline that emphasizes rigorous data validation, modern machine learning ensembles, and natural language processing of crash narratives. Work zones are high-risk environments on road networks and accurate identification and classification of work zone crashes is essential to enable targeted safety interventions, resource allocation, and reliable research (Yang, 2015; Blackman et al., 2020). Yet, existing operational crash datasets suffer from misclassification, incomplete fields, and inconsistent semantics arising from heterogeneous reporting practices (Swansen et al., 2013; Carrick et al., 2009). We argue that improving data quality through systematic validation and hybrid AI-augmented checks is a prerequisite for robust predictive modeling (Van Der Loo & De Jonge, 2020; Redman, 1998). Building on advances in ensemble learning and hyperparameter optimization (Almahdi et al., 2023; Asadi & Wang, 2023), together with text-mining approaches for narrative analysis (Sayed et al., 2021), we design and describe an end-to-end methodology: (1) a layered data validation and correction module that uses deterministic rules and large language model-assisted anomaly detection; (2) a multimodal feature engineering strategy that integrates structured traffic and environmental data with unstructured narrative-derived features; (3) an ensemble classifier framework that uses stacked learners with hyperparameter tuning to achieve robust classification across varying traffic conditions; and (4) a human-in-the-loop verification stage to capture residual errors and provide continuous feedback for model retraining (Malviya & Parate, 2025; OpenAI, 2023). We present a descriptive analysis of modeled experimental outcomes and sensitivity studies, discuss theoretical implications, confront limitations, and outline future research directions. The findings demonstrate that combining principled data validation with ensemble learning and narrative text mining materially reduces misclassification rates, produces better calibrated crash-risk scores, and yields interpretability benefits valuable for practitioners and policymakers (Pande et al., 2011; Sayed et al., 2021). This article contributes a detailed procedural blueprint and theoretical rationale for transportation researchers seeking reliable, defensible analytics for work zone safety.
Keywords
Work zone safety, crash classification, data validation, ensemble learning
References
Planning Stage Work Zone Configurations Using an Artificial Neural Network. Transp. Res. Rec. 2022, 2676, 377–384.
Yang, H.; Ozbay, K.; Ozturk, O.; Xie, K. Work Zone Safety Analysis and Modeling: A State-of-the-Art Review. Traffic Inj. Prev. 2015, 16, 387–396.
Blackman, R.; Debnath, A.K.; Haworth, N. Understanding Vehicle Crashes in Work Zones: Analysis of Workplace Health and Safety Data as an Alternative to Police-Reported Crash Data in Queensland, Australia. Aust. Traffic Inj. Prev. 2020, 21, 222–227.
Sayed, M.A.; Qin, X.; Kate, R.J.; Anisuzzaman, D.M.; Yu, Z. Identification and Analysis of Misclassified Work-Zone Crashes Using Text Mining Techniques. Accid. Anal. Prev. 2021, 159, 106211.
Almahdi, A.; Al Mamlook, R.E.; Bandara, N.; Almuflih, A.S.; Nasayreh, A.; Gharaibeh, H.; Alasim, F.; Aljohani, A.; Jamal, A. Boosting Ensemble Learning for Freeway Crash Classification under Varying Traffic Conditions: A Hyperparameter Optimization Approach. Sustainability 2023, 15, 15896.
Pande, A.; Das, A.; Abdel-Aty, M.; Hassan, H. Estimation of Real-Time Crash Risk. Transp. Res. Rec. 2011, 2237, 60–66.
OpenAI. GPT-3.5 Turbo Fine-Tuning and API Updates; OpenAI: San Francisco, CA, USA, 2023.
Swansen, E.; Mckinnon, I.A.; Knodler, M.A. Integration of Crash Report Narratives for Identification of Work Zone-Related Crash Classification. In Proceedings of the Transportation Research Board 92nd Annual Meeting, Washington, DC, USA, 13–17 January 2013.
Carrick, G.; Heaslip, K.; Srinivasan, S.; Brady, B. A Case Study in Spatial Misclassification of Work Zone Crashes. In Proceedings of the 88th Transportation Research Board Annual Meeting, National Academy of Sciences, Washington, DC, USA, 11–15 January 2009.
Asadi, H.; Wang, J. An Ensemble Approach for Predicting Crash Severity in Work Zones Using Machine Learning. Sustainability 2023, 15, 1201.
M. P. Van Der Loo and E. De Jonge, Data validation, arXiv preprint arXiv:2012.12028, 2020.
Malviya, S., & Vrushali Parate. AI-Augmented Data Quality Validation in P&C Insurance: A Hybrid Framework Using Large Language Models and Rule-Based Agents. International Journal of Computational and Experimental Science and Engineering, 11(3), 2025. https://doi.org/10.22399/ijcesen.3613
T. C. Redman, The impact of poor data quality on the typical enterprise, Communications of the ACM, vol. 41, no. 2, pp. 79–82, 1998.
L. L. Pipino, Y. W. Lee, and R. Y. Wang, Data quality assessment, Communications of the ACM, vol. 45, no. 4, pp. 211–218, 2002.
Great expectations. (2021) greatexpectations.io
S. Madnick, R. Wang, and X. Xian, The design and implementation of a corporate householding knowledge processor to improve data quality, Journal of Management Information Systems, vol. 20, no. 3, pp. 41–70, 2003.
Achiam, J.; Adler, S.; Agarwal, S.; Ahmad, L.; Akkaya, I.; Aleman, F. L.; Almeida, D.; Altenschmidt, J.; Altman, S.; Anadkat, S. et al., GPT-4 technical report, arXiv preprint arXiv:2303.08774, 2023.
Touvron, H.; Lavril, T.; Izacard, G.; Martinet, X.; Lachaux, M.-A.; Lacroix, T.; Rozière, B.; Goyal, N.; Hambro, E.; Azhar, F. et al., Llama: Open and efficient foundation language models, arXiv preprint arXiv:2302.13971, 2023.
C. Batini, C. Cappiello, C. Francalanci, and A. Maurino, Methodologies for data quality assessment and improvement, ACM Computing Surveys, vol. 41, no. 3, pp. 1–52, 2009.
D. Sculley, G. Holt, D. Golovin, E. Davydov, T. Phillips, D. Ebner, V. Chaudhary, and M. Young, Machine learning: The high interest credit card of technical debt, in SE4ML: Software engineering for machine learning (NIPS 2014 Workshop), vol. 8. Cambridge, MA, 2014.
E. J. Hu, Y. Shen, P. Wallis, Z. Allen-Zhu, Y. Li, S. Wang, L. Wang, W. Chen et al., LoRA: Low-rank adaptation of large language models, ICLR, 2022.
Download and View Statistics
Copyright License
Copyright (c) 2025 Dr. Mateo Alvarez

This work is licensed under a Creative Commons Attribution 4.0 International License.
Authors retain the copyright of their manuscripts, and all Open Access articles are disseminated under the terms of the Creative Commons Attribution License 4.0 (CC-BY), which licenses unrestricted use, distribution, and reproduction in any medium, provided that the original work is appropriately cited. The use of general descriptive names, trade names, trademarks, and so forth in this publication, even if not specifically identified, does not imply that these names are not protected by the relevant laws and regulations.

