Automated Log Intelligence System using Machine Learning Techniques
Ananya Srivastava , Department of Computer Science & Engineering Shikha Singh , Amity School of Engineering & Technology Vineet Singh , Amity University, Uttar Pradesh, Lucknow, IndiaAbstract
Every application, server, and network device generate logs whenever it performs an action, making these logs a valuable source of information for determining whether a system is in a normal state or has any fault that can cause failure. In large distributed systems, the volume of generated logs is extremely high, often reaching millions of log entries on a daily basis, which makes the manual analysis of these logs almost impossible. Traditional rule-based log management methods require engineers to anticipate all possible faults and their corresponding alerts, making them ineffective for detecting new types of anomalies according to the changing times. This paper presents a machine learning based log intelligence system that processes raw, unstructured log data and performs anomaly detection through a complete pipeline of log parsing, preprocessing, and finally model evaluation. The dataset used is collected from the LogHub repository and is called the HDFS dataset. The HDFS dataset contains about 11 million logs. In this project we have used 3 approaches for log analysis: Isolation Forest (unsupervised algorithm), Random Forest (supervised algorithm), and LSTM (deep learning). The model is trained on each of the methods listed above and then is evaluated using the metrics of precision, recall, accuracy and F1 score. The result after evaluation shows that random forest achieves the highest F1 score of 0.97, then comes LSTM with the score of 0.89 and Isolation Forest a score of 0.79. The results achieved clearly demonstrate that machine learning and deep learning can be used to detect log anomalies in real world data for IT operations.
Keywords
Log analysis, machine learning, HDFS Dataset, Isolation Forest
References
Apache Hadoop Project. (2023). HDFS architecture guide. https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-hdfs/HdfsDesign.html
Breiman, L. (2001). Random forests. Machine Learning, 45(1), 5–32.
Drain3 Library. (2023). Online log template extraction using Drain. https://github.com/logpai/Drain3
Du, M., Li, F., Zheng, G., & Srikumar, V. (2017). DeepLog: Anomaly detection and diagnosis from system logs through deep learning. In Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security (pp. 1285–1298). ACM.
Guan, W., Cao, J., Qian, S., Gao, J., & Ouyang, C. (2024). LogLLM: Log-based anomaly detection using large language models. arXiv. https://arxiv.org/abs/2411.08561
Guo, H., Yuan, S., & Wu, X. (2021). LogBERT: Log anomaly detection via BERT. In Proceedings of the International Joint Conference on Neural Networks (IJCNN) (pp. 1–8). IEEE.
He, P., Zhu, J., Zheng, Z., & Lyu, M. R. (2017). Drain: An online log parsing approach with fixed-depth tree. In Proceedings of the IEEE International Conference on Web Services (ICWS) (pp. 33–40). IEEE.
He, S., Zhu, J., He, P., & Lyu, M. R. (2016). Experience report: System log analysis for anomaly detection. In Proceedings of the IEEE International Symposium on Software Reliability Engineering (ISSRE) (pp. 207–218). IEEE.
Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory. Neural Computation, 9(8), 1735–1780.
Le, V. H., & Zhang, H. (2022). Log-based anomaly detection without log parsing. In Proceedings of the IEEE/ACM International Conference on Automated Software Engineering (ASE) (pp. 492–504). IEEE/ACM.
Lin, Q., Zhang, H., Lou, J., Zhang, Y., & Chen, X. (2016). Log clustering based problem identification for online service systems. In Proceedings of the International Conference on Software Engineering Companion (ICSE-C) (pp. 102–111). IEEE/ACM.
Liu, F. T., Ting, K. M., & Zhou, Z. H. (2008). Isolation forest. In Proceedings of the IEEE International Conference on Data Mining (ICDM) (pp. 413–422). IEEE.
Liu, Y., Ren, S., Wang, X., & Zhou, M. (2024). Temporal logical attention network for log-based anomaly detection in distributed systems. Sensors, 24(24), 7949.
LogHub Dataset Repository. (2023). A large collection of system log datasets for AI-driven log analytics. https://github.com/logpai/loghub
Lou, J., Fu, Q., Yang, S., Xu, Y., & Li, J. (2010). Mining invariants from console logs for system problem detection. In Proceedings of the USENIX Annual Technical Conference (pp. 1–14). USENIX.
Makanju, A., Zincir-Heywood, A. N., & Milios, E. E. (2009). Clustering event logs using iterative partitioning. In Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 1255–1264). ACM.
Meng, W., Liu, Y., Zhu, Y., Zhang, S., Pei, D., Liu, Y., Chen, Y., Zhang, R., Tao, S., Sun, P., & Zhou, Y. (2019). LogRobust: Anomaly detection for unstable log data using prior knowledge. In Proceedings of the 2019 EuroSys Conference (pp. 1–13). ACM.
Nagappan, M., & Vouk, M. A. (2010). Abstracting log lines to log event types for mining software system logs. In Proceedings of the IEEE Working Conference on Mining Software Repositories (MSR) (pp. 114–117). IEEE.
Vaarandi, R. (2003). A data clustering algorithm for mining patterns from event logs. In Proceedings of the IEEE Workshop on IP Operations and Management (IPOM) (pp. 119–126). IEEE.
Xu, W., Huang, L., Fox, A., Patterson, D., & Jordan, M. I. (2009). Detecting large-scale system problems by mining console logs. In Proceedings of the ACM SIGOPS Symposium on Operating Systems Principles (SOSP) (pp. 117–132). ACM.
Zhang, X., et al. (2019). Robust log-based anomaly detection on unstable log data. In Proceedings of the ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE) (pp. 807–817). ACM.
Zhu, J., He, S., Liu, J., He, P., Xie, Q., Zheng, Z., & Lyu, M. R. (2019). Tools and benchmarks for automated log parsing. In Proceedings of the International Conference on Software Engineering: Software Engineering in Practice (ICSE-SEIP) (pp. 121–130). IEEE/ACM.
Download and View Statistics
Copyright License
Copyright (c) 2026 Ananya Srivastava, Shikha Singh, Vineet Singh

This work is licensed under a Creative Commons Attribution 4.0 International License.
Authors retain the copyright of their manuscripts, and all Open Access articles are disseminated under the terms of the Creative Commons Attribution License 4.0 (CC-BY), which licenses unrestricted use, distribution, and reproduction in any medium, provided that the original work is appropriately cited. The use of general descriptive names, trade names, trademarks, and so forth in this publication, even if not specifically identified, does not imply that these names are not protected by the relevant laws and regulations.

Applied Sciences
| Open Access |
DOI: