Articles
| Open Access | Reliability-Aware Error Budget Governance and Resource Orchestration for Large-Scale Language Model Serving Infrastructures
Kaniel Verhoeven , Department of Computer Science, Delft University of Technology, NetherlandsAbstract
The rapid industrialization of large-scale language models has transformed contemporary digital services into reliability-critical socio-technical systems whose failure modes propagate across economic, institutional, and cognitive domains. While traditional Site Reliability Engineering (SRE) frameworks have long governed the stability of web-scale platforms, the arrival of transformer-based generative models introduces unprecedented variability in workload, latency, cost, and quality. This article develops a unified theoretical and methodological framework that integrates classical error budget management with the emerging operational realities of large language model (LLM) serving infrastructures. Anchored in the reliability-centered principles articulated by Dasari (2025), this study expands the concept of error budgets from static service-level constructs into adaptive, learning-driven governance mechanisms capable of handling bursty inference traffic, heterogeneous hardware topologies, and stochastic model behaviors. Drawing on recent advances in transformer architectures, distributed inference engines, resource-aware scheduling, and LLM evaluation frameworks, the article constructs a multi-layered analytical model that explains how reliability objectives can be decomposed, monitored, enforced, and renegotiated across the computing continuum. The methodology combines systems-theoretic reasoning, socio-technical risk analysis, and interpretive synthesis of contemporary research to derive a set of reliability patterns for AI-native infrastructures. The results demonstrate that error budgets, when reinterpreted through probabilistic service envelopes and adaptive feedback control, can serve as the central coordination primitive between offline training, online inference, and user-facing service-level objectives. The discussion situates these findings within broader debates on cloud servitization, cyber-physical monitoring, causal modeling, and digital transformation, revealing how reliability engineering becomes the institutional backbone of trustworthy AI. The article concludes by outlining a future research agenda in which reliability budgets evolve into market-like governance instruments mediating trade-offs between performance, cost, sustainability, and societal trust.
Keywords
Site reliability engineering, error budget management, large language model serving, distributed inference
References
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., and Polosukhin, I. (2023). Attention Is All You Need. arXiv:1706.03762.
Kryvinska, N., and Bickel, L. (2020). Scenario-Based analysis of IT enterprises servitization as a part of digital transformation of modern economy. Applied Sciences, 10, 1076.
Dasari, H. (2025). Site reliability engineering practices for error budget management in large-scale systems. International Journal of Applied Mathematics, 38(5s), 991–1001.
Yu, G. I., Jeong, J. S., Kim, G. W., Kim, S., and Chun, B. G. (2022). Orca: A distributed serving system for transformer-based generative models. Proceedings of the USENIX Symposium on Operating Systems Design and Implementation.
Zhong, Y., Liu, S., Chen, J., Hu, J., Zhu, Y., Liu, X., Jin, X., and Zhang, H. (2024). Distserve: Disaggregating prefill and decoding for goodput-optimized large language model serving. arXiv:2401.09670.
Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M. A., Lacroix, T., Roziere, B., Goyal, N., Hambro, E., Azhar, F., Rodriguez, A., Joulin, A., Grave, E., and Lample, G. (2023). LLaMA: Open and efficient foundation language models. arXiv:2302.13971.
Zheng, L., Chiang, W. L., Sheng, Y., Zhuang, S., Wu, Z., Zhuang, Y., Lin, Z., Li, Z., Li, D., Xing, E., et al. (2024). Judging llm-as-a-judge with mt-bench and chatbot arena. Advances in Neural Information Processing Systems, 36.
Wang, Z., Li, S., Li, X., Zhou, Y., Zhang, Z., Wang, Z., Gu, R., Tian, C., Yang, K., and Zhong, S. (2025). Echo: Efficient co-scheduling of hybrid online-offline tasks for large language model serving. arXiv:2504.03651.
Zhao, Y., Yang, S., Zhu, K., Zheng, L., Kasikci, B., Zhou, Y., Xing, J., and Stoica, I. (2024). BlendServe: Optimizing offline inference for auto-regressive large models with resource-aware batching. arXiv:2411.16102.
Wake, A., Chen, B., Lv, C. X., Li, C., Huang, C., Cai, C., Zheng, C., Cooper, D., Zhou, F., Hu, F., Wang, G., Ji, H., Qiu, H., Zhu, J., Tian, J., Su, K., Zhang, L., Li, L., Song, M., Li, M., Liu, P., Hu, Q., Wang, S., Zhou, S., Yang, S., Li, S., Zhu, T., Xie, W., He, X., Chen, X., Hu, X., Ren, X., Niu, X., Li, Y., Zhao, Y., Luo, Y., Xu, Y., Sha, Y., Yan, Z., Liu, Z., Zhang, Z., and Dai, Z. (2024). Yi-Lightning Technical Report. arXiv:2412.01253.
Yang, A., et al. (2025). Qwen2.5 Technical Report. arXiv:2412.15115.
Canizo, M., Conde, A., Charramendieta, S., Minon, R., Cid-Fuentes, R. G., and Onieva, E. (2019). Implementation of a large-scale platform for cyber-physical system real-time monitoring. IEEE Access, 7, 52455–52466.
Habeeb, R. A. A., Nasaruddin, F., Gani, A., Hashem, I. A. T., Ahmed, E., and Imran, M. (2019). Real-time big data processing for anomaly detection: A survey. International Journal of Information Management, 45, 289–307.
Chen, Y., Iyer, S., Liu, X., Milojicic, D., and Sahai, A. (2007). SLA decomposition: Translating service level objectives to system level thresholds. Proceedings of the International Conference on Autonomic Computing.
Poniszewska-Maranda, A., Matusiak, R., Kryvinska, N., and Yasar, A. U. H. (2020). A real-time service system in the cloud. Journal of Ambient Intelligence and Humanized Computing, 11, 961–977.
Sedlak, B., Casamayor Pujol, V., Donta, P. K., and Dustdar, S. (2024). Markov blanket composition of SLOs. Proceedings of the IEEE International Conference on Edge Computing and Communications.
Sedlak, B., Casamayor Pujol, V., Donta, P. K., and Dustdar, S. (2024). Equilibrium in the computing continuum through active inference. Future Generation Computer Systems.
Kitson, N. K., Constantinou, A. C., Guo, Z., Liu, Y., and Chobtham, K. (2023). A survey of Bayesian network structure learning. Artificial Intelligence Review, 56, 8721–8814.
Ankan, A., and Textor, J. (2023). pgmpy: A Python toolkit for Bayesian networks.
Download and View Statistics
Copyright License
Copyright (c) 2026 Kaniel Verhoeven

This work is licensed under a Creative Commons Attribution 4.0 International License.
Authors retain the copyright of their manuscripts, and all Open Access articles are disseminated under the terms of the Creative Commons Attribution License 4.0 (CC-BY), which licenses unrestricted use, distribution, and reproduction in any medium, provided that the original work is appropriately cited. The use of general descriptive names, trade names, trademarks, and so forth in this publication, even if not specifically identified, does not imply that these names are not protected by the relevant laws and regulations.

