Articles | Open Access |

Reliability-Aware Error Budget Governance and Resource Orchestration for Large-Scale Language Model Serving Infrastructures

Kaniel Verhoeven , Department of Computer Science, Delft University of Technology, Netherlands

Abstract

The rapid industrialization of large-scale language models has transformed contemporary digital services into reliability-critical socio-technical systems whose failure modes propagate across economic, institutional, and cognitive domains. While traditional Site Reliability Engineering (SRE) frameworks have long governed the stability of web-scale platforms, the arrival of transformer-based generative models introduces unprecedented variability in workload, latency, cost, and quality. This article develops a unified theoretical and methodological framework that integrates classical error budget management with the emerging operational realities of large language model (LLM) serving infrastructures. Anchored in the reliability-centered principles articulated by Dasari (2025), this study expands the concept of error budgets from static service-level constructs into adaptive, learning-driven governance mechanisms capable of handling bursty inference traffic, heterogeneous hardware topologies, and stochastic model behaviors. Drawing on recent advances in transformer architectures, distributed inference engines, resource-aware scheduling, and LLM evaluation frameworks, the article constructs a multi-layered analytical model that explains how reliability objectives can be decomposed, monitored, enforced, and renegotiated across the computing continuum. The methodology combines systems-theoretic reasoning, socio-technical risk analysis, and interpretive synthesis of contemporary research to derive a set of reliability patterns for AI-native infrastructures. The results demonstrate that error budgets, when reinterpreted through probabilistic service envelopes and adaptive feedback control, can serve as the central coordination primitive between offline training, online inference, and user-facing service-level objectives. The discussion situates these findings within broader debates on cloud servitization, cyber-physical monitoring, causal modeling, and digital transformation, revealing how reliability engineering becomes the institutional backbone of trustworthy AI. The article concludes by outlining a future research agenda in which reliability budgets evolve into market-like governance instruments mediating trade-offs between performance, cost, sustainability, and societal trust.

Keywords

Site reliability engineering, error budget management, large language model serving, distributed inference

References

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., and Polosukhin, I. (2023). Attention Is All You Need. arXiv:1706.03762.

Kryvinska, N., and Bickel, L. (2020). Scenario-Based analysis of IT enterprises servitization as a part of digital transformation of modern economy. Applied Sciences, 10, 1076.

Dasari, H. (2025). Site reliability engineering practices for error budget management in large-scale systems. International Journal of Applied Mathematics, 38(5s), 991–1001.

Yu, G. I., Jeong, J. S., Kim, G. W., Kim, S., and Chun, B. G. (2022). Orca: A distributed serving system for transformer-based generative models. Proceedings of the USENIX Symposium on Operating Systems Design and Implementation.

Zhong, Y., Liu, S., Chen, J., Hu, J., Zhu, Y., Liu, X., Jin, X., and Zhang, H. (2024). Distserve: Disaggregating prefill and decoding for goodput-optimized large language model serving. arXiv:2401.09670.

Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M. A., Lacroix, T., Roziere, B., Goyal, N., Hambro, E., Azhar, F., Rodriguez, A., Joulin, A., Grave, E., and Lample, G. (2023). LLaMA: Open and efficient foundation language models. arXiv:2302.13971.

Zheng, L., Chiang, W. L., Sheng, Y., Zhuang, S., Wu, Z., Zhuang, Y., Lin, Z., Li, Z., Li, D., Xing, E., et al. (2024). Judging llm-as-a-judge with mt-bench and chatbot arena. Advances in Neural Information Processing Systems, 36.

Wang, Z., Li, S., Li, X., Zhou, Y., Zhang, Z., Wang, Z., Gu, R., Tian, C., Yang, K., and Zhong, S. (2025). Echo: Efficient co-scheduling of hybrid online-offline tasks for large language model serving. arXiv:2504.03651.

Zhao, Y., Yang, S., Zhu, K., Zheng, L., Kasikci, B., Zhou, Y., Xing, J., and Stoica, I. (2024). BlendServe: Optimizing offline inference for auto-regressive large models with resource-aware batching. arXiv:2411.16102.

Wake, A., Chen, B., Lv, C. X., Li, C., Huang, C., Cai, C., Zheng, C., Cooper, D., Zhou, F., Hu, F., Wang, G., Ji, H., Qiu, H., Zhu, J., Tian, J., Su, K., Zhang, L., Li, L., Song, M., Li, M., Liu, P., Hu, Q., Wang, S., Zhou, S., Yang, S., Li, S., Zhu, T., Xie, W., He, X., Chen, X., Hu, X., Ren, X., Niu, X., Li, Y., Zhao, Y., Luo, Y., Xu, Y., Sha, Y., Yan, Z., Liu, Z., Zhang, Z., and Dai, Z. (2024). Yi-Lightning Technical Report. arXiv:2412.01253.

Yang, A., et al. (2025). Qwen2.5 Technical Report. arXiv:2412.15115.

Canizo, M., Conde, A., Charramendieta, S., Minon, R., Cid-Fuentes, R. G., and Onieva, E. (2019). Implementation of a large-scale platform for cyber-physical system real-time monitoring. IEEE Access, 7, 52455–52466.

Habeeb, R. A. A., Nasaruddin, F., Gani, A., Hashem, I. A. T., Ahmed, E., and Imran, M. (2019). Real-time big data processing for anomaly detection: A survey. International Journal of Information Management, 45, 289–307.

Chen, Y., Iyer, S., Liu, X., Milojicic, D., and Sahai, A. (2007). SLA decomposition: Translating service level objectives to system level thresholds. Proceedings of the International Conference on Autonomic Computing.

Poniszewska-Maranda, A., Matusiak, R., Kryvinska, N., and Yasar, A. U. H. (2020). A real-time service system in the cloud. Journal of Ambient Intelligence and Humanized Computing, 11, 961–977.

Sedlak, B., Casamayor Pujol, V., Donta, P. K., and Dustdar, S. (2024). Markov blanket composition of SLOs. Proceedings of the IEEE International Conference on Edge Computing and Communications.

Sedlak, B., Casamayor Pujol, V., Donta, P. K., and Dustdar, S. (2024). Equilibrium in the computing continuum through active inference. Future Generation Computer Systems.

Kitson, N. K., Constantinou, A. C., Guo, Z., Liu, Y., and Chobtham, K. (2023). A survey of Bayesian network structure learning. Artificial Intelligence Review, 56, 8721–8814.

Ankan, A., and Textor, J. (2023). pgmpy: A Python toolkit for Bayesian networks.

Download and View Statistics

Views: 0   |   Downloads: 0

Copyright License

Download Citations

How to Cite

Kaniel Verhoeven. (2026). Reliability-Aware Error Budget Governance and Resource Orchestration for Large-Scale Language Model Serving Infrastructures. The American Journal of Interdisciplinary Innovations and Research, 8(01), 155–160. Retrieved from https://www.theamericanjournals.com/index.php/tajiir/article/view/7437