Optimizing Cloud-Native LLM Workloads with Serverless GPU Orchestration and Token-Aware Scheduling

Pradeep Rao Vennamaneni

PDF

Engineering and Technology | Open Access |

Optimizing Cloud-Native LLM Workloads with Serverless GPU Orchestration and Token-Aware Scheduling

Pradeep Rao Vennamaneni , Senior Data Engineer - Lead, USA

Download PDF

Published Date 2024-04-25

Pages 33-59

Abstract

Cloud-native LLM inference has bursty and size-variable demand that leads to head-of-line blocking, cold-start overheads, and infinitely variable tail latency. An end-to-end design that integrates token-sensitive scheduling with serverless GPU orchestration to achieve TTFT/TLET SLOs at reduced cost is proposed through this study. Its architecture combines a feasibility-sensitive admission controller, prefill and decode sensitive micro-batching, KV-cache paging with watermarks, warm pools to eliminate cold starts, and autoscaling based upon queue- and token-level cues; placement encompasses full-GPU, MIG, and MPS modes using per-tenant policies. Deployed on Kubernetes on 7B-70B decoder-only models using vLLM/TensorRT-LLM/Triton backends, the system is aimed at heterogeneous H100/A100/L4 fleets and chat/RAG workloads with heavy-tailed token lengths. In experimentation, the strategy advanced cluster throughput 31.7 percent more than the finest baseline, decreased P95 TTFT to 420 ms and P99 to 1.3 s, increased SM and memory-bandwidth utilization, and lessened cost per one million output tokens by 26.8 percent, while offering a similar degree of per-tier fairness on the same basis. This study provides contributions, including a production-ready control/data-plane design, SLO-aware admission tests, degradation, and routing, token-aware batching, and KV-cache usage/freezing to avoid memory-driven stalls, an easily reproducible evaluation recipe with KPIs (TTFT, P95/P99 latency, tokens/s, utilization, and $/1M tokens). Such findings introduce a scalable deployment avenue to foreseeable latency and efficiency. The blueprint reflects the reality of the operators today.

Keywords

Cloud-native LLM serving, Serverless GPU orchestration, Token-aware scheduling, SLO-aware admission control, KV-cache management

References

Abdelhamid, A. S. (2021). Efficient Distributed Processing Over Micro-Batched Data Streams (Doctoral dissertation, Purdue University).

Antcliff, K., Borer, N., Sartorius, S., Saleh, P., Rose, R., Gariel, M., ... & Oullette, R. (2021). Regional air mobility: Leveraging our national investments to energize the American travel experience.

Arden, B. S. (2022). Performance analysis of tensor-oriented runtimes for database workloads (Doctoral dissertation).

Chavan, A. (2022). Importance of identifying and establishing context boundaries while migrating from monolith to microservices. Journal of Engineering and Applied Sciences Technology, 4, E168. http://doi.org/10.47363/JEAST/2022(4)E168

Chavan, A. (2023). Managing scalability and cost in microservices architecture: Balancing infinite scalability with financial constraints. Journal of Artificial Intelligence & Cloud Computing, 2, E264. http://doi.org/10.47363/JAICC/2023(2)E264

Chen, W., Zhou, X., & Rao, J. (2019). Preemptive and low latency datacenter scheduling via lightweight containers. IEEE Transactions on Parallel and Distributed Systems, 31(12), 2749-2762.

Dai, H., Wang, Y., Kent, K. B., Zeng, L., & Xu, C. (2022). The state of the art of metadata managements in large-scale distributed file systems—scalability, performance and availability. IEEE Transactions on Parallel and Distributed Systems, 33(12), 3850-3869.

Ellore, A. R. (2023). Rethinking Serverless for Machine Learning Inference (Doctoral dissertation, Virginia Tech).

Elsten, J. M. (2023). Exploring the potential use of FaaS within an iPaaS infrastructure (Master's thesis, University of Twente).

Erwin, W. J. (2021). Verification and Validation of Radiation Protection Factors from Monte Carlo Simulations.

Guo, J., & Yang, C. (2020). Impact of prediction errors on high throughput predictive resource allocation. IEEE Transactions on Vehicular Technology, 69(9), 9984-9999.

Jonglez, B. (2020). End-to-end mechanisms to improve latency in communication networks (Doctoral dissertation, Université Grenoble Alpes [2020-....]).

Karwa, K. (2023). AI-powered career coaching: Evaluating feedback tools for design students. Indian Journal of Economics & Business. https://www.ashwinanokha.com/ijeb-v22-4-2023.php

Koh, N. S., Hahn, T., & Boonstra, W. J. (2019). How much of a market is involved in a biodiversity offset? A typology of biodiversity offset policies. Journal of environmental management, 232, 679-691.

Konneru, N. M. K. (2021). Integrating security into CI/CD pipelines: A DevSecOps approach with SAST, DAST, and SCA tools. International Journal of Science and Research Archive. Retrieved from https://ijsra.net/content/role-notification-scheduling-improving-patient

Kumar, A. (2019). The convergence of predictive analytics in driving business intelligence and enhancing DevOps efficiency. International Journal of Computational Engineering and Management, 6(6), 118-142. Retrieved from https://ijcem.in/wp-content/uploads/THE-CONVERGENCE-OF-PREDICTIVE-ANALYTICS-IN-DRIVING-BUSINESS-INTELLIGENCE-AND-ENHANCING-DEVOPS-EFFICIENCY.pdf

Liang, Q., Hanafy, W. A., Bashir, N., Irwin, D., & Shenoy, P. (2023, December). Energy time fairness: Balancing fair allocation of energy and time for GPU workloads. In Proceedings of the Eighth ACM/IEEE Symposium on Edge Computing (pp. 53-66).

Liang, S., & He, Y. (2023). Real-Time Operational Dashboards for Executive Leadership to Drive Agile Decision-Making in Multisite Health Systems. International Journal of Advanced Computational Methodologies and Emerging Technologies, 13(11), 1-11.

Liu, X., Zhang, Y., Yan, Z., & Ge, Y. (2023). Defining ‘seamlessly connected’: user perceptions of operation latency in cross-device interaction. International Journal of Human-Computer Studies, 177, 103068.

Mohamed, K. S. (2020). Parallel computing: OpenMP, MPI, and CUDA. In Neuromorphic Computing and Beyond: Parallel, Approximation, Near Memory, and Quantum (pp. 63-93). Cham: Springer International Publishing.

Nyati, S. (2018). Revolutionizing LTL carrier operations: A comprehensive analysis of an algorithm-driven pickup and delivery dispatching solution. International Journal of Science and Research (IJSR), 7(2), 1659-1666. Retrieved from https://www.ijsr.net/getabstract.php?paperid=SR24203183637

Nyati, S. (2018). Transforming telematics in fleet management: Innovations in asset tracking, efficiency, and communication. International Journal of Science and Research (IJSR), 7(10), 1804-1810. Retrieved from https://www.ijsr.net/getabstract.php?paperid=SR24203184230

Oberholzer, P. (2021). Scheduling for MIG-capable GPUs: Accelerator-aware operating system scheduling (Master's thesis, ETH Zurich, Department of Computer Science).

Pemberton, N. T. (2022). The Serverless Datacenter: Hardware and Software Techniques for Resource Disaggregation (Doctoral dissertation, University of California, Berkeley).

Raju, R. K. (2017). Dynamic memory inference network for natural language inference. International Journal of Science and Research (IJSR), 6(2). https://www.ijsr.net/archive/v6i2/SR24926091431.pdf

Roy, A., Pachuau, J. L., & Saha, A. K. (2021). An overview of queuing delay and various delay based algorithms in networks. Computing, 103(10), 2361-2399.

Sardana, J. (2022). Scalable systems for healthcare communication: A design perspective. International Journal of Science and Research Archive. https://doi.org/10.30574/ijsra.2022.7.2.0253

Sardana, J. (2022). The role of notification scheduling in improving patient outcomes. International Journal of Science and Research Archive. Retrieved from https://ijsra.net/content/role-notification-scheduling-improving-patient

Schall, D., Sandberg, A., & Grot, B. (2023, October). Warming up a cold front-end with ignite. In Proceedings of the 56th Annual IEEE/ACM International Symposium on Microarchitecture (pp. 254-267).

Schlatow, J. (2021). Enabling in-field integration in critical embedded systems (Doctoral dissertation, Dissertation, Braunschweig, Technische Universität Braunschweig, 2021).

Scholl, B., Swanson, T., & Jausovec, P. (2019). Cloud native: using containers, functions, and data to build next-generation applications. O'Reilly Media.

Singh, V. (2022). Integrating large language models with computer vision for enhanced image captioning: Combining LLMs with visual data to generate more accurate and context-rich image descriptions. Journal of Artificial Intelligence and Computer Vision, 1(E227). http://doi.org/10.47363/JAICC/2022(1)E227

Singh, V. (2022). Intelligent traffic systems with reinforcement learning: Using reinforcement learning to optimize traffic flow and reduce congestion. International Journal of Research in Information Technology and Computing. https://romanpub.com/ijaetv4-1-2022.php

Studholme, J., Fedorov, A. V., Gulev, S. K., Emanuel, K., & Hodges, K. (2022). Poleward expansion of tropical cyclone latitudes in warming climates. Nature Geoscience, 15(1), 14-28.

Trach, B. (2022). Systems Support for Trusted Execution Environments.

Truex, R. (2020). Authoritarian gridlock? Understanding delay in the Chinese legislative system. Comparative Political Studies, 53(9), 1455-1492.

Vernon, V., & Jaskula, T. (2021). Strategic monoliths and microservices: driving innovation using purposeful architecture. Addison-Wesley Professional.

Wang, K. T. A., Ho, R., & Wu, P. (2019, March). Replayable execution optimized for page sharing for a managed runtime environment. In Proceedings of the Fourteenth EuroSys Conference 2019 (pp. 1-16).

Zhang, D. (2023). Memory Turbo Boost: Architectural Support for Using Unused Memory for Memory Replication to Boost Server Memory Performance (Doctoral dissertation, Virginia Polytechnic Institute and State University).

Article Statistics

Copyright License

This work is licensed under a Creative Commons Attribution 4.0 International License.

Authors retain the copyright of their manuscripts, and all Open Access articles are disseminated under the terms of the Creative Commons Attribution License 4.0 (CC-BY), which licenses unrestricted use, distribution, and reproduction in any medium, provided that the original work is appropriately cited. The use of general descriptive names, trade names, trademarks, and so forth in this publication, even if not specifically identified, does not imply that these names are not protected by the relevant laws and regulations.

Download Citations

How to Cite

Pradeep Rao Vennamaneni. (2024). Optimizing Cloud-Native LLM Workloads with Serverless GPU Orchestration and Token-Aware Scheduling. The American Journal of Engineering and Technology, 4(04), 33–59. Retrieved from https://www.theamericanjournals.com/index.php/tajet/article/view/6603

Download Citation

Endnote/Zotero/Mendeley (RIS)

BibTeX

Optimizing Cloud-Native LLM Workloads with Serverless GPU Orchestration and Token-Aware Scheduling

Abstract

Keywords

References

Article Statistics

Copyright License

Download Citations

How to Cite

Download Citation

Information

Instructions

Policies

Optimizing Cloud-Native LLM Workloads with Serverless GPU Orchestration and Token-Aware Scheduling

Abstract

Keywords

References

Article Statistics

Copyright License

Download Citations

How to Cite

Download Citation

Journal Citation Report

Search article, authors.....