Optimizing Large Language Model Deployment with Scalable Inference and Ensemble Techniques

Main Article Content

Gurupriya Adurthy

Abstract

The rapid expansion of complex system logs in modern infrastructures has heightened the need for accurate, interpretable, and low-latency risk analysis. These logs contain high-dimensional, context-rich data that is essential for operational reliability, cybersecurity, and compliance. While conventional machine learning models are efficient, they often overlook the nuanced semantic relationships in sequential log data, limiting predictive reliability. Conversely, large language models (LLMs) offer deeper contextual understanding but are computationally intensive, making them unsuitable for real-time, large-scale deployment. This study presents a deploymentoptimised pipeline that balances semantic depth with computational efficiency for log-based risk prediction. The architecture integrates lightweight MiniLM embeddings with an XGBoost classifier to produce interpretable, high-quality predictions at reduced computational cost. Key optimizations include class balancing to address dataset skew, model quantization to lower memory usage, and batched inference to increase throughput, enabling cost-effective CPU-only execution without GPUs. A structured evaluation examined accuracy, latency, and memory trade-offs across production scenarios. Testing on representative log datasets showed notable gains over a TF-IDF baseline: classification accuracy improved from 21.4% to 57.1%, weighted F1-scores rose accordingly, and inference latency decreased with negligible loss in predictive strength. By combining transformer-based dense embeddings with gradient-boosted decision trees, this approach delivers a practical balance of semantic expressiveness, interpretability, and deployment efficiency. The framework supports scalable, real-time risk prediction for cybersecurity monitoring, compliance auditing, and IT operations, bridging the gap between advanced language modelling and real-world infrastructure constraints.

Downloads

Download data is not yet available.

Article Details

Section

Articles

How to Cite

[1]
Gurupriya Adurthy , Tran., “Optimizing Large Language Model Deployment with Scalable Inference and Ensemble Techniques”, IJEAT, vol. 15, no. 2, pp. 9–14, Dec. 2025, doi: 10.35940/ijeat.A4692.15021225.
Share |

References

Yang, Z. & Harris, I. G. (2025). LogLLaMA: Transformer-based log anomaly detection with LLaMA. https://arxiv.org/abs/2503.14849

Kasneci, G. & Kasneci, E. (2024). Enriching tabular data with contextual LLM embeddings: A comprehensive ablation study for ensemble classifiers. https://doi.org/10.48550/arXiv.2403.06789

Ayub, M. A. & Majumdar, S. (2024). Embedding-based classifiers can detect prompt injection attacks. Proceedings of CAMLIS’24. https://doi.org/10.1145/3643563

Pospieszny, P., Mormul, W., Szyndler, K. & Kumar, S. (2025). ADALog: Adaptive unsupervised anomaly detection in logs with a self-attention masked language model. https://doi.org/10.48550/arXiv.2505.13496

Most read articles by the same author(s)

1 2 3 4 5 6 7 8 9 10 > >>