Optimizing Large Language Model Deployment with Scalable Inference and Ensemble Techniques
Main Article Content
Abstract
The rapid expansion of complex system logs in modern infrastructures has heightened the need for accurate, interpretable, and low-latency risk analysis. These logs contain high-dimensional, context-rich data that is essential for operational reliability, cybersecurity, and compliance. While conventional machine learning models are efficient, they often overlook the nuanced semantic relationships in sequential log data, limiting predictive reliability. Conversely, large language models (LLMs) offer deeper contextual understanding but are computationally intensive, making them unsuitable for real-time, large-scale deployment. This study presents a deploymentoptimised pipeline that balances semantic depth with computational efficiency for log-based risk prediction. The architecture integrates lightweight MiniLM embeddings with an XGBoost classifier to produce interpretable, high-quality predictions at reduced computational cost. Key optimizations include class balancing to address dataset skew, model quantization to lower memory usage, and batched inference to increase throughput, enabling cost-effective CPU-only execution without GPUs. A structured evaluation examined accuracy, latency, and memory trade-offs across production scenarios. Testing on representative log datasets showed notable gains over a TF-IDF baseline: classification accuracy improved from 21.4% to 57.1%, weighted F1-scores rose accordingly, and inference latency decreased with negligible loss in predictive strength. By combining transformer-based dense embeddings with gradient-boosted decision trees, this approach delivers a practical balance of semantic expressiveness, interpretability, and deployment efficiency. The framework supports scalable, real-time risk prediction for cybersecurity monitoring, compliance auditing, and IT operations, bridging the gap between advanced language modelling and real-world infrastructure constraints.
Downloads
Article Details
Section

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.
How to Cite
References
Yang, Z. & Harris, I. G. (2025). LogLLaMA: Transformer-based log anomaly detection with LLaMA. https://arxiv.org/abs/2503.14849
Kasneci, G. & Kasneci, E. (2024). Enriching tabular data with contextual LLM embeddings: A comprehensive ablation study for ensemble classifiers. https://doi.org/10.48550/arXiv.2403.06789
Ayub, M. A. & Majumdar, S. (2024). Embedding-based classifiers can detect prompt injection attacks. Proceedings of CAMLIS’24. https://doi.org/10.1145/3643563
Pospieszny, P., Mormul, W., Szyndler, K. & Kumar, S. (2025). ADALog: Adaptive unsupervised anomaly detection in logs with a self-attention masked language model. https://doi.org/10.48550/arXiv.2505.13496