Enhancing GPU-HBM Data Transfer Efficiency Using Markov Chains and Neural Network-Driven Predictive Caching with Quantization and Pruning

Main Article Content

Samiel Azmaien

Abstract

Background High-bandwidth memory (HBM) systems face persistent data transfer bottlenecks, particularly when CPUs are unable to supply data to GPUs at a sufficient rate. This limitation reduces overall computational efficiency and highlights the need for improved cache management strategies. Methods: Markov Chains represented transitions between frequently accessed memory blocks, enabling predictive sequencing of data needs. A neural network was then applied to model and optimise these Markov transitions, improving cache prefetching accuracy and further optimising data movement techniques. Results & Conclusions: The combined use of Markov-based memory modelling, NN optimisation, and supplementary data transfer techniques demonstrates strong potential to mitigate CPU–GPU bandwidth limitations. Together, these methods offer more efficient cache utilization and reduced bottlenecks in high-demand computational environments.

Downloads

Download data is not yet available.

Article Details

Section

Articles

How to Cite

[1]
Samiel Azmaien, “Enhancing GPU-HBM Data Transfer Efficiency Using Markov Chains and Neural Network-Driven Predictive Caching with Quantization and Pruning”, IJSCE, vol. 15, no. 6, pp. 12–16, Jan. 2026, doi: 10.35940/ijsce.F3700.15060126.

References

Joseph, D., & Grunwald, D. (2002, August 06). Prefetching using Markov predictors. IEEE Journals & Magazine.

DOI: https://doi.org/10.1109/12.75265, works remain significant, see the declaration

Jog, A., Kayiran, O., Mishra, A. K., Kandemir, M. T., Mutlu, O., Iyer, R., & Das, C. R. (2013, June 23). Orchestrated scheduling and prefetching for GPGPUs. Association for Computing Memory. DOI: https://doi.org/10.1145/2485922.2485951, works remain significant, see the declaration

Bauer, M., Cook, H., & Khailany, B. (2011, November 12). CUDADMA: Optimizing GPU memory bandwidth via warp specialization. Association for Computing Machinery. DOI: https://doi.org/10.1145/2063384.2063400, works remain significant, see the declaration

Liang, T., Glossner, J., Wang, L., Shi, S., & Zhang, X. (2021, January 24). Pruning and quantization for deep neural network acceleration: a survey. arXiv.org. DOI: https://doi.org/10.48550/arXiv.2101.09671

Shi, Z., Huang, X., Jain, A., & Lin, C. (2019, October 12). Applying deep learning to the cache replacement problem. Proceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture. DOI: https://doi.org/10.1145/3352460.3358319

Chopra, B. (2024, May 7). Enhancing machine learning Performance: the role of GPU-based AI computer architectures. Journal of Knowledge Learning and Science Technology ISSN 2959-6386 (Online), 3(3), 20–32. DOI: https://doi.org/10.60087/jklst.vol3.n3.p20-32

Hou, J., Tao, T., Lu, H., & Nayak, A. (2023, June 22). Intelligent caching with graph neural network-based deep reinforcement learning on SDN-based ICN. Future Internet, 15(8), 251. DOI: https://doi.org/10.3390/fi15080251

Bakhoda, A., Yuan, G. L., Fung, W. W. L., Wong, H., & Aamodt, T. M. (2009, April 1). Analyzing CUDA workloads using a detailed GPU simulator. IEEE Conference Publication.DOI: https://doi.org/10.1109/ISPASS.2009.4919648

Liu, A., & Tucker, A. (1988). Applied Combinatorics. DOI: https://doi.org/10.1137/1030075, works remain significant, see the declaration

Mittal, S. (2015, January 16). A survey of techniques for managing and leveraging caches in GPUs. Journal of Circuits, Systems and Computers, 23(08), 1430002. DOI: https://doi.org/10.1142/s0218126614300025, works remain significant, see the declaration

Most read articles by the same author(s)

1 2 3 4 5 6 7 > >>