Dynamic GEN AI-Powered Web Crawling on Azure Using Automation Account and GPT-3.5

Main Article Content

Chandan Srinath
Sakshi Srivastava

Abstract

The integration of AI-powered automation in web crawling marks a significant advancement over traditional methods, which were often labor-intensive, inflexible, and prone to security risks. This paper presents a case study on the implementation of a dynamic web crawling solution using Azure Automation Account, leveraging GPT-3.5 from Azure OpenAI services. This new approach allows for parameterized execution via automation variables, enabling user-defined requirements to guide the crawler's behavior in a more flexible and intelligent manner. Unlike previous static methods that required constant manual adjustments, our system uses GPT-3.5's Natural Language Processing (NLP) capabilities to interpret complex instructions and dynamically adapt to various web structures. Post-crawling, the data undergoes a security scan using ClamAV, ensuring its integrity before storage in Azure Blob Storage. SendGrid is employed for user alerts regarding the scan results and storage status. The system is scheduled to run at regular intervals, fully automating the process while maintaining robust security protocols. This paper includes a detailed comparison between traditional web crawling techniques and this AI-driven approach, demonstrating the improvements in efficiency, security, and adaptability.

Downloads

Download data is not yet available.

Article Details

Section

Articles

Author Biographies

Chandan Srinath, Digital Aviation Solutions, Boeing India, Pt. Ltd., Bangalore, India.



Sakshi Srivastava, Digital Aviation Solutions, Boeing India, Pt. Ltd., Bangalore, India.

Sakshi Srivastava is a talented software engineer hailing from West Bengal, India. She pursued her passion for technology by completing her Bachelor of Engineering in Computer Science (CSE) from BMS College of Engineering, a prestigious institution in Bangalore. In August 2022, Sakshi began her professional journey as an Associate Software Engineer at Boeing, one of the world's leading aerospace companies. Starting as a fresher, she quickly demonstrated her skills and enthusiasm for innovation and learning. Her curiosity and commitment to personal and professional growth have made her an asset to her team.

How to Cite

[1]
Chandan Srinath and Sakshi Srivastava , Trans., “Dynamic GEN AI-Powered Web Crawling on Azure Using Automation Account and GPT-3.5”, IJEAT, vol. 14, no. 2, pp. 6–10, Feb. 2025, doi: 10.35940/ijeat.B4556.14021224.
Share |

References

Cai, M., Ma, B., & Qiu, M. Web Data Extraction Based on a Hybrid Model. J. Comput. Sci. Technol. 33, 1237-1250 (2018). https://doi.org/10.1007/s11390-018-1887-0

Chiticariu, L., Krishnamurthy, R., & Li, Y. Domain-independent Information Extraction Using Machine Learning. Proc. 58th Annu. Meet. Assoc. Comput. Linguist. 752-760 (2019). https://www.aclweb.org/anthology/P19-1072/

Hoxha, J., Chelnokov, D., & Agre, G. Intelligent Web Data Extraction Using Machine Learning Techniques. Lect. Notes Comput. Sci. 11961, 240-251 (2019). https://doi.org/10.1007/978-3-030-37334-4_19

Liu, Y., Yuan, H., & Zhang, J. Smart Web Data Mining Based on Artificial Intelligence and Big Data Technology. IEEE Access 8, 128634-128646 (2020). https://ieeexplore.ieee.org/document/9125092

Singh, R., Bansal, M., & Aggarwal, P. Web Crawling Algorithms and Techniques: A Review. Int. J. Comput. Appl. 177, 17-20 (2019). https://www.academia.edu/25127230/Web_Crawlers_and_Web_Crawling_Algorithms_A_Review

Zhao, S., Liu, X., & Li, J. Research on Intelligent Web Crawler Technology for Data Extraction. 2018 IEEE 3rd Adv. Inf. Technol. Electron. Autom. Control Conf. (IAEAC) 1052-1055 (2018). https://ieeexplore.ieee.org/document/8577631

Brown, T., Mann, B., & Ryder, N. Language Models are Few-Shot Learners. arXiv preprint arXiv:2005.14165 (2020). https://arxiv.org/abs/2005.14165

Yuan, X., Liu, Y., & Zhang, L. Scalable Web Crawling System Based on Cloud Computing. 2021 IEEE 5th Inf. Technol. Netw. Electron. Autom. Control Conf. (ITNEC) 1234-1238 (2021). https://ieeexplore.ieee.org/document/9531631

C, M. S., & Sundaram, Dr. S. M. (2020). A Survey of Bio Inspired Algorithms for Web Information Extraction and Optimization for Big Data Analytics. In International Journal of Engineering and Advanced Technology (Vol. 10, Issue 2, pp. 56–60). https://doi.org/10.35940/ijeat.b2011.1210220

Singh, S., Gupta, A. K., & Singh, T. (2019). Sign Language Recognition using Hybrid Neural Networks. In International Journal of Innovative Technology and Exploring Engineering (Vol. 9, Issue 2, pp. 1092–1098). https://doi.org/10.35940/ijitee.l3349.129219

Shinde, Ms. A. B., & Dange, Mr. A. S. (2020). System to Crawl Web and Forum for Medical Data. In International Journal of Recent Technology and Engineering (IJRTE) (Vol. 8, Issue 5, pp. 138–139). https://doi.org/10.35940/ijrte.d9938.018520

Radhamani, V., & Dalin, G. (2019). Significance of Artificial Intelligence and Machine Learning Techniques in Smart Cloud Computing: A Review. In International Journal of Soft Computing and Engineering (Vol. 9, Issue 3, pp. 1–7). https://doi.org/10.35940/ijsce.c3265.099319

Patil, M. M., Nikumbh, S. N., & Parigond, A. P. (2021). Fake Product Monitoring and Removal for Genuine Product Feedback. In International Journal of Emerging Science and Engineering (Vol. 7, Issue 1, pp. 1–3). https://doi.org/10.35940/ijese.a2494.037121

Most read articles by the same author(s)

<< < 4 5 6 7 8 9