RiskBERT: A Pre-Trained Insurance-Based Language Model for Text Classification

Rida Ghafoor Hussain

doi:10.35940/ijitee.F1097.14070625

PDF

Published: 30-06-2025

DOI: https://doi.org/10.35940/ijitee.F1097.14070625

Keywords:

Clause, Domain-Specific, Insurance, Legal, Pre-Training

Rida Ghafoor Hussain

Researcher, Department of Information Engineering, University Of Florence, Siena Italy.

https://orcid.org/0000-0002-5646-5910

Abstract

The rapid growth of insurance-related documents has increased the need for efficient and accurate text classification techniques. Advances in natural language processing (NLP) and deep learning have enabled the extraction of valuable insights from textual data, particularly in specialised domains such as insurance, legal, and scientific documents. While Bidirectional Encoder Representations from Transformers (BERT) models have demonstrated state-of-theart performance across various NLP tasks, their application to domain-specific corpora often results in suboptimal accuracy due to linguistic and contextual differences. In this study, I propose RiskBERT, a domain-specific language representation model pre-trained on insurance corpora. I further pre-trained LegalBERT on insurance-specific datasets to enhance its understanding of insurance-related texts. The resulting model, RiskBERT, was then evaluated on downstream clause and provision classification tasks using two benchmark datasets – LEDGAR [1] and Unfair ToS [2]. I conducted a comparative analysis against BERT-Base and LegalBERT to assess the impact of domain-specific pre-training on classification performance. The findings demonstrate that pre-training on insurance-specific corpora significantly improves the model’s ability to analyse complex insurance texts. The experimental results show that RiskBERT significantly outperforms LegalBERT and BERT-Base, achieving superior accuracy in classifying complex insurance texts, with 96.8% and 92.1% accuracy, respectively, on the LEDGAR and Unfair ToS datasets. These findings highlight the effectiveness of domain-adaptive pre-training and underscore the importance of specialised language models for enhancing insurance document processing, making RiskBERT a valuable tool for this purpose.

Downloads

Download data is not yet available.

Issue

Vol. 14 No. 7 (2025): Volume-14 Issue-7, June 2025

Section

Articles

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.

CC-BY-NC-ND 4.0

How to Cite

[1]

Rida Ghafoor Hussain , Tran., “RiskBERT: A Pre-Trained Insurance-Based Language Model for Text Classification”, IJITEE, vol. 14, no. 7, pp. 12–18, Jun. 2025, doi: 10.35940/ijitee.F1097.14070625.

References

Don Tuggener, Pius von Däniken, Thomas Peetz, and Mark Cieliebak. 2020. LEDGAR: A Large-Scale Multi-label Corpus for Text Classification of Legal Provisions in Contracts. In Proceedings of the 12th Language Resources and Evaluation Conference, pages 1235–1241, Marseille, France. European Language Resources Association. https://aclanthology.org/2020.lrec-1.155.pdf

Lippi, M., Pałka, P., Contissa, G. et al. CLAUDETTE: an automated detector of potentially unfair clauses in online terms of service. Artif Intell Law 27, 117–139 (2019). DOI: https://doi.org/10.1007/s10506-019-09243-2

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention Is All You Need in 31th Annual Conference on Neural Information Processing Systems, Long Beach, CA, USA.

DOI: https://doi.org/10.48550/arXiv.1706.03762

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. Proceedings of the Annual Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, DOI: https://doi.org/10.48550/arXiv.1810.04805

Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. RoBERTa: A Robustly Optimized BERT Pretraining Approach. CoRR, DOI: https://doi.org/10.48550/arXiv.1907.11692

Zhilin Yang, Zihang Dai, Yiming Yang, Jaime G. Carbonell, Ruslan Salakhutdinov, and Quoc V. Le. 2019. XLNet: Generalized Autoregressive Pretraining for Language Understanding. CoRR,

DOI: https://doi.org/10.48550/arXiv.1906.08237

Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, and Radu Soricut. 2019. ALBERT: A Lite BERT for Self-supervised Learning of Language Representations. CoRR,

DOI: https://doi.org/10.48550/arXiv.1909.11942

Jinhyuk Lee, Wonjin Yoon, Sungdong Kim, Donghyeon Kim, Sunkyu Kim, Chan Ho So, and Jaewoo Kang. 2019. BioBERT: a pre-trained biomedical language representation model for biomedical text mining. In CoRR.

DOI: https://doi.org/10.48550/arXiv.1901.08746

Ilias Chalkidis, Manos Fergadiotis, Prodromos Malakasiotis, Nikolaos Aletras, and Ion Androutsopoulos. 2020. LEGAL-BERT: The Muppets straight out of Law School. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 2898–2904, Online. Association for Computational Linguistics. https://aclanthology.org/2020.findings-emnlp.261.pdf

Emily Alsentzer, John Murphy, William Boag, WeiHung Weng, Di Jin, Tristan Naumann, and Matthew McDermott. 2019. Publicly available clinical BERT embeddings. In Proceedings of the 2nd Clinical Natural Language Processing Workshop, pages 72–78, Minneapolis, Minnesota, USA.

DOI: https://doi.org/10.48550/arXiv.1904.03323

Iz Beltagy, Kyle Lo, and Arman Cohan. 2019. SciBERT: A pretrained language model for scientific text. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 3606– 3611, Hong Kong, China.

https://aclanthology.org/D19-1371.pdf

Chul Sung, Tejas Dhamecha, Swarnadeep Saha, Tengfei Ma, Vinay Reddy, and Rishi Arora. 2019. Pre-training BERT on domain resources for short answer grading. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLPIJCNLP), pages 6071–6075, Hong Kong, China. Association for Computational Linguistics.

https://aclanthology.org/D19-1628.pdf

Moens MF, Boiy E, Palau RM, Reed C (2007). Automatic detection of arguments in legal texts. In: Proceedings of the 11th International Conference on Artificial Intelligence and Law, ACM, pp 225 -230.

DOI: https://doi.org/10.1145/1276318.1276362

Lippi M, Torroni P (2016a) Argumentation mining: State of the art and emerging trends. ACM Trans Internet Technol 16(2):10:1{10:25.

DOI: https://doi.org/10.1145/2850417

Shulayeva O, Siddharthan A, Wyner A (2017) Recognizing cited facts and principles in legal judgements. Artificial Intelligence and Law 25(1):107{126.

https://link.springer.com/article/10.1007/s10506-017-9197-6

Aletras N, Tsarapatsanis D, Preoiuc-Pietro D, Lampos V (2016) Predicting judicial decisions of the European Court of Human Rights: a natural language processing perspective. PeerJ Computer Science 2:e93.

https://peerj.com/articles/cs-93/

Fabian B, Ermakova T, Lentz T (2017) Large-scale readability analysis of privacy policies. In: Proceedings of the International Conference on Web Intelligence, ACM, pp 18- 25.

DOI: https://doi.org/10.1145/3106426.3106427

Ashley KD, Walker VR (2013) Toward constructing evidence-based legal arguments using legal decision documents and machine learning. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Law, ACM, pp 176-180. https://sites.hofstra.edu/vern-walker/wp-content/uploads/sites/69/2019/12/Ashley_Walker-Toward_Evidence-Based_Legal_Arguments-ICAIL2013.pdf

Laila Rasmy, Yang Xiang, Ziqian Xie, Cui Tao, Degui Zhi “Med-BERT: pre-trained contextualized embeddings on large-scale structured electronic health records for disease prediction in Digital Medicine, 2021.

https://www.nature.com/articles/s41746-021-00455-y

Thomas Vakili, Anastasios Lamproudis, Aron Henriksson, Hercules Dalianis. Downstream Task Performance of BERT Models Pre-Trained Using Automatically De-Identified Clinical Data. In Proceedings of the 13th Conference on Language Resources and Evaluation (LREC 2022), pages 4245–4252, Marseille, 20-25 June 2022. https://aclanthology.org/2022.lrec-1.451/

Iiro Rastas, Yann Ciarán Ryan, Iiro Tiihonen, Mohammadreza Qaraei, Liina Repo, Rohit Babbar, Eetu Mäkelä, Mikko Tolonen, and Filip Ginter. 2022. Explainable Publication Year Prediction of Eighteenth Century Texts with the BERT Model. In Proceedings of the 3rd Workshop on Computational Approaches to Historical Language Change, pages 68–77, Dublin, Ireland. Association for Computational Linguistics.

https://aclanthology.org/2022.lchange-1.7.pdf

Hang Li. Language Models: Past, Present, and Future. Communications of the ACM, July 2022, Vol. 65 No. 7, Pages 56-63,

https://doi.org/10.1145/3490443

Rukhma Qasim, Waqas Haider Bangyal, Mohammed A. Alqarni, Abdulwahab Ali Almazroi, "A Fine-Tuned BERT-Based Transfer Learning Approach for Text Classification", Journal of Healthcare Engineering, vol. 2022, Article ID 3498123, 17 pages, 2022.

DOI: https://doi.org/10.1155/2022/3498123

Koroteev, Mikhail. (2021). BERT: A Review of Applications in Natural Language Processing and Understanding.

DOI: https://doi.org/10.48550/arXiv.2103.11943

Athar Hussein Mohammed and Ali H. Ali 2021 J. Phys.: Conf. Ser.. 1963 012173.

https://iopscience.iop.org/article/10.1088/1742-6596/1963/1/012173/pdf

Chiu, Shih-Hsuan & Chen, Berlin. (2021). Innovative Bert-Based Reranking Language Models for Speech Recognition. 266-271.

DOI: http://doi.org/10.1109/SLT48900.2021.9383557

Wang, Z., Li, Y., & Zhu, Z. (2021). BERT-based knowledge extraction method for unstructured domain text.

DOI: https://doi.org/10.48550/arXiv.2103.00728

DeepSee.ai. (2021). Meet FILBERT: Google’s BERT trained on finance, insurance, legal.

https://deepsee.ai/knowledge-center/meet-filbert-googles-bert-trained-on-finance-insurance-legal/

Eling, M., Gemmo, I., Guxha, D., & Schmeiser, H. (2024). Big data, risk classification, and privacy in insurance markets. The Geneva Risk and Insurance Review. DOI: https://doi.org/10.1057/s10713-024-00098-5

Khan, Z., Olivia, D., & Shetty, S. (2025). A Machine Learning and Explainable Artificial Intelligence Approach for Insurance Fraud Classification. Inteligencia Artificial, 28(75), 140-169.

DOI: https://doi.org/10.4114/intartif.vol28iss75

Dhanekulla, P. (2024). Real-Time Risk Assessment in Insurance: A Deep Learning Approach to Predictive Modelling. International Journal For Multidisciplinary Research. https://www.ijfmr.com/papers/2024/6/31710.pdf

Hou, Y., Xia, X., & Gao, G. (2024). Combining structural and unstructured data: A topic-based finite mixture model for insurance claim prediction. arXiv preprint arXiv:2410.04684.

DOI: https://doi.org/10.48550/arXiv.2410.04684

A. Dimri, S. Yerramilli, P. Lee, S. Afra and A. Jakubowski, "Enhancing Claims Handling Processes with Insurance Based Language Models," 2019 18th IEEE International Conference On Machine Learning And Applications (ICMLA), 2019, pp. 1750-1755,

https://ieeexplore.ieee.org/document/8999288

Tang, S., Liu, Q., Tan, Wa. (2019). Intention Classification Based on Transfer Learning: A Case Study on Insurance Data. In: Milošević, D., Tang, Y., Zu, Q. (eds) Human Centered Computing. HCC 2019. Lecture Notes in Computer Science(), vol 11956. Springer, Cham. DOI: https://doi.org/10.1007/978-3-030-37429-7_36

Article Sidebar

Main Article Content