Research progress on big-data-driven analysis strategies for imbalanced data of rare events

Published on Aug. 29, 2025Total Views: 2857 times Total Downloads: 672 times Download Mobile

Author: ZHOU Jiangjie ¹ WANG Yutong ² FENG Tian ¹ MENG Xianglong ²  LIANG Baosheng ¹  WANG Shengfeng ²

Affiliation: 1. Department of Biostatistics, School of Public Health, Peking University Health Science Center, Beijing 100191, China 2. Department of Epidemiology and Biostatistics, School of Public Health, Peking University Health Science Center, Beijing 100191, China

Keywords: Rare events Imbalanced data Data-driven Deep learning

DOI: 10.12173/j.issn.1005-0698.202411080

Reference: ZHOU Jiangjie, WANG Yutong, FENG Tian, MENG Xianglong, LIANG Baosheng, WANG Shengfeng. Research progress on big-data-driven analysis strategies for imbalanced data of rare events[J]. Yaowu Liuxingbingxue Zazhi, 2025, 34(8): 952-961. DOI: 10.12173/j.issn.1005-0698.202411080.[Article in Chinese]  Copied

Abstract
Full-text
References

Abstract

Rare events are widely prevalent in various disciplines, including rare adverse reactions to vaccines and drugs, clinical rare diseases, and low-probability clinical outcomes. The reason for research interest on such events is that their occurrence often brings incalculable and serious consequences. In the context of big data, numerous methods have emerged for rare event data analysis, including sampling based, category weighting, ensemble learning, and deep learning. This article systematically summarizes the research progress of current rare event data analysis methods, and introduces their basic principles and applicable scenarios. By analyzing the advantages and disadvantages of existing methods, the challenges of rare event research are sorted out and summarized, and potential research directions in related fields are explored to provide references for researchers.

Full-text

Please download the PDF version to read the full text: download

References

1.Barro RJ. Rare disasters and asset markets in the twentieth century[J]. QJ Econ, 2006, 121 (3): 823-866. DOI: 10.1162/qjec.121.3.823.

2.周颖, 鲁云兰. 严重或罕见的药物不良反应报告8例[J]. 中国新药杂志, 2002, 11(1): 92-94. [Zhou Y, Lu YL. 8 cases of severe or infrequent adverse drug reactions[J]. Chinese Journal of New Drugs, 2002, 11(1): 92-94.] DOI: 10.3321/j.issn:1003-3734.2002.01.023.

3.Li XT, Zhou J, Wang HS. Gaussian mixture models with rare events[J]. J Mach Learn Res, 2024, 25 (1): 1-40. DOI: 10.48550/arXiv.2405.16859.

4.贾玉春. 浅谈大数据技术在安全生产管理中的创新应用[J]. 大数据与人工智能, 2024, 5(2): 4-6. [Jia YC. Discussion on the innovative application of big data technology in safety production management[J].Big Data and Artificial Intelligence, 2024, 5(2): 4-6.] DOI: 10.12345/bdai.v5i2.16496.

5.Northcott R. Big data and prediction: four case studies[J]. Stud Hist Philos Sci Part A, 2020, 81: 96-104. DOI: 10.1016/j.shpsa.2019.09.002.

6.Chen W, Hu Y, Zhang X, et al. Causal risk factor discovery for severe acute kidney injury using electronic health records[J]. BMC Med Inform Decis Mak, 2018, 18 (Suppl 1): 13. DOI: 10.1186/s12911-018-0597-7.

7.Shyalika C, Wickramarachchi R, Sheth A. A comprehensive survey on rare event prediction[J]. ACM Comput Surv, 2024, 57(3): 1-39. DOI: 10.1145/3699955.

8.Chen W, Yang K, Yu Z, et al. A survey on imbalanced learning: latest research, applications and future directions[J]. Artif Intell Rev, 2024, 57(6): 137. DOI: 10.1007/s10462-024-10759-6.

9.Coley RY, Liao Q, Simon N, et al. Empirical evaluation of internal validation methods for prediction in large-scale clinical data with rare-event outcomes: a case study in suicide risk prediction[J]. BMC Med Res Methodol, 2023, 23(1): 33. DOI: 10.1186/s12874-023-01844-5.

10.Leisman DE. Rare events in the ICU: an emerging challenge in classification and prediction[J]. Crit Care Med, 2018, 46(3): 418-424. DOI: 10.1097/CCM.0000000000002943.

11.Oeding JF, Lu Y, Pareek A, et al. Understanding risk for early dislocation resulting in reoperation within 90 days of reverse total shoulder arthroplasty: extreme rare event detection through cost-sensitive machine learning[J]. J Shoulder Elbow Surg, 2023, 32(9): e437-e450. DOI: 10.1016/j.jse.2023.03.001.

12.郭丹, 金晔, 刘炜达, 等. 罕见病临床样本库的建立与应用 [J]. 中国科学: 生命科学, 2024, 54(6): 1041-1049. [Guo D, Jin Y, Liu WD, et al. Development and application of rare diseases biobank[J]. Scientia Sinica(Vitae), 2024, 54(6): 1041-1049.] DOI: 10.1360/SSV-2023-0038.

13.Zhu C, Cotton F, Kawase H, et al. How well can we predict earthquake site response so far? Machine learning vs physics-based modeling[J]. Earthq Spectra, 2023, 39(1): 478-504. DOI: 10.1177/87552930221116399.

14.Anuar MAS, Rahman RZA, Soh AC, et al. A promising wavelet decomposition -NNARX model to predict flood: application to Kelantan river flood[J]. Int J Electr Comput, 2020, 11(2): 89-99. DOI: 10.32985/ijeces.11.2.4.

15.Sima RJ. Combining AI and analog forecasting to predict extreme weather[J/OL]. Eos, 101, [2025-02-13]. DOI: 10.1029/2020eo 140896.

16.Wang X. Firth logistic regression for rare variant association tests[J]. Front Genet, 2014, 5: 187. DOI: 10.3389/fgene.2014.00187.

17.Xie Y, Yu Z. Mixture cure rate models with neural network estimated nonparametric components[J]. Comput Stat, 2021, 36(4): 2467-2489. DOI: 10.1007/s00180-021-01086-3.

18.Ding Y, Zou J, Fan Y, et al. A digital twin-based testing and data collection system for autonomous driving in extreme traffic scenarios[A]//The 6th International Conference on Video and Image Processing[C]. ICVIP, 2022: 101-109. DOI: 10.1145/3579109.3579127.

19.Hamaguchi R, Sakurada K, Nakamura R. Rare event detection using disentangled representation learning[A]//2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition[C]. IEEE Computer Society, 2019: 9319-9327. DOI: 10.1109/CVPR. 2019.00955.

20.Cui T, Dolgov S, Scheichl R. Deep importance sampling using tensor trains with application to a priori and a posteriori rare events[J]. SIAM J Sci Comput, 2024, 46(1): C1-C29. DOI: 10.1137/23M1546981.

21.Olmuş H, Nazman E, Erbaş S. Comparison of penalized logistic regression models for rare event case[J]. Commun Stat-Simul C, 2019, 51(4): 1-13. DOI: 10.1080/03610918.2019.1676438.

22.Zhao L. Event prediction in the big data era: a systematic survey[J]. ACM Comput Surv, 2022, 54(5): 1-37. DOI: 10.1145/3450287.

23.Burns BL, Rhoads DD, Misra A. The use of machine learning for image analysis artificial intelligence in clinical microbiology[J]. J Clin Microbiol, 2023, 61(9): e0233621. DOI: 10.1128/jcm.02336-21.

24.Ashraf MT, Dey K, Mishra S. Identification of high-risk roadway segments for wrong-way driving crash using rare event modeling and data augmentation techniques[J]. Accid Anal Prev, 2023, 181: 106933. DOI: 10.1016/j.aap.2022.106933.

25.Chawla NV, Bowyer KW, Hall LO, et al. SMOTE: synthetic minority over-sampling technique[J]. J Artif Int Res, 2002, 16(1): 321-357. DOI: 10.1613/jair.953.

26.Dablain D, Krawczyk B, Chawla NV. DeepSMOTE: Fusing deep learning and SMOTE for imbalanced data[J]. IEEE Trans Neural Netw Learn Syst, 2023, 34(9): 6390-6404. DOI: 10.1109/TNNLS.2021.3136503.

27.Arafa A, El-Fishawy N, Badawy M, et al. RN-SMOTE: reduced noise SMOTE based on DBSCAN for enhancing imbalanced data classification[J]. J King Saud Univ-Com, 2022, 34(8): 5059-5074. DOI: 10.1016/j.jksuci.2022. 06.005.

28.He H, Bai Y, Garcia EA, et al. ADASYN: adaptive synthetic sampling approach for imbalanced learning[A]//IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence)[C]. IEEE, 2008: 1322-1328. DOI: 10.1109/IJCNN.2008.4633969.

29.Branco P, Torgo L, Ribeiro RP. Pre-processing approaches for imbalanced distributions in regression[J]. Neurocomputing, 2019, 343(8): 0925-2312. DOI: 10.1016/j.neucom.2018.11.100.

30.Beinecke J, Heider D. Gaussian noise up-sampling is better suited than SMOTE and ADASYN for clinical decision making[J]. BioData Min, 2021, 14(1): 49. DOI: 10.1186/s13040-021-00283-6.

31.Wang H. Logistic regression for massive data with rare events[A]//Proceedings of the 37th International Conference on Machine Learning[C]. PMLR, 2020: 9829-9836. https://doi.org/10.48550/arXiv.2006.00683.

32.Alejo R, Sotoca JM, Valdovinos RM, et al. Edited nearest neighbor rule for improving neural networks classifications[A]//Advances in Neural Networks-ISNN 2010[C]. Berlin, Heidelberg: Springer, 2010: 303-310. http://dx.doi.org/10.1007/978-3-642-13278-0_39.

33.Nguyen HM, Cooper EW, Kamei K. Borderline over-sampling for imbalanced data classification[J]. Int J Knowl Eng Soft Data Paradigm, 2011, 3(1): 4-21. DOI: 10.1504/IJKESDP.2011.039875.

34.Li J, Fong S, Hu S, et al. Rare event prediction using similarity majority under-sampling technique[A]//Soft Computing in Data Science, 6th International Conference[C]. Singapore: Springer, 2017: 23-39. DOI: 10.1007/978-981-10-7242-0_3.

35.Zhu Y, Hu Y, Liu Q, et al. A hybrid approach for predicting corporate financial risk: integrating SMOTE-ENN and NG Boost[J]. IEEE Access, 2023, 11: 111106-111125. DOI: 10.1109/ACCESS.2023.3323198.

36.Asniara, Maulidevia NU, Surendro K. SMOTE-LOF for noise identification in imbalanced data classification[J]. J King Saud Univ-Com, 2022, 34(6): 3413-3423. DOI: 10.1016/j.jksuci.2021.01.014.

37.Arafa A, El-Fishawy N, Badawy M, et al. RN-autoencoder: reduced noise autoencoder for classifying imbalanced cancer genomic data[J]. J Biol Eng, 2023, 17(1): 7. DOI: 10.1186/s13036-022-00319-3.

38.Thai-Nghe N, Gantner Z, Schmidt-Thieme L. Cost-sensitive learning methods for imbalanced data[A]//The 2010 International Joint Conference on Neural Networks[C]. IJCNN, 2010: 1-8. DOI: 10.1109/IJCNN.2010.5596486.

39.Puhr R, Heinze G, Nold M, et al. Firth's logistic regression with rare events: accurate effect estimates and predictions?[J]. Stat Med, 2017, 36(14): 2302-2317. DOI: 10.1002/sim.7273.

40.Maalouf M, Trafalis TB. Robust weighted kernel logistic regression in imbalanced and rare events data[J]. Comput Stat Data An, 2011, 55(1): 168-183. DOI: 10.1016/j.csda.2010.06.014.

41.He J, Cheng MX. Weighting methods for rare event identification from imbalanced datasets[J]. Front Big Data, 2021, 4: 1-11. DOI: 10.3389/fdata.2021.715320.

42.Rahman MS, Sultana M. Performance of firth-and logF-type penalized methods in risk prediction for small or sparse binary data[J]. BMC Med Res Methodol, 2017, 17(1): 33. DOI: 10.1186/s12874-017-0313-9.

43.Hoerl AE, Kennard RW. Ridge regression: biased estimation for nonorthogonal problems[J]. Technometrics, 2000, 42(1): 80-86. https://doi.org/10.2307/1271436.

44.Tibshirani R. Regression shrinkage and selection via the Lasso[J]. J R Stat Soc B, 1996, 58(1): 267-288. DOI: 10.1111/j.2517-6161.1996.tb02080.x.

45.Zou H, Hastie T. Regularization and variable selection via the elastic net[J]. J R Stat Soc B, 2005, 67(2): 301-320. DOI: 10.1111/j.1467-9868.2005.00503.x.

46.Ambler G, Seaman S, Omar RZ. An evaluation of penalized survival methods for developing prognostic models with rare events[J]. Stat Med, 2012, 31(11-12): 1150-1161. DOI: 10.1002/sim.4371.

47.Zhong S, Zhang J, Jiao J, et al. A machine learning case study to predict rare clinical event of interest: imbalanced data, interpretability, and practical considerations[J]. J Biopharm Stat, 2024, 11: 1-14. DOI: 10.1080/10543406.2024.2364722.

48.Li X, Zhu X, Wang H. Distributed Logistic regression for massive data with rare events[J]. Stat Sinica, 2024, 34(4): 2277-2300. DOI: 10.5705/ss.202022.0242.

49.Yağci AM, Aytekin T, Gürgen FS. Balanced random forest for imbalanced data streams[A]//24th Signal Processing and Communication Application Conference[C]. SIU, 2016: 1065-1068. DOI: 10.1109/SIU.2016.7495927.

50.Chawla NV, Lazarevic A, Hall LO, et al. SMOTEBoost: improving prediction of the minority class in boosting[A]//Knowledge Discovery in Databases: PKDD 2003[C]. Berlin, Heidelberg: Springer, 2003: 107-119. DOI: 10.1007/978-3-540-39804-2_12.

51.Efthimiou O. Practical guide to the Meta-analysis of rare events[J]. Evid Based Ment Health, 2018, 21(2): 72-76. DOI: 10.1136/eb-2018-102911.

52.Xu C, Furuya-Kanamori L, Lin L. Synthesis of evidence from zero-events studies: a comparison of one-stage framework methods[J]. Res Syn Meth, 2022, 13(2): 176-189. DOI: 10.1002/jrsm.1521.

53.Jansen K, Holling H. Rare events Meta-analysis using the Bayesian beta-binomial model[J]. Res Synth Methods, 2023, 14(6): 853-873. DOI: 10.1002/jrsm.1662.

54.Zabriskie B, Corcoran C, Senchaudhuri P. A comparison of confidence distribution approaches for rare event Meta-analysis[J]. Stat Med, 2021, 40(24): 5276-5297. DOI: 10.1002/sim.9125.

55.Goodfellow I, Bengio Y, Courville A. Deep learning[M]. Cambridge, Mass: The MIT Press, 2016: 1-775.

56.Goodfellow I, Pouget-Abadie J, Mirza M, et al. Generative adversarial nets[A]//Proceedings of the 28th International Conference on Neural Information Processing Systems[C]. NIPS, 2014: 2672-2680. https://www.semanticscholar.org/paper/Generative-Adversarial-Nets-Goodfellow-Pouget-Abadie/86ee1835a56722b76564119437070782fc90eb19.

57.Kingma DP, Welling M. Auto-encoding variational Bayes[EB/OL]. (2022-12-10) [2025-02-12]. http://arxiv.org/abs/1312.6114.

58.Yi X, Walia E, Babyn P. Generative adversarial network in medical imaging: a review[J]. Med Image Anal, 2019, 58: 101552. DOI: 10.1016/j.media.2019.101552.

59.Xiu Z, Tao C, Gao M, et al. Variational disentanglement for rare event modeling[J]. Proc AAAI Conf Artif Intell, 2021, 35(12): 10469-10477. https://pubmed.ncbi.nlm.nih.gov/34888123/.

60.Yang J, El-Bouri R, O'Donoghue O, et al. Deep reinforcement learning for multi-class imbalanced training: applications in healthcare[J]. Mach Learn, 2024, 113(5): 2655-2674. DOI: 10.1007/s10994-023-06481-z.