With the growing accumulation of real-world data and substantial advances in cutting-edge computing technologies, artificial intelligence (AI) is profoundly transforming the research paradigm of pharmacoepidemiology. Based on the Guide on Methodological Standards in Pharmacoepidemiology (2nd edition), this paper systematically examines two core application domains of AI in pharmacoepidemiological research: targeted intelligent data extraction and the generation of in-depth data insights. In the domain of data extraction, this article elaborates on the pivotal role of natural language processing (NLP) and machine learning (ML) techniques in converting unstructured medical text into structured data, while emphasizing the need to carefully balance precision and recall in specific clinical application scenarios and to report model performance metrics in a standardized manner. In the domain of data insights, this article conducts an in-depth analysis of the methodological advantages and existing limitations of ML as applied to the control of complex confounding bias, the construction of high-dimensional clinical prediction models, and cutting-edge probabilistic phenotyping in pharmacoepidemiological research. Furthermore, with a focus on methodological fundamentals, this paper systematically reviews the dual modeling pathways of ML for identifying heterogeneity of treatment effect (HTE)—namely the "risk-based" and "effect-based" approaches. To address the "black-box" nature of deep learning, the importance of explainable AI (XAI) is further discussed in facilitating the review of medical risks and the avoidance of ethical concerns, as well as its inherent limitations in elucidating the underlying decision-making mechanisms of models. This paper aims to provide researchers with rigorous methodological support and practical guidance for the standardized application of AI technologies, with a view to enhancing the quality of pharmacoepidemiological research and the reliability of decision-making.
1. Sabaté M, Montané E. Pharmacoepidemiology: an overview[J]. J Clin Med, 2023, 12(22): 7033. DOI: 10.3390/jcm12227033.
2. Qiao H, Chen Y, Qian C, et al. Clinical data mining: challenges, opportunities, and recommendations for translational applications[J]. J Transl Med, 2024, 22(1): 185. DOI: 10.1186/s12967-024-05005-0.
3. Létinier L, Jouganous J, Benkebil M, et al. Artificial intelligence for unstructured healthcare data: application to coding of patient reporting of adverse drug reactions[J]. Clin Pharmacol Ther, 2021, 110(2): 392-400. DOI: 10.1002/cpt.2266.
4. Capobianco E. High-dimensional role of AI and machine learning in cancer research[J]. Br J Cancer, 2022, 126(4): 523-532. DOI: 10.1038/s41416-021-01689-z.
5. 颜济南, 吴昀效, 聂晓璐, 等. 《中国药物流行病学研究方法学指南(第2版)》的制订/修订过程[J]. 药物流行病学杂志, 2025, 34(2): 121-135. [Yan JN, Wu YX, Nie XL, et al. Revision process of the Guide on Methodological Standards in Pharmacoepidemiology in China (2nd edition)[J]. Chinese Journal of Pharmacoepidemiology, 2025, 34(2): 121-135.] DOI: 10.12173/j.issn.1005-0698.202502028.
6. Zeng J, Gensheimer MF, Rubin DL, et al. Uncovering interpretable potential confounders in electronic medical records[J]. Nature Commun, 2022, 13(1): 1014. DOI: 10.1038/s41467-022-28546-8.
7. Romanelli RJ, Schwartz NRM, Dixon WG, et al. The use of narrative electronic prescribing instructions in pharmacoepidemiology: a scoping review for the international society for pharmacoepidemiology[J]. Pharmacoepidemiol Drug Saf, 2021, 30(10): 1281-1292. DOI: 10.1002/pds.5331.
8. Fraile Navarro D, Ijaz K, Rezazadegan D, et al. Clinical named entity recognition and relation extraction using natural language processing of medical free text: a systematic review[J]. Int J Med Inform, 2023, 177: 105122. DOI: 10.1016/j.ijmedinf.2023.105122.
9. Alfattni G, Belousov M, Peek N, et al. Extracting drug names and associated attributes from discharge summaries: text mining study[J]. JMIR Med Inform, 2021, 9(5): e24678. DOI: 10.2196/24678.
10. Weeks HL, Beck C, McNeer E, et al. medExtractR: a targeted, customizable approach to medication extraction from electronic health records[J]. J Am Med Informa Assoc, 2020, 27(3): 407-418. DOI: 10.1093/jamia/ocz207.
11. Sharma T, Emmert-Streib F. Deep mining the textual gold in relation extraction[J]. Artif Intell Rev, 2024, 58(1): 34. DOI: 10.1007/s10462-024-11042-4.
12. Wei Q, Ji Z, Li Z, et al. A study of deep learning approaches for medication and adverse drug event extraction from clinical text[J]. J Am Med Informa Assoc, 2020, 27(1): 13-21. DOI: 10.1093/jamia/ocz063.
13. Tan HX, Teo CHD, Ang PS, et al. Combining machine learning with a rule-based algorithm to detect and identify related entities of documented adverse drug reactions on hospital discharge summaries[J]. Drug Saf, 2022, 45(8): 853-862. DOI: 10.1007/s40264-022-01196-x.
14. Delgado-Chaves FM, Jennings MJ, Atalaia A, et al. Transforming literature screening: the emerging role of large language models in systematic reviews[J]. Proc Natl Acad Sci U S A, 2025, 122(2): e2411962122. DOI: 10.1073/pnas.2411962122.
15. Sciurti A, Migliara G, Siena LM, et al. Compact large language models for title and abstract screening in systematic reviews: an assessment of feasibility, accuracy, and workload reduction[J]. Res Synth Methods, 2026, 17(2): 332-347. DOI: 10.1017/rsm.2025. 10044.
16. Zou H, Wang Y, Xiang K, et al. Knowledge-augmented large language model-based framework for adverse drug reaction analysis[J]. Appl Soft Comput, 2025, 185: 114025. https://doi.org/10.1016/j.asoc.2025.114025.
17. Meldau EL, Bista S, Rofors E, et al. Automated drug coding using artificial intelligence: an evaluation of WHODrug Koda on adverse event reports[J]. Drug Saf, 2022, 45(5): 549-561. DOI: 10.1007/s40264-022-01162-7.
18. Liu Z, Roberts RA, Lal-Nag M, et al. AI-based language models powering drug discovery and development[J]. Drug Discov Today, 2021, 26(11): 2593-2607. DOI: 10.1016/j.drudis.2021.06.009.
19. Bomgni AB, Mbotchack Ngale CE, Aryal S, et al. NLPADADE: leveraging natural language processing for automated detection of adverse drug effects[A]. In: 2023 IEEE International Conference on Bioinformatics and Biomedicine. IEEE, 2023: 4480-4487. DOI: 10.1109/bibm58861.2023.10385626.
20. Noguchi Y, Tachi T, Teramachi H. Detection algorithms and attentive points of safety signal using spontaneous reporting systems as a clinical data source[J]. Briefings Bioinf, 2021, 22(6): bbab347. DOI: 10.1093/bib/bbab347.
21. McDonough CW, Babcock K, Chucri K, et al. Optimizing identification of resistant hypertension: computable phenotype development and validation[J]. Pharmacoepidemiol Drug Saf, 2020, 29(11): 1393-1401. DOI: 10.1002/pds.5095.
22. Kan-Tor Y, Ness L, Szlak L, et al. Comparing the efficacy of anti-seizure medications using matched cohorts on a large insurance claims database[J]. Epilepsy Res, 2024, 201: 107313. DOI: 10.1016/j.eplepsyres.2024.107313.
23. McMaster C, Chan J, Liew DFL, et al. Developing a deep learning natural language processing algorithm for automated reporting of adverse drug reactions[J]. J Biomed Inform, 2023, 137: 104265. DOI: 10.1016/j.jbi.2022.104265.
24. Khouri C, Revol B, Lepelley M, et al. A Meta-epidemiological study found lack of transparency and poor reporting of disproportionality analyses for signal detection in pharmacovigilance databases[J]. J Clin Epidemiol, 2021, 139: 191-198. DOI: 10.1016/j.jclinepi.2021.07.014.
25. Guo Y, Strauss VY, Català M, et al. Machine learning methods for propensity and disease risk score estimation in high-dimensional data: a plasmode simulation and real-world data cohort analysis[J]. Front Pharmacol, 2024, 15: 1395707. DOI: 10.3389/fphar.2024.1395707.
26. Cahn A, Raz I Leiter LA, et al. Cardiovascular, renal, and metabolic outcomes of dapagliflozin versus placebo in a primary cardiovascular prevention cohort: analyses from DECLARE-TIMI 58[J]. Diabetes Care, 2021, 44(5): 1159-1167. DOI: 10.2337/dc20-2492.
27. Patel A, Doernberg SB, Zack T, et al. Predictive modeling of drug-related adverse events with real-world data: a case study of linezolid hematologic outcomes[J]. Clin Pharmacol Ther, 2024, 115(4): 847-859. DOI: 10.1002/cpt.3201.
28. Leite W, Zhang H, Collier Z, et al. Machine learning for propensity score estimation: a systematic review and reporting guidelines[J/OL]. Psychol Methods, 2025. DOI: 10.1037/met0000789.
29. Wang B, Zhuang S, Lin S, et al. Analysis of risk factors for immune checkpoint inhibitor-associated liver injury: a retrospective analysis based on clinical study and real-world data[J]. Hepatol Int, 2025, 19(5): 1172-1186. DOI: 10.1007/s12072-025-10783-w.
30. Bang D, Koo B Kim S. Transfer learning of condition-specific perturbation in gene interactions improves drug response prediction[J]. Bioinformatics, 2024, 40 (Suppl 1): i130-i139. DOI: 10.1093/bioinformatics/btae249.
31. Berchialla P, Lanera C, Sciannameo V, et al. Prediction of treatment outcome in clinical trials under a personalized medicine perspective[J]. Sci Rep, 2022, 12(1): 4115. DOI: 10.1038/s41598-022-07801-4.
32. Axford D, Sohel F, Abedi V, et al. Development and internal validation of machine learning-based models and external validation of existing risk scores for outcome prediction in patients with ischaemic stroke[J]. Eur Heart J Digit Health, 2024, 5(2): 109-122. DOI: 10.1093/ehjdh/ztad073.
33. Rajpurkar P, Chen E, Banerjee O, et al. AI in health and medicine[J]. Nature Med, 2022, 28(1): 31-38. DOI: 10.1038/s41591-021-01614-0.
34. He T, Belouali A, Patricoski J, et al. Trends and opportunities in computable clinical phenotyping: a scoping review[J]. J Biomed Inform, 2023, 140: 104335. DOI: 10.1016/j.jbi.2023.104335.
35. Banda JM, Shah NH, Periyakoil VS. Characterizing subgroup performance of probabilistic phenotype algorithms within older adults: a case study for dementia, mild cognitive impairment, and Alzheimer's and Parkinson's diseases[J]. JAMIA Open, 2023, 6(2): ooad043. DOI: 10.1093/jamiaopen/ooad043.
36. Ding L, Mane R, Wu Z, et al. Data-driven clustering approach to identify novel phenotypes using multiple biomarkers in acute ischaemic stroke: a retrospective, multicentre cohort study[J]. EClinicalMedicine, 2022, 53: 101639. DOI: 10.1016/j.eclinm.2022.101639.
37. Ling Y, Upadhyaya P, Chen L, et al. Emulate randomized clinical trials using heterogeneous treatment effect estimation for personalized treatments: methodology review and benchmark[J]. J Biomed Inform, 2023, 137: 104256. DOI: 10.1016/j.jbi.2022. 104256.
38. Xu E, Vanghelof J, Wang Y, et al. Outcome risk model development for heterogeneity of treatment effect analyses: a comparison of non-parametric machine learning methods and semi-parametric statistical methods[J]. BMC Med Res Methodol, 2024, 24(1): 158. DOI: 10.1186/s12874-024-02265-8.
39. Galatro D, Di Nardo A, Pai V, et al. Considerations for using tree-based machine learning to assess causation between demographic and environmental risk factors and health outcomes[J]. Environ Sci Pollut Res Int, 2024, 31(51): 60927-60935. DOI: 10.1007/s11356-024-35304-4.
40. Gupta J, Seeja KR. A comparative study and systematic analysis of XAI models and their applications in healthcare[J]. Arch Comput Methods Eng, 2024, 31(7): 3977-4002. DOI: 10.1007/s11831-024-10103-9.
41. Noor AA, Manzoor A, Mazhar Qureshi MD, et al. Unveiling explainable AI in healthcare: current trends, challenges, and future directions[J]. WIREs Data Min Knowl, 2025, 15(2): e70018. DOI: 10.1002/widm.70018.