Using text mining techniques to extract predictive information for prostate cancer (Gleason score) from semi-structured narrative laboratory reports in Gauteng Province, South Africa


This article was originally published here

BMC Med Inform Décis Mak. November 25, 2021; 21 (1): 330. doi: 10.1186 / s12911-021-01697-2.


BACKGROUND: Prostate cancer (PCa) is the leading male neoplasm in South Africa with an age-standardized incidence rate of 68.0 per 100,000 population in 2018. The Gleason score (GS) is the strongest predictor for the treatment of PCa and is integrated into semi-structured narrative prostate biopsy reports. Manual extraction of the GS is laborious. The objective of our study was to explore the use of text mining techniques to automate the extraction of GS from irregularly reported text-intensive patient reports.

METHODS: We used the morphology and topography codes of associated systematic medical nomenclature clinical terms to identify prostate biopsies with a diagnosis of PCa in men> 30 years of age between 2006 and 2016 in the province. from Gauteng, South Africa. We developed a text mining algorithm to extract the GS from 1000 biopsy reports with a PCa diagnostic from the National Health Laboratory Service database and validated the algorithm using 1000 private sector biopsies. The logical steps of the algorithm were data acquisition, preprocessing, feature extraction, feature value representation, feature selection, information extraction, classification, and knowledge discovery. . We evaluated the algorithm using precision, recall and F-score. The GS was manually coded by two experts for both data sets. The first five SGs have been reported, with the remaining scores categorized as “Other” for both data sets. The percentage of biopsies with high risk SG (≥ 8) has also been reported.

RESULTS: The first discharge reported an F-score of 0.99 which improved to 1.00 after the algorithm was changed (the GS reported in clinical history was ignored). For the validation dataset, an F score of 0.99 was reported. The most frequently reported SGs were 5 + 4 = 9 (17.6%), 3 + 3 = 6 (17.5%), 4 + 3 = 7 (16.4%), 3 + 4 = 7 (14 , 7%) and 4 + 4 = 8 (14.2%). For the validation dataset, the most frequently reported SGs were: (i) 3 + 3 = 6 (37.7%), (ii) 3 + 4 = 7 (19.4%), (iii) 4 + 3 = 7 (14.9%), (iv) 4 + 4 = 8 (10.0%) and (v) 4 + 5 = 9 (7.4%). High risk SG was reported for 31.8% versus 17.4% for the validation dataset.

CONCLUSIONS: We have demonstrated reliable extraction of OS information from narrative textual patient reports using an in-house text mining algorithm. A secondary result was that late presentation could be assessed.

PMID: 34823522 | DOI: 10.1186 / s12911-021-01697-2


About Author

Comments are closed.