Drug-induced liver injury (DILI) is one of the most common reasons for the withdrawal of drug candidates. Among the cases of DILI, detecting unexpected (idiosyncratic) liver injury poses an interesting challenge, since this is not directly tied to the (dose-dependent) toxicity of a drug or its metabolites. As such, literature search remains a major tool for sourcing DILI-related information, which often comes directly from clinical practice.
Here, we present dialogí, a text-mining tool that combines different Natural Language Processing (NLP) approaches, together with a linear classifier, to differentiate between DILI-positive and -negative PubMed abstracts. Often, within the same DILI-positive paper, multiple drugs-- most of which unrelated to DILI-- are mentioned. We, thus, expand our tool with a framework that tries to identify and extract key (DILI-positive) drugs on a paper-by-paper basis.
The aforementioned classifier was trained on 11,200 equally-split DILI-positive and -negative PubMed abstracts, including titles, and was validated (internally) on the remaining 2,800 abstracts, resulting in a precision of 94.8% and recall of 93.5%. On external validation, the model displayed precision and recall of 93.3% and 94.9%, respectively, with an accuracy of 94.1%.