As the number of clinical reports in the peer-reviewed medical literature keeps growing, there is an increasing need for online search tools to find and analyze publications on patients with similar clinical characteristics. This problem is especially critical and challenging for rare diseases, where publications of large series are scarce. Through an applied example we illustrate how to semantically annotate the relevant literature about patient case reports to capture the phenotype of a rare disease.
Datasets
Our dataset involved 515 abstracts selected from Pubmed corresponding to papers with the keyword Cerebrotendinous Xanthomatosis in the title/abstract (at the end of October 2013). Here you can look over the abstracts we used.
The subset of 223 abstracts limited to case reports of CTX can be viewed here.
Selection of case reports and extraction of snippets
A set of five linguistic patterns was designed and used to extract the relevant snippets. In total, 174 papers were classified as case reports by our automated method. From the 223 papers tagged as 'case reports' in PubMed, our method only identified relevant 124 abstracts . In addition, a combined method identified 56 more snippets, which were papers in PubMed with no available abstract. In total, 230 papers were classified as case reports and their relevant snnipets were extracted.
Annotation of relevant snippets
Three different tests were carried out using:-
The OBO annotator with the HPO ontology .
- The Obo annotator was implemented in Java, and it is available freely for download here (with the User's Manual ).
- A version of the Obo annotator with no graphical user interface is also available freely for download here. To run the project from the command line, go to the dist folder and type the following: java -jar "OBOAnnotatorNoGUI.jar" inputFile.txt outputDirectory
- The two set of annotations generated by the Obo annotator are available in the following files: File 1 and File 2 (case reports with only title in PubMed).
- The NCBO Annotator with HPO.
- The service provided by GoPubMed , which is based on GO and MeSH.
The curated HPO annotations about CTX
- The curated HPO annotations about CTX provided by the HPO team and used in this work (accesible from the phenomizer ).
The HPO annotations about CTX automatically extracted from PubMed case reports
- The HPO annotations about CTX automatically extracted from PubMed (relevant snippets from abstracts), and represented using the format provided by the HPO team HPO. Each line in the annotation file represents a link between the CTX disease and one of the clinical features annotated in a case report (PubMed reference). The Qualifier indicates if the feature is present or not in the case report. The Onset modifier specifies the age of onset (childhood, adult, etc.), and the Frequency modifier, the frequency of the feature in the set of abstracts in PubMed.
The induced Ontologies
- The CTX ontology (in OBO format) induced by the automated annotations from the literature using the OBO annotator.
- The CTX ontology (in OBO format) induced by the automated annotations from the literature using the NCBO annotator.
- The CTX ontology (in OBO format) induced by the curated annotations.