As the number of clinical reports in the peer-reviewed medical literature keeps growing, there is an increasing need for online search tools to find and analyze publications on patients with similar clinical characteristics. This problem is especially critical and challenging for rare diseases, where publications of large series are scarce. Through an applied example we illustrate how to semantically annotate the relevant literature about patient case reports to capture the phenotype of a rare disease.

Datasets

Our dataset involved 515 abstracts selected from Pubmed corresponding to papers with the keyword Cerebrotendinous Xanthomatosis in the title/abstract (at the end of October 2013). Here you can look over the abstracts we used.
The subset of 223 abstracts limited to case reports of CTX can be viewed here.

Selection of case reports and extraction of snippets

A set of five linguistic patterns was designed and used to extract the relevant snippets. In total, 174 papers were classified as case reports by our automated method. From the 223 papers tagged as 'case reports' in PubMed, our method only identified relevant 124 abstracts . In addition, a combined method identified 56 more snippets, which were papers in PubMed with no available abstract. In total, 230 papers were classified as case reports and their relevant snnipets were extracted.

Annotation of relevant snippets

Three different tests were carried out using:

The curated HPO annotations about CTX

The HPO annotations about CTX automatically extracted from PubMed case reports

The induced Ontologies