Title: Maximum Entropy Models for Text mining from the Life Sciences Literature
Other Titles: Addressing Data heterogeneity
Authors: Nikolov, Nikolay
Keywords: NLP
Chemistry
Text-mining
Maximum Entropy
Issue Date: 18-Sep-2009
Publisher: DAMTP and Department of Chemistry
Abstract: The life sciences nowadays are characterized by rapid growth. Due to the huge number of publications per year – in the hundreds of thousands and growing – it is becoming increasingly difficult for the researchers to stay abreast of the latest developments. Thus, automated methods of analysing the scientific information grow in importance. Text mining in the Life Sciences aims at extracting information from textual data (usually abstracts or full texts of scientific publications, but also non-publications like clinical histories or patents). It normally involves some kind of machine learning technique that requires training data from the given thematical domain. Our case study concerns the automatic identification of chemical named entities (e.g. compounds, reaction names) from the life science literature. We investigate the impact of the data heterogeneity on the performance of Maximum Entropy Markov models and explore possible solutions to this problem. This is, to the best of our knowledge, the first study to explore thematical heterogeneity in the chemistry-related life science literature and its impact on named entity recognition. Thus it is necessarily general - its role is to collect evidence, establish basic facts and explore possible solutions. In doing so, our study suggests that the genre structure is especially important for high precision recognition. It also suggests that a system aiming at recall, rather than precision, transferring training data from one domain to another is a useful strategy (especially in respect to the domains having smaller training datasets). But, most importantly, this study provides motivation for a model that explicitly models the thematic heterogeneity of the life science literature. It explores possible solutions and the practical issues of such implementation.
Description: This is supporting data and software for an MPhil project report submitted on 2009-08-18 by Nikolay Nikolov. The data should be used in conjunction with the OSCAR3 software as described in the project report
URI: http://www.dspace.cam.ac.uk/handle/1810/218855
Appears in Collections:Project documentation - Unilever Centre

Files in This Item:

File Description SizeFormat
models.zipmodle files from OSCAR36.73 MBZIPView/Open
pr.zipstatistics from experiments with OSCAR1.57 MBZIPView/Open
r.zipRcode: (a) prototype for hierarchical model (b) generating graphs4.04 kBZIPView/Open
Oscar3Dadapt.jarOSCAR3 version used in the experiments40.41 MBExecutable (Java) JAR fileView/Open
Additional resources for this item
retrieve citation metadata in EndNote format

This item has been accessed 981 times.

Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.