| Title: | Maximum Entropy Models for Text mining from the Life Sciences Literature |
| Other Titles: | Addressing Data heterogeneity |
| Authors: | Nikolov, Nikolay |
| Keywords: | NLP Chemistry Text-mining Maximum Entropy |
| Issue Date: | 18-Sep-2009 |
| Publisher: | DAMTP and Department of Chemistry |
| Abstract: | The life sciences nowadays are characterized by rapid growth. Due to the huge number of publications per year – in the hundreds of thousands and growing – it is becoming increasingly difficult for the researchers to stay abreast of the latest developments. Thus, automated methods of analysing the scientific information grow in importance. Text mining in the Life Sciences aims at extracting information from textual data (usually abstracts or full texts of scientific publications, but also non-publications like clinical histories or patents). It normally involves some kind of machine learning technique that requires training data from the given thematical domain. Our case study concerns the automatic identification of chemical named entities (e.g. compounds, reaction names) from the life science literature. We investigate the impact of the data heterogeneity on the performance of Maximum Entropy Markov models and explore possible solutions to this problem. This is, to the best of our knowledge, the first study to explore thematical heterogeneity in the chemistry-related life science literature and its impact on named entity recognition. Thus it is necessarily general - its role is to collect evidence, establish basic facts and explore possible solutions. In doing so, our study suggests that the genre structure is especially important for high precision recognition. It also suggests that a system aiming at recall, rather than precision, transferring training data from one domain to another is a useful strategy (especially in respect to the domains having smaller training datasets). But, most importantly, this study provides motivation for a model that explicitly models the thematic heterogeneity of the life science literature. It explores possible solutions and the practical issues of such implementation. |
| Description: | This is supporting data and software for an MPhil project report submitted on 2009-08-18 by Nikolay Nikolov. The data should be used in conjunction with the OSCAR3 software as described in the project report |
| URI: | http://www.dspace.cam.ac.uk/handle/1810/218855 |
| Appears in Collections: | Project documentation - Unilever Centre |
Files in This Item:
|
| Additional resources for this item |
|---|
| retrieve citation metadata in EndNote format |
This item has been accessed 981 times.
Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.

