Natural Language Processing and Information Extraction


Abstract

This web page is a set of notes on the Natural Language Processing sub-area, Information Extraction. These notes were collected as I read through various papers and a few books on information extraction. This obviously does not make me an expert in this area (when ever I have a shallow introduction to a topic I'm always reminded of the saying a little knowledge is a dangerous thing).

Introduction

Web databases like CiteSeer (the Scientific Literature Digital Library placed on-line by NEC) have made scientific information far easier to obtain. News "wire" articles are posted and updated as they come in on a variety of web sites. Personal web pages like bearcave.com and "blogs" have created a global dialog.

In the past large data archives were only available on magnetic tape. Large RAID (Redundant Array of Inexpensive Disks) systems have made information stores of almost arbitrary size availabe to computer networks (the most familiar example of this is Google, which mirrors a significant fraction of the world wide web). Large corporations and government agencies have been collecting information in computer readable form for many years. This information can now be made availabe to networked computer systems.

There are two common methods used to access large amounts of computer readable information: relational database systems and flat text search (flat text is also referred to as "unstructured text"). For massive datasets both of these techniques can have scalability problems, although Google has shown that flat text search can be applied to massive data sets. As far as I know, there is no relational database of similar size.

Relational databases support powerful queries (e.g., SQL) which allow the data set to be accessed in a variety of fashions. Applications that access the relational database can "mine" the underlying data in a variety of ways and display it using different abstractions. Database systems work well for information that was initially entered into the database (for example, sales transactions, purchase orders and payroll information). Relational databases are of no use in processing flat text, unless the information content is distilled into a form that can be entered into the relational database.

Flat text search usually relies on matching a set of words or phrases. Flat text search algorithms do not require any preparation of the text database (e.g., addition of structure), but the query power is limited. While flat text search can return good results, these are usually mixed with large numbers of related documents that are of less interest to the searcher.

Work on Natural Language Processing has been going on for at least thirty years. In the last decade computer processing power, computer memory and disk storage have reached sufficient power and capacity to support more powerful NLP applications. This seems to have created a flowering of NLP work. The field is moving rapidly and much of the work leading to real applications has been done in the last ten years. NLP has started to become an applied, rather than a theoretical science.

NLP holds out a number of promises, many of them only tantalizing and unrealized. NLP software can process flat text for entry into a relational database (information extraction). NLP techniques can also extract some degree of meaning from text, supporting more accurate search (information retrieval).

NLP draws from a number of areas including linguistics, statistics and machine learning techniques like neural networks and support vector machines. As someone new to NLP, I have found that reading the literature can be difficult, since the field has built up a terminology that I have never seen used elsewhere. For example, although I am widely read, I have never before encountered the word anaphoric before.

Information Extraction

Natural Language Processing is a large area, which includes topics like text understanding and machine learning. I have concentrated on a subset: Information Extraction, which processes a body of text so that it can be entered into a relational database or analyzed using data mining2.

In Information Extraction a body of texts is input. The output is a closely defined data format that is suitable for a database or data mining application. The rigid format for the final result means that only a fraction of the data is relevant. Understanding or meaning are useful in only a limited way to disambiguate the input. Information Extraction systems may be used to process large bodies of information, so performance may be important. The common steps in information extraction are shown below in Figure 1.

Information
Extraction Pipeline
Figure 1

In this diagram POS stands for Parts Of Speech. Here the words in a sentence are tagged to indicate whether they are verbs, nouns, etc... In later stages an attempt is then made to match the parts of speech (tagged words) against a template (see the reference on the paper Information Extraction as a Stepping Stone Toward Story Understanding by Riloff, below).

The tokenizer, which forms the first step, is similar to a tokenizer or scanner in artificial language processing (e.g., compiling Java). However, the problem is more difficult with natural langauges. For example, who are compound words, like massively-parallel, handled? Even simple constructs, like commas and periods add complexity. For example, a tokenizer may have to recognize that the period in Mr. Pooky does not terminate the sentence.

Accuracy of Information Extraction Techniques

The Message Understanding Conference sponsored by DARPA provided a forum not only for researchers to meet and learn about current work, but to compare results. The last message understanding conference was MUC-7. By MUC-6 information extraction software was achieving scores of 75% precision and 75% recall on the MUC data sets.

The MUC data sets represent a close universe, which is supposed to represent news articles. Whether these sets really generalize to, say, a Bloomberg or Associated Press data feed is another question.

From what I can tell, the information extraction results are based on information templates extracted from the underlying text data set. There is a metric beyond this however. If the objective is to use information extraction to feed for a database, what is the accuracy and usability of the database input? And how would this compare to a data set processed by a human?

Footnotes:

  1. Of course essoteric terminology is found in many specialized areas. Before reading about digital signal processing and wavelets I have not encountered the term basis function.

  2. Data mining is an almost content free term, since it applies to such a diverse set of algorithms and objectives. One common feature may be that some structure must be added to a data set before it can be analyzed by a data mining algorithm.

Definitions

A few notes on how words and phrases seem to be used in the NLP and Information Extraction literature (some of these definitions are taken from the NIST web page on Information Extraction). Some of these terms are not terribly well defined, which is sort of ironic since the area of application is computational linguistics.

Books

Annotated Web Accessible Papers

Web Sites

Ian Kaplan September 2003
Revised: November, 2009