Natural Language Processing and Information Extraction

Abstract

This web page is a set of notes on the Natural Language Processing sub-area, Information Extraction. These notes were collected as I read through various papers and a few books on information extraction. This obviously does not make me an expert in this area (when ever I have a shallow introduction to a topic I'm always reminded of the saying a little knowledge is a dangerous thing).

Introduction

Web databases like CiteSeer (the Scientific Literature Digital Library placed on-line by NEC) have made scientific information far easier to obtain. News "wire" articles are posted and updated as they come in on a variety of web sites. Personal web pages like bearcave.com and "blogs" have created a global dialog.

In the past large data archives were only available on magnetic tape. Large RAID (Redundant Array of Inexpensive Disks) systems have made information stores of almost arbitrary size availabe to computer networks (the most familiar example of this is Google, which mirrors a significant fraction of the world wide web). Large corporations and government agencies have been collecting information in computer readable form for many years. This information can now be made availabe to networked computer systems.

There are two common methods used to access large amounts of computer readable information: relational database systems and flat text search (flat text is also referred to as "unstructured text"). For massive datasets both of these techniques can have scalability problems, although Google has shown that flat text search can be applied to massive data sets. As far as I know, there is no relational database of similar size.

Relational databases support powerful queries (e.g., SQL) which allow the data set to be accessed in a variety of fashions. Applications that access the relational database can "mine" the underlying data in a variety of ways and display it using different abstractions. Database systems work well for information that was initially entered into the database (for example, sales transactions, purchase orders and payroll information). Relational databases are of no use in processing flat text, unless the information content is distilled into a form that can be entered into the relational database.

Flat text search usually relies on matching a set of words or phrases. Flat text search algorithms do not require any preparation of the text database (e.g., addition of structure), but the query power is limited. While flat text search can return good results, these are usually mixed with large numbers of related documents that are of less interest to the searcher.

Work on Natural Language Processing has been going on for at least thirty years. In the last decade computer processing power, computer memory and disk storage have reached sufficient power and capacity to support more powerful NLP applications. This seems to have created a flowering of NLP work. The field is moving rapidly and much of the work leading to real applications has been done in the last ten years. NLP has started to become an applied, rather than a theoretical science.

NLP holds out a number of promises, many of them only tantalizing and unrealized. NLP software can process flat text for entry into a relational database (information extraction). NLP techniques can also extract some degree of meaning from text, supporting more accurate search (information retrieval).

NLP draws from a number of areas including linguistics, statistics and machine learning techniques like neural networks and support vector machines. As someone new to NLP, I have found that reading the literature can be difficult, since the field has built up a terminology that I have never seen used elsewhere. For example, although I am widely read, I have never before encountered the word anaphoric before.

Information Extraction

Natural Language Processing is a large area, which includes topics like text understanding and machine learning. I have concentrated on a subset: Information Extraction, which processes a body of text so that it can be entered into a relational database or analyzed using data mining².

In Information Extraction a body of texts is input. The output is a closely defined data format that is suitable for a database or data mining application. The rigid format for the final result means that only a fraction of the data is relevant. Understanding or meaning are useful in only a limited way to disambiguate the input. Information Extraction systems may be used to process large bodies of information, so performance may be important. The common steps in information extraction are shown below in Figure 1.

Figure 1

In this diagram POS stands for Parts Of Speech. Here the words in a sentence are tagged to indicate whether they are verbs, nouns, etc... In later stages an attempt is then made to match the parts of speech (tagged words) against a template (see the reference on the paper Information Extraction as a Stepping Stone Toward Story Understanding by Riloff, below).

The tokenizer, which forms the first step, is similar to a tokenizer or scanner in artificial language processing (e.g., compiling Java). However, the problem is more difficult with natural langauges. For example, who are compound words, like massively-parallel, handled? Even simple constructs, like commas and periods add complexity. For example, a tokenizer may have to recognize that the period in Mr. Pooky does not terminate the sentence.

Accuracy of Information Extraction Techniques

The Message Understanding Conference sponsored by DARPA provided a forum not only for researchers to meet and learn about current work, but to compare results. The last message understanding conference was MUC-7. By MUC-6 information extraction software was achieving scores of 75% precision and 75% recall on the MUC data sets.

The MUC data sets represent a close universe, which is supposed to represent news articles. Whether these sets really generalize to, say, a Bloomberg or Associated Press data feed is another question.

From what I can tell, the information extraction results are based on information templates extracted from the underlying text data set. There is a metric beyond this however. If the objective is to use information extraction to feed for a database, what is the accuracy and usability of the database input? And how would this compare to a data set processed by a human?

Footnotes:

Of course essoteric terminology is found in many specialized areas. Before reading about digital signal processing and wavelets I have not encountered the term basis function.

Data mining is an almost content free term, since it applies to such a diverse set of algorithms and objectives. One common feature may be that some structure must be added to a data set before it can be analyzed by a data mining algorithm.

Definitions

A few notes on how words and phrases seem to be used in the NLP and Information Extraction literature (some of these definitions are taken from the NIST web page on Information Extraction). Some of these terms are not terribly well defined, which is sort of ironic since the area of application is computational linguistics.

Attribute A property of an entity, such as the entity name, an alias or the entity type.

An example of entities might be people. Attributes are the details of the person and their relations to other people.

Dictionary A set of case frames.

Entity an object of interest such as a person or an organization. In object oriented programming terminology an entity would be an object, like a person. The person object would include attributes (address, phone number, hair color).

Books

Natural Language Processing for Online Applications by Peter Jackson and Isabelle Moulinier, John Benjamins Publishing Company, 2002

This book provides an excellent overview of Natural Language Processing techniques, especially those aimed at large computer hosted data sets (e.g., the Web). This book concentrates on real applications, rather than research.

The chapter in this book on information extraction has a detailed discussion of FASTUS (see FASTUS: A Cascaded Finite-State Transducer for Extracting Information from Natural-Language Text below). On of the points the authors make is that building templates by hand can be very labor intensive and at some point reaches a diminishing level of return (e.g., more templates do not improve information extraction). At the end of the chapter on information extraction there is a discussion of some automatic (or machine learning) techniques for generating templates. But the details of systems like AutoSlog (see below) are not discussed.

While this book does concentrate on applications, it does not concentrate on implementation. For example, the authors discuss "push down automata", which are parsers like YACC or BISON. But they do not give examples of grammars that might be used. They also discuss non-deterministic grammars, but they don't give examples of how a parser might backtrack.

Foundations of Statistical Natural Language Processing by Christopher D. Manning and Hinrich Schütze, MIT Press, 1999

Foundations of Statistical Natural Langauge Processing is one of the best text books that I've encountered. The margins have annotations for the topics that are introduced in each paragraph. The book is clearly written and the authors provide examples to illustrate their points.
The book is written for a range of readers, from people with a computer science background to those with a linguistics background. The early chapters provide an overview of statistics and linguistics for those who are not familiar with these areas. The authors clearly introduct statistical topics like the t-test and the Chi-Square test.

As noted in Figure 1, above, the first stages in Information Extraction involved scanning and parsing the text. This usually results in sentences that have been tagged with parts of speech. This book covers some very good material on these steps. However, the book is almost entirely devoid of material on information extraction or even text summarization.

Since the motivation for my reading in NLP involves information extraction, I found it disappointing that this topic was not covered in this excellent book. I have yet to find a source that is as well written as this text, which covers information extraction.

Foundations of Statistical Natural Langauge Processing web site

Annotated Web Accessible Papers

Information Extraction as a Stepping Stone Toward Story Understanding (PDF) by Ellen Riloff, 1999 (this is apparently a chapter from the book Understanding Language Understanding edited by Ram and Moorman, MIT Press

This well written paper starts out with a clear description of information extraction techniques using "case frames". Case frames are templates against which partially parsed text is matched in information extraction. Most of the examples in the paper are taken from a database of news articles and deal with the terrorism domain. An example given in the paper of a case frame in this domain would be

<X> was kidnapped

Where X, the victim, is a slot that is filled by the subject of the passive verb "kidnapped". The database of case frames is referred to as a "dictionary".

Some of the terminology that I had seen in the computational linguistics literature that I did not understand was explained in this article. Prof. Riloff describes the work that she and her students did on the AutoSlog (gotta love the name) and AutoSlog-TS software. Prof. Riloff explains that generating case frames by hand is time consuming and error prone. AutoSlog-TS presents an automatically generated ordered set of case frames to the user. Those of interest can be selected and used in knowledge extraction.

One thing that seems to be missing from AutoSlog is case frame merging. In case frame or template merging, information from different templates is merged into a single template. For example, one template may define a partial set of information. Another template may define an overlapping set, that contains additional information. Merging produces a single data set with all the information.

FASTUS: A Cascaded Finite-State Transducer for Extracting Information from Natural-Language Text (PDF)by Jerry R. Hobbs, Douglas Appelt, John Bear, David Israel, Megumi Kameyama, Mark Stickel and Mabry Tyson, Artifical Intelligence Center, SRI International, 1996 (published in Finite State Devices for Natural Language Processing, MIT Press).

As the title suggests, the FASTUS system consists of a pipeline which reads text and produces a extracted information. The beginning steps parse the text (what some people seem to refer to as a partial parse). Grammar rules are then used to extract information. The FASTUS system is supposed to be fast and accurate, although this paper does not provide statistics on its accuracy. It would be interesting to combine a system like FASTUS with automatic dictionary creation (as in AutoSlog-TS).

FASTUS and finite state techniques are discussed in Natural Language Processing for Online Applications (see above).

Extraction Patterns for Information Extraction Tasks: A Survey by Ion Muslea, 1999

What Ellen Riloff refers to as "case frames" this paper refers to as "extraction patterns" or "extraction rules". Some papers refer to extraction patterns as templates. This paper provides a survey of machine learning algorithms that generate extracton patterns. AutoSlog is one of the systems discussed (apparently the paper was written before the AutoSlog-TS research work was published).

All these systems are a common preprocessing step:

Such extraction rules are based on syntactic and semantic constraints that help identify the relevant information within a document. Consequently, in order to apply the extraction patterns below, one has to pre-process the original text with a syntactic analyzer and a semantic tagger.

This paper provides a few paragraphs surveying a variety of academic information extraction systems. This gives the paper the feel of an annotated bibliography and it is difficult to draw conclusions about the strengths and weaknesses of the various systems, beyond the characteristics of the extraction patterns (e.g., some are single slot, others are multi-slot).

Financial informaton extraction using pre-defined and user definable templates in the LOLITA System (PDF) by Marco Costantino, Richard G. Morgan and Russell J. Collingham, September 24, 1996

The scope of the LOLITA system is broader than information extraction. A major part of LOLITA is a semantic network (a graph which represents information and its associations - knowledge).

The semantic network consists of a hierarchy of nodes (concepts) connected with arcs. The nodes represent entities (the company), events (The company made losses) and relations (A company IS A business). The knowledge is stored in the network by using control variables. Control variables are essential information stored at each node, there are about 50 different control variables. Knowledge is represented in the Semantic Network according to the connectivity between the nodes and arcs.

Concepts in the graph are linked by labeled arcs. There are specialization and generalization links, which allow concepts like set and subset to be represented. For example, the concept of a "company" is a specialization of the concept of a "business". Presumably this helps with document level information extraction, where one concept is linked to another.

The LOLITA based knowledge extraction system stores document information in LOLITA and then extracts information from the LOLITA semantic graph using templates (referred to by other authors as patterns). The process of storing the document information in LOLITA is itself a multi-stage information extraction task. The templates are semantic not simply syntactic however. For example, an event template can look for corporate takeovers.

The user defined templates mentioned in the paper's title are these high level concept based templates. The financial information extraction systems based on LOLITA also supported a basic question answering interface. For example, questions like What was the cost of the takeover could be asked. This would generate a template that looked for takeovers and corporate purchase prices. Questions were not as effective as noun-phrases (the cost of the takeover) or statements (a company acquires another company) at constructing templates.

While this paper provides an overview of the LOLITA system and a financial information extraction system built from LOLITA a number of important issues were omitted.

Performance. To be useful a financial information extraction system must be able to handle large amounts of data. Given the complexity of processing a document so that it can be entered in a semantic graph, LOLITA may be slow. LOLITA is also implemented in the functional language Haskell. While Haskell seems to provide a great deal of power, it is likely that a Haskell based system would be much slower than a system implemented in C++ or even Java.

Useful results. A financial information extraction system could be used in a number of applications. The web based stock discussion boards could be mined for predictive information. News articles could be mined for trading events or market predictors. Although there are a variety of tantalizing applications for a financial information extraction engine, this article says nothing about whether the results were useful, even in toy applications.

Related links:

Advanced Financial System Research This is Marco Costantino's consulting company.

IE-Expert: Integrating Natural Language Processing and Expert Systems Techniques for Real-Time Equity Derivatives Trading by Marco Costantino, December 20, 1998, published in The Journal of Computational Intelligence in Finance, Vol.7, No.2, pp.34-52, March 1999.

The IE in the paper title stands for Information Extraction.

The system described in this paper uses the financial information extraction system built from LOLITA as a front end for a rule based system. The motivation for such a system in derivatives trading is discussed along with the overall system architecture. The accuracy of the information extraction is also presented.

Despite the "real-time" in the title, the author never discusses whether the system meets the real time goals needed for actual trading. He does mention that the expert system rule engine is implemented in C. However, LOLITA is still processing the text as it flows in from a source like Bloomberg. My suspicion remains that LOLITA is slow.

The final benchmark for such a system is not discussed at all: whether this system can help traders make more money. Admittedly this is the most difficult and time consuming benchmark, especially since there is a human component (the trader) which is difficult to quantify. In fact, such a question might only be answered by building a derivatives trading model, which uses predictors based on information extracted from the news stream and the expert system rules.

In many cases investment banks and hedge funds (the major traders in derivatives) want results that make money now. Even large investment banks do not tend to invest in long term projects like this which may not pan out. So outside of the academic community, obtaining sponsorship to build such a system is a difficult. However, the academic world had a problem with information access. Information may want to be free, but it is most certainly not free when it comes to financial data. Investment banks have access to the on-line data and trading networks which would allow an information extraction trading model to be tested. In theory they also have the money to fund such a project. But if the money for the project comes, in part, out of the bonus for the trading group managers, funding is unlikely.

Web Sites

BADGER Information Extraction Software from the University of Massachusetts

The Natural Language Processing Laborary web page of the University of Massachusetts states that they no longer longer active in NLP. Their last publication is dated 1999. It is not clear what was responsible for what seems to be an NLP dieoff at the University of Massachusetts.

UMass still distributes the source code for BADGER information extraction system. This seems to be a push-down automata based natural language processor (it uses the Bison parser). The BADGER system includes CRYSTAL, which attempts to automatically build dictionaries of case frames (extraction rules).

Cornell Natural Language Processing Group

Unlike the UMass group, the Cornell Natural Language Processing Group is still very active. They have a sub-group which specializes in Information Extraction and Text Summarization.

One of the challenges in Information Extraction is to reduce the human workload in developing extraction rules (a.k.a., case frames or templates). On their web page the Cornell group writes: Our research emphasizes the development of techniques for drastically reducing the amount of human-annotated data required to train an IE system.

Proteus Project at New York University

The Proteus Project at NYU is involved in various fascets of Natural Language Processing, including information extraction and text summarization.

The Stanford Natural Language Processing Group

The Stanford NLP group seems to be a relatively new group centered around Prof. Chris Manning, co-author of Foundations of Statistical Natural Language Processing.

The Message Understand Conference and Information Extraction

DARPA sponsored the Message Understanding Conference (MUC) for a number of years. Apparently SAIC was the conduit through which the DARPA support flowed. The US National Institute of Standards and Technology (NIST) now supports the MUC converence web site. This includes data sets that were used to compare information extraction software, papers and links to other sites. As far as I can tell, the last message understanding conference was MUC-7, which was held in 1998.

Document Understanding Conferences

Apparently the Document Understanding Conferences have picked up where the Message Understanding Conferences left off (the first one was held in 2001). Like the Message Understanding Conferences, the Document Understanding Conferences are sponsored by DARPA. The DOC papers seem to be heavily involed with document summarization (which is related to information extraction).

Ian Kaplan September 2003
Revised: November, 2009