ONLINE, March 2001 | The Web as Database: New Extraction Technologies and Content Management

Information extraction (IE) software is an important part of any knowledge management system. Working in conjunction with information retrieval and organization tools, machine-driven extraction is a powerful means of finding content on the Web. Information extraction software pulls information from texts in heterogeneous formats–such as PDF files, emails, and Web pages– and converts it to a single homogeneous form. In functional terms, this converts the Web into a database that end-users can search or organize into taxonomies. The precision and efficiency of information access improves when digital content is organized into tables within a relational database. The two main methods of information extraction technology–natural language processing and wrapper induction–offer a number of important benefits:

DEFINING INFORMATION EXTRACTION

To improve accuracy and ease development, IE software is usually domain or topic specific. An IE system designed to monitor technical articles about Information Science, for example, could pull out the names of professors, research studies, topics of in- terest, conferences, forthcoming publications from press releases, news stories, or emails and encode this information in a database. End-users can then search across this database by textual attribute or feature. A typical search could be for all forthcoming publications about information retrieval or to locate all conference presentations on a specific information science topic. In addition, the structured information contained within a database could be ordered into a taxonomy.

DIFFERENCES BETWEEN IE AND IR

Information extraction software requires that end-users specify in advance the categories of information they want to capture from a text. For instance, a system devoted to scanning financial news stories could extract all company names, interest rate changes, SEC announcements, or stock market quotes from texts. Because the parameters that define a particular topic are determined a priori, IE systems are fully customizable. IR and IE are different, but complementary. Together they create powerful new tools for accessing and organizing information stored on Web servers.

DIFFICULTIES OF INFORMATION RETRIEVAL AND EXTRACTION

In addition to synonymy and homonymy, IE also must contend with co-reference recognition. Co-reference recognition determines when an expression, such as pronouns like "he," "she," or "it" and noun phrases like "the company," refers to the same thing in a sentence. For IE to work correctly, various entities within documents (locations, people, places, events) must be identified within a block of text. Information extraction involves discourse analysis, and co-reference recognition refers to an entity introduced earlier in specific dis- course [3]. In the following example, co-reference recognition would disambiguate the personal pronouns in the third sentence: "Barbara is a professional chef. Lael likes to make desserts. She made a birthday cake for her." The Web is filled with such terse and abbreviated communications.

NATURAL LANGUAGE PROCESSING

All participants in the MUCs develop software systems that extract information from texts composed of everyday speech. Participants develop software to perform a pre-determined IE task and then convene to compare notes. Conference organizers determine the topic of study–past MUCs have analyzed news releases about terrorist activities in Latin America, corporate joint ventures, company management changes, and microelectronics.

Formally evaluating NLP information extraction software requires compiling a corpus of texts and manually creating an answer key. Texts are run through an IE software system and a template of answers is produced. This template is then measured against an answer key that specifies what information should be extracted from a text [4]. Each software system fills empty template slots with appropriate values derived from the test documents.

Natural language processing is a complex task and involves many steps. The text is divided into sentences and each sentence is tagged according to its part of speech (verb, adverb, noun). This syntactic structure is matched to pre-existing linguistic patterns and relevant content is determined. Semantic content (or a text's meaning) is determined via syntactic patterns. Then, information is extracted and a summary is produced.

EVALUATION METRICS

When evaluating test results for NLP systems, it's instructive to consider human performance as a point of comparison. When humans perform information extraction tasks they tend to fall short of perfect performance for a number of reasons. Boredom, time constraints, and an inadequate subject background can contribute to less than stellar precision and recall. In studies, human workers tested on 120 documents performed at 79% recall and with 82% precision [5]. NLP systems often score in the mid-50s on both precision and recall [6]. While humans can outperform most machine-driven extractions, they cannot beat software in terms of speed and scalability.

Furthermore, it's important not to equate these metrics with standard grading scores where 90-100% is excellent, 80-90% is good, and 70-80% is acceptable [7]. Using this measurement, highly-trained human labor can manage no more than "acceptable" recall and "good" precision when measured against an answer key.

WRAPPER INDUCTION

While the MUCs encouraged the development of IE for the natural language processing community, the explosive growth of the Web is responsible for increasing popularity of wrappers. The need for tools that could extract and integrate data from multiple Web sources led to the development of the wrapper generation field. They are less dependent on full grammatical sentences than NLP techniques and, as noted earlier, this is important when extracting content from resources like emails and press releases. Many information resources on the Web do not exhibit the rich grammatical structure that NLP was designed to exploit. Furthermore, linguistic approaches tend to have a long processing time. Wrappers are fast to create and test.

Furthermore, wrappers demonstrate that extensive linguistic knowledge is not necessary for successful IE. Instead, shallow pattern-matching techniques can be very effective. Information can be extracted from texts based on document formats rather than what the sentences "actually mean." This type of IE analysis is ideally suited to the Web because online information is a combination of text and document structure. Almost all documents located on Web servers offer clues to their meaning in the form of textual formatting. For example, research on a collection of email conference announcements shows that speaker's names are often prefixed by "Who" and many names begin with the title "Dr." One can easily exploit document structure in deciding where relevant content is located. Such announcements often follow a discernable pattern and this means that relevant information, such as location, affiliation, or job title, can be located within a text based on formatting alone [9].

THE MACHINE LEARNING SUCCESS STORY

At the most fundamental level, machine learning rests on the assumption that mathematical concepts capture relevant properties of reality, and that you can translate back and forth between reality and mathematics In short, concepts are translated into mathematical patterns.

Machine learning is important because content management on the Web must be automated as much as possible. Automation is key because human labor cannot scale to an ever-increasing number of documents and users. Automation significantly reduces the amount of money and number of hours required to manage content effectively.

CHALLENGES FOR IE TECHNOLOGY

On the wrapper side of IE, these systems require large amounts of training data, and collecting these examples can be time-consuming. To avoid gathering a lot of training data, the domain in which the wrapper is expected to be effective must be limited. In addition, Web sites are occasionally remodeled, and when the user interface changes, a site's wrapper is broken. Because wrappers rely on low-level formatting details, they are brittle [12].

XML AND IE

XML makes it easier for developers to take the pieces of a document apart and reassemble them, yet information extraction still needs to be accomplished. XML is an enabling technology–it functions like a building permit at a construction site by permitting information aggregation and synthesis and performing the work itself.

SOLUTION TO THE "INFOGLUT" PROBLEM

Information access would improve if information in many different formats could be extracted and integrated into a structured form. Because IE is an attempt to convert information from various text documents into database entries, it plays a key role in improving online knowledge discovery. Information extraction software has the potential to convert the Web into a structured database. This is an exciting vision for reordering how end-users retrieve and organize digital information. Once information is encoded in a database, it could be organized into a taxonomy or searched over by textual attribute or feature. This stands as a vast improvement over the usual search protocol: index content and query full-text documents by keyword.

THE HIDDEN WEB

IE can access the useful information hidden away in relational databases. In fact, wrapper induction technology is especially well-suited to this problem because the wrapper only has to learn a Web site's template. Templates are easy for wrappers to train on. The "hidden Web" is another example of how document structure can be exploited by IE technology.

WEB CONTENT FALLS APART

Call it what you will–integral to knowledge management initiatives, an alternative to traditional search functions, the catalyst for knowledge mining, an access point to the hidden Web, the solution to information overload–data extraction technologies foreshadow exciting new developments for information professionals.

Acknowledgements: Many thanks to Chris Harris for his thoughtful comments. I appreciate Neil Senturia's continuing support.

REFERENCES

[15] Luh, James. "Content Goes to Pieces." Internet World (July 1, 2000): http://www.internetworld.com/070100/7.01Cover1.asp.

EXTRACTION PATTERN A pattern that represents a pre-determined entity or event (corporate names, conferences, and workshops, etc.) in a natural language text.

NATURAL LANGUAGE PROCESSING (NLP) Using software to "understand" the meaning contained within texts. Everyday speech is broken down into patterns. Typically, these systems employ syntactic analysis to infer the semantic meaning embedded in documents. NLP identifies patterns in sample texts and makes predictions about unseen texts. Also called computational linguistics.

SEMANTIC The part of language concerned with meaning. For example, the phrases "my mother's brother" and "my uncle" are two ways of saying the same thing and, therefore, have the same semantic value.

SEMI-STRUCTURED TEXTS Most information contained on the Web is embedded in semi-structured texts. This includes email, new stories, resumes, magazine articles, press releases, etc. The information contained within these documents is not as rigidly ordered as database entries, but does contain someone reliable formatting.

STRUCTURED TEXTS The various types of documents available on the Internet are often (erroneously) characterized as structured or unstructured. Structured documents refer to database entries and information in tabular form. For example, MARC records in an OPAC database, search results on Yahoo!, Ebay product postings, etc. See semi-structured texts.

SYNTACTIC The part of language concerned with syntax, sentence structure. For example, the phrases "my mother's brother" and "my brother" express the same relationship, but the way in which the information is expressed differs.

UNSTRUCTURED TEXTS Sometimes called "natural language" texts, these are documents that exists with minimal formatting. The difference between unstructured and structured texts is a matter of document formatting. See semi-structured texts.

Katherine C. Adams (kadams@monohime.com) is an information architect for Mohomine.

大香蕉群交之肛交视频在线

The Web as Database: New Extraction Technologies and Content Management