Skip to Navigation
University of Pittsburgh
Print This Page Print this pages

March 5, 2009

Electronic information explosion poses retrieval-effort challenge

The explosion of new information technologies has brought with it an explosion of information — in print or audio forms to be stored and retrieved in novel ways.

Jason R. Baron, director of litigation at the National Archives and Records Administration, discussed his role in improving the retrieval of relevant electronically stored information (ESI) in a Feb. 20 talk sponsored by the School of Information Sciences Digital Libraries and Cyberscholarship Colloquium.

Baron’s talk, “What Do I Do With a Billion Emails? The TREC Legal Track and the Future of Information Retrieval in E-Discovery,” focused on his work with the Text Retrieval Conference (TREC), a project that is evaluating more efficient methods of wading through the ever-increasing numbers of electronic records in response to litigation-related requests.

“The world we live in at the National Archives is the world of email,” Baron said. “It’s not just the Constitution or the Declaration of Independence and a bunch of old dusty records. It’s email and more email and lots of email and then lots of other electronic records that are eventually going to be coming to us through Web 2.0 stuff.”

Baron said lawyers increasingly have had to request ESI in addition to paper records as they prepare cases. “Asking for documents didn’t cut it because there’s all this other stuff in electronic form,” he said.

ESI isn’t clearly defined, but is open-ended. A common example is email, but the list of other ESI grows. “Today it’s email, tomorrow it’s wikis, the next day it’s the next thing — RSS feeds and beyond to Web 2.0 and whatever else,” he said. “All of it is evidence in litigation and all of it needs to be searched.”

The problem of retrieving relevant information from an ever-growing variety of electronic sources isn’t limited to the legal profession, he said, noting that archivists and historians face similarly daunting searches and that the fields of information science, law, business and engineering share a common set of issues with regard to the problem.

Baron got involved in searching electronic records during his 13 years as a justice department trial lawyer and later moved on to his position at the national archives. He was involved in the case against tobacco giant Philip Morris in which the National Archives was ordered to produce all the relevant documents it held.

“I got a request to produce that was 1,726 paragraphs long,” he said, adding that the request meant that the archives had to search all its paper records on tobacco policy in the presidential libraries dating back to the Eisenhower administration, plus some 32 million electronic federal records dating back to the Clinton administration. “It was a tremendous problem meeting our litigation obligations,” he said.

In the same way anyone would search for information online, his solution was to devise a set of relevant keywords to help retrieve the information. “It was a simple task in one sense,” he said. However, the database search for TI — an abbreviation for Tobacco Institute — sometimes turned up information related to “ti” as in the tone on the musical scale. PMI — short for Philip Morris Institute — yielded results about presidential management interns.

Using keywords including “tobacco,” “smoking” and “tar,” his search yielded 200,000 results — 1 percent of some 20 million presidential email records. “Of those, I spent six months with 25 people — lawyers and archivists — culling that down to 100,000 relevant emails,” 80,000 of which were turned over to the opposing party.

The rising numbers of electronic records are rendering manual searching impossible. “I can do 1 percent of 20 million in six months with 25 people, but I can’t do 1 percent of a billion,” he said.

To illustrate, he said it would take 100 people working 10 hours a day 54 years to go through a billion documents manually. “Even 10 million would take 28 weeks,” he said. “I don’t have time as a National Archives lawyer to spend that amount of time.”

Those numbers aren’t too far into the future. “There will be a billion emails at the National Archives at the end of the Obama administration if there are two terms,” Baron noted.

Lawyers now need to rethink everything they learned in the 20th century about discovery and how to do it, he said. “If you have a task to go through a whole collection to look for relevant documents, you can’t do it manually.”

Automated database searches have their problems. In spite of the thought that a well-crafted keyword search will yield most relevant documents, there are plenty of inefficiencies in that method.

The optical character recognition (OCR) scanning process that translates words into computer-editable documents is prone to letter recognition errors. “When you have 250 ways to spell tobacco, that’s a problem,” Baron noted. In addition, “every word that you can think of that’s material to a major litigation has multiple uses for the term,” he said, citing “strike,” which can refer to baseball or labor, as one example.

Even searching on a person’s name isn’t foolproof. In the National Archives, for instance, a search on George Bush begs the question of which one — the nation’s 41st president or 43rd?

“It’s all contextual,” he said.

In addition, there are complexities caused by synonyms. “None of us are smart enough to think of a bag of words completely on our own,” Barron noted.

Changes in word usage also complicate matters. Baron noted that to him, POS is a business term meaning “point of sale,” while to his teenage daughter it’s a chat code for “parent over shoulder.”

“The search issue is multiplied,” he said.

The goal in a good search is to maximize pertinent results and minimize the “noise” of irrelevant ones, Baron said. “That task is a harder one than the Google task of finding a good restaurant in Pittsburgh tonight.”

Hungry searchers don’t care if their search yields 10,000 results. “You look at the first few pages and call it a day,” Baron said. “But for my task as a lawyer, I need those 10,000 hits — I need to find all relevant documents related to a case.”
Attorneys want those relevant hits to be found — and the non-relevant ones to be omitted from the results.

Certain searches, such as patent searches that contain very specific technical terms, are more suitable for a keyword search.

Others, such as for antitrust evidence, are tougher. “People don’t say in email they have committed fraud, or ‘We violated the Sherman Act today,’” Baron said. “You need to come up with proxies that are much more difficult in terms of the squishiness of human language than a very focused exercise.”

Raising the efficiency of searches so more relevant documents are found in less time is crucial. “What keeps me up at night,” Baron said, “is false negatives — The ‘smoking gun’ document that I don’t find.”

Attorneys may not be aware that there are other methods beyond basic Boolean keyword searches that can be employed.

Research is needed to answer “whether I’m doing well enough using keywords versus some other set of search methodologies that might be out there when I have a big litigation like the Philip Morris case,” Baron said.

To aid the legal community in more efficiently generating searches that yield relevant results, the TREC Legal Track sponsored by the National Institute of Standards and Technology researches the efficiency of various information retrieval methods.

The pro bono project, now in its fourth year, creates imaginary legal complaints, complete with fictional document requests. Searches for the pertinent documents are fed into a publicly available database of 7 million documents as a means of comparing the success of Boolean keyword searches against alternative methods.

In the project’s first year, 53 percent of relevant documents were found by Boolean searches; 47 percent by some other technique, said Baron. In the second year, 78 percent of relevant documents were found using other methods versus 22 percent found by Boolean searches.

“That’s a potentially startling figure for the legal profession,” said Baron. “It means that 78 percent of relevant documents in a large universe of information are left on the table and are not found by keyword searches of the type that lawyers use.”

The paradox is that there isn’t a single method that beats Boolean searching — the comparison represents a sum function of other methods. “Nobody finds everything,” he said.

“TREC is telling us you need to use a bunch of different methods,” Baron said. “No automated method is going to be perfect.”

The premium is on finding methods that locate “the really good stuff, the highly relevant, hot document,” Baron said, noting that an added aspect of the TREC Legal Track this year is to have volunteers evaluate search results to rate them in terms of being “highly relevant” rather than “merely relevant.”

“It is of great interest to find the methods to find the smoking gun,” he said. “It’s the elusive grail quest.”

—Kimberly K. Barlow


Leave a Reply