|
|
The Cole Digest, May 31, 1995Gentle Reader, We're with Dan Woods (formerly of the Raleigh News & Observer; now with Time Inc.) and he's discussing how the paper set up a freeware WAIS (Wide Area Information Server) as an electronic library: The strength of WAIS is its effortless connectivity. WAIS is based on the TCP/IP networking protocol, the basis of the Internet, which is available on every major computing platform. From the day we created our first index, we have been pleasantly surprised by new interfaces. We started searching using a Gopher client, and then discovered that the United States Geological Survey had distributed a WAIS client for Windows. Then we discovered a program called MacWais, which allowed searching from a Macintosh. Other programmers have created gateways to WAIS from Mosaic, a World-Wide Web client, and Lynx, a text-based Web client. Now for the bad news: The search engine is the weakest part of WAIS. It allows queries such as this: crime and murder and unsolved This query will find every story that has these three words in it. The stories are returned in relevance order -- a sequence based on how frequently and how early each word occurs in each story. A relevance score is assigned to each story that satisfies the query. Fielded searching allows a portion of a document to be searched. Fields are described to the index program and a separate index is created for each field. A sample query is as follows: banking and failure and byline=woods This will find all of the stories that have the words "banking" and "failure" in them and have "woods" in the byline portion of the story. This query illustrates a couple of other problems. Most advanced searchers assume that plurals are automatically handled because almost every modern search engine does so. But here, the search term "failure" will find only "failure," not "failures." This can be overcome by entering "failure" followed by an asterisk ("failure*"), but users don't always remember such techniques. Another problem is that the implicit connector is the Boolean operator OR. Most of our users would prefer that AND were the default connector so that "banking failure byline=woods" would be implicitly connected by AND. Searching for a phrase is difficult. WAIS allows it, but the implementation is so bad that it never finishes searching when common words appear as the first word of the phrase. This is surmountable, but it would be better if it didn't exist. Finally, one of the most troubling problems is index overflow. Words that appear too frequently get dropped out of the index, and searches using those words return no matching stories. In our database, only about 15 words fall in this category, but it does create a problem. For example, the word "Raleigh" overflows the index, so that we can't do searches that use it. Because this word appears so frequently in our database, it is not much of a search term, but its absence is an annoyance. Frequently, common words are important in combination with other words, like "Raleigh City Council." Next week: development and alternatives. Onward. \dmc [THE COLE DIGEST is written by consultant David M. Cole, editor and publisher of the industry newsletter THE COLE PAPERS. The DIGEST is made available to PressLink subscribers every Wednesday at no extra charge. Send comments by e-mail to cole@plink.geis.com. The COLE DIGEST is the property of The Cole Group, a California sole proprietorship. Reproduction in whole or in part without the written permission of The Cole Group is prohibited. Copyright (C) 1995, The Cole Group. Opinions expressed are those of The Cole Group, unless otherwise noted. [THE COLE PAPERS is a monthly newsletter devoting itself to technology, journalism and publishing. Subscriptions are $117 for 12 issues ($135 outside the U.S.). MasterCard, Visa and American Express cards are accepted. For more information, e-mail COLE, call (415) 673-2424, fax (415) 673-2449 or write The Cole Group, 2590 Greenwich St., Ste. 9, San Francisco USA 94123-3333.] |
|
Top |
ColeGroup.com |
Consulting |
Cole Papers |
NewsInc. |
Cole's Store |
Miscellanea |
Search Copyright © 1990-2008, The Cole Group. All Rights Reserved. Contact us. Modified date: 06/08/1995, 09:51:59 AM. URL: http://www.colegroup.com/miscellanea/TCD/Cole130.HTML |