|
|
The Cole Digest, May 24, 1995Gentle Reader, As newspapers move more rapidly toward digital delivery, it is becoming imperative for them to have a robust database of all the stories and pictures they've published. For some, that means going out and buying a system from a supplier; for others, it means doing the work in-house. At the Raleigh News & Observer, they took the latter tack. I had heard about the work done in Raleigh and asked Dan Woods, formerly the paper's database editor (and now with Time Inc.'s Pathfinder project), to set down some of steps the paper took. For the next couple of weeks, we'll hear from Woods: The Raleigh News & Observer's News Research department looked far and wide in 1994 for an archive solution. Paralog, Cascade, Personal Library and Basis presented their wares, which either were too expensive or lacked the desired functionality. Then we looked to WAIS, an acronym for Wide Area Information Server that's pronounced "ways." WAIS, a program ubiquitous on the Internet, creates an inverted index that allows documents to be searched and retrieved. While the searching capability has some surmountable flaws, the connectivity is unparalleled and the cost is unbeatable -- free. We decided that putting our archive up on WAIS was worth a try. UNIX knowledge is the commodity needed to unlock the usefulness of WAIS, and fortunately, because of our long-term commitment to UNIX, we had plenty of that. The result of our experiment is an archive system that allows our reporters to search the text of stories from almost every Internet client. Its field indexing capabilities allow reporters to search for information in fields like date, headline, page number, byline, dateline. Individual search terms can be connected with the Boolean operators AND, OR and NOT. It's not perfect, but it does comfortably exceed our minimum requirements, and, best of all from our perspective, we can improve any part we want with a little programming. Our first task was to build several software bridges to transport data from our System Integrators System/55 to our UNIX machines. Stories start their trip into the database when they are sent to the typesetter. Copies go to a special basket where they are enhanced by the research staff, and from there are transferred to one of our UNIX machines. At this stage, the stories are massaged and transferred from ASCII text files into a Sybase database. Programs written in Perl -- another masterpiece of UNIX freeware -- move the stories to a Sybase database, where librarians have another chance to work on them. Another Perl application moves the stories into text files where WAIS reads each file and creates an index that may be searched from the many WAIS clients that run on a variety of platforms. We provide six databases of news stories to our reporters, one for each of the years 1990 to 1994 and one comprehensive database that covers 1990 to the present. The index allows searches of story text as well as these fields: date, byline, headline, section, page, subject, type, column, source, edition. (Two asides: As most readers will know, the Raleigh paper is in the process of being acquired by McClatchy Newspapers and many industry analysts have pointed to Raleigh's commitment to technology as part of the transaction's high price. (Also, WAIS Inc., the company that provides the WAIS software, agreed to be acquired by America Online on Monday. AOL cited the superior quality of the WAIS software as the reason for the acquistion.) Next week: connectivity and searching. Onward. \dmc [THE COLE DIGEST is written by consultant David M. Cole, editor and publisher of the industry newsletter THE COLE PAPERS. The DIGEST is made available to PressLink subscribers every Wednesday at no extra charge. Send comments by e-mail to cole@plink.geis.com. The COLE DIGEST is the property of The Cole Group, a California sole proprietorship. Reproduction in whole or in part without the written permission of The Cole Group is prohibited. Copyright (C) 1995, The Cole Group. Opinions expressed are those of The Cole Group, unless otherwise noted. [THE COLE PAPERS is a monthly newsletter devoting itself to technology, journalism and publishing. Subscriptions are $117 for 12 issues ($135 outside the U.S.). MasterCard, Visa and American Express cards are accepted. For more information, e-mail COLE, call (415) 673-2424, fax (415) 673-2449 or write The Cole Group, 2590 Greenwich St., Ste. 9, San Francisco USA 94123-3333.] |
|
Top |
ColeGroup.com |
Consulting |
Cole Papers |
NewsInc. |
Cole's Store |
Miscellanea |
Search Copyright © 1990-2008, The Cole Group. All Rights Reserved. Contact us. Modified date: 05/27/1995, 01:41:33 PM. URL: http://www.colegroup.com/miscellanea/TCD/Cole129.HTML |