1 (Semi-)Intelligent Web Searching Brian Randell Chris Hurford* Department of Computing Science University of Newcastle upon Tyne 26 Oct 1996 Abstract We describe a new means of "information discovery" for the Web, intended for searching a pre-defined set of pages that are known to hold data stored in a variety of differing syntactic formats. The basic idea behind the project is that of providing means of associating, with each particular Web page or group of pages of interest, a "semantic description" indicating how particular semantic entities are represented and laid out in such pages. This is done indirectly by indicating how such semantic entities are to be recognized by a search engine that has been set up to provide a search facility for some particular domain. The actual setting up of such "semantic descriptions" is not automated, rather it needs human intelligence (on the part of one or more administrators) Ñ hence the term "semi-intelligent" web searching. A brief account is given of a prototype such system, and of its use to test this approach to coping with the variety of data formats used on the Web, through sample searches of a heterogeneous collection of pages taken from the UK & Ireland Genealogy Information Service, a large virtual library that is distributed geographically over a number of Web servers. Keywords: World Wide Web, Search Engine, Information Discovery 1 Introduction There are many search engines on the Web at the moment, the most powerful of which, such as Altavista, provide means of locating text strings within pages, rather than just of locating pages with titles or descriptions, that match given criteria. There are in addition systems such as Harvest [Bowman, 1995 #8; Bowman, 1994 #3] and its associated Glimpse [Manber, 1994 #4] sub-system, that can be used to create specialized search engines (or "information discovery and access systems", if you will). These various systems locate, and in most cases produce indexes to, potentially relevant Web pages or parts of pages. Then. in response to particular search requests, they create and display references that provide means of obtaining what has been located, having first ordered the references so as to give precedence to the most accurate matches. These engines vary in the sophistication of the within-page search requests that they support (two of the more sophisticated being Altavista and Glimpse), but typically all of them leave their users to cope with the great variety of ways in which basically the same information might be represented (e.g. a person's name, or a date) within a Web page. For example, a name "Thomas Jones" might be represented in some Web pages as "JONES, Thomas" at the start of a line, or as "Tom Jones" after a line number, and a date as "25 Apr 1996", or "April 25, 1996". In a particular page or set of pages, emanating from the same source, there may be a quite standard form and/or placement of such information, which would be readily recognized by a human reader Ñ the trouble is that there are lots of such "standards", many of them essentially personal and ad hoc. This is particularly evident in the UK & Ireland Genealogy Information Service (GENUKI, at http://midas.ac.uk/genuki/), a large virtual library distributed geographically over a number of Web servers [Austen, 1995 #14]. GENUKI holds an ever-growing number of files, such as transcriptions and indexes to parish registers, census returns, town directories, etc., that have been contributed by many different individuals and societies, and which as a result come in a wide variety of formats, some in HTML, some in ASCII. Though a search for entries corresponding to an individual who has an unusual surname can be quite effective using a standard search engine, if the surname is a very common one, then one really needs to search for more than just the surname, and such searches are bedevilled by the variety of data formats. In principle, if all information on the Web were adequately (and consistently) encoded using HTML, then one might be able to rely on HTML tags to indicate the intended semantics of various bits of text on web pages (at least for those types of entity for which a tag had been defined in the HTML specification), and so greatly help identify relevant information. In practice, however, HTML tagging is used mainly for formatting and linking, rather than for identifying semantic content, and of course much information on the Web does not even use HTML. Thus, in contrast to (say) typical online library catalogue searches, which can make use of defined data formats distinguishing titles, authors, publishers, etc., or Web-form front ends to corporate databases, web searches are essentially just syntax-based Ñ and at best allow users the functionality (and complexity) of what is essentially a regular expression. However ,what usually is really needed is an "information discovery" facility, rather than a mere syntactic data search. (There are systems which try to automate the identification of page formats, e.g. ASCII, versus HTML versus Postscript, but our interest concerns the textual information within a page, and in making use of whatever semantic guidance can be obtained from any syntactic conventions used in a particular page.) In considering how to make progress towards such a facility, it is important to try to ensure that it is capable of searching existing Web pages, and not to require that pages be converted or altered in any way. (A significant cause of the Web's rapid take-off and growth was that from the start it encompassed ftp and gopher, for example, as well as the then new HTML scheme.) Thus it is in keeping with the history of the Web to try to find some way of dealing with the many and varied data formats that one already finds on various Web pages. This project was an initial exploration of a means of attempting to do just that, motivated by the very practical need for a better search facility for GENUKI than any of the available search engines were able to provide. However the project was not designed to be specific to GENUKI, or to the sorts of information contained in GENUKI. The basic idea behind the project is that of providing means of associating, with each particular Web page or group of pages of interest, a "semantic description" indicating how particular semantic entities are represented and laid out in such pages. However, this is not done directly, by some sort of metadata that attempts to describe the information content of the page. (For a survey of such schemes see [Dempsey, 1996 #16].) Rather it is done indirectly by indicating how particular semantic entities are to be recognized by a search engine that has been set up to provide a domain-specific search facility over a pre-defined search space. Thus our semantic descriptions are better thought of as programs than as (meta)data. (We are not aware of any previous such scheme - Harvest, for example, facilitates the generation of domain-specific Web searches, but not the application of page-specific search criteria.) Such a semantic description could also provide domain-specific keywords that characterize the Web page as a whole, and hence provide means of limiting each search to an appropriate subset of the overall search space. (This aspect of our semantic description does act like a conventional metadata description.) The actual setting up of particular "semantic descriptions" is not automated. Instead, it requires human intelligence (on the part of one or more administrators) Ñ hence the term "semi-intelligent" web searching. 2 Common Types of Syntactic Variation Information consisting of more than one term can be represented in several different forms or different syntaxes. These differences are easy for humans to grasp, but they may cause difficulties during a computerized search. One of these differences is the word order of the information. It is very common to reverse name information, for example, so that "Dwight Eisenhower" is transformed to "Eisenhower, Dwight". This syntax overlaps with a second difference, the punctuation which surrounds the query terms and thus affects positioning on the page. Depending on the standard used to store information, the length of terms may be changed, most commonly by abbreviation of such things as Christian names or dates. Thus "William" might be shortened to "Wm.", "September" to "Sept.", or "1996" to "Õ96". Records prepared for humans to peruse may be made more legible by manipulation of the case of the characters, either to differentiate terms or to draw attention to the most important information. It is very common in genealogical records, for example, for a surname to be capitalised. However, unless special precautions are taken, the computer will not treat upper and lower case versions of the same letter as being effectrively the same character. Dropping of information is another common way for humans to abbreviate information. Removal of surplus information is unlikely to affect the human's success in matching the central terms being searched. A computer, however, may not show a match if the information in question arises in between two terms being searched on. For example, the query, "John Kennedy" will not necessarily match with string "John F. Kennedy", and the use of "Ditto", say, will defeat any simplistic computerized string matching strategy. Sixthly, use of synonyms is also a common human trait, Ñ this can become quite cryptic. It takes experience and intelligence for a human to learn that, semantically, 'Bill' is usually taken as equivalent to 'William', 'Tina' to 'Christina', and 'Betty' to 'Elizabeth'. Finally, lamentably, misspelling of information is a very common human weakness. Again, however, a human will normally be able to recognise misspelled words or take into account differences in international spelling. One can envisage attempting to cope with all these types of variation Ñ the first prototype Semi-Intelligent Web Search that we have so far implemented is intended to help with just the first five (see below). However, in an as yet separate project we have developed a significantly improved version of the Soundex encoding scheme ([Knuth, #13], pp391-392), called (Phonex [Lait, 1995 #12]) for dealing Ñ at least for the case of person's surnames Ñ with the last type of variation. 3 Semantic Descriptions Our so-called semantic descriptions are little special-purpose programs that might be termed "parameterized search strategies". (Typically these are just regular expressions, of the sort used in the grep utility, although some form of phonetic encoding might also be involved, as we have indicated for spelling variations, as well as means for dealing with various formatting and abbreviation ("dittoing") conventions.) The formal parameters in the search strategy correspond to fields in a search form; what a user types into the form in order to request a search will provide the corresponding actual parameters. The task of each administrator is to associate an appropriate strategy with each page in the chosen search space. To give a simplistic example, assume that a Name search form is to have two fields, and that these have been labelled "Given Name" and "Surname". Character strings typed into such fields will be the actual parameters corresponding to the formal parameters FP1 and FP2, respectively in the various search strategies. Thus the regular expression-based search strategies: FP2, FP1 and ^[0-9]*: FP1 FP2 could be used respectively for pages with entries typified by: Bonaparte, Napoleon and if, as here, it starts a new line: 1805: Horatio Nelson Such search strategies evidently have to be specified (by the administrator(s) who set up a particular search facility) for each page or set of pages to be searched, though some sort of default search strategy could be used where no special strategy is specified. Each administrator also has to provide a set of "query rules" (typically just simple regular expressions) that define the permissible form of each actual parameter. In the case of the above example, both the Surname and the Given Name might be limited to the expression [-'a-zA-Z]*, for example Ñ thus making it possible to search for a names such as "Jean-Paul O'Donnell". Combining these search strategies and query rules together one gets what is, in fact, a definition of the permitted search associated with a given page or set of pages. For the two types of page discussed in the above example one gets: {2[-'a-zA-Z]*}, {1[-'a-zA-Z]*} and {1[-'a-zA-Z]*} {2[-'a-zA-Z]*} (where "{N" and "}" are used to delimit instances of the query rule corresponding to the N-th parameter.) Such a search expression is used to search a page for items corresponding to the parameters that (1) individually fit with the query rules, and (2) collectively, by virtue of the sequence in which they appear and the surounding punctuation, match the search expression. Between them, the set of query rules for the parameters, and the search strategies for the pages, can be claimed to act as a semantic definition of the chosen set of Web pages, albeit one that is specialized to the needs of a particular search service. (Evidently, several different such search services might be provided covering the same set of pages, via the provision of differing sets of search strategies.) In fact it is useful to think of search strategies, the query rules, and indeed the semantic definition formed from their combination as programs, or better still as "methods" associated with the various Web page "objects" that are to be invoked as needed in order to perform indexing, search request checking, and actual searching. One final point regarding semantic descriptions. These provide a very convenient place in which an administrator can give one or more keywords characterizing the contents of a given Web page as a whole. Such keywords can be chosen from a domain-specific set that has been specially created by the administrator, and thus provide a very effective means by which a user could limit a search to a particular subset of the search space, quite immune to the vagaries of vocabulary used in particular Web pages and their titles. (For example, in GENUKI, these could provide a very effective means of indicating that the names in given pages all related to a particular geographical area.) 4 The Search Space Associated with a given search form is an indication of the pages to be searched. This will have been set up previously by the administrator(s), and could either be fixed, e.g. to the whole GENUKI archive, or variable, the search form providing some means, such as a pop-up menu, for limiting the set of pages for searching, e.g. to particular countries or even counties within GENUKI). The basic idea is to have a predefined overall search space for any given search facility. Clearly the identification of this search space, if it is large and scattered across many servers, could be a tedious and ongoing task, especially if the servers are not under the control of the administrator(s). However one can imagine the provision of means for helping the administrator(s) to identify additional pages and sets of pages worthy of being added to the search space, and of quickly testing what sort of search strategy should be associated with such pages. Means for helping to keep the list of search strategies up to date, as page locations and formats are changed, would also be very useful Ñ this could be done using facilities similar to those now becoming available for managing large Web servers. What is not envisaged is using the whole Web as a search space (not so much because of its size, but because of the need with our approach to associate predefined search strategies with specific pages). Thus our system is aimed at helping users "discover" information in pages that have themselves previously been "discovered" by the system administrator(s). However, one simple facility would be to provide the user with an option to pass the most recently processed search request, in the form of a suitably formatted version of the default search strategy, onto a general Web page search engine such as AltaVista, so that at least a crude (i.e. conventional) search could be made on his/her behalf of the whole Web. 5 Searching and Indexing The search strategies and query rules can be used directly to control the performance of a search Ñ whether such a search is performed locally using a Web crawler (e.g. in the manner of Fish Search [De Bra, 1994 #10])or by a powerful centrally-located search engine (e.g. AltaVista, running on a bank of Alpha computers) to which the specified set of pages are brought in succession and where they are indexed and perhaps cached, or instead is performed co- operatively by a set of searching engines, each dealing only with local pages. However the semantic definitions formed from them can be used to control the generation of a specialized and potentially highly efficient inverted (centralized or distributed) index of the pages to be searched. (A full text inverted index normally is very large, being complete apart from the exclusion of common words such as "the", "a", etc.) Such a specialized condensed index supporting a given semi-intelligent search form need retain only those terms that correspond to legal actual parameters that are associated together in the specified syntax for the particular page contaning these items, i.e. it need create and maintain references only to information that could form a valid answer to a viable search request. (The task of creating an accurate semantic definition could be aided by providing administrators with means of viewing each page via its semantic definition, so-to-speak, e.g. showing the text on the page with all the words that have been selected for the index given in boldface characters.) This index could be constructed either centrally by means of some form of Web crawling, or at each separate site containing pages that are subject to being searched (in the manner of the WAIS system [Kahle, 1991 #5]). In this latter case a master index could also be employed (e.g. as with ALIWEB [Koster, 1994 #11]). However, given that the index would be limited to just those items which could be legal matches, the index could be of relatively modest size Ñ especially compared to the sort of inverted index created by WAIS, which typically is as big as the body of text that it is constructed from. These various approaches differ in the respective loads they place on the network and the servers involved, as well as in the potential speed of response to search requests. It is likely that differing approaches will be most effective for different search facilities operating over different search spaces. Thus it would be very convenient if the creators and administrator(s) of a particular search facility could exercise direct control over the strategies employed in their particular case. However it would be important to provide means of limiting the consequences of poor choices of strategy, and of attempted excessive usage, given the impact that this might have on other users of the servers and network links involved. (In Section 8 below we discuss the possible use of Harvest as a starting point for investigating these issues, so as to create a general, and potentially very powerful, Semi-Intelligent Web Searching System.) 6 The Prototype Semi-Intelligent Web Searcher The prototype system (developed in PERL [Wall, 1990 #6] by Chris Hurford, for his three-month Summer MSc. dissertation project) provides a form-filling interface by means of which an administrator can set up regular expression- based search strategies and associate them with individual URLs in order to provide a single search facility. (It does not support the idea of query rules.) The administrator does not have to make direct use of any regular expressions. Rather he/she simply makes menu choices indicating the order of search terms, and for each search term whether it can be abbreviated, what punctuation characters precede or follow it, whether it is case sensitive, etc. The search space is given simply by a list of URLs. (Only rudimentary facilities are presently provided to administrators for building up this list.). Based on the administratorÕs input, the system generates a simple user form containing the required number of appropriately-named fields for the specified search parameters. It then allows searches to be made thereafter of the given search space. It returns any matching items that it finds, embedded in a small amount of context, with a link to the page at which they were found, in much the manner of systems such as AltaVista. (No attempt is made to order this list according to closeness of match with the search criteria.) In principal the prototype system is capable of searching a large set of pages held at a variety of geographically distributed sites. Currently, however, since it does its work simply by fetching one page at a time in its entirety before searching it, rather than delegating the search to the system holding the page, or by making use of any pre-computed index, it is, in fact, only really suitable for searching a small local search space. However, the tests that have been possible with this prototype have proved adequate to give an idea of the potential effectiveness of the general technique, and of what would be needed in a more general system. 7 Test Results The system has been tested and demonstrated using a highly disparate collection of pages from GENUKI, for each of which an appropriate search strategy was identifed. These files, some of which used HTML, the others being raw ASCII, had been produced by a variety of different people over the years, in various formats. The results of test searches on this set of files demonstrated that the prototype system indeed provided a very simple, reasonably effective, but slow, means of fulfilling search requests submitted using a First Name/Surname Search Form. Figure 1 below shows how the system found entries (of which only one is shown) listing James Hall as the husband in various 19th century Northumberland and Durham marriage indexes, and also an entry amongst a list of Gloucester Jail prisoners during 1850/1851, although in this list his name was in fact given as HALL/James. USER RESULTS - SEARCH DONE! Here are the results of your search for James Hall http://www.cs.ncl.ac.uk/genuki/Transcriptions/NBL/LBN18 00.html : 1) 1836.11.14 Lancelot Liddle = Ann Todd 1836.12.19 George Robinson = Jane Sharp 1837.01.03 Matthew Robinson = Eleanor Liddle 1837.01.29 James Hall = Mary Swindle 1837.02.08 William Mills = Jane Turnbull 1837.02.18 John Forster = Thomasing Colling . . . . . http://www.cs.ncl.ac.uk/genuki/GLS/Jail1850.txt : 1) TEALE/Alfred/18/Trespass in search of game/Dec. 24, 1849/1 Cal month hard labour or pay 2l. each/R Waller, clerk, F E Witts, clerk/ HALL/James/16/Trespass in search of game/Dec. 26, 1849/6 weeks hard labour or pay 1l. each/W Croome, esq, W H Hinton, esq/ . . . . . SEARCH DATA COLLECTED, YOU HAVE 10 MATCHES Back to the user form Figure: An Example Search Request Result The most controlled test of this service involved a set of pages taken from just the GENUKI server at Manchester, since this server has a Harvest-based search facility (albeit one that searches far more than just the GENUKI archive) with which our prototype system could be compared. A set of queries were submitted to the semi-intelligent searcher to match known answers represented in pages at Manchester that used a variety of syntactic formats. The same queries were submitted, in the same form, to the Harvest system and its responses compared. Even though the Harvest system is one of the most sophisticated and efficient Web indexers available (including support for spelling mistakes, case insensitive matches, boolean searches and regular expressions) our tests proved very favourable. As expected, the Harvest system returned more results overall, since it was indexing a much greater range of pages (including many which were nothing to do with GENUKI at all), not just the fifteen sample pages. However, the semi-intelligent search engine returned a large number of useful matches that the Harvest system missed in the fifteen sample pages. In fact, out of the fifteen known queries to be searched for, the Harvest system only correctly identified one match (see Figure 2 below). This match (on "Bert Bailey") was in the same word order as the original query, with no differences in punctuation, but the case of one of the search terms was changed, a situation which the Harvest system can readily cope with. The other matches were not made because they were located in pages that used various radically different syntactic formats to that of the original stipulated query. (One could of course have obtained many more matches from Harvest at the expense of submitting a much more complicated search request making use of its Boolean expression facilities. However this would be tedious and error prone, and likely to produce a lot more false returns, since the more complicated search quest would be applied regardless to various differently formatted files.) Figure 2: Comparison with Harvest System (omitted from ASCII version) Clearly such tests are merely indicative Ñ however more extensive ones, especially ones that would facilitate comparisons of the precision and recall achieved by our semi-intelligent search system relative to that achieved by Harvest, would require a better-engineered prototype and, ideally, some means of constraining both systems to the same search space. 8 Possible Use of Harvest A fully engineered large scale Semi-Intelligent Web Search System would almost certainly be best designed as a set of cooperating indexing and searching systems. Each of these systems would deal only with locally-held pages. Collectively, communicating perhaps via agents, they would provide each user with the impression that he/she was interacting with a single centralised system Ñ an approach which would require at least the passive if not the active cooperation of the owners of the various sites. A possible basis for creating such a system would be the Harvest system. The Harvest system "provides an integrated set of customizable tools for gathering information from diverse repositories, building topic-specific content indexes, flexibly searching the indexes, widely replicating them, and caching objects as they are retrieved across the Internet" [Bowman, 1995 #8; Bowman, 1994 #3]. One important type of component in Harvest is a "Gatherer". A Gatherer collects information from specified sets of pages from one or more Web servers, though ideally from just one server with which is is co-located. This information consists of a set of object summaries (created using the Essence subsystem [Hardy, 1994 #15]). It delivers these summaries in compressed form to one or more "Brokers", each of which uses them to provide a specially- tailored indexing and searching service. Each Broker employs one of various indexing and searching sub-systems, such as Glimpse [Manber, 1994 #4], other subsystems allied to either a Broker or a Gatherer provide caching and replication services Ñ the whole scheme being aimed at achieving fast searching whilst avoiding excessive network or server load. It would appear that the incorporation of Semi-Intelligent Searching into a Harvest system would mainly involve the provision of a replacement for, or the extensive modification of, Essence, so that the object summaries that it delivers are just those that fit the various page-specfic semantic definitions, and are all in a standard format despite the syntactic vagaries of the pages from which they were extracted. It would appear that the Glimpse subsystem, and its query interface provisions, would require relatively little modification other than the provision of means of requesting Soundex or Phonex searches, and of parameter extension and contraction. (Parameter extension allows, say, an initial in a query form to match a full first name given in a Web page, contraction the opposite.) As to whether this use of Harvest is feasible depends on how stable and well documented are the various relevant interfaces and data formats, and how readily one would be able to retain all the rest of the Harvest facilities Ñ however, at least in principle, this would appear to be an attractive means of implmenting a full-scale Semi-Intelligent Web Search facility that was capable of providing very fast searches and efficient of considerable bodies of information. 9 Concluding remarks Evidently, the scheme we have introduced here of indirectly identifying the semantic content of Web pages via associating a search strategy with each page is best suited to the searching of pages whose syntax and format provide useful hints as to the semantics of the data that they contain. The Semi-Intelligent Web Search system could of course be used for searching free-format text pages for particular words and phrases Ñ but unless these words and phrases have a variety of equivalent representations (something that is in fact particularly the case with people's names and dates) and a given page is consistent as to the particular representations used, then the result is unlikely to have significant advantages over the use of one of the better general Web page search systems, such as AltaVista. Where such a Semi-Intelligent Web Search system should have most advantages is in the provision of coherent searches over a variety of heterogeneously formatted files of specialised information. The need for such a system arose with the GENUKI virtual genealogical library, but our speculation is that there are many other similar needs, for example in providing integrated searches of various types of online catalogue, of various sources of financial information, etc. Although one of the most important next steps in producing a full-scale Semi- Intelligent Web Search system would be an implementation that provided fast searches without incurring excessive network and server loads, much could be done to provide an improved administrator's interface. The aim would be to reduce the task of specifying search strategies, e.g. by making it easy to reuse existing strategies, to identify sets of pages as all being of a particular format, to test the effectiveness of new or modified search strategies, and to aid maintenance of the set of search strategies as the search space changes, etc. How far one could go towards increasing the "intelligence" of the system by automating the provision of the search strategies is is not clear at this stage. One can, however, draw some analogies to recent work on automating the identification of synonyms (part of the "vocabulary problem" [Li, 1995 #2]) for purposes of automated document retrieval. However, before attempting a fully engineered large scale system, it might be useful to produce what might be regarded as a second albeit much more complete prototype, adequate enough to be made available for use with GENUKI, and permitting a wider range of accurate search strategies than is possible just using ordinary regular expressions. (Given the variety of data formats and the number of different servers (and Webmasters!) involved, the overall size of the archive, and the likely level of search traffic, this would be quite severe enough test of the general idea, at least as applied to name searching.) 9 References 1. M. Austen, V. Dunstan, B. Randell, A. Stanier, P. Stringer and J. Woodgate, "An Information Service for United Kingdom & Ireland Genealogy Based on the Internet's World Wide Web," Computers in Genealogy, vol. 5, no. 7, pp.294-307, 1995. Available at http://www.cs.ncl.ac.uk/genuki/CiGpaper/ 2. C.M. Bowman, P. Danzig, D. Hardy, M. U and M. Swartz. Harvest: A Scalable, Customizable Discovery and Access System, Boulder, Department of Computer Science, U. of Colorado, July, 1994. Available at http://harvest.cs.colorado.edu/ and ftp://ftp.cs.colorado.edu/pub/techreports/schartz/Harvest.Jour.ps.Z 3. C.M. Bowman, P. Danzig, D. Hardy, M. U and M. Swartz. "The Harvest Information Discovery and Access System," in Proceedings of the Second International World Wide Web Conference, pp.763-71, Chicago, Illinois, Oct, 1994. Available from ftp://ftp.cs.coloradi.edu/pub/cs/techreports/schwartz/Harvest.Conf.ps.Z 4. P.M.E. De Bra and R. Post. "Information Retrieval in the WWW: Making Client Based Searching feasible," in Proceedings of the First World Wide Web Conference, Geneva, May 1994. Available from http://www.win.tue.nl/win/cs/is-/reinpost/www94/www94.html 5. B. Kahle and A. Medlar, "An Information System for Corporate Users: Wide Area Information Servers," ConneXions Ñ The Interoperability Report (Interop, Inc.), vol. 5, no. 11, pp.2-9, Nov, 1991. Available from ftp://think.com/wais/wais-corporate-paper.text 6. D.E. Knuth. The Art Of Computer Programming, Vol. 3, Sorting and Searching, Addison Wesley, . 7. M. Koster. "ALIWEB Ñ Archie-Like Indexing in the Web," in Proceedings of The First International World-Wide Web Conference, pp.91-100, Geneva, May 1994. Available at http://web.nexor.co.uk/aliweb/doc/aliweb.html see also introduction.html 8. A.J. Lait and B. Randell. An Assessment of Name Matching Algorithms, U. of Newcastle upon Tyne, Sept. 1995. Available at: http://www.cs.ncl.ac.uk/~brian.randell/home.informal/Genealogy/NameMatching.ps 9. S.H. Li and P. Danzig. "Vocabulary Problem in Internet Resource Discovery," in Proceedings of the Second International Workshop on Next Generation Information Technologies and Systems, pp.139-45, Naharia, Israel, June, 1995. (Available from ftp://catarina.usc.edu/shli/ngits.ps.gz) 10. U. Manber and S. Wu. "GLIMPSE: A Tool to Search Through Entire File Systems," in Proceedings of the USENIX Winter Conference, pp.23-32, San Francisco, California, Jan, 1994. 11. O.A. McBryan. "GENVL and WWWW: Tools for Taming the Web," in Proceedings of the First International World Wide Web Conference, pp.79-90, Geneva, May 1994. Available from http:www.cs.colorado.edu/home/mcbrayan/mypapers/www94.ps * Present Affiliation: IBM Bedfont Lakes, London