Phind is a phrase browsing interface that lets readers browse comfortably through the phrases that occur in large document collections. Although purely lexically based, it creates and presents a plausible, easily-understood, hierarchical, topic-oriented structure for documents in the collectiona structure that conventional keyword queries could never reveal.
The Phind phrase browsing interface has been integrated into the Greenstone digital library software, and can be added to any Greenstone collection with a simple line in its configuration file. The Phind browser shown below is based on the FAO on the Internet (1998) collection. The user has searched for drought and several related phrases are returned. The first results (in green) are related phrases from the AGROVOC thesaurus, the other phrases (in black) are the phrase hierarchy extracted by Phind. The user has selected drought conditions from the top pane, and the first ten expansions of this phrase, and the first two documents containing it, are shown in the lower pane.
Manually constructed subject thesauri perform a similar, though more soundly based, classification function to Phind: identifying semantically meaningful topics and describing the relationships between them. When they are added into the automatically-extracted hierarchy, the combined structure benefits from the editorial control of thesaurus phrases and the empirical validation of the document collection.
Digital libraries frequently provide support for browsing document metadata such as titles or authors. Metadata provides a compact surrogate for the documents themselves. As collections grow, however, even the metadata rapidly becomes too large to scan efficiently. Effective browsing requires mechanisms that prevent the user from drowning in information. The standard solution is to provide a hierarchical, topic-oriented arrangement, such as a standard library classification scheme, that permits users to navigate from broad groups of items down to more manageable subsets. But manual classification is expensive and often infeasible to produce.
One solution to this problem is to automatically create a subject browser that resembles a thesaurus, but lacks its editorial supervision. Phind allows the user to examine the terminology actually present in the collection, viewing individual terms in the context of the phrases in which they appear, and viewing phrases in their larger context. Context is particularly effective in distinguishing between relevant and irrelevant usage of a given term; for example, the word bank sometimes refers to a financial institution (World Bank) and sometimes to a geographic entity (river bank).