Chapter 32

How Intranet Search Tools and Spiders Work


CONTENTS

Corporate intranets can contain an almost unimaginable amount of information. Departments, divisions, and individuals create a wide variety of Web pages, both for internal and external consumption. Human resource information, personnel handbooks, procedures manuals, and newsletters are all posted internally. Databases-both those hosted directly on the intranet and on "legacy" databases on non TCP/IP systems-are available. Add that to all the information that can be gotten via the Internet using the World Wide Web, and you have a serious case of information overload.

There are several ways to help intranet users find the information they need. One way is to create subject directories of intranet data that present a highly structured way to find information. They let you browse through information by categories and subcategories, such as marketing, personnel, sales, research and development, budget, competitors, and so on. In a Web browser, you click on a category, and you are then presented with a series of subcategories, such as East Coast Sales, South Sales, Midwest Sales, and West Sales. Depending on the size of the subject directory, there may be several such layers of subcategories. At some point, when you get to the subcategory you're interested in, you'll be presented with a list of relevant documents. To get those documents, you click on links to them. On the Internet, Yahoo is the most well-known, largest, and most popular subject directory.

Another popular way of finding information-and in the long run for intranets, probably more useful-is to use search engines, also called search tools. Search engines operate differently from subject directories. They are essentially massive databases that index all the information found on the intranet-and can include information found on the Internet as well. Search engines don't present information in a hierarchical fashion. Instead, you search through them as you would a database, by typing in keywords that describe the information you want.

Intranet search engines are usually built out of three components: An agent, spider, or crawler that crawls across the intranet gathering information; a database, which contains all the information the spiders gather; and a search tool, which people use as an interface to search through the database. The technology is similar to Internet search engines such as Alta Vista.

Intranet search tools differ somewhat from their Internet equivalents. The database of information they search can be built not just by agents and spiders searching Web-based pages. Agents can be written that can go into existing corporate databases, extract data from them, and put them into the database of searchable information. And people on an intranet can fill out forms and submit their information into the database as well. Additionally, since they are built for a specific corporation and its data, the information they gather and the way they are searched can be customized.

How Intranet Search Tools Work

Searching and cataloging tools, sometimes called search engines, can be used to help people find the information they need. Intranet search tools, such as agents, spiders, crawlers, and robots, are used to gather information about the documents available on an intranet. These search tools are programs that search Web pages, extract the hypertext links on those pages, and automatically index the information they find to build a database. Each search engine has its own set of rules guiding how documents are gathered. Some follow every link on every page that they find, and then in turn examine every link on each of those new home pages, and so on. Some ignore links that lead to graphics files, sound files, and animation files; some ignore links to certain resources such as WAIS databases; and some are instructed to look primarily for the most popular home pages.

  1. Agents are the "smartest" of the tools. They can do more than just search out records: They can per-form transactions on your behalf, eventually such as finding and ordering the lowest-fare airline ticket for your vacation. Right now they can search sites for particular recordings and return a list of five sites, sorted by the lowest price first. Agents can cope with the context of the content. Agents can find and index other kinds of intranet resources, not just Web pages. They can also be programmed to extract records from legacy data-bases. Whatever information the agents index, they send back to the search engine's database.
  2. General searchers are commonly known as spiders. Spiders report the content found. They index the information they find and extract summary information. They look at headers and at some of the links and send an index of the information to the search engine's database. There is some overlap between the tools-spiders can be robots, for example.
  3. Crawlers look at headers and report first layer links only. Crawlers can be spiders.
  4. Robots can be programmed to go to various link depths, compile the index, and even test the links. Because of their nature, they can get stuck in loops, and they take consider-able Web resources going through the system. There are methods available to prevent robots from searching your site.
  5. Agents extract and index different kinds of information. Some, for example, index every single word in each document, while others index only the most important 100 words in each; some index the size of the document and number of words in it; some index the title, headings and subheadings, and so on. The kind of index built will determine what kind of searching can be done with the search engine, and how the information will be displayed.
  6. Agents can also go out to the Internet and find information there to put in the search engine's database. Intranet administrators can decide which sites or kinds of sites the agents should visit and index-for example, competitors to the corporation or news sources. The information is indexed and sent to the search engine's database in the same way as is information found on the intranet.
  7. Individuals can put information into the index by filling out a form about the data they want put in. That data is then put into the database.
  8. When someone wants to find information available on the intranet, they visit a Web page and fill out a form detailing the information they're looking for. Keywords, dates, and other criteria can be used. The criteria in the search form must match the criteria used by the agents for indexing the information they found while crawling the intranet.
  9. The database is searched, based on the information specified in the fill-out form, and a list of matching documents is prepared by the database. The data-base then applies a ranking algorithm to determine the order in which the list of documents will be displayed. Ideally, the documents most relevant to a user's query will be placed highest on the list. Different search engines use different ranking algorithms. The database then tags the ranked list of documents with HTML and returns it to the individual requesting it. Different search engines also choose different ways of displaying the ranked list of documents-some just provide URLs; some show the URL as well as the first several sentences of the document; and some show the title of the document as well as the URL.
  10. When you click on a link to one of the documents you're interested in, that document is retrieved from where it resides. The document itself is not in the database or on the search engine site.