Internet Search Engines and Spiders, Exclusion Technology

Posted by arlene

When the owner of a Web site wishes to advertise that site with a search engine they will access a registration page at the search engine site. This page will usually contain a form which asks for details of the site such as its home page, the name and e-mail address of the person submitting the site and some brief description of the purpose of the site. Some time later the search engine will use a spider to look at the site and index it. This is a multi-step process:

Living the Web 2.0

  • Some robots will then extract the links in the home page and identify which of them belong to the site being accessed; they will then visit these pages and process the HTML which was used to build them.
  • The keywords which have been extracted from the visited pages are then stored in the databases used by the search engine and associated with the URL of the site.

Once the spider has done its work the site can then be referenced by queries which are issued by the user of the search engine.

Other uses for spiders

Spiders are not just used for search engine indexing. There are also spiders which gather Web statistics, for example they can determine which sites are popular by examining links to them or count the number of Web pages in selected sites in order to calculate growth rates of the World Wide Web. Others spiders can be used to check for links in Web pages which reference resources that no longer exist.

One problem worth outlining here and which is at its worst with search engine spiders is that of bandwidth overload. In the early days of the Internet, when communication media were slow and computers were primitive, there was an acrimonious debate about whether spiders were a ‘good thing’ since the early spiders were developed in such a way that they would slow down the performance of the early Web servers by, for example, visiting the servers for too long a time or revisiting them at very frequent intervals. At one point those who thought they were a nuisance seemed to be in the ascendancy; however, as lessons were learned and the Web got larger, those who regarded them as an important technology have been to the fore.

Exclusion technology

One of the problems that I alluded to above is the fact that robots can consume bandwidth. In 1994 a group of users concerned with this problem developed a standard known as the Robot Exclusion Standard which provided guidance to visiting robots about the site. The standard defines a very simple language which contains information about what pages can be accessed by a robot. An example of this language is shown below:

# Simple example for Book

User-agent: *

Disallow: /main/temp/dayfiles

The first line is a comment, the second line states that any robot can visit the site (the asterisk is used as a wild card) and the final line specifies any directories that the robot should not visit.

Text such as that shown above is stored in a file which is consulted by any robot that visits the site and is used mainly to:

It is worth pointing out that the onus on whether to consult the robot exclusion text is on the developer of the robot or spider; he or she can develop the robot in such a way that it can ignore the exclusion text.

Possibly related posts: (automatically generated)
Internet Search Engines and Spiders, Exclusion Technology

6 Responses to “Internet Search Engines and Spiders, Exclusion Technology”

  1. Was regarded as one of the nation’ s most profitable hosting firms, with 16 consecutive quarters of revenue growth and profitability. … Ecommerce Web Hosting

  2. Using the Ad words Conversion Tracking, you’ll be able to see which visitors convert into buyers more often (those that search for “widget, &quote; “large widget, &quote; “small widget, &quote; etc.). … Search Engine

  3. ClixGalore Affiliate Marketing tries to be as flexible as possible regarding the types of media you can use and will try to support your media type if not currently supported. … Internet Marketing Forum

  4. Are so enthusiastic about the product they are offering they offer a $1000 bonus to any publisher who send 500 applicants in one month! … Online Advertisers

  5. Customer may allow ftp access to its server and host web sites for its customers without violating this Policy. … Applicable Renewal Fees

  6. It’s a coding language made up of “”tags”" that tell Web browsers like Navigator and Internet Explorer how to display text, links, and other Web content ranging from graphics and other multimedia to forms, Java, ActiveX, and JavaScript. … Web Content Ranging

Leave a Reply

LogoAlexa CounterFeedBurner Counter