Internet Search Engines and Spiders, Exclusion Technology
When the owner of a Web site wishes to advertise that site with a search engine they will access a registration page at the search engine site. This page will usually contain a form which asks for details of the site such as its home page, the name and e-mail address of the person submitting the site and some brief description of the purpose of the site. Some time later the search engine will use a spider to look at the site and index it. This is a multi-step process:
- The robot establishes a connection with the Web server and retrieves the home page.
- The home page is textually processed. How this is done depends on the search engine: some search engines just look at the meta tags in the HTML of the Web page, others look at the words which are displayed in the home page and others do both.
- Some robots will then extract the links in the home page and identify which of them belong to the site being accessed; they will then visit these pages and process the HTML which was used to build them.
- The keywords which have been extracted from the visited pages are then stored in the databases used by the search engine and associated with the URL of the site.
Once the spider has done its work the site can then be referenced by queries which are issued by the user of the search engine.
Other uses for spiders
Spiders are not just used for search engine indexing. There are also spiders which gather Web statistics, for example they can determine which sites are popular by examining links to them or count the number of Web pages in selected sites in order to calculate growth rates of the World Wide Web. Others spiders can be used to check for links in Web pages which reference resources that no longer exist.
One problem worth outlining here and which is at its worst with search engine spiders is that of bandwidth overload. In the early days of the Internet, when communication media were slow and computers were primitive, there was an acrimonious debate about whether spiders were a ‘good thing’ since the early spiders were developed in such a way that they would slow down the performance of the early Web servers by, for example, visiting the servers for too long a time or revisiting them at very frequent intervals. At one point those who thought they were a nuisance seemed to be in the ascendancy; however, as lessons were learned and the Web got larger, those who regarded them as an important technology have been to the fore.
Exclusion technology
One of the problems that I alluded to above is the fact that robots can consume bandwidth. In 1994 a group of users concerned with this problem developed a standard known as the Robot Exclusion Standard which provided guidance to visiting robots about the site. The standard defines a very simple language which contains information about what pages can be accessed by a robot. An example of this language is shown below:
# Simple example for Book
User-agent: *
Disallow: /main/temp/dayfiles
The first line is a comment, the second line states that any robot can visit the site (the asterisk is used as a wild card) and the final line specifies any directories that the robot should not visit.
Text such as that shown above is stored in a file which is consulted by any robot that visits the site and is used mainly to:
- Restrict access to pages which have been dynamically generated using technologies such as Java Server Pages.
- Restrict access to pages which are not yet complete and which are under construction.
- Restrict access to core pages of a site which contain the main information. This is done to ensure that a robot does not consume too much bandwidth by reading pages with no useful information on them.
It is worth pointing out that the onus on whether to consult the robot exclusion text is on the developer of the robot or spider; he or she can develop the robot in such a way that it can ignore the exclusion text.
Possibly related posts: (automatically generated)
Internet Search Engines and Spiders, Exclusion Technology
- Security Facilities in Java
- Adding JavaScript to the HTML Form
- The Apache Web server, a rich Java Web site continue...
- Java Database, J2EE Framework
- Client-Side Scripting
- VBScript and JScript
- VBScript and JScript continue...
- HTML Coding for different E-mail Reader Platforms
- BadArticle.com Article Rewrite and HTML Markup Tool
- The Apache Web server, a rich Java Web site
- August 21st

Was regarded as one of the nation’ s most profitable hosting firms, with 16 consecutive quarters of revenue growth and profitability. … Ecommerce Web Hosting
Using the Ad words Conversion Tracking, you’ll be able to see which visitors convert into buyers more often (those that search for “widget, "e; “large widget, "e; “small widget, "e; etc.). … Search Engine
ClixGalore Affiliate Marketing tries to be as flexible as possible regarding the types of media you can use and will try to support your media type if not currently supported. … Internet Marketing Forum
Are so enthusiastic about the product they are offering they offer a $1000 bonus to any publisher who send 500 applicants in one month! … Online Advertisers
Customer may allow ftp access to its server and host web sites for its customers without violating this Policy. … Applicable Renewal Fees
It’s a coding language made up of “”tags”" that tell Web browsers like Navigator and Internet Explorer how to display text, links, and other Web content ranging from graphics and other multimedia to forms, Java, ActiveX, and JavaScript. … Web Content Ranging