gthelp.com

Domain for sale - best offer

Welcome Guest, our newest member is: maxxxmodels2009  
User Name Password Save?
 Robots: Control and Prevention

Post New Content Post New Content  

The help rating is: Thread Rating: 1 votes, 5.00 average. (more info)
Default 

Robots: Control and Prevention

  #1  
Old June 23rd, 2004, 09:54 PM
joevelez's Avatar
joevelez Offline
CEO/Owner
 
Join Date: Jun 2004
Location: Vineland NJ USA
Posts: 1,235 10
Send a message via AIM to joevelez Send a message via MSN to joevelez

Robots are programs that are given objectives to travel the WEB and extract information. The "information" a robot retrieves can be general or specific.

The most common robots:

Search Engine Spiders


Search Engine Spiders are used by search engines to index web sites for their databases.

E-Mail Collectors


E-Mail Collectors are user agents used to harvest e-mail addresses from a web pages, usually to flood these e-mail addresses with unsolicited commercial e-mail later.

Code Validators


Code Validators are very versatile. They are deployed in services such as dead link checking, user-friendliness tests, HTML validation, banner exchange code integrity checks, and many more.

You can prevent or control a robot when visiting your server by using a robot.txt as described below or by using a robots META tag in the of your html documents.

In a nutshell, using a robots.txt is a method that allows WEB site administrator to prevent or control how a robot visits a site. This is done by creating a robot.txt file and inserting into the root of your domain (eg. http://www.yoursite.com/robots.txt).

The /robot.txt file usually contains a record that looks like this:

Code:
User-agent: * Disallow: /cgi-bin/ Disallow: /tmp/ Disallow: /~joevelez/


The above record will prevent all robots from "visiting" your cgi-bin, tmp, and ~joevelez directories.

NOTE: A "Disallow" line is needed for every URL you want to exclude. Only use a blank line when inserting a new record. You can only use a * (wildcard) for "any robots".

You can prevent robots from indexing a specific page by using the following record:

Code:
User-agent: * Disallow: /images/nudes.htm Disallow: /forum/secret.php Disallow: /XXX/members.php


You can prevent a specific robot from indexing your site with the following record:

Code:
User-agent: WebCrawler Disallow: /


You can allow a specific robot to index your site with the following record:

Code:
User-agent: WebCrawler Disallow:

If you want to allow all robots simply create an empty /robots.txt file.

Prevent Google From Indexing Image


This will prevent Google indexing http://www.yoursite.com/image/specialimage.jpg and adding it to their IMAGE DIRECTORY. Insert the following record to the /robots. txt file:

Code:
User-agent: Googlebot-Image Disallow: /image/specialimage.jpg


If you want to prevent Google indexing all of your images you can use the following commands:

Code:
User-agent: Googlebot-Image Disallow: /


Remove A URL Located At Google


If you want to remove a URL (URL of a site or image) from Google:
AUTOMATIC URL REMOVAL SYSTEM

Robots META TAG


There is no server administrator action required to implement the ROBOTS META tag. It is simply inserting in the section of your html page. NOTE: This is not implemented in all robots. The above META tag states to the robot NOT to index the page and NOT to follow links on the page. You can also specify a robot to index a page and NOT to follow the links on the page and vice versa. Here are the following tags you can use:

HTML Code:
<meta name="robots" content="index,follow" > 
  <meta name="robots" content="noindex,follow">
  <meta name="robots" content="index,nofollow">
  <meta name="robots" content="noindex,nofollow">


This tutorial was made with the help from the following sites. Please visit for further reading.
SpiderSpotting
Spider Hunter
Search Engine Spider IP Address
The WEB Robots Pages
Spambot Beware
robots.txt syntax checker
User Agent Info


Last edited by joevelez : September 5th, 2004 at 11:54 AM.
 

Post New Content Post New Content  

Display Modes


Staff - Contact Us - Main - Archive - Technorati Profile - Top  
vBulletin Software: Copyright ©2000 - 2010, Jelsoft Enterprises Ltd.
Page generated in 0.80982 seconds with 11 queries