Domain for sale - best offer |
|
Welcome Guest, our newest member is: maxxxmodels2009
/ / /
Robots: Control and Prevention
The help rating is:
(more info)
Robots are programs that are given objectives to travel the WEB and extract information. The "information" a robot retrieves can be general or specific. The most common robots: Search Engine SpidersSearch Engine Spiders are used by search engines to index web sites for their databases. E-Mail CollectorsE-Mail Collectors are user agents used to harvest e-mail addresses from a web pages, usually to flood these e-mail addresses with unsolicited commercial e-mail later. Code ValidatorsCode Validators are very versatile. They are deployed in services such as dead link checking, user-friendliness tests, HTML validation, banner exchange code integrity checks, and many more. You can prevent or control a robot when visiting your server by using a robot.txt as described below or by using a robots META tag in the of your html documents. In a nutshell, using a robots.txt is a method that allows WEB site administrator to prevent or control how a robot visits a site. This is done by creating a robot.txt file and inserting into the root of your domain (eg. http://www.yoursite.com/robots.txt). The /robot.txt file usually contains a record that looks like this: Code:
The above record will prevent all robots from "visiting" your cgi-bin, tmp, and ~joevelez directories. NOTE: A "Disallow" line is needed for every URL you want to exclude. Only use a blank line when inserting a new record. You can only use a * (wildcard) for "any robots". You can prevent robots from indexing a specific page by using the following record: Code:
You can prevent a specific robot from indexing your site with the following record: Code:
You can allow a specific robot to index your site with the following record: Code:
If you want to allow all robots simply create an empty /robots.txt file. Prevent Google From Indexing ImageThis will prevent Google indexing http://www.yoursite.com/image/specialimage.jpg and adding it to their IMAGE DIRECTORY. Insert the following record to the /robots. txt file: Code:
If you want to prevent Google indexing all of your images you can use the following commands: Code:
Remove A URL Located At GoogleIf you want to remove a URL (URL of a site or image) from Google: AUTOMATIC URL REMOVAL SYSTEM Robots META TAGThere is no server administrator action required to implement the ROBOTS META tag. It is simply inserting in the section of your html page. NOTE: This is not implemented in all robots. The above META tag states to the robot NOT to index the page and NOT to follow links on the page. You can also specify a robot to index a page and NOT to follow the links on the page and vice versa. Here are the following tags you can use: HTML Code: <meta name="robots" content="index,follow" > <meta name="robots" content="noindex,follow"> <meta name="robots" content="index,nofollow"> <meta name="robots" content="noindex,nofollow"> This tutorial was made with the help from the following sites. Please visit for further reading. SpiderSpotting Spider Hunter Search Engine Spider IP Address The WEB Robots Pages Spambot Beware robots.txt syntax checker User Agent Info Last edited by joevelez : September 5th, 2004 at 11:54 AM.
|



(


Linear Mode