• Home
    Home This is where you can find all the blog posts throughout the site.
  • Categories
    Categories Displays a list of categories from this blog.
  • Tags
    Tags Displays a list of tags that have been used in the blog.
  • Bloggers
    Bloggers Search for your favorite blogger from this site.
  • Archives
    Archives Contains a list of blog posts that were created previously.
  • Login
    Login Login form

Controlling how search engines access and index your website

Posted by on in Search Engines
  • Hits: 750

I'm often asked about how Google and search engines work. One key question is: how does Google know what parts of a website the site owner wants to have show up in search results? Can publishers specify that some parts of the site should be private and non-searchable? The good news is that those who publish on the web have a lot of control over which pages should appear in search results.

The key is a simple file called robots.txt that has been an industry standard for many years. It lets a site owner control how search engines access their web site. Withrobots.txt you can control access at multiple levels -- the entire site, through individual directories, pages of a specific type, down to individual pages. Effective use of robots.txt gives you a lot of control over how your site is searched, but its not always obvious how to achieve exactly what you want. This is the first of a series of posts on how to use robots.txt to control access to your content.

What does robots.txt do?

The web is big. Really big. You just won't believe how vastly hugely mind-bogglingly big it is. I mean, you might think it's a lot of work maintaining your website, but that's just peanuts to the whole web. (with profound apologies to Douglas Adams)

Search engines like Google read through all this information and create an index of it. The index allows a search engine to take a query from users and show all the pages on the web that match it.

In order to do this Google has a set of computers that continually crawl the web. They have a list of all the websites that Google knows about and read all the pages on each of those sites. Together these machines are known as the Googlebot. In general you want Googlebot to access your site so your web pages can be found by people searching on Google.

However, you may have a few pages on your site you don't want in Google's index. For example, you might have a directory that contains internal logs, or you may have news articles that require payment to access. You can exclude pages from Google's crawler by creating a text file called robots.txt and placing it in the root directory. Therobots.txt file contains a list of the pages that search engines shouldn't access. Creating a robots.txt is straightforward and it allows you a sophisticated level of control over how search engines can access your web site.

Fine-grained control
In addition to the robots.txt file -- which allows you to concisely specify instructions for a large number of files on your web site -- you can use the robots META tag for fine-grain control over individual pages on your site. To implement this, simply add specificMETA tags to HTML pages to control how each individual page is indexed. Together,robots.txt and META tags give you the flexibility to express complex access policies relatively easily.

A simple example
Here is a simple example of a robots.txt file.

User-Agent: Googlebot
Disallow: /logs/

The User-Agent line specifies that the next section is a set of instructions just for the Googlebot. All the major search engines read and obey the instructions you put inrobots.txt, and you can specify different rules for different search engines if you want to. The Disallow line tells Googlebot not to access files in the logs sub-directory of your site. The contents of the pages you put into the logs directory will not show up in Google search results.

Preventing access to a file
If you have a news article on your site that is only accessible by registered users, you'll want it excluded from Google's results. To do this, simply add a META tag into the html file, so it starts something like:

<meta name="googlebot" content="noindex">

This stops Google from indexing this file. META tags are particularly useful if you have permission to edit the individual files but not the site-wide robots.txt. They also allow you to specify complex access-control policies on a page-by-page basis.

Learn more
You can find out more about robots.txt at http://www.robotstxt.org and at Google's Webmaster help center, which contains lots of helpful information, including:

We've also done several posts in our webmaster blog about robots.txt that you may find useful, such as:

There is also a useful list of the bots used by the major search engines:http://www.robotstxt.org/wc/active/html/index.html

Next time...
Coming soon: a post detailing the use of robots and metatags, and another on specific examples for common cases.

Update: Added a sentence to paragraph 9 on access-control policies.


Source : https://googleblog.blogspot.ca

Rate this blog entry:


airs logo

Association of Internet Research Specialists is the world's leading community for the Internet Research Specialist and provide a Unified Platform that delivers, Education, Training and Certification for Online Research.

Get Exclusive Research Tips in Your Inbox

Receive Great tips via email, enter your email to Subscribe.

Follow Us on Social Media