Robots.txt File Simplified: How to Create a Crawlable Custom Robots.txt File for SEO

The process of getting Google to index your web pages starts with proper crawling of your website. If search engines can’t crawl your site, you won’t be able to rank for your primary keywords.

And robots.txt is that text file you can use to instruct search engines (such as Googlebot) on how to crawl the pages on your website.

In this article, I will show you what Robot.txt file is, how to create one or edit existing ones and the best practices recommended by search engines.

Table of Contents

What is a Robots.txt file?

Robots.txt file is a part of the Robots Exclusion Protocol (REP) which regulates how robots crawl, access and index content and then serve the content to the users. REP also includes the Robots Meta & X‑Robots – and how search engines should react to links found on the pages of your site.

Basically, the robots.txt file tells whether Googlebot should or should not crawl a particular page, image or other media file on your site.

Here’s a basic format:

User-agent: [user-agent name]

Disallow: [URL string not to be crawled]

Although, you can specify:

  • multiple user agents, 
  • disallow multiple pages, 
  • set the crawl delays

And some other things, which you’ll see in a moment.

Robots.txt in Action

Assuming we’re working with the Robots.txt file attached to this URL: 

www.yourdomain.com/robots.txt

robots.txt wordpress - semoladigital.com

Blocking all web crawlers from all content

User-agent: * 

Disallow: /

This syntax instructs all “web crawlers” (GoogleBots or BingBots) not to crawl any pages on www.yourdomain.com, including the homepage.

Allowing all web crawlers access to all content

User-agent: * 

Disallow: 

This syntax will instruct web crawlers to crawl all pages on www.yourdomain.com, including the homepage.

Blocking a specific web crawler from a specific folder

Using robots.txt file to block search engines from indexing a page.
Credit: Yoast SEO

User-agent: Googlebot 

Disallow: /sample-subfolder/

Using this syntax in your robots.txt file tells Google’s crawler (Googlebot) not to crawl any pages that contain the URL string www.yourdomain.com/sample-subfolder/.

Blocking a specific web crawler from a specific web page

User-agent: Bingbot 

Disallow: /sample-subfolder/blocked-page.html

This tells only Bingbot (user-agent of Bing—Microsoft) to avoid crawling the specific page at www.yourdomain.com/sample-subfolder/blocked-page.html.

Blocking only a specific web page

For example,

If you have a page that you don’t want search engines to index and show users in their SERP, you can specify by disallowing only the page while you still make other pages appear on search engines.

User-agent: *

Disallow: /do-not-crawl-page.html

Setting crawlers to wait for certain time before crawling

User-agent: *

Disallow: 

Crawl-delay: 300

This tells all search engines bots to wait for 300 msc before starting to crawl the pages on the site.

Block all images on your site from Google Images:

User-agent: Googlebot-Image

Disallow: /

By specifying Googlebot-image as the user agent, together with a disallow rule, you’re directly instructing Googlebot not to crawl your images.

Block a specific image from Google Images:

User-agent: Googlebot-Image

Disallow: /images/dogs.jpg

This syntax will only block this specific image from being crawled.

Two regular expressions you need to know

Notice how we use an asterisk (*) here. The other is a dollar sign ($). These are regular expressions honoured by Google – and that you can use to specify the user-agent and identify pages or subfolders that you want to exclude from crawlers.

  1. * is a wildcard that represents any sequence of characters.
  2. $ matches the end of the URL

Example;

User-agent: *

Disallow: /category/*.xml$

For all the possible pattern-matching syntax, check Google robots.txt syntax and example.

Understanding Robots.txt File Syntax

Robots.txt syntax can be thought of as the “language” of robots.txt files. There are five common terms you’re likely to come across in a robots.txt file. They include:

1. User-agent

User-agent is the specific web crawler—you’re giving instructions to (usually a search engine such as Googlebot). 

You can set custom instructions for each user agent. Each search engine has its own user agent. Below is a list of major user agents every SEO must be familiar with:

  • Google: Googlebot
  • Google Images: Googlebot-Image
  • Bing: Bingbot
  • Yahoo: Slurp
  • Baidu: Baiduspider
  • DuckDuckGo: DuckDuckBot

You can check this comprehensive list of user agents to learn more.

2. “Disallow” Directive

The command ” disallow” tells search engines bots (User-agent) not to crawl a given URL. You can have multiple disallow in a robots.txt file but you can only place a single URL in each disallow line. But you can use:

Disallow: /category/*.xml$

To disallow a group of URL matching the stated pattern in the category folder

3. “Allow” Directive

The “Allow” option is only applicable for Google user agents. It tells Googlebot it can access a given URL in a folder even when you have disallowed the entire folder.

4. Crawl-delay

Not all search engines follow this rule. Google for example will simply ignore any crawl rate you set in the robots.txt file. But Google provides a solution for crawl delay in the Google search console. What crawl delay does is tell the crawlers to slow down crawling so as not to overload the webserver. 

It’s possible some websites experience high traffic and would want to slow search engines’ spiders or crawlers down to allow for more server resources. The number of delays is in seconds and tells crawlers to wait that amount of time in-between pages while crawling.

User-agent: Bingbot

Crawl-delay: 180

5. Sitemap Directive

You can choose to include an XML sitemap in a robot.txt file and you can ignore it. It’s supported by major search engines. And still a good practice in SEO.

A sitemap is a list of URLs on your site that you want search engines to crawl and index.

You can include a sitemap in the robot.txt file to specify its location to search engines.

Here’s a robots.txt file using the sitemap directive as an example:

User-agent: *

Disallow: /blog/

Allow: /blog/post-title/

Sitemap: https://www.yourdomain.com/sitemap.xml

robots.txt wordpress - semoladigital.com

Other Robots.txt Syntax

A blank line – must separate each user-agent/disallow group, but no blank lines should exist within a group. That is… between the user-agent line and the last disallow.

The hash symbol (#) – The hash symbol “#” can be used to comment out important text in the robots.txt file. And whatever text appears after the # is ignored by crawlers.

You can either use it for whole line comments or end of lines.

Directories and filenames are cases sensitive; “public”, “Public”, and “PUBLIC” are all uniquely different to search engines.

How Robot.txt File Works

Knowing how the robots.txt file works are crucial to your SEO success. Because, to have Google or other search engines index and rank your content, they must first crawl through your site. 

Every search engine has two main goals:

  1. To discover content by crawling the web.
  2. Index that content and serve it to users who’re looking for information on the web.

Search engines crawl URLs. That is…they only follow links from one page/site to another and in the process, crawl through millions of websites.

Crawler (or spider or not) before crawling through a site, will first look to see if there’s any robots.txt file. If it finds one, it reads it and follows the commands set in the file. Whatever you disallow, it ignores.

Many webmasters struggle to get their content indexed by Google due to the errors set in the robots.txt file.

Does my Website need a Robots.txt File?

Having no robots.txt file in your site directory doesn’t hurt. But there are many good reasons why you should create one. 

Plus, it’s pretty easy and simple to create.

Setting robots.txt file gives you much control over how search engines crawl your site—where and how to crawl.

Here’s is what Google says:

When Googlebot visits a website, we first ask for permission to crawl by attempting to retrieve the robots.txt file. A website without a robots.txt file, robots meta tags or X-Robots-Tag HTTP headers will generally be crawled and indexed normally.

And here are some reasons you might want to consider creating a robots.txt file:

  • It prevents the crawling of duplicate content;
  • It keeps some sections of your site private (e.g., your staging site, admin page, etc.);
  • It prevents the crawling of internal search results pages on your site. And from showing on public SERP
  • Robots.txt file with crawl delay set, prevents server overload;
  • It prevents Google from wasting “crawl budget.”
  • It also prevents images, videos, and resources files from appearing in search engines results pages.
  • It specifies the location of sitemaps.

Note that while Google doesn’t typically crawl and index pages that are blocked in robots.txt, it’s not a guarantee for exclusion from search results

Because, if the content you blocked (disallowed) in your site has a link pointing to it somewhere else on the web, it may still show up on SERP. To prevent Google from indexing a certain page use the noindex Meta tag or the HTTP response header. Or password-protect the file in your server.

How to know if you have a robots.txt file on your site

If you want to check to see if you already have a robots.txt file on your site, head over to your browser, and type:

yourdomain.com/robots.txt

But if you can’t find one… relax!

How to Create a Robots.txt File

If you follow the format above and the browser couldn’t return a robots.txt, don’t panic!

Open a blank .txt document and type the directive you want by following the patterns we discussed earlier.

For example, if you want search engines not to crawl your WordPress admin page, then it’ll look something like this:

User-agent: *

Disallow: /wp-admin/ or low: /admin/

There are SEO plugins you can use to generate a robots.txt file. If you’re using WordPress, the Yoast SEO plugin will generate a robots.txt file likewise the RankMath SEO plugin and some other popular SEO plugins for WordPress.

You can as well use this tool to generate one and upload it to the root directory of your website.

Robots.txt file generator like this helps you minimize syntax errors—as a little mistake could hurt your SEO effort.

But, tools are for generalization. They don’t give you that customization you’d want.

If you want to customize beyond what the tools are giving you, just follow the best practices which we’re going to discuss next and the patterns discussed earlier.

Where to Put Your Robots.txt File

When a crawler arrives on your site, there’s a specific place it’ll look into to find if you have any robots.txt files and what command you have in there. 

It’ll look at your root directory and if you have it placed outside of the root directory, it’ll be inaccessible for the crawlers. And so, it assumes there’s no robots.txt instruction on the site and crawls the entire site the way you might not want them to.

To avoid this, always ensure that your robots.txt file is directly placed at the root directory of your WordPress website or other CMS and Custom websites so that—it’s accessible at:

www.yourdomain.com/robots.txt

If you have a subdomain; something like blog.example.com, and you would like to control how bots crawl it, then place the robots.txt file in the root directory of the subdomain – such that it’s accessible at:

www.blog.example.com/robots.txt

Robots.txt File Best Practices

Robots.txt best practices are to guide you on how to properly create and upload a robots.txt file to your server without making mistakes.

1. For each directive, use a new line

This should be avoided:

Bad practice

User-agent: * Disallow: /directory/ Disallow: /another-directory/

Best practice:

User-agent: * 

Disallow: /directory/ 

Disallow: /another-directory/

2. Simplify instructions with wildcards (*)

Use wildcards to limit the number of URLs by applying it as a URL pattern-matching.

You can also use it to specify that you want all user-agents to crawl your site instead of entering the list of user-agents one after the other

For example, you can use pattern-matching to instruct search engines that you don’t want them to crawl “parameterized” product category on your website this way:

Bad practice:

User-agent: * 

Disallow: /products/t-shirts?

Disallow: /products/hoodies?

Disallow: /products/jackets?

Good practice:

User-agent: * 

Disallow: /products/*?

Imagine you have hundreds of products in the category to deal with. ..see how (*) can save you lots of time.

This instructs all search engines not to crawl all URLs under the /product/ subfolder containing a question mark. Simply put…any parameterized product category URLs should not be crawled.

3. Use “$” to specify the end of a URL

Always use the dollar sign “$” symbol to mark the end of a URL. For example, if you wanted to block crawlers from accessing all .xml files on your site, here’s the best practice:

User-agent: * 

Disallow: /*.xml$

This way, you prevent search engines from accessing any URLs ending with .xml. 

4. Declare the same user-agent only once

Although, Google doesn’t mind if you use the same user-agent multiple times like this:

User-agent: Googlebot

Disallow: /admin/

User-agent: Googlebot 

Disallow: /category-products/

Googlebot will still obey and not crawl any of the subfolders.

But, it thus looks simple and keeps things neat if you follow it this way:

User-agent: Googlebot

Disallow: /admin/

Disallow: /category-products/

5. Be specific with declarative to avoid unintentional errors

What do I mean…

Assuming you have a multilingual site, and the German (de) or french (fr) version is in the development stage – and you don’t want that indexed yet. You could go ahead and use a directive like this: 

User-agent: *

Disallow: /de

Here search engines will ignore everything under the subfolder with no problem.

But something is wrong with the directive; it isn’t specific enough and can as well prevent search engines from crawling any page or file on your site beginning with “/de.”

For example:

/designer-dresses/

/delivery-information.html

/definitely-not-for-public-viewing.pdf

Good practice

Include a trailing slash “/” at the end and that solves the problem:

User-agent: *

Disallow: /de/

6. Use comments to ease the readability of your robots.txt file to people

Comments help people, including developers to better understand your robots.txt file.

It can be anything like the reason why you’re disallowing a page or anything like this:

# This instructs Google not to crawl our site.

User-agent: Googlebot

Disallow: /

# I want this specific folder disallowed because it contains information that’s private to me.

Crawler ignores everything on lines that start with a hash (#).

So feel comfortable using it anytime you want to.

7. Use a separate robots.txt file for each subdomain

A robots.txt file only applies to the host subdomain. You can’t assume the domain’s robots.txt will apply to any subdomain and likewise, a robots.txt file that resides in one subdomain doesn’t work for other subdomains of your website.

So, you’ll have to create and upload a different robots.txt file for the domain (root directory of the domain), and each subdomain (root directory of the subdomains) of your website.

8. Robots.txt is case sensitive

Always check for errors like this:

Wrong

Robots.txt or robots.TXT, or otherwise

Right

robots.txt

SEO best practices for robots.txt file

  • Avoid blocking any content or sections of your site you want to be crawled.
  • Any link blocked by robots.txt will not be crawled and indexed. But note that if such a page is accessible from other pages on the web, it may still appear on the search engines results page.
  • Also do not use the robots.txt file as a means to prevent sensitive data such as private user data from appearing on the search engine results page. Because if other pages on your site link to it directly, it may still show up on SERP.
  • The best practice is to use a noindex Meta tag or password-protect the linked file in your server.
  • Once you block a page URL in the robots.txt file, no link equity will be passed. 
  • Search engines usually cache the robots.txt content, but they do update the cached content at least once a day. If you make changes and you want it updated quicker than they usually do, you can submit the robots.txt URL to Google.

Difference Between Robots.txt, Meta robots And X-Robots

Robots.txt is a site— or directory-wide crawling behaviour. It only manages the accessibility of your content to crawlers, it doesn’t tell them whether they should index the content or not. 

Although many webmasters or SEO manages to include the noindex within the robots.txt, it was never officially supported by Google and some search engines. And was officially deprecated in July 2019.

X-Robot and Robots Meta tags tell search engines how to crawl and index a particular page at the page level. It’s a snippet of HTML code that you place in the <head> section of your web pages:

<meta name=”robots” content=”noindex” />

Now, If you set the robots.txt file and robots meta tag instructions for a page and it conflicts with each other, Googlebot will follow the most restrictive rule.

For example,

If you prevent a page with robots.txt and set it to “index” with Meta tag or X-Robots, Google will

never crawl the page and so won’t be able to read any robot’s meta tags you set on the page. And the result is — such a page will never make it to Google index base.

Likewise, If you allow a page with robots.txt but prevent it from being indexed using a robot’s meta tag, Googlebot will access and crawl the page, but will not index it. 

Further Reading: Robots Meta Tag, X‑Robots-Tag and Data-nosnippet: Everything You Need to Know in 2021

Wrapping Robots.txt File up with FAQ…

Here are a few frequently asked questions that I think will help clear some doubt. However, Let me know in the comment below – if anything is still not clear and I will add it up to the FAQ after providing answers to your questions so that others who may have the same question can benefit. 

How can I slow down Google’s crawling of my website?

You can set the crawl-rate setting in your Google Search Console account. Google will ignore any crawl delay directive you set in the robots.txt file.

Can I put the robots.txt file in the subdirectory of my site?

No. You can only place it in the top-level directory of your website, which is the root directory. Any other location will make it inaccessible to search engines.

Do I need to include an allow directive to allow crawling?

You do not need to specify an allow directive before search engines can crawl a certain page or pages. The allow directive only applies when you want to override a disallow directive in the same robots.txt file.

For instance, you can disallow a folder with the disallow directive but at the same time, you can single out a page out of the disallowed folder and set it to “allow”.

If I block Google from crawling a page using a robots.txt disallow directive, will that prevent it from showing up on SERP?

Usually, a disallowed page will not show up on SERP because search engines will not be able to crawl it. But disallowing directives in the robots.txt file is not a total guarantee that it’ll not show up.

A link to that page somewhere else on the web can make it appear in SERP. 

If you want a page not to show up at all on SERP, use the X-Robots Tag HTTP header and Robots Meta tag. Even after doing this, you should still allow crawlers to the page to allow it to see the noindex instruction.

How can I temporarily suspend all crawling of my website?

The best way to do this is by making all the URLs including the robots.txt file return a status code of 503. Google will periodically try to access the robots.txt file until it can access it again.

What program should I use to create a robot.txt file?

You can use simple tools like Notepad, TextEdit, anything that can create a valid text file is allowed.

There are also many tools online you can use to generate robots.txt files.

How to find robots.txt in WordPress?

Head over to your browser, and enter www.yourdomain.com/robots.txt.

How do I edit robots.txt in WordPress?

You can either create one manually or use one of the many WordPress SEO plugins like Yoast and RankMath that let you edit robots.txt from the WordPress backend.

Oladoyin Falana
Oladoyin Falana
https://semoladigital.com