Robots.txt, also known as the Robot Exclusion, is key in preventing search engine robots from crawling restricted areas of your site.
In this article, I’ll go over the basics of how to block URLs in robots.txt.
What We’ll Cover:
- What a robtos.txt file is
- When you should use it
- Getting started
- How to create a robots.txt file
- How to disallow a file
- How to save your robots.txt
- How to test your results
What is a Robots.txt File?
Robots.txt is a text file that webmasters create to teach robots how to crawl website pages and lets crawlers know whether to access a file or not.
You may want to block urls in robots txt to keep Google from indexing private photos, expired special offers or other pages that you’re not ready for users to access. Using it to block a URL can help with SEO efforts.
It can solve issues with duplicate content (however there may be better ways to do this, which we will discuss later). When a robot begins crawling, they first check to see if a robots.txt file is in place that would prevent them from viewing certain pages.
When should I use a Robots.txt file?
You’ll need to use one if you don’t want search engines to index certain pages or content. If you want search engines (like Google, Bing and Yahoo) to access and index your entire site, you don’t need a robots.txt file. Although it’s worth mentioning that in some cases, people do use it to point users to a sitemap.
However, if other sites link back to pages on your website blocked, search engines may still index the URLs, and as a result, they may still show up in the search results. To prevent this from happening, use an x-robots-tag, noindex meta tag or rel canonical to the appropriate page.
These file types help websites with the following:
- Keep parts of a site private—think admin pages or your development team’s sandbox.
- Prevent duplicate content from appearing in the search results.
- Avoid indexation problems
- blocking a URL
- Prevent search engines from indexing specific files like images or PDFs
- Manage crawl traffic and prevent media files from appearing in the SERPs.
- Use it if you’re running paid ads or links that require special instructions for robots.
That said, if you don’t have any areas on your site that you don’t need to control, well, you don’t need one. Google’s guidelines also mention that you should not use robots.txt to block web pages from the search results.
The reason being, if other pages link to your site with descriptive text, your page could still be indexed by virtue of showing up on that third-party channel. Noindex directives or password-protected pages are a better bet here.
Getting Started With Robots.txt
Before you start putting together the file, you’ll want to make sure that you don’t already have one in place. To find it, just add “/robots.txt” to the end of any domain name—www.examplesite.com/robots.txt. If you have one, you’ll see a file that contains a list of instructions. Otherwise, you’ll see a blank page.
Next, Check if Any Important Files are being Blocking
Head over to your Google Search Console to see whether your file is blocking any important files. The robots.txt Tester will reveal whether your file is preventing Google’s crawlers from reaching certain parts of your website.
It’s also worth noting that you might not need a robots.txt file at all. If you have a relatively simple website and don’t need to block off specific pages for testing or to protect sensitive information, you’re fine without one. And—the tutorial stops here.
Setting Up Your Robots.Txt File
These files can be used in a variety of ways. However, their main benefit is that marketers can allow or disallow several pages at a time without having to access the code of each page manually.
All robots.txt files will result in one of the following outcomes:
- Full allow—all content can be crawled
- Full disallow—no content can be crawled. This means that you’re fully blockingGoogle’s crawlers from reaching any part of your website.
- Conditional allow—The rules outlined in the file determine which content is open for crawling and which is blocked. If you’re wondering how to disallow a url without blocking crawlers off from the whole site, this is it.
If you would like to set up a file, the process is actually quite simple and involves two elements: the “user-agent,” which is the robot the following URL block applies to, and “disallow,” which is the URL you want to block. These two lines are seen as one single entry in the file, meaning that you can have several entries in one file.
How to Block URLs in Robots txt:
For the user-agent line, you can list a specific bot (such as Googlebot) or can apply the URL txt block to all bots by using an asterisk. The following is an example of a user-agent blocking all bots.
User-agent: *
The second line in the entry, disallow, lists the specific pages you want to block. To block the entire site, use a forward slash. For all other entries, use a forward slash first and then list the page, directory, image, or file type
Disallow: / blocks the entire site.
Disallow: /bad-directory/ blocks both the directory and all of its contents.
Disallow: /secret.html blocks a page.
After making your user-agent and disallow selections, one of your entries may look like this:
User-agent: *
Disallow: /bad-directory/
View other example entries from Google Search Console.
How to Save Your File
- Save your file by copying it into a text file or notepad and saving as “robots.txt”.
- Be sure to save the file to the highest-level directory of your site and ensure that it is in the root domain with a name exactly matching “robots.txt”.
- Add your file to the top-level directory of your website’s code for simple crawling and indexing.
- Make sure that your code follows the correct structure: User-agent → Disallow → Allow → Host → Sitemap. This allows search engines to access pages in the correct order.
- Make all URLs you want to “Allow:” or “Disallow:” are placed on their own line. If several URLs appear on a single line, crawlers will have difficulties separating them and you may run into trouble.
- Always use lowercase it to save your file, as file names are case sensitive and don’t include special characters.
- Create separate files for different subdomains. For example, “example.com” and “blog.example.com” each have individual files with their own set of directives.
- If you must leave comments, start a new line and preface the comment with the # character. The # lets crawlers know not to include that information in their directive.
How to Test Your Results
Test your results in your Google Search Console account to make sure that the bots are crawling the parts of the site you want and blocking the URLs you don’t want searchers to see.
- First, open the tester tool and take a look over your file to scan for any warnings or errors.
- Then, enter the URL of a page on your website into the box found at the bottom of the page.
- Then, select the user-agent you’d like to simulate from the dropdown menu.
- Click TEST.
- The TEST button should read either ACCEPTED or BLOCKED, which will indicate whether the file is blocked by crawlers or not.
- Edit the file, if needed, and test again.
- Remember, any changes you make inside GSC’s tester tool will not be saved to your website (it’s a simulation).
- If you’d like to save your changes, copy the new code to your website.
Keep in mind that this will only test the Googlebot and other Google-related user-agents. That said, using the tester is huge when it comes to SEO. See, if you do decide to use the file, it’s imperative that you set it up correctly. If there are any errors in your code, the Googlebot might not index your page—or you might inadvertently block important pages from the SERPs.
Finally, make sure you don’t use it as a substitute for real security measures. Passwords, firewalls, and encrypted data are better options when it comes to protecting your site from hackers, fraudsters, and prying eyes.
Wrapping Up
Ready to get started with robots.txt? Great!
If you have any questions or need help getting started, let us know!