Why Google Indexes Blocked Web Pages
Post Author: Harry James
Post Date: 7 September 2024
In the intricate world of SEO, peculiarities like Google indexing blocked web pages can confuse even experienced professionals.
Understanding why this happens and how it impacts your site can demystify some of these perplexing reports in Google Search Console.
The Mystery Behind Google’s Indexing
Google sometimes indexes pages that you’ve blocked via robots.txt. When Googlebot encounters a blocked page, it cannot view the noindex directive embedded in the code. As a result, those pages might still be indexed despite being disallowed from crawling. This discrepancy can be puzzling, particularly when reviewing Search Console reports.
Limitations of the Site:search Operator
The site:search operator has significant limitations. It doesn’t connect to the regular search index, rendering it ineffective for diagnostic purposes. Google’s advanced search operators, including site:search, are unreliable tools for understanding how content is ranked or indexed.
Using Noindex Tags Effectively
Pages with a noindex tag generate a ‘crawled/not indexed’ entry in Search Console. This indicates that Google crawled the page but did not index it. Such entries don’t negatively affect the rest of your website.
Handling Links to Non-existent Pages
Google’s documentation advises that for a noindex rule to be effective, the page must be accessible to the crawler. A blocked page by robots.txt prevents this, leading to potential indexing issues.
Interpreting Crawled/Not Indexed Reports
These reports are helpful for alerting publishers to pages that are unintentionally blocked. If the restriction is deliberate, no action is needed.
Google’s Advice and Best Practices
It’s essential to avoid unnecessary restrictions. Pages must be crawlable to apply noindex effectively.
Why Disallowed Pages Are Indexed
Disallowed pages can be indexed due to URL discovery through internal or external links, even if blocked by robots.txt. This happens when Googlebot identifies the URL but can’t read the noindex tag due to crawling restrictions.
Understanding why Google indexes blocked pages helps clarify Search Console anomalies.
Using noindex tags correctly and comprehending Googlebot behaviour mitigates unintended indexing.
Source: Searchenginejournal