Ever been wondering why you're losing traffic? Like something is preventing Google from accessing fully to your website? Google doesn't index all the pages it finds on a website and has been transparent about it so far. Google also helps website owners and webmasters to find which of the pages are indexed using the Google Search Console. Every other page not found in the Search Console is either not indexed or is experiencing some difficulty that has to be addressed.
GSC gives you all vital information about the specific issue every page is facing. This information includes server errors, 404s, and some common content-related issues. We need to dig a little deeper to find exactly what prevents your pages from showing up in Google Search.
Just imagine – would it be possible to sell anything without actually putting it on the store shelf or anywhere a likely customer can find it? So, to show your website page to your potential users, Google has to find your pages and index them.
If you want your pages to show up in search, they have to be properly indexed. At the same time, Google analyzes your content to decide for which queries it may be relevant.
If you want to get organic traffic from Google, you need to have your pages indexed. And the more pages are indexed, the more they appear in the search results, and you can expect more traffic from Google.
That's why you need to know if Google can index your content.
How To Identify Indexing Issues
Optimizing websites from a technical point makes them more visible on Google. Of course, not every page has the same value, and your goal is not to have them all indexed. You have old, outdated pages, taxonomies, tag pages, and other eCommerce filter parameters, for example.
Web admins have multiple ways of letting Google know to ignore them, including the robots.txt file and the no-index tag.
Considering such pages would negatively affect the overall website SEO, so it's better to keep a tidy listing of all the pages blocked by robots.txt and marked as no-index and keep track of all redirects 404s or other status codes other than 200.
It would also help if you kept your sitemap updated, including all relevant and valid URLs. A good and updated sitemap is the most straightforward representation of valuable URLs on every website. It means no random junk URLs, just pure value pages.
Main indexing issues vary depending on the size of a website. There are small websites of about 1-15 000 pages, medium sites with up to 100 000 pages, and big websites with pages even over million pages. The common practice could not be even remotely the same.
This is mostly because a specific issue that one "big" website faces can outweigh a bunch of other problems some smaller website is facing. Each website has its pattern of indexing issues they struggle with. But there is a way to categorize them.
Top Indexing Issues
As we previously mentioned, all websites face specific problems while trying to rank on Google. Top issues that are preventing the site from being thoroughly indexed are:
- Discovered - not indexed
- Crawled page - not indexed
- Duplicate content
- Crawl issue
- Soft 404s
But, there are also other things you should consider. One of the most common problems websites are facing is content quality issues. Meaning your pages can have thin content, or you can copy your content from another website, or it can be offensive or misleading in any way.
If you cannot provide unique, valuable content that Google wants to show to users, you may face significant indexing problems.
For example, Google may recognize some of your pages as duplicate content - even if you provided unique and non-copied content.
We all know that using the canonical tag prevents problems of duplicate content appearing on multiple URLs. You can set canonical tags pointing to different pages and, as an outcome getting the original page not indexed.
If you have duplicate content, use the canonical tag or a 301 redirect to make sure the same pages on your site aren't competing against each other for views, clicks, and links.
A crawl budget represents the number of web pages Google is prepared to access on every website. Googlebot will only crawl a limited amount of URLs on each website. This is why optimization is vital, so you don't waste your crawl budget on irrelevant pages.
We also had 404 errors as one of the indexing issues. 404s mean you submitted a deleted or non-existent page for indexing. Soft 404s display "not found" information but don't return the HTTP 404 status code to the server. Also, redirecting non-existent pages to others that are irrelevant is a common mistake. Multiple redirects may also show up as soft 404 errors, and it can result in Google stop indexing the final destination page. So, try to get rid of redirect chains as much as you can.
There are many crawl issues, but an important one is a problem with robots.txt. If Googlebot finds a robots.txt for your site but can't access it, it will not crawl the site at all.
As far as we discovered so far, almost all big websites face the same issues, so it's hard to keep the quality level when it comes to large websites, with over 100k pages.
What we know so far is that:
- Sites with 10 000 - 100 000 may have insufficient crawl budget and could face indexing issues
- The crawl budget and pages quality become more important with bigger websites
- The duplicate content problem varies depending on the website
- Orphan pages are often neglected issue causing indexing problems
Orphan pages have no internal links leading to them. As a result, Googlebot does not have a clear path to find this page, preventing it from indexing the page.
The solution to this issue is to add links from related pages to that page. Or you can add the orphan page to your sitemap and prevent further indexing problems with that page. Either way, having an intelligent content strategy and continually optimizing your website may save you from a lot of indexing trouble. And bring you much-valued traffic!