Executive Summary:
In this article, you'll learn what "crawlability" means, ways to make sure your site is crawlable (by both the search engines like Google and Bing, and by AI bots for answer engines like ChatGPT), how to test your site's content crawlability, and then finally we'll talk about best practices for getting your content indexed.
What Does "Crawlability" Mean?
At a high level, crawlability is the ability for a bot to start at a website's home page and discover every page on the site, simply by following links from the home page, and links from pages discovered on pages linked to from the home page. This is a recursive process, meaning that the bot might start at the home page and on that page, find links to pages A, B, and C. Then, the bot will crawl each of pages A, B, and C, and make a "crawl queue" of the links it finds on those pages. The bot is going to keep track of pages they've already crawled, and each time it finds a link to a page it's never seen before, it adds that page to its crawl queue. The bot is "done" with crawling that website once it has no more pages that it discovered in its crawl that it hasn't yet crawled.
What Makes for Good Crawlability?
The short answer: good site architecture. The long answer: it's all about internal linking. Your most important pages should be linked to from the main menu. Each of those pages should link to the secondary pages that together make up a "hub" of information about that important page's topic. Much less important pages might only have 1 or 2 links, from deep inside those secondary pages, or maybe blog posts.
When it comes to site architecture for your blog, your category archives pages should be linked to from every blog post that belongs to that category. Many sites will actually have a list of links to all of their category archives pages--usually in a sidebar or perhaps the footer. This approach ensures that (a) the category archives pages get passed a lot of "link juice" from other pages on the site, and (b) every blog post has at least one link to it: from its category archive page. In addition, you might also have date archives, tag archives, author archives, etc. that have additional links to those blog posts.
Another excellent tactic for both usability and crawlability is to use breadcrumb links at the top of the page. This not only makes it easy for humans to understand where the content fits in your website overall (and get to related content easily), it sends a clear signal to the bots as to how the pages relate to one another: what's a main topic page vs. a sub-topic page vs. a detail page.
A positive side effort of good crawlability: this tends to pass more "link juice" to your most important pages, but still passes some link juice to the lesser pages. It's important to realize that any page linked to from the main menu is getting link juice passed to it from all pages on the site--and this is a nice, clear signal to Google and the AI bots that this is one of your more important pages. And while Google hasn't published their PageRank scores for many years, it's still part of the Google ranking algorithm.
To summarize:
- Main menu links to your most important pages
- Links from most important pages to the next tier pages
- Internal links from within less important pages and from within blog posts to the least important pages
- Use links to blog archives pages to ensure all blog posts can be discovered by a crawler
- Use breadcrumb links to make your site hierarchy clear to users and to the bots
What Makes for Bad Crawlability?
- Overly minimalistic main menu
- Main menu that is NOT implemented via regular <a href="xyz"> links (e.g. Javascript links)
- Main menu that isn't visible in the initially-downloaded HTML (i.e. before client-side rendering occurs)
- Not having links to blog archives pages within the blog posts themselves
- A user interface built primary around a search form (Googlebot does NOT submit forms!)
- Links contained within content that is lazy-loaded
Why Do I Need My Site to be Crawlable, Anyway?
You might think having an XML sitemap is enough: with that, you're essentially giving Googlebot a list of all of the URLs on your website.
That's a good thing, of course...but, XML sitemaps don't pass link juice, i.e. PageRank.
You're also sending Google a very negative signal about any page that's in the XML sitemap but cannot be found via a link on this site. You're telling Google that the page is VERY unimportant--so unimportant you don't want users to actually be able to find it, except maybe via a site search.
How Do You Test Your Website's Crawlability?
It's simple: you use a crawler like Screaming Frog SEO Spider to emulate what a crawler like Googlebot does: start it at the home page and let it find every page it can, via links on pages.
A crawler like Screaming Frog will give you a list of all of the pages it found on the website. Compare that to the list of pages in your XML sitemaps, and you'll see which pages aren't findable by a crawler.
Of course, the above presumes your XML sitemap is up to date and accurate--if it's automatically generated, it should be up to date and accurate. If it's not...stop reading this right now and fix that!
Crawlability vs. Fetchability
Just because a bot can crawl your page doesn't mean it can "see" the content on your page. Any number of things can get in the way and prevent the crawler from seeing what a human sees on the page:
- Client-side rendering: where the content gets formed on the page by client-side Javascript, executed after the page is fetched
- Content blocked by robots.txt: generally this will be stylesheets or images that are stored in folders blocked by robots.txt
- Delays and timeouts: a bot won't wait forever to get the content. Googlebot will give up after about 5 seconds; the AI bots will often give up MUCH sooner
- Late-loading of content: this often happens on long blog posts, or e-commerce product pages with long lists of products
- Background images: while rendering images using CSS background styles is a convenient way to make the image responsive (resize appropriately for different sized viewports/devices), when you do this, you're essentially telling Google that the image is decoration, and NOT CONTENT. Images that you want Google and the AI bots to see as content about the page's topic must not be implemented using CSS. Use <img src="xyz"> tags instead.
The acid test for fetchability is to do a Live URL Inspection on the page in Google Search Console. You can also use Max Prin's excellent fetch-and-render tool to test how the page will be seen by various different bots. It actually does a much better job than Search Console, with more options plus it will fetch the WHOLE page, not just the first part like Google Search Console does.
Best Practices for Getting Your Content Indexed
- Internal linking: the more pages on your site that link to a page, the stronger a signal you're sending to Google that that page is important.
- External linking: if other websites link directly to a page on your site, that's a strong signal to Google that OTHER people think your page is important. Social media links count here, too, by the way--even though they're generally nofollowed or tagged with rel="UGC".
- Quality signals: get the basics right! That means: no duplicate or missing page titles, no duplicate or missing meta descriptions, no spelling/grammar mistakes, good readability.
- Originality signals: stock photos and AI-generated imagery is just "fluff". Google sees original images as a signal that the author actually had first-hand experience with whatever the page is about.
- Server-side rendering: if you page uses on of the popular JS platforms, make sure you've got server-side rendering in place.
- Page load times: Googlebot and the AI bots will give up if your page loads too slowly. Test with tools like GT Metrix, and make sure your pages are passing all of the Core Web Vitals tests (especially for mobile). As well, make sure the AI bots can find a summary of what the page is about very early on the page--otherwise you're not likely to get cited in AI answers.
- Schema markup: tell Google and the AI bots what entities are on the page, so they know it's about a product, or course, or case study, or a review, etc.
- Word count: while not really a ranking factor, having a decent amount of information is definitely important. A page with 50 to 100 words on it isn't likely to have a lot of value to anybody. Having said that, a page that is too long can struggle to rank as well. A 20,000 word page might have a lot of information in it, but Google is well aware that very few people are willing to sift through that much content.