There's nothing so basic to SEO than getting Google to index your content.
- Indexable and
Its robust nature provides many features that have turned the web into a powerful machine; providing and adding:
- Intuitive features
- Better user experience.
And lots more…
As AJAX-based applications replace the popular static HTML pages, users can now enjoy a better and richer experience faster than they used to.
But this has come at a huge cost for the web as crawlers are unable to access and see content that's dynamically created.
Processing AJAX applications has been a difficult thing for most search engines.
Because while browsers can dynamically produce your content easily, it remains invisible to crawlers.
However, Google has a way of crawling and indexing your content:
"If you're running an AJAX application with content that you'd like to appear in search results, we have a new process that, when implemented, can help Google (and potentially other search engines) crawl and index your content."
But you need to regularly carry out manual maintenance to keep your content up-to-date.
And you'll see this shortly in the troubleshooting section of this article.
Google has a rendering system (Web Rendering Service - WRS), which is based on an evergreen version of chromium.
And this process is not as easy as with HTML-based websites.
1. Crawling Phase
Before crawling begins, Googlebot will first check whether you have allowed it to crawl through the web pages by reading the Robots.txt file.
And then proceed (if allowed) to make a GET request to the URLs waiting in the crawl queue.
But if you've disallowed it from crawling, the crawler will skip the URL(s) and not make any HTTP request to it.
Now, the crawler parses the response from the request made for other URLs and all links wrapped within href attribute of HTML are added to the crawl queue.
I'll explain this under the processing.
Pro tip: Google uses either mobile or desktop crawlers to crawl web pages. Each crawler is a user agent to your site with that device type.
If you want to know the type of crawler to your site, you can use the URL inspection tool in the search console.
Here's an example, notice the “Crawl as”...
This also tells you whether you're on mobile-first indexing or Desktop indexing.
If your site is new, you'll probably be on the primary crawler which is the mobile crawler.
But sometimes, Google may send a secondary crawler (desktop) to your website.
There are two main problems you may encounter that can impact your SEO:
1. blocking a specific Country or using a particular IP in different ways- The majority of the requests Google makes come from Mountain View, CA but sometimes make requests from other locations outside the USA.
The problem arises here if you block a specific country or use a particular IP in different ways, and because of that Googlebot may not be able to see your content
2. Using user-agent detection to show content to a specific crawler - this usually results in Google seeing content differently from the users.
These tools tell you if Google can see the content on your page - maybe you're blocking the bots.
If you notice that spammers are accessing your site in the name of Googlebot, you can verify to be sure of the user-agent by running a DNS lookup.
Note: Google doesn't post a public list of IP addresses that you can easily whitelist. And often change their IP address.
So you may want to verify by following these three steps:
- Go to your logs and run a DNS lookup on the IP address by using the host command.
- Check to see that the domain name is either googlebot.com or google.com.
- Run a forward DNS lookup on the domain name retrieved in step 1 using the host command on the retrieved domain name. Verify that it is the same as the original accessing IP address from your logs.
For more details on this, kindly follow this URL https://support.google.com/webmasters/answer/80553
When crawling ends, the HTML - together with the Js file, CSS file and XHR requests - of the page are stored and ready for processing.
2. The Processing phase
A lot of things are involved in the processing stage but I've simplified it below:
If you allow Googlebot to crawl your webpage and follow the URLs (the links), it discovers the links in the HTML page that links to other pages (external or internal).
And then, add all the links to the crawl queue, which are then used to schedule crawling.
This is where you can control and instruct Google to either dofollow or nofollow a certain link.
"For certain links on your site, you might want to tell Google your relationship with the linked page. In order to do that, you can use one of the rel attribute values in the <a> tag.
For regular links that you expect Google to follow without any qualifications, you don't need to add a rel attribute."
If you don't want Google to follow a link on your page, simply add a nofollow tag and Google will skip it - and not add it to the crawl queue. Meaning... you're cutting the crawl path.
Although, this is relatively done fast and shouldn't be a problem.
But it calls for concern when too many unnecessary scripts are loaded before it gets rendered.
Along with links, Google also pulls out the resources required for the page such as CSS and JS files from the <link> tags.
Note that, the external and internal links are pulled from <a> tag with href attribute, which is why you must specify the correct attributes for your links.
<a href=”/page” onclick=”goTo(‘page’)”> still okay
Or simply using;
<a href=”/page”>simple is good</a>
Google will ignore your cache timings and fetch a new copy when they want to.
I’ll talk a bit more about this and why it’s important in the Renderer.
When Google crawls web pages, it makes raw HTML copies of the files from your server.
And before pages are sent to the renderer, all the files involved in the build-up of the page are downloaded and cached.
This idea simply allows search engines to fetch pages for users quickly so that Google doesn't need to go all out to download pages for every page request to match users' queries.
The cached version of your web pages is indexed.
To make Google downloads and updates the updated version of your Web pages for rendering, use file versioning or content fingerprinting to generate new file names whenever there's a significant change.
Pro tips: by saying HTML it means all the resources including HTML codes, JS and CSS codes that make up the page.
Google caches them all by ignoring your cache timings and fetching new pages from your server whenever they like.
Google's aim is to index and show pages with distinct information.
Before a page is rendered and sent to the index, Google tries to eliminate any duplicate content from the already downloaded HTML just after crawling.
App shell model is used on sites with relatively unchanging navigation, but dynamic content.
How this causes duplicate content
As Google is unable to see the actual content which of course should make each page distinct, the result may be some little content and piece of code which can even look similar to some extent on many pages across the site.
In such a case, the web pages appear as duplicate and may not immediately go to rendering.
Although this should resolve within a few seconds.
But to avoid this, a server-side or pre-rendering method is still a great way. Because it makes your website load faster for users and Googlebot to crawl.
Another thing that happens during the processing stage is how Google responds to a given directive.
For instance, if there is a conflict between normal HTML and the rendered version of a page, Google will only obey whichever is restrictive.
No index will override index, and the noindex in HTML will skip rendering.
3. Rendering process
It'll then parse the rendered HTML for links again and queue up the URLs it finds for crawling.
This is the stage where Google will be able to see the content on the webpage.
Google uses the Web Rendering Service (WRS), which includes things like denying permissions, being stateless, flattening light DOM and shadow DOM, and more which you can access here.
The rendered HTML is also used to index the page.
Note: Google will add pages to the queue in both the crawling and rendering process and it can be difficult to know when and which page is waiting for crawling or rendering.
How Google Sees Content In The Rendered DOM
Does Googlebot See Content The Same Way Users Do?
The Document Object Model (DOM) is an API for HTML and XML documents.
It defines the logic behind documents structure and the way a document can be accessed and manipulated.
When you browse through a web page you don't get to see what has happened in the background.
So as long as your content is loaded in the DOM, Google will see it.
But if the content can't be loaded into the DOM, your content will not be seen.
You might be wondering how Googlebot see the entire content on a long page...
Well, unlike humans, Googlebot doesn't need to scroll through a long piece of content to know what's on it. They have their own way of accessing every content on a page.
Google usually resizes the loaded page into a longer size and doesn't need to scroll.
For example, a mobile device with a screen size of 411x731 is resized to 411x12140 pixels.
For desktop also the length is increased to full length.
E.g., 1024x768 pixels is resized to 1024x9307 pixels.
Also, when Google renders the page they don't paint the pixels.
All they need is to know the structure and the layout, and they know this without actually painting the pixels.
Here's what Martin Splitt from Google said:
"In Google search, we don’t really care about the pixels because we don’t really want to show it to someone. We want to process the information and the semantic information so we need something in the intermediate state. We don’t have to actually paint the pixels."
1. Pages Cache
When Google crawls the web, it takes a "snapshot" of the webpage as a backup.
The cached pages are extremely useful especially when the site is down or timed out.
What Googlebot sees is the initial HTML - and sometimes, the rendered HTML.
2. View Source vs. Inspect Element
If you always think these two are the same, now you know there is a difference.
When you right-click to view-source, it'll show you the same thing a GET request would, which is the raw HTML of the page.
But inspect would show you the content after the DOM has been processed and changed.
The page source will not include all the content you will want the crawler to see.
In other words "View Page Source" is exactly what the crawler gets.
3. Google Search
Copy out some text from your content and search for it on Google. If the page that contains the content is returned on the SERP, that means the content was seen. If not, check to see if the content it's not hidden by default because any hidden content will not show up on SERP.
4. Google Testing Tools
Tools like Mobile-Friendly Test Tool, Rich Results and URLs Inspection Tool in Google Search Console are useful tools - you can use to see DOM loaded content, resources that are blocked and including error messages, which can be helpful while debugging.
However, while these tools will often show you HTML rendered in the DOM, you can always search on Google for a snippet of text to see if it was actually loaded by default.
5. Ahrefs Tools
Rendering takes the content on your web pages and displays them to the user. There are two main types of rendering:
- Server-side rendering
- Client-side rendering
But there are other rendering options in between you can choose from to make your website search-friendly.
Client-side rendering is not a problem for Google but you must also consider other search engines.
There are some slight differences from the "normal" SEO which I'm going to show you.
Allow Googlebot to access and download resources on your site so that they can properly render your content and index. Your robots.txt file should look this:
Use ‘History’ Mode Instead Of The Traditional ‘Hash (Fragments)’ Mode For URLs
To allow Googlebot to find links on your pages use the History API. Googlebot looks for links on your page and only considers the links in href of the HTML attribute.
Anything after #/ is ignored and Googlebot won't crawl it. So avoid the use of fragments to load different page content (most especially single-page applications) and only use the History API.
Use meaningful HTTP status codes
One thing Googlebot uses to detect if something went wrong during crawling is the HTTP status code.
Use HTTP status code to tell Googlebot which page it should crawl and index or which ones it shouldn't.
You can use 401 (Unauthorized status code) for pages behind a login for example, and you can tell Googlebot if a page has moved to a new URL so as to update the index when crawling again.
When set up this way, it'd send visitors to https://www.semoladigital.com/ upon page load.
Although many would advise you to use…
But The problem with this implementation is that - the current URL is added to the visitor's navigation history. This can cause the visitor to get stuck in back-button loops.
Whereas, the window.location.replace() would not.
Avoid using the window.location.href when you want to redirect visitors immediately to another URL.
Client-side JS Redirects: Can Googlebot Detect Them? #AskGoogleWebmasters
Use meta robots tags correctly
Sometimes you might have some pages (or content) that you want to prevent search engines from indexing.
For pages like thin content or upcoming promotions, you can use the meta robot tag to prevent Googlebot from indexing or following links on the page.
For example, the code below on a page will block Googlebot from indexing the page:
<!-- Googlebot won't index this page or follow links on this page -->
<meta name="robots" content="noindex, nofollow">
For more on on-page SEO optimization follow this link
Handle duplicate content with canonical tags as you would for every other type of website.
Canonical tags allow you to choose only one version and hint search engines to index that version.
SEO “plugin” type options
Use structured data
The following code will generate and save your site sitemap:
const router = require('./router').default;
const Sitemap = require('../').default;
Whatever framework you use, typing the framework +sitemap on Google will return a link to where and how you can implement the sitemap on your site.
You can also add a noindex tag to pages that are failing along with something like: "404 page Not Found" since a 404 Not Found page will return a status code of 200, which is what you want.