The Canonical Race Test
The test itself was relatively simple.
I create four pages, 2 unique pages (A and B), and 2 duplicates of each unique page (C and D).
I then added a canonical HTML tag onto duplicate page C and pointed it to unique page A. For duplicate page D I configured my server to add a rel=”canonical” HTTP header and pointed the link to unique page B.
I then submitted all pages A, B, C, and D to Google using the Fetch and Render Tool (it was still accessible at that stage).
The hypothesis for the test was fairly simple: Google processes both canonical methods at the same speed.
Canonical Race Results
When checking the results a few days later I noticed that both duplicate pages C and D had been crawled on the same date.
A few days later I was surprised to find that Google had analysed and processed duplicate page D with the rel=”canonical” HTTP header, and updated the Index Coverage report for the URL and labeled it as “Alternative page with proper canonical tag”.
Google picked up the other HTML canonical tag 3 days later and displayed it as an “Alternative page with proper canonical tag”. Although it was interesting that, at the time, the Index Coverage report labeled the duplicate URL as indexed until the canonical HTML tag was picked up.
Based on my test the results show that Google’s indexing systems pick up canonical link signals in HTTP response headers faster than HTML tags.
Why was the HTTP canonical link picked up faster?
I suspect this is due to HTTP response header attributes being processed first before HTML in Google’s crawling and indexing pipeline.
For example if you do a cURL -i command on a live page on a website you can see that the HTTP Header is downloaded with the HTML.
I imagine Googlebot does something similar, as Gary mentioned that it is built using a custom implementation similar to cURL.
We know Googlebot analyses and extracts attributes from HTTP response headers, as the HTTP protocol is an critical part of how servers and clients communicate to each other.
I suspect that the service which analyses HTTP response header information is done very early on in the crawling/indexing pipeline. This is needed to filter out any URLs which don’t need to be sent to the HTML parsing or renderer services to improve efficiency (theory).
This makes (sort of) sense as supported HTTP response header attributes are an efficient way of analysing the state of a URL to help the indexing pipeline make a decision about a page.
Compare this to a canonical HTML tag which needs to go through the HTML parsing and renderer queue before metadata is picked up and the indexing pipeline makes a decision about the page.
So my theory is that any data in the HTTP header is picked up slightly faster than in the HTML, as it is just more efficient.
Faster approach to de-indexing duplicate content
Do I think SEOs and developers should start using just HTTP canonical links?
It’s important you make sure that you use the solution that helps you get the job done. If that is using classic HTML tags then go for it (I will), if you use HTTP header links then keep at it.
I personally would want to combine this technique with Oliver Mason’s XML Sitemap hack to see if these two techniques can help get a large amount of duplicate content crawled and analysed faster.
If a canonical implementation strategy can be picked up quicker by Google, and less low-value URLs are de-indexed, then this can result in a positive impact on a website’s SEO performance overall.