Re: Fixing Google Search on the docs (redux) - Mailing list pgsql-www

From Dave Page
Subject Re: Fixing Google Search on the docs (redux)
Date
Msg-id CA+OCxoxet+zWmWB5b2sLjvUHwNngPnyu8STQ4frYoMtY--MNMA@mail.gmail.com
Whole thread Raw
In response to Re: Fixing Google Search on the docs (redux)  (Magnus Hagander <magnus@hagander.net>)
Responses Re: Fixing Google Search on the docs (redux)  (Greg Stark <stark@mit.edu>)
List pgsql-www


On Thu, Nov 19, 2020 at 9:58 AM Magnus Hagander <magnus@hagander.net> wrote:

> The issue for me is that the current situation sucks for the vast majority of users, as evidenced by our analytics. If we blocked indexing of all but the current version of the docs, it would suck in the same way only for those that specifically want to look at an older version, and those that search for one of the very few things that have been removed from the latest version. In short, I think the current situation is worse.

Or we need a somewhat in between level. Like, right now I bet most
people would actually want version 11 or 12, not 13. So do we need to
define a "most likely wants to search for this" version as well, which
would then trail the actual latest-release version, and point the
search engines to that?

Perhaps an interesting datapoint is this

====
If you have a single page accessible by multiple URLs, or different pages with similar content (for example, a page with both a mobile and a desktop version), Google sees these as duplicate versions of the same page. Google will choose one URL as the canonical version and crawl that, and all other URLs will be considered duplicate URLs and crawled less often.

If you don't explicitly tell Google which URL is canonical, Google will make the choice for you, or might consider them both of equal weight, which might lead to unwanted behavior, as explained below in Why should I choose a canonical URL?
==== 


I think this is interesting because it makes the point that non-canonical URLs will still be indexed, just less often. I wonder if we can do something like the following, but still retain the ability to do a search like "postgresql 12 create trigger":

- Remove (by default) all doc URLs from the sitemap that aren't under /current/ (note that evidence indicates Google will still index pages not in the sitemap if it finds them, if a sitemap is present).
- Include a canonical URL in all doc pages that points to the /current/ version
- Where a page has been removed entirely, mark the most recent version of it as the canonical one instead of the /current/ version).

If the Google docs are correct, it'll still index the older versions (and presumably use them in results if it needs to, e.g. because the user included a version number), but it'll prefer the canonical one.


That said, I also agree with the suggestion to start by at least
blocking those that are unsupported. However, we should monitor the
results carefully so that doesn't end up with google just zapping
*everything* -- we need them to realize the newer versions are there.
Doing the canonical-URL-setup that Jonathan suggested would make
google update it, the question is what happens if they just "go away".
Do we *loose* all the existing "google power" of those links? If so,
it might be a very costly expereiment...

I think there's a risk here whatever we do. I'm not sure that's a good enough reason to do nothing though.
 
--
Dave Page
Blog: http://pgsnake.blogspot.com
Twitter: @pgsnake

EDB: http://www.enterprisedb.com

pgsql-www by date:

Previous
From: Magnus Hagander
Date:
Subject: Re: Fixing Google Search on the docs (redux)
Next
From: Greg Stark
Date:
Subject: Re: Fixing Google Search on the docs (redux)