Re: Fixing Google Search on the docs (redux) - Mailing list pgsql-www

From Dave Page
Subject Re: Fixing Google Search on the docs (redux)
Date
Msg-id CA+OCxoy9=wJxWtEkH6j0B6pg+H35TJxx+MZoLiZm9Edd9PsNeg@mail.gmail.com
Whole thread Raw
In response to Re: Fixing Google Search on the docs (redux)  (Magnus Hagander <magnus@hagander.net>)
Responses Re: Fixing Google Search on the docs (redux)  (Magnus Hagander <magnus@hagander.net>)
List pgsql-www


On Wed, Nov 18, 2020 at 5:29 PM Magnus Hagander <magnus@hagander.net> wrote:
On Wed, Nov 18, 2020 at 5:44 PM Jonathan S. Katz <jkatz@postgresql.org> wrote:
>
> On 11/18/20 11:20 AM, Dave Page wrote:
> > I was looking at our analytic data, and saw that the vast majority of
> > inbound traffic to the docs, hits the 9.1 version. We've known this has
> > been an issue for years and have tried various remedies, clearly none of
> > which are working.
> >
> > Should we try an experiment for a couple of months, in which we simply
> > block anything that matches \/docs\/((\d+)|(\d.\d))\/ in robots.txt?
> > It's a much more drastic option, but at least it might force Google into
> > indexing the latest doc version with the highest priority.
>
> If we're going down this road, I would suggest borrowing a concept from
> the Django Project documentation which has a similar issue to us. In
> their codebase, use a <link> tag with rel="canonical" to point to the
> latest version of docs on their page[1].
>
> So for example, given 3.1 is their latest release, you will find
> something similar to this:
>
> <link rel="canonical"
> href="https://docs.djangoproject.com/en/3.1/ref/templates/builtins/">
>
> From a quick test of searching various Django concepts, it seems that
> the 3.1 pages tend to turn up first.
>
> Our equivalent would be "current".
>
> Jonathan
>
> [1]
> https://developers.google.com/search/docs/advanced/crawling/consolidate-duplicate-urls

We've discussed this many times before, and I think so far they've all
bogged down at "google suck" :) The problem is that they don't even
consider the case like we have where the pages *aren't* identical, but
yet related.

Sure, but we need to do something, regardless of whether Google suck in this case. The current situation is ridiculous; I don't remember the last time I searched on something and didn't have to click an alternate version link if I chose a result from our docs.
 

The problem it usually comes down to is that if we do that, then you
will no longer be able to say search for something in the old docs *at
all*. A good example right now might be that recovery.conf stuff goes
away. Even if you explicitly search for "postgresql recovery.conf 11".
And I'd guess the majority of people are actually looking for things
in versions that are NOT the latest (though an even bigger majority of
people will be looking for things in versions that are not 9.1).

The irony is that that example would be far less of an issue if we hadn't removed all the release notes for older versions (see https://www.enterprisedb.com/edb-docs/s?q=recovery.conf&c=&p=19&v=272 as an example). The older release notes would give users a hint as to where to look.

 
FWIW, I find the django example absolutely terrible -- in fact, it's a
great example of how the canonical URL handling sucks. There is AFAICT
no way to actually search for information about old versions. You have
to search for it in the new version and then hope that the same info
happens to be on the same page in an earlier version, and then
manually browse your way back to that version (also through very
annoying js popover stuff, but that's a different thing)

That is true, however the *vast* majority of cases will be present in older versions.
 

I don't know of any way to actually tell google to prioritise the new
versions. You used to be able to do this using the sitemap.xml stuff,
which is why we do that, but at some point they just stopped caring
about those, even in the cases where we're *lowering* our own
priority, under the argument of not letting us increase our priority.

It's not that what we have now for this is especially great. It might
be that going down that route is still the least bad. But we have to
make that decision while knowing this means that *nobody* will be able
to search for things in our older documentation even if they
explicitly ask for it. At all.

On public search engines. They will still be able to using our own site search.
 
Their only chance is to search for
something else that might hit our docs, then in that click over to the
correct version they actually asked for, and then search *again* using
our site-search and hope that it shows up there. I'm willing to bet
very few users will figure that part out...

The issue for me is that the current situation sucks for the vast majority of users, as evidenced by our analytics. If we blocked indexing of all but the current version of the docs, it would suck in the same way only for those that specifically want to look at an older version, and those that search for one of the very few things that have been removed from the latest version. In short, I think the current situation is worse.

--
Dave Page
Blog: http://pgsnake.blogspot.com
Twitter: @pgsnake

EDB: http://www.enterprisedb.com

pgsql-www by date:

Previous
From: Christophe Pettus
Date:
Subject: Re: Fixing Google Search on the docs (redux)
Next
From: Dave Page
Date:
Subject: Re: Fixing Google Search on the docs (redux)