Re: Fixing Google Search on the docs (redux) - Mailing list pgsql-www

From Magnus Hagander
Subject Re: Fixing Google Search on the docs (redux)
Date
Msg-id CABUevEzmy02nUWdisHNCoM9-19vWE1xATinWfFgW6n-iFd3qUQ@mail.gmail.com
Whole thread Raw
In response to Re: Fixing Google Search on the docs (redux)  (Dave Page <dpage@pgadmin.org>)
Responses Re: Fixing Google Search on the docs (redux)  (Dave Page <dpage@pgadmin.org>)
List pgsql-www
On Thu, Nov 19, 2020 at 10:40 AM Dave Page <dpage@pgadmin.org> wrote:
>
>
>
> On Wed, Nov 18, 2020 at 5:29 PM Magnus Hagander <magnus@hagander.net> wrote:
>>
>> On Wed, Nov 18, 2020 at 5:44 PM Jonathan S. Katz <jkatz@postgresql.org> wrote:
>> >
>> > On 11/18/20 11:20 AM, Dave Page wrote:
>> > > I was looking at our analytic data, and saw that the vast majority of
>> > > inbound traffic to the docs, hits the 9.1 version. We've known this has
>> > > been an issue for years and have tried various remedies, clearly none of
>> > > which are working.
>> > >
>> > > Should we try an experiment for a couple of months, in which we simply
>> > > block anything that matches \/docs\/((\d+)|(\d.\d))\/ in robots.txt?
>> > > It's a much more drastic option, but at least it might force Google into
>> > > indexing the latest doc version with the highest priority.
>> >
>> > If we're going down this road, I would suggest borrowing a concept from
>> > the Django Project documentation which has a similar issue to us. In
>> > their codebase, use a <link> tag with rel="canonical" to point to the
>> > latest version of docs on their page[1].
>> >
>> > So for example, given 3.1 is their latest release, you will find
>> > something similar to this:
>> >
>> > <link rel="canonical"
>> > href="https://docs.djangoproject.com/en/3.1/ref/templates/builtins/">
>> >
>> > From a quick test of searching various Django concepts, it seems that
>> > the 3.1 pages tend to turn up first.
>> >
>> > Our equivalent would be "current".
>> >
>> > Jonathan
>> >
>> > [1]
>> > https://developers.google.com/search/docs/advanced/crawling/consolidate-duplicate-urls
>>
>> We've discussed this many times before, and I think so far they've all
>> bogged down at "google suck" :) The problem is that they don't even
>> consider the case like we have where the pages *aren't* identical, but
>> yet related.
>
>
> Sure, but we need to do something, regardless of whether Google suck in this case. The current situation is
ridiculous;I don't remember the last time I searched on something and didn't have to click an alternate version link if
Ichose a result from our docs. 
>
>>
>>
>> The problem it usually comes down to is that if we do that, then you
>> will no longer be able to say search for something in the old docs *at
>> all*. A good example right now might be that recovery.conf stuff goes
>> away. Even if you explicitly search for "postgresql recovery.conf 11".
>> And I'd guess the majority of people are actually looking for things
>> in versions that are NOT the latest (though an even bigger majority of
>> people will be looking for things in versions that are not 9.1).
>
>
> The irony is that that example would be far less of an issue if we hadn't removed all the release notes for older
versions(see https://www.enterprisedb.com/edb-docs/s?q=recovery.conf&c=&p=19&v=272 as an example). The older release
noteswould give users a hint as to where to look. 

The release notes themselves are still under for example
https://www.postgresql.org/docs/release/12.0/ as well, so we should be
able to keep *that* searchable still. So for this particular case it
would at least tell people that "yeah, you're right, it used to be
called recovery conf" when they're searching for documentation about
11 and earlier... They still won't get to the actual documentation for
it though -- but neither does your example from edb :)


>> FWIW, I find the django example absolutely terrible -- in fact, it's a
>> great example of how the canonical URL handling sucks. There is AFAICT
>> no way to actually search for information about old versions. You have
>> to search for it in the new version and then hope that the same info
>> happens to be on the same page in an earlier version, and then
>> manually browse your way back to that version (also through very
>> annoying js popover stuff, but that's a different thing)
>
>
> That is true, however the *vast* majority of cases will be present in older versions.

Yes, but one could also argue that specifically the things that people
search for might be less cross-platform present there..


>> I don't know of any way to actually tell google to prioritise the new
>> versions. You used to be able to do this using the sitemap.xml stuff,
>> which is why we do that, but at some point they just stopped caring
>> about those, even in the cases where we're *lowering* our own
>> priority, under the argument of not letting us increase our priority.
>>
>> It's not that what we have now for this is especially great. It might
>> be that going down that route is still the least bad. But we have to
>> make that decision while knowing this means that *nobody* will be able
>> to search for things in our older documentation even if they
>> explicitly ask for it. At all.
>
>
> On public search engines. They will still be able to using our own site search.

Yes, of course.


>> Their only chance is to search for
>> something else that might hit our docs, then in that click over to the
>> correct version they actually asked for, and then search *again* using
>> our site-search and hope that it shows up there. I'm willing to bet
>> very few users will figure that part out...
>
>
> The issue for me is that the current situation sucks for the vast majority of users, as evidenced by our analytics.
Ifwe blocked indexing of all but the current version of the docs, it would suck in the same way only for those that
specificallywant to look at an older version, and those that search for one of the very few things that have been
removedfrom the latest version. In short, I think the current situation is worse. 

Or we need a somewhat in between level. Like, right now I bet most
people would actually want version 11 or 12, not 13. So do we need to
define a "most likely wants to search for this" version as well, which
would then trail the actual latest-release version, and point the
search engines to that?

That said, I also agree with the suggestion to start by at least
blocking those that are unsupported. However, we should monitor the
results carefully so that doesn't end up with google just zapping
*everything* -- we need them to realize the newer versions are there.
Doing the canonical-URL-setup that Jonathan suggested would make
google update it, the question is what happens if they just "go away".
Do we *loose* all the existing "google power" of those links? If so,
it might be a very costly expereiment...

--
 Magnus Hagander
 Me: https://www.hagander.net/
 Work: https://www.redpill-linpro.com/



pgsql-www by date:

Previous
From: Dave Page
Date:
Subject: Re: Fixing Google Search on the docs (redux)
Next
From: Dave Page
Date:
Subject: Re: Fixing Google Search on the docs (redux)