Thread: once more: documentation search indexing

once more: documentation search indexing

From
Andres Freund
Date:
Hi,

in a recent twitter discussion [1] $subject again has been brought
up. Unsurprisingly - it's still awful.

It's been brought up many times before:
- https://www.postgresql.org/message-id/CA%2BOCxoyVwmmZkWUJCez2hCqa89iGv%3Dvq58NF1yQkTg9gtpkn%3Dg%40mail.gmail.com
- https://www.postgresql.org/message-id/CAHyXU0wu7w%3DOpeHtvpei4J9SAr7TTmdRJOyCWF6MRXpQcFNHGw%40mail.gmail.com
- https://www.postgresql.org/message-id/CANNMO%2B%2BkxJmaaB7X6hq_8SqcEruySZrF%3DUkcPm-EG1JCKVascw%40mail.gmail.com
- https://www.postgresql.org/message-id/38c68b83-30ae-c039-acd0-9e853997edc4@2ndquadrant.com
- https://www.postgresql.org/message-id/560614CA.1080304@mail.com
- ...

One issue around the topic is that we seem to get bogged down in finding
a perfect solution to how to present "versioned document" to google,
preventing us from making small incremental adjustments. Since it seems
unlikely that we'll get a perfect solution anytime soon (we'd have found
it already), I'd like to try to see if we can find a way to agree on
some incremental steps.

Suggested small steps:

- add a docs/current link to https://www.postgresql.org/docs/. Often
  enough that's what a user wants anyway, and it's not useful to add
  additional steps for users and search engines to navigate to
  docs/current/.

  I can see us either making it a separate row in the versioned table,
  or to split the most recent released version's link into a /current/
  and $major link.


- put version in page titles where it makes sense. E.g. change
  "PostgreSQL: Documentation: 10: 6.1. Inserting Data" to
  "PostgreSQL 10 Documentation: 6.1. Inserting Data"

  The current ordering doesn't seem like it has much going for it, and
  it can't help search engines to have the version number people might
  search for removed from the product name.

  Right now this seem to contribute to less than helpful titles in
  search engine results. Searching anonymously for "postgres alter
  table" I get the less than helpful "Documentation: 12: ALTER TABLE -
  PostgreSQL" on google.

  It might also be worth to go a bit further and put the documentation
  version *after* the page title, given that it's most likely already
  clear to the reader that this is about postgres. I.e. something like
  "ALTER TABLE - Documentation for PostgreSQL 14"


- Consider removing chapter numbers from page titles. I'd argue that the
  particular chapter number for content isn't interesting as the title. E.g.
  https://www.postgresql.org/docs/12/plpgsql-declarations.html#PLPGSQL-DECLARATION-PARAMETERS
  has a title of "PostgreSQL: Documentation: 12: 42.3. Declarations"

  (see also previous item). The 42.3 piece seems pointless in a title of
  a website - although the actual chapter name could be helpful, because
  it's not immediately obvious that the page refers to plpgsql.


- Add a meta description - even just including what we have for the
  og:description thing seems like it would often be better what google
  is kind of forced to make up?


Greetings,

Andres Freund

[1] https://twitter.com/samokhvalov/status/1403410028334256128



Re: once more: documentation search indexing

From
"Jonathan S. Katz"
Date:
Hi,

On 6/12/21 4:29 PM, Andres Freund wrote:
> Hi,
>
> in a recent twitter discussion [1] $subject again has been brought
> up. Unsurprisingly - it's still awful.
>
> It's been brought up many times before:
> - https://www.postgresql.org/message-id/CA%2BOCxoyVwmmZkWUJCez2hCqa89iGv%3Dvq58NF1yQkTg9gtpkn%3Dg%40mail.gmail.com
> - https://www.postgresql.org/message-id/CAHyXU0wu7w%3DOpeHtvpei4J9SAr7TTmdRJOyCWF6MRXpQcFNHGw%40mail.gmail.com
> - https://www.postgresql.org/message-id/CANNMO%2B%2BkxJmaaB7X6hq_8SqcEruySZrF%3DUkcPm-EG1JCKVascw%40mail.gmail.com
> - https://www.postgresql.org/message-id/38c68b83-30ae-c039-acd0-9e853997edc4@2ndquadrant.com
> - https://www.postgresql.org/message-id/560614CA.1080304@mail.com
> - ...
>
> One issue around the topic is that we seem to get bogged down in finding
> a perfect solution to how to present "versioned document" to google,
> preventing us from making small incremental adjustments. Since it seems
> unlikely that we'll get a perfect solution anytime soon (we'd have found
> it already), I'd like to try to see if we can find a way to agree on
> some incremental steps.

Thank you for bringing this up I applaud the suggestion of approach. I
will note that we have taken incremental steps however the years to
improve this, some of which I will site below.

However, part of the issue with the "incremental steps" is ensuring that
we're following things according to the best upstream guidance.

The one element that I know that is cited as "this should work" is
setting canonical references, which I'll touch on below.

> Suggested small steps:
>
> - add a docs/current link to https://www.postgresql.org/docs/. Often
>   enough that's what a user wants anyway, and it's not useful to add
>   additional steps for users and search engines to navigate to
>   docs/current/.

We do that at the very top: that is the first link in the main body.
This was done back in Nov 2020[1]

>   I can see us either making it a separate row in the versioned table,
>   or to split the most recent released version's link into a /current/
>   and $major link.

I'm not sure if that's any different than the above right now; if there
is something you could cite around that, I'm happy to be convinced
otherwise.

However, I'm also not opposed to putting a (Current) link next to the
current version in the table. I think that'd at least be helpful from a
user perspective, if they don't click the big button up top.

> - put version in page titles where it makes sense. E.g. change
>   "PostgreSQL: Documentation: 10: 6.1. Inserting Data" to
>   "PostgreSQL 10 Documentation: 6.1. Inserting Data"
>
>   The current ordering doesn't seem like it has much going for it, and
>   it can't help search engines to have the version number people might
>   search for removed from the product name.
>
>   Right now this seem to contribute to less than helpful titles in
>   search engine results. Searching anonymously for "postgres alter
>   table" I get the less than helpful "Documentation: 12: ALTER TABLE -
>   PostgreSQL" on google.
>
>   It might also be worth to go a bit further and put the documentation
>   version *after* the page title, given that it's most likely already
>   clear to the reader that this is about postgres. I.e. something like
>   "ALTER TABLE - Documentation for PostgreSQL 14"

I think having "PostgreSQL $MAJOR_VERSION" together would help both for
some of the indexing issues + readability in the search engine. The
question is around how the content is ordered. in the title.

Doing "PostgreSQL $MAJOR_VERSION: Documentation: $page_title" might be
the way to go. The other thing I see done for SEO what you suggest, but
just hyphenated i.e. "ALTER TABLE - Documentation - PostgreSQL 14"

Anyway, I'm generally in favor for combining at least "PostgreSQL
$MAJOR_VERSION."

> - Consider removing chapter numbers from page titles. I'd argue that the
>   particular chapter number for content isn't interesting as the title. E.g.
>   https://www.postgresql.org/docs/12/plpgsql-declarations.html#PLPGSQL-DECLARATION-PARAMETERS
>   has a title of "PostgreSQL: Documentation: 12: 42.3. Declarations"

We'd likely need to do something to indicate what chapter you're in
(e.g. PL/pgSQL) given there could be multiple "Declarations" sections in
the docs.

>   (see also previous item). The 42.3 piece seems pointless in a title of
>   a website - although the actual chapter name could be helpful, because
>   it's not immediately obvious that the page refers to plpgsql.
>
>
> - Add a meta description - even just including what we have for the
>   og:description thing seems like it would often be better what google
>   is kind of forced to make up?

Not necessarily opposed, but I'd like to freshen up on some of the
modern SEO practices. I think that may help with search engine display,
but not move the needle on the indexing.

That all said, as stated and cited in some of those previous threads, I
think the biggest lift is around making our documentation URLs
canonical. After discussing with Magnus a bit, there are a few things
that we need to consider in it:

1. Whether or not the documentation page is in "current"
2. If it's not in "current", which is the last version the page is a
part of? We make that the canonical

I've attached a patch that does this. The one part I'm not sure I like
is how we treat something that is solely in "devel" -- knowing that
eventually something in devel could end up in current. Perhaps if
something is only in "devel", we exclude it from being part of the
canonical tree?

Thanks,

Jonathan

[1] https://git.postgresql.org/gitweb/?p=pgweb.git;a=commitdiff;h=24a48d2037

Attachment

Re: once more: documentation search indexing

From
Andres Freund
Date:
Hi,

On 2021-06-12 17:05:22 -0400, Jonathan S. Katz wrote:
> Thank you for bringing this up I applaud the suggestion of approach.

Glad to hear it.


> > Suggested small steps:
> > 
> > - add a docs/current link to https://www.postgresql.org/docs/. Often
> >   enough that's what a user wants anyway, and it's not useful to add
> >   additional steps for users and search engines to navigate to
> >   docs/current/.
> 
> We do that at the very top: that is the first link in the main body.
> This was done back in Nov 2020[1]

Oh - I had not realized that at all. I think the similarity to the news
bar made me completely blend the "view the manual" element out.


> >   I can see us either making it a separate row in the versioned table,
> >   or to split the most recent released version's link into a /current/
> >   and $major link.
> 
> I'm not sure if that's any different than the above right now; if there
> is something you could cite around that, I'm happy to be convinced
> otherwise.

I don't think the existing link is particularly helpful - it's just
visually too different from the other links. And doesn't indicate which
version it is for etc.


> However, I'm also not opposed to putting a (Current) link next to the
> current version in the table. I think that'd at least be helpful from a
> user perspective, if they don't click the big button up top.

Yea, I think that'd be good.


> > - put version in page titles where it makes sense. E.g. change
> >   "PostgreSQL: Documentation: 10: 6.1. Inserting Data" to
> >   "PostgreSQL 10 Documentation: 6.1. Inserting Data"
> > 
> >   The current ordering doesn't seem like it has much going for it, and
> >   it can't help search engines to have the version number people might
> >   search for removed from the product name.
> > 
> >   Right now this seem to contribute to less than helpful titles in
> >   search engine results. Searching anonymously for "postgres alter
> >   table" I get the less than helpful "Documentation: 12: ALTER TABLE -
> >   PostgreSQL" on google.
> > 
> >   It might also be worth to go a bit further and put the documentation
> >   version *after* the page title, given that it's most likely already
> >   clear to the reader that this is about postgres. I.e. something like
> >   "ALTER TABLE - Documentation for PostgreSQL 14"
> 
> I think having "PostgreSQL $MAJOR_VERSION" together would help both for
> some of the indexing issues + readability in the search engine. The
> question is around how the content is ordered. in the title.
> 
> Doing "PostgreSQL $MAJOR_VERSION: Documentation: $page_title" might be
> the way to go. The other thing I see done for SEO what you suggest, but
> just hyphenated i.e. "ALTER TABLE - Documentation - PostgreSQL 14"
> 
> Anyway, I'm generally in favor for combining at least "PostgreSQL
> $MAJOR_VERSION."

Yea, let's do that separately then.

WRT ordering, I do think I prefer the versions with the actual subject
of the page first - to distinguish between different PG doc pages
"PostgreSQL 14 Documentation" is really not helpful. I often have
multiple doc pages open in different tabs, and there's right now no way
to distinguish them, because there's never enough space for even just
"PostgreSQL 13: Documentation:", not to speak of an actual title.


> That all said, as stated and cited in some of those previous threads, I
> think the biggest lift is around making our documentation URLs
> canonical. After discussing with Magnus a bit, there are a few things
> that we need to consider in it:
> 
> 1. Whether or not the documentation page is in "current"
> 2. If it's not in "current", which is the last version the page is a
> part of? We make that the canonical

Yea, I know that's a potentially significant improvement. I just didn't
feel it's useful to wade into the topic because it's been discussed for
about a decade by now. And that there's things we could make easier
progress on...


> I've attached a patch that does this. The one part I'm not sure I like
> is how we treat something that is solely in "devel" -- knowing that
> eventually something in devel could end up in current. Perhaps if
> something is only in "devel", we exclude it from being part of the
> canonical tree?

Right now all of docs/devel is prevented from being indexed via
robots.txt:
Disallow: /docs/devel/

So it won't really matter for SEO purposes.

Greetings,

Andres Freund



Re: once more: documentation search indexing

From
"Jonathan S. Katz"
Date:
On 6/12/21 5:37 PM, Andres Freund wrote:

>>> - add a docs/current link to https://www.postgresql.org/docs/. Often
>>>   enough that's what a user wants anyway, and it's not useful to add
>>>   additional steps for users and search engines to navigate to
>>>   docs/current/.
>>
>> We do that at the very top: that is the first link in the main body.
>> This was done back in Nov 2020[1]
>
> Oh - I had not realized that at all. I think the similarity to the news
> bar made me completely blend the "view the manual" element out.

Yeah, there may be a UX tweak we can do there. However, given the length
of time we've had a "current" doc link at the top of the content bar,
I'm skeptical that it's helped the SEO problem.

>> However, I'm also not opposed to putting a (Current) link next to the
>> current version in the table. I think that'd at least be helpful from a
>> user perspective, if they don't click the big button up top.
>
> Yea, I think that'd be good.

Cool. This is a trivial add.

>>> - put version in page titles where it makes sense. E.g. change
>>>   "PostgreSQL: Documentation: 10: 6.1. Inserting Data" to
>>>   "PostgreSQL 10 Documentation: 6.1. Inserting Data"
>>>
>>>   The current ordering doesn't seem like it has much going for it, and
>>>   it can't help search engines to have the version number people might
>>>   search for removed from the product name.
>>>
>>>   Right now this seem to contribute to less than helpful titles in
>>>   search engine results. Searching anonymously for "postgres alter
>>>   table" I get the less than helpful "Documentation: 12: ALTER TABLE -
>>>   PostgreSQL" on google.
>>>
>>>   It might also be worth to go a bit further and put the documentation
>>>   version *after* the page title, given that it's most likely already
>>>   clear to the reader that this is about postgres. I.e. something like
>>>   "ALTER TABLE - Documentation for PostgreSQL 14"
>>
>> I think having "PostgreSQL $MAJOR_VERSION" together would help both for
>> some of the indexing issues + readability in the search engine. The
>> question is around how the content is ordered. in the title.
>>
>> Doing "PostgreSQL $MAJOR_VERSION: Documentation: $page_title" might be
>> the way to go. The other thing I see done for SEO what you suggest, but
>> just hyphenated i.e. "ALTER TABLE - Documentation - PostgreSQL 14"
>>
>> Anyway, I'm generally in favor for combining at least "PostgreSQL
>> $MAJOR_VERSION."
>
> Yea, let's do that separately then.
>
> WRT ordering, I do think I prefer the versions with the actual subject
> of the page first - to distinguish between different PG doc pages
> "PostgreSQL 14 Documentation" is really not helpful. I often have
> multiple doc pages open in different tabs, and there's right now no way
> to distinguish them, because there's never enough space for even just
> "PostgreSQL 13: Documentation:", not to speak of an actual title.

...but that's why you can hover your mouse of the tab, and the full
title appears! ;)

That said, there are SEO ramifications to the ordering of the content in
the <title> tag :) This one is a tricky balance.

I'd suggest starting with

    "PostgreSQL $MAJOR_VERSION: Documentation: $page_title"

and see how we do with that.

>> That all said, as stated and cited in some of those previous threads, I
>> think the biggest lift is around making our documentation URLs
>> canonical. After discussing with Magnus a bit, there are a few things
>> that we need to consider in it:
>>
>> 1. Whether or not the documentation page is in "current"
>> 2. If it's not in "current", which is the last version the page is a
>> part of? We make that the canonical
>
> Yea, I know that's a potentially significant improvement. I just didn't
> feel it's useful to wade into the topic because it's been discussed for
> about a decade by now. And that there's things we could make easier
> progress on...

I think we're at the point where "all else has failed, so let's do this."

>> I've attached a patch that does this. The one part I'm not sure I like
>> is how we treat something that is solely in "devel" -- knowing that
>> eventually something in devel could end up in current. Perhaps if
>> something is only in "devel", we exclude it from being part of the
>> canonical tree?
>
> Right now all of docs/devel is prevented from being indexed via
> robots.txt:
> Disallow: /docs/devel/
>
> So it won't really matter for SEO purposes.

Thanks for pointing that out, I hadn't checked robots.txt in awhile.

So this simplifies the patch a bit, i.e. we will not show a canonical
URL on the devel pages.

Updated patch to account for that. I also included a change to the docs
index page to show which one is "current" in the table.

Jonathan

Attachment

Re: once more: documentation search indexing

From
Magnus Hagander
Date:
On Sun, Jun 13, 2021 at 2:41 PM Jonathan S. Katz <jkatz@postgresql.org> wrote:
>
> On 6/12/21 5:37 PM, Andres Freund wrote:
>
> >>> - add a docs/current link to https://www.postgresql.org/docs/. Often
> >>>   enough that's what a user wants anyway, and it's not useful to add
> >>>   additional steps for users and search engines to navigate to
> >>>   docs/current/.
> >>
> >> We do that at the very top: that is the first link in the main body.
> >> This was done back in Nov 2020[1]
> >
> > Oh - I had not realized that at all. I think the similarity to the news
> > bar made me completely blend the "view the manual" element out.
>
> Yeah, there may be a UX tweak we can do there. However, given the length
> of time we've had a "current" doc link at the top of the content bar,

I agree with the UX tweak being needed -- I've missed that button a t
least a couple of times myself, and I am the one who put it there :)


> I'm skeptical that it's helped the SEO problem.

Agreed, on this particular one (obviously not on the general SEO
problem, that part is pretty obvious)



> >> However, I'm also not opposed to putting a (Current) link next to the
> >> current version in the table. I think that'd at least be helpful from a
> >> user perspective, if they don't click the big button up top.
> >
> > Yea, I think that'd be good.
>
> Cool. This is a trivial add.

WFM. +1.


> >>> - put version in page titles where it makes sense. E.g. change
> >>>   "PostgreSQL: Documentation: 10: 6.1. Inserting Data" to
> >>>   "PostgreSQL 10 Documentation: 6.1. Inserting Data"
> >>>
> >>>   The current ordering doesn't seem like it has much going for it, and
> >>>   it can't help search engines to have the version number people might
> >>>   search for removed from the product name.
> >>>
> >>>   Right now this seem to contribute to less than helpful titles in
> >>>   search engine results. Searching anonymously for "postgres alter
> >>>   table" I get the less than helpful "Documentation: 12: ALTER TABLE -
> >>>   PostgreSQL" on google.
> >>>
> >>>   It might also be worth to go a bit further and put the documentation
> >>>   version *after* the page title, given that it's most likely already
> >>>   clear to the reader that this is about postgres. I.e. something like
> >>>   "ALTER TABLE - Documentation for PostgreSQL 14"
> >>
> >> I think having "PostgreSQL $MAJOR_VERSION" together would help both for
> >> some of the indexing issues + readability in the search engine. The
> >> question is around how the content is ordered. in the title.
> >>
> >> Doing "PostgreSQL $MAJOR_VERSION: Documentation: $page_title" might be
> >> the way to go. The other thing I see done for SEO what you suggest, but
> >> just hyphenated i.e. "ALTER TABLE - Documentation - PostgreSQL 14"
> >>
> >> Anyway, I'm generally in favor for combining at least "PostgreSQL
> >> $MAJOR_VERSION."
> >
> > Yea, let's do that separately then.
> >
> > WRT ordering, I do think I prefer the versions with the actual subject
> > of the page first - to distinguish between different PG doc pages
> > "PostgreSQL 14 Documentation" is really not helpful. I often have
> > multiple doc pages open in different tabs, and there's right now no way
> > to distinguish them, because there's never enough space for even just
> > "PostgreSQL 13: Documentation:", not to speak of an actual title.

So we actually have a ticket in that ticket system that nobody uses,
tat's been there since 2012, so it even carried over from the old
layout :)

That one deals with the generic case of our pages being "PostgreSQL:
<title>", which leads to every tab you open basically saying
"PostgreSQL" and nothing more. The docs are even more affected by
that, but we should consider this across the whole site, and not just
the docs.

As for the suggested titles above, I'd vote for "ALTER TABLE -
Documentation - PostgreSQL 14" or "ALTER TABLE - PostgreSQL 14
Documentation". In a way "documentation" feels redundant there wihle
you're on the site, but it is what goes in the links on search hits
and I think it's definitely valuable to see specifically that it's
documentation there.


> ...but that's why you can hover your mouse of the tab, and the full
> title appears! ;)

Yeah, but that's a workaround of course :)


> That said, there are SEO ramifications to the ordering of the content in
> the <title> tag :) This one is a tricky balance.

This one I cannot comment on, I know nothing about it. Other than that
it's important :)


> I'd suggest starting with
>
>     "PostgreSQL $MAJOR_VERSION: Documentation: $page_title"
>
> and see how we do with that.
>
> >> That all said, as stated and cited in some of those previous threads, I
> >> think the biggest lift is around making our documentation URLs
> >> canonical. After discussing with Magnus a bit, there are a few things
> >> that we need to consider in it:
> >>
> >> 1. Whether or not the documentation page is in "current"
> >> 2. If it's not in "current", which is the last version the page is a
> >> part of? We make that the canonical
> >
> > Yea, I know that's a potentially significant improvement. I just didn't
> > feel it's useful to wade into the topic because it's been discussed for
> > about a decade by now. And that there's things we could make easier
> > progress on...
>
> I think we're at the point where "all else has failed, so let's do this."


Yes.

It can get worse. But not all that much worse...


> >> I've attached a patch that does this. The one part I'm not sure I like
> >> is how we treat something that is solely in "devel" -- knowing that
> >> eventually something in devel could end up in current. Perhaps if
> >> something is only in "devel", we exclude it from being part of the
> >> canonical tree?
> >
> > Right now all of docs/devel is prevented from being indexed via
> > robots.txt:
> > Disallow: /docs/devel/
> >
> > So it won't really matter for SEO purposes.
>
> Thanks for pointing that out, I hadn't checked robots.txt in awhile.
>
> So this simplifies the patch a bit, i.e. we will not show a canonical
> URL on the devel pages.
>
> Updated patch to account for that. I also included a change to the docs
> index page to show which one is "current" in the table.

Absolute nitpick, but isn't:
+    if len(list(filter(lambda v: v.version.current, versions))):
cleaner written as:
if any(filter(lambda v: v.version.current, versions)):
?

And the loop thereafter I think can just be:
version_max = max(versions, key=lambda v: v.tree)
?


I do think the bigger question is if we want the actual /current/ URL
to be the canonical one, rather than the /<version number of current
version>/?

I would've guessed that's better? But again I don't realy know, that's
a guess, so if it was considered and rejected for good reasons then
ignore that comment :)

-- 
 Magnus Hagander
 Me: https://www.hagander.net/
 Work: https://www.redpill-linpro.com/



Re: once more: documentation search indexing

From
Michael Christofides
Date:
Sorry to resurrect an old thread, this seemed so close to going through.
 
I agree with the UX tweak being needed -- I've missed that button a t
least a couple of times myself, and I am the one who put it there :)
 
+1 for changing this, I also never saw the button somehow.
 
I do think the bigger question is if we want the actual /current/ URL
to be the canonical one, rather than the /<version number of current
version>/?

I would've guessed that's better? But again I don't realy know, that's
a guess, so if it was considered and rejected for good reasons then
ignore that comment :)

I think this is important, but that either would be a big improvement over the status quo. I think there are a couple of advantages of going with /current/ as the canonical URL though:

1. Search engines factor in being told what is canonical, but it is only one factor they consider[0], so I think there'll be benefits of the URL we mark as canonical not changing every year (although links that are only one major version old would be a lot better than the status quo).

2. It would make it more common for people to link back to the /current/ URLs on Stack Overflow, in blog posts, and similar. In the vast majority of cases this will improve the experience for folks following those links in future, and it will also help search engines be confident that the /current/ version is the canonical one.

Having said that, I'd favour pushing the proposed patch over doing nothing, as it will still be a big improvement! Thanks for all the work on this so far. 


Re: once more: documentation search indexing

From
"Jonathan S. Katz"
Date:
On 3/16/22 6:32 AM, Michael Christofides wrote:
> Sorry to resurrect an old thread, this seemed so close to going through.
> 
>     I agree with the UX tweak being needed -- I've missed that button a t
>     least a couple of times myself, and I am the one who put it there :)
> 
> +1 for changing this, I also never saw the button somehow.
> 
>     I do think the bigger question is if we want the actual /current/ URL
>     to be the canonical one, rather than the /<version number of current
>     version>/?
> 
>     I would've guessed that's better? But again I don't realy know, that's
>     a guess, so if it was considered and rejected for good reasons then
>     ignore that comment :)
> 
> 
> I think this is important, but that either would be a big improvement 
> over the status quo. I think there are a couple of advantages of going 
> with /current/ as the canonical URL though:
> 
> 1. Search engines factor in being told what is canonical, but it is only 
> one factor they consider[0], so I think there'll be benefits of the URL 
> we mark as canonical not changing every year (although links that are 
> only one major version old would be a lot better than the status quo).
> 
> 2. It would make it more common for people to link back to the /current/ 
> URLs on Stack Overflow, in blog posts, and similar. In the vast majority 
> of cases this will improve the experience for folks following those 
> links in future, and it will also help search engines be confident that 
> the /current/ version is the canonical one.
> 
> Having said that, I'd favour pushing the proposed patch over doing 
> nothing, as it will still be a big improvement! Thanks for all the work 
> on this so far.

If there is consensus on this approach, it's been ready for awhile and 
collecting dust[1]. I'm OK with pushing it -- I've used this before in a 
few different situations and have pushed for this method for a few years 
-- but I want to ensure the other folks on the web team are comfortable 
or at least willing to try it out and see.

Thanks,

Jonathan

[1] 
https://www.postgresql.org/message-id/6cfa15b4-77a5-0961-5168-7d191989ff73%40postgresql.org

Attachment

Re: once more: documentation search indexing

From
Peter Geoghegan
Date:
On Wed, Mar 16, 2022 at 7:19 AM Jonathan S. Katz <jkatz@postgresql.org> wrote:
> If there is consensus on this approach, it's been ready for awhile and
> collecting dust[1]. I'm OK with pushing it -- I've used this before in a
> few different situations and have pushed for this method for a few years
> -- but I want to ensure the other folks on the web team are comfortable
> or at least willing to try it out and see.

A thread that I came across on Twitter recently:

https://twitter.com/laurencerowe/status/1484322796863836160

Simon Willison is a co-creator of Django.

The thread links to:

https://til.simonwillison.net/readthedocs/documentation-seo-canonical

See also:

https://docs.readthedocs.io/en/latest/custom_domains.html#canonical-urls

I'm not an expert on Webdev by any means, but I will say this: if
there is a defacto Django-ecosystem solution for this exact problem
with documentation SEO (which this arguably is), then why wouldn't we
use it?

-- 
Peter Geoghegan



Re: once more: documentation search indexing

From
Daniel Gustafsson
Date:
> On 16 Mar 2022, at 15:19, Jonathan S. Katz <jkatz@postgresql.org> wrote:

>  -- but I want to ensure the other folks on the web team are comfortable or at
>  least willing to try it out and see.


I won't oppose to trying it.

--
Daniel Gustafsson        https://vmware.com/




Re: once more: documentation search indexing

From
"Jonathan S. Katz"
Date:
On 3/17/22 9:42 AM, Daniel Gustafsson wrote:
>> On 16 Mar 2022, at 15:19, Jonathan S. Katz <jkatz@postgresql.org> wrote:
> 
>>   -- but I want to ensure the other folks on the web team are comfortable or at
>>   least willing to try it out and see.
> 
> 
> I won't oppose to trying it.

So let's timebox this. If there are no objections, by Mon, Mar 21, I 
will apply this version of the patch (attached), pending any additional 
feedback or review.

Thanks,

Jonathan

Attachment

Re: once more: documentation search indexing

From
Peter Geoghegan
Date:
On Thu, Mar 17, 2022 at 11:42 AM Jonathan S. Katz <jkatz@postgresql.org> wrote:
> So let's timebox this. If there are no objections, by Mon, Mar 21, I
> will apply this version of the patch (attached), pending any additional
> feedback or review.

Great news! I think that this will noticeably improve the situation.

Thanks
-- 
Peter Geoghegan



Re: once more: documentation search indexing

From
Andres Freund
Date:
Hi,

On 2022-03-17 14:41:47 -0400, Jonathan S. Katz wrote:
> On 3/17/22 9:42 AM, Daniel Gustafsson wrote:
> > > On 16 Mar 2022, at 15:19, Jonathan S. Katz <jkatz@postgresql.org> wrote:
> > 
> > >   -- but I want to ensure the other folks on the web team are comfortable or at
> > >   least willing to try it out and see.
> > 
> > 
> > I won't oppose to trying it.
> 
> So let's timebox this. If there are no objections, by Mon, Mar 21, I will
> apply this version of the patch (attached), pending any additional feedback
> or review.

Cool! I think a shorter waiting period would also be fine, we've been stuck on
this forever and we can change course if we find out the rel=canon approach
doesn't work.

Don't make much out of the review comments below, this is not my area of
expertise...

> diff --git a/pgweb/docs/views.py b/pgweb/docs/views.py
> index c2d00c8..162776f 100644
> --- a/pgweb/docs/views.py
> +++ b/pgweb/docs/views.py
> @@ -120,11 +120,30 @@ def docpage(request, version, filename):
>      else:
>          contentpreview = ''
>  
> +    # determine the canonical version of the page
> +    # if the doc page is in the current version, then we set it to current
> +    # otherwise, check the supported and unsupported versions and find the
> +    # last version that the page appeared
> +    # we exclude "devel" as development docs are disallowed in robots.txt

Not related to this, but I think we should change this at some point. It's
nice to be able to find a documentation page for a new tool.



> diff --git a/templates/docs/docspage.html b/templates/docs/docspage.html
> index f5f3e3b..7a4e2fc 100644
> --- a/templates/docs/docspage.html
> +++ b/templates/docs/docspage.html
> @@ -27,6 +27,9 @@
>    {%endif%}
>      <link rel="stylesheet" type="text/css" href="/dyncss/base.css?{{gitrev}}">
>    {%block extrahead%}{%endblock%}
> +  {% if canonical_version %}
> +    <link rel="canonical" href="https://www.postgresql.org/docs/{{ canonical_version }}/{{ ver.file }}" />
> +  {% endif %}
>    </head>

What's the reason to put this after extrahead, rather than before?


> diff --git a/templates/docs/index.html b/templates/docs/index.html
> index cfcc2f8..63e4559 100644
> --- a/templates/docs/index.html
> +++ b/templates/docs/index.html
> @@ -27,6 +27,9 @@
>      <tr>
>       <td>
>        <a href="/docs/{{v.numtree}}/index.html">{{v.treestring}}</a>
> +    {% if v.current %}
> +      (<a href="/docs/current/index.html">Current</a>)
> +    {% endif %}
>       </td>
>       <td>
>        {%if v.a4pdf or v.uspdf%}

So this is just going to a separate link for the html docs, not the pdf
docs. Which do not seem to be available under a 'current' style link anyway? I
guess that's good enough for now...

Perhaps some non-link visual separation between e.g. "14" and "current" would
make sense? Even just a " / " might help. Otherwise it might not be obvious
that they're different link targets.

Greetings,

Andres Freund



Re: once more: documentation search indexing

From
"Jonathan S. Katz"
Date:
On 3/17/22 3:15 PM, Andres Freund wrote:

>> diff --git a/pgweb/docs/views.py b/pgweb/docs/views.py
>> index c2d00c8..162776f 100644
>> --- a/pgweb/docs/views.py
>> +++ b/pgweb/docs/views.py
>> @@ -120,11 +120,30 @@ def docpage(request, version, filename):
>>       else:
>>           contentpreview = ''
>>   
>> +    # determine the canonical version of the page
>> +    # if the doc page is in the current version, then we set it to current
>> +    # otherwise, check the supported and unsupported versions and find the
>> +    # last version that the page appeared
>> +    # we exclude "devel" as development docs are disallowed in robots.txt
> 
> Not related to this, but I think we should change this at some point. It's
> nice to be able to find a documentation page for a new tool.

This is indeed a whole separate thread ;)

>> diff --git a/templates/docs/docspage.html b/templates/docs/docspage.html
>> index f5f3e3b..7a4e2fc 100644
>> --- a/templates/docs/docspage.html
>> +++ b/templates/docs/docspage.html
>> @@ -27,6 +27,9 @@
>>     {%endif%}
>>       <link rel="stylesheet" type="text/css" href="/dyncss/base.css?{{gitrev}}">
>>     {%block extrahead%}{%endblock%}
>> +  {% if canonical_version %}
>> +    <link rel="canonical" href="https://www.postgresql.org/docs/{{ canonical_version }}/{{ ver.file }}" />
>> +  {% endif %}
>>     </head>
> 
> What's the reason to put this after extrahead, rather than before?

No reason, but according to this article from 2013[1], it should be 
further up in <head>. I'll adjust.

When testing, I also noticed I had a wrong reference there for 
generating the page title; this version fixes it.

>> diff --git a/templates/docs/index.html b/templates/docs/index.html
>> index cfcc2f8..63e4559 100644
>> --- a/templates/docs/index.html
>> +++ b/templates/docs/index.html
>> @@ -27,6 +27,9 @@
>>       <tr>
>>        <td>
>>         <a href="/docs/{{v.numtree}}/index.html">{{v.treestring}}</a>
>> +    {% if v.current %}
>> +      (<a href="/docs/current/index.html">Current</a>)
>> +    {% endif %}
>>        </td>
>>        <td>
>>         {%if v.a4pdf or v.uspdf%}
> 
> So this is just going to a separate link for the html docs, not the pdf
> docs. Which do not seem to be available under a 'current' style link anyway? I
> guess that's good enough for now...

Ugh this patch had some dust. I think that may have been part of the 
proposal to either a) have more links to "current" on the docs index 
page and/or b) make it clear which version is "current" in the list.

> Perhaps some non-link visual separation between e.g. "14" and "current" would
> make sense? Even just a " / " might help. Otherwise it might not be obvious
> that they're different link targets.

...I'd even be OK with removing it, but it's also one of those things 
that's easy enough to change, so trying that out.

Attached another version.

Jonathan

[1] 

https://developers.google.com/search/blog/2013/04/5-common-mistakes-with-relcanonical#mistake-5:-rel=canonical-in-the-body


Attachment

Re: once more: documentation search indexing

From
"Jonathan S. Katz"
Date:
On 3/17/22 10:20 PM, Jonathan S. Katz wrote:

> Attached another version.

I double-checked the docs on "rel=canonical"[1], looked over the patch, 
and tested this out on 30+ pages locally. I ensured that a removed page 
should use the last time it was visible as the canonical URL[2].

I believe this is ready to push. I'll do so Monday morning EDT.

Thanks,

Jonathan

[1] 
https://developers.google.com/search/docs/advanced/crawling/consolidate-duplicate-urls
[2] https://www.postgresql.org/docs/9.6/tsearch2.html

Attachment

Re: once more: documentation search indexing

From
Michael Christofides
Date:
 
I believe this is ready to push. I'll do so Monday morning EDT.

 Amazing, thank you!

Re: once more: documentation search indexing

From
"Jonathan S. Katz"
Date:
On 3/21/22 8:48 AM, Michael Christofides wrote:
>     I believe this is ready to push. I'll do so Monday morning EDT.
> 
> 
>   Amazing, thank you!

I've pushed this and done some cache flushing, so the changes should be 
out there.

The Google docs don't say how long it will take for it to recognize the 
changes, so we should continuously then periodically check on how this 
is doing.

I would still like to tackle page titles / SEO, but I recall that, even 
with agreeing on an approach, there is some trickiness to that. Perhaps 
it's worth looking at it again after we see how this experiment plays out.

Thanks,

Jonathan

Attachment

Re: once more: documentation search indexing

From
Peter Geoghegan
Date:
On Mon, Mar 21, 2022 at 8:37 AM Jonathan S. Katz <jkatz@postgresql.org> wrote:
> On 3/21/22 8:48 AM, Michael Christofides wrote:
> >     I believe this is ready to push. I'll do so Monday morning EDT.
> >
> >
> >   Amazing, thank you!

+1

> I've pushed this and done some cache flushing, so the changes should be
> out there.

Looks like it's started to work already, at least with Google, at
least for me. I notice that Google now links to the version 14 docs
for certain likely-common search terms, like "create table
postgresql". It still won't provide search results that link to the
latest doc version for less common terms, such as "alter table
postgresql", but I suspect that it's only a matter of time (plus all
the usual caveats apply).

I'll be keeping an eye on it. Apparently it can take days or even
weeks for this kind of change to be reflected in their search results.

-- 
Peter Geoghegan



Re: once more: documentation search indexing

From
Tom Lane
Date:
Peter Geoghegan <pg@bowt.ie> writes:
> I'll be keeping an eye on it. Apparently it can take days or even
> weeks for this kind of change to be reflected in their search results.

I'm surprised that you noticed any difference yet at all.  They can't
be crawling the whole web every day, or even every week.

            regards, tom lane



Re: once more: documentation search indexing

From
Peter Geoghegan
Date:
On Mon, Mar 21, 2022 at 7:02 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> I'm surprised that you noticed any difference yet at all.  They can't
> be crawling the whole web every day, or even every week.

I don't think it's all that surprising. Apparently they emphasize
recency these days, particularly for anything connected with an
unfolding news story. (I'm not sure that that's entirely a good thing,
but that's another conversation.)

-- 
Peter Geoghegan



Re: once more: documentation search indexing

From
Michael Christofides
Date:
 
Looks like it's started to work already, at least with Google, at
least for me. I notice that Google now links to the version 14 docs
for certain likely-common search terms, like "create table
postgresql". 

I am also now seeing the version 14 documentation for many similar searches on Google, and specifically the /current/ URL, which makes me think it is a direct result of this change. For example: "pg_stat_statements", "auto_explain", "postgres create index", and "postgres explain analyze" all now link to the /current/ URL too.

There are some that still link to old versions (again, for me, on Google), including "postgres create view" (to version 9.2), and "postgres gin index" (to version 9.1). I'm pleasantly surprised how many have already flipped though.

Thanks again to all involved!

Michael

Re: once more: documentation search indexing

From
Magnus Hagander
Date:


On Tue, Mar 29, 2022 at 12:33 PM Michael Christofides <michael@pgmustard.com> wrote:
 
Looks like it's started to work already, at least with Google, at
least for me. I notice that Google now links to the version 14 docs
for certain likely-common search terms, like "create table
postgresql". 

I am also now seeing the version 14 documentation for many similar searches on Google, and specifically the /current/ URL, which makes me think it is a direct result of this change. For example: "pg_stat_statements", "auto_explain", "postgres create index", and "postgres explain analyze" all now link to the /current/ URL too.

There are some that still link to old versions (again, for me, on Google), including "postgres create view" (to version 9.2), and "postgres gin index" (to version 9.1). I'm pleasantly surprised how many have already flipped though.

Thanks again to all involved!



Sadly, we do seem to have lost the ability to find old versions. Searching for example for "pg_stat_statements 9.6" for me still gives the 14 version, and the 9.6 version only in Russian off the pgpro site.
 
It is of course another sign that it is working -- because our 9.6 docs does say "I don't exist, go look at 14 instead", and that's what Google does.

It would've been nice if there was a way around that, but I think it's just the price we have to pay. Those that want the 9.6 docs will have to start from current and add an extra click.

--

Re: once more: documentation search indexing

From
Daniel Gustafsson
Date:
> On 29 Mar 2022, at 12:57, Magnus Hagander <magnus@hagander.net> wrote:

> Sadly, we do seem to have lost the ability to find old versions. Searching for example for "pg_stat_statements 9.6"
forme still gives the 14 version, and the 9.6 version only in Russian off the pgpro site. 

FWIW, searching for "pg_stat_statements 9.6" on DuckDuckGo correctly links to
the 9.6 docs but searching for just "pg_stat_statements" link to the 9.4 docs.

--
Daniel Gustafsson        https://vmware.com/




Re: once more: documentation search indexing

From
"Jonathan S. Katz"
Date:
On 3/29/22 7:30 AM, Daniel Gustafsson wrote:
>> On 29 Mar 2022, at 12:57, Magnus Hagander <magnus@hagander.net> wrote:
> 
>> Sadly, we do seem to have lost the ability to find old versions. Searching for example for "pg_stat_statements 9.6"
forme still gives the 14 version, and the 9.6 version only in Russian off the pgpro site.
 
> 
> FWIW, searching for "pg_stat_statements 9.6" on DuckDuckGo correctly links to
> the 9.6 docs but searching for just "pg_stat_statements" link to the 9.4 docs.

Some aggregate stats since launch (Mar 21).

I looked over a period from Mar 21 - Apr 5 vs. Mar 5 - Mar 20. There was 
a marked shift in the data starting Mar 28, which is when it looks like 
the first big reindexing took place.

Here are the changes in page views for each doc version:

current: + 184.05%
14:      +   7.84%
13:      -  12.46%
12:      -   4.92%
11:      -   0.46%
10:      -  11.38%
9.6:     -   2.60%
9.5:     -  30.32%
9.4:     -  15.02%
9.3:     -  22.93%
9.2:     -  21.60%
9.1:     -  33.10%
9.0:     -  27.80%
8.4:     -  10.24%

So it appears that there is a traffic shift to the most recent docs 
based on the rel=canonical change.

The next analysis will be to see how much people are clicking into an 
older version of the docs after they land on a particular page, and if 
so what versions.

Thanks,

Jonathan

Attachment

Re: once more: documentation search indexing

From
Peter Geoghegan
Date:
On Tue, Apr 5, 2022 at 6:33 PM Jonathan S. Katz <jkatz@postgresql.org> wrote:
> I looked over a period from Mar 21 - Apr 5 vs. Mar 5 - Mar 20. There was
> a marked shift in the data starting Mar 28, which is when it looks like
> the first big reindexing took place.

Unsurprising, given the big changes to the search results. While this
change is not without its downsides, it's a far better overall
trade-off IMV. Thanks for working on this.

-- 
Peter Geoghegan



Re: once more: documentation search indexing

From
Michael Christofides
Date:
 
Sadly, we do seem to have lost the ability to find old versions. Searching for example for "pg_stat_statements 9.6" for me still gives the 14 version, and the 9.6 version only in Russian off the pgpro site.
 
I remember this concern, and it's a shame there isn't a good solution to both.

Another concern that was raised previously was whether we'd lose the reputation that had been built up, but I'm still seeing the docs in the top search position almost all of the time. It would be good to check whether overall traffic, or average position, are down at all though.

So it appears that there is a traffic shift to the most recent docs
based on the rel=canonical change.

Great to see the numbers, thanks again! 

Michael Christofides

Re: once more: documentation search indexing

From
"Jonathan S. Katz"
Date:
On 4/13/22 1:47 PM, Michael Christofides wrote:

> Another concern that was raised previously was whether we'd lose the 
> reputation that had been built up, but I'm still seeing the docs in the 
> top search position almost all of the time. It would be good to check 
> whether overall traffic, or average position, are down at all though.

I can look at overall traffic to docs. From a period from Mar 21 - Apr 
13, traffic to docs is up 6.78% vs. Feb 25 - Mar 20. So it seems OK at a 
high level.

Jonathan

Attachment

Re: once more: documentation search indexing

From
Peter Geoghegan
Date:
On Wed, Apr 13, 2022 at 8:19 PM Jonathan S. Katz <jkatz@postgresql.org> wrote:
> I can look at overall traffic to docs. From a period from Mar 21 - Apr
> 13, traffic to docs is up 6.78% vs. Feb 25 - Mar 20. So it seems OK at a
> high level.

Have you thought about using robots.txt to forbid Google from indexing
versions of Postgres that are now out of support?

Perhaps that is an overly aggressive approach, but it seems worth considering.

-- 
Peter Geoghegan



Re: once more: documentation search indexing

From
Daniel Gustafsson
Date:
> On 14 Apr 2022, at 05:35, Peter Geoghegan <pg@bowt.ie> wrote:
>
> On Wed, Apr 13, 2022 at 8:19 PM Jonathan S. Katz <jkatz@postgresql.org> wrote:
>> I can look at overall traffic to docs. From a period from Mar 21 - Apr
>> 13, traffic to docs is up 6.78% vs. Feb 25 - Mar 20. So it seems OK at a
>> high level.
>
> Have you thought about using robots.txt to forbid Google from indexing
> versions of Postgres that are now out of support?

robots.txt won't keep Google from indexing the page, if it's linked to from
anywhere on the web it will still appear in the index and search results.

If we want to keep outdated version away from the search results they need a
noindex attribute in <head>:

    <meta name="robots" content="noindex">

--
Daniel Gustafsson        https://vmware.com/




Re: once more: documentation search indexing

From
Peter Geoghegan
Date:
On Thu, Apr 14, 2022 at 1:25 AM Daniel Gustafsson <daniel@yesql.se> wrote:
> If we want to keep outdated version away from the search results they need a
> noindex attribute in <head>:
>
>         <meta name="robots" content="noindex">

I see.

Do you think that doing so for out of support releases would improve
our search results? Do you see any potential downsides?

-- 
Peter Geoghegan



Re: once more: documentation search indexing

From
Daniel Gustafsson
Date:
> On 14 Apr 2022, at 18:23, Peter Geoghegan <pg@bowt.ie> wrote:
>
> On Thu, Apr 14, 2022 at 1:25 AM Daniel Gustafsson <daniel@yesql.se> wrote:
>> If we want to keep outdated version away from the search results they need a
>> noindex attribute in <head>:
>>
>>        <meta name="robots" content="noindex">
>
> I see.
>
> Do you think that doing so for out of support releases would improve
> our search results? Do you see any potential downsides?

I don't really have a good answer, googlebot et.al acts in mysterious ways.  It
shouldn't affect searching for up to date information since we identify
/current as the canonical version of pages in backbranches (supported or not).
But if an 8.4 page is indexed and linked to from a gazillion stack overflow
posts, then who knows how that shifts the results.

Given how it works right now, and what we know, I would err on the side of
caution and keep them indexed - but that's a highly unscientifically based
opinion.

--
Daniel Gustafsson        https://vmware.com/




Re: once more: documentation search indexing

From
Robert Treat
Date:
On Thu, Apr 14, 2022 at 6:21 PM Daniel Gustafsson <daniel@yesql.se> wrote:
>
> > On 14 Apr 2022, at 18:23, Peter Geoghegan <pg@bowt.ie> wrote:
> >
> > On Thu, Apr 14, 2022 at 1:25 AM Daniel Gustafsson <daniel@yesql.se> wrote:
> >> If we want to keep outdated version away from the search results they need a
> >> noindex attribute in <head>:
> >>
> >>        <meta name="robots" content="noindex">
> >
> > I see.
> >
> > Do you think that doing so for out of support releases would improve
> > our search results? Do you see any potential downsides?
>
> I don't really have a good answer, googlebot et.al acts in mysterious ways.  It
> shouldn't affect searching for up to date information since we identify
> /current as the canonical version of pages in backbranches (supported or not).
> But if an 8.4 page is indexed and linked to from a gazillion stack overflow
> posts, then who knows how that shifts the results.
>
> Given how it works right now, and what we know, I would err on the side of
> caution and keep them indexed - but that's a highly unscientifically based
> opinion.
>

The immediate use case that comes to mind is folks searching for
documentation in older versions that no longer exists in the /current/
documentation, which is perhaps a small use case but also a fairly
valid one. I reckon there are others if we think about it, so +1 on
leaving the old version indexed for now.


Robert Treat
https://xzilla.net



Re: once more: documentation search indexing

From
Magnus Hagander
Date:
On Sat, Apr 16, 2022 at 5:02 PM Robert Treat <rob@xzilla.net> wrote:
On Thu, Apr 14, 2022 at 6:21 PM Daniel Gustafsson <daniel@yesql.se> wrote:
>
> > On 14 Apr 2022, at 18:23, Peter Geoghegan <pg@bowt.ie> wrote:
> >
> > On Thu, Apr 14, 2022 at 1:25 AM Daniel Gustafsson <daniel@yesql.se> wrote:
> >> If we want to keep outdated version away from the search results they need a
> >> noindex attribute in <head>:
> >>
> >>        <meta name="robots" content="noindex">
> >
> > I see.
> >
> > Do you think that doing so for out of support releases would improve
> > our search results? Do you see any potential downsides?
>
> I don't really have a good answer, googlebot et.al acts in mysterious ways.  It
> shouldn't affect searching for up to date information since we identify
> /current as the canonical version of pages in backbranches (supported or not).
> But if an 8.4 page is indexed and linked to from a gazillion stack overflow
> posts, then who knows how that shifts the results.
>
> Given how it works right now, and what we know, I would err on the side of
> caution and keep them indexed - but that's a highly unscientifically based
> opinion.
>

The immediate use case that comes to mind is folks searching for
documentation in older versions that no longer exists in the /current/
documentation, which is perhaps a small use case but also a fairly
valid one. I reckon there are others if we think about it, so +1 on
leaving the old version indexed for now.

Yeah, losing that ability completely would definitely be a negative. We've already lost (I think) the ability to search for those words if they are on the same page as a new version which doesn't have it, losing the ability to search it off pages that don't even exist anymore seems even worse.

What would be the actual *advantage* of excluding them? 

//Magnus

Re: once more: documentation search indexing

From
Tom Lane
Date:
Magnus Hagander <magnus@hagander.net> writes:
> What would be the actual *advantage* of excluding them?

The immediate problem is that Google is still preferentially returning old
pages in some cases, e.g. top hit for "postgres gist gin index" is still

https://www.postgresql.org/docs/9.1/textsearch-indexes.html

Now maybe that just means they've not completely reindexed since we made
the canonical-version change, so I'm content to wait awhile longer
before concluding that that change wasn't sufficient.  But we should be
considering the possibility that it wasn't.

            regards, tom lane



Re: once more: documentation search indexing

From
Daniel Gustafsson
Date:
> On 18 Apr 2022, at 20:04, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> Magnus Hagander <magnus@hagander.net> writes:
>> What would be the actual *advantage* of excluding them?
>
> The immediate problem is that Google is still preferentially returning old
> pages in some cases, e.g. top hit for "postgres gist gin index" is still
>
> https://www.postgresql.org/docs/9.1/textsearch-indexes.html
>
> Now maybe that just means they've not completely reindexed since we made
> the canonical-version change, so I'm content to wait awhile longer
> before concluding that that change wasn't sufficient.  But we should be
> considering the possibility that it wasn't.

That particular 9.1 page is the second hit for "postgres gin index" after the
/current/ page for the Gin Index chapter.  (I first thought it was the first
hit since I dismissed the "featured snippet" result as an ad.) DuckDuckGo
returns the 9.1 page or the current page seemingly at random for "postgres gin
gist index".

Searching for "postgres gist gin index <version>" on Google returns the correct
page for versions 8.3 through 9.4, for any other version (including lower) it
returns /current/.

Removing the old content might improve search results, but it might also just
remove it altogether bumping non-postgresql.org content higher.

--
Daniel Gustafsson        https://vmware.com/




Re: once more: documentation search indexing

From
Magnus Hagander
Date:


On Tue, Apr 19, 2022 at 11:18 AM Daniel Gustafsson <daniel@yesql.se> wrote:
> On 18 Apr 2022, at 20:04, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> Magnus Hagander <magnus@hagander.net> writes:
>> What would be the actual *advantage* of excluding them?
>
> The immediate problem is that Google is still preferentially returning old
> pages in some cases, e.g. top hit for "postgres gist gin index" is still
>
> https://www.postgresql.org/docs/9.1/textsearch-indexes.html
>
> Now maybe that just means they've not completely reindexed since we made
> the canonical-version change, so I'm content to wait awhile longer
> before concluding that that change wasn't sufficient.  But we should be
> considering the possibility that it wasn't.

That particular 9.1 page is the second hit for "postgres gin index" after the
/current/ page for the Gin Index chapter.  (I first thought it was the first
hit since I dismissed the "featured snippet" result as an ad.) DuckDuckGo
returns the 9.1 page or the current page seemingly at random for "postgres gin
gist index".

Searching for "postgres gist gin index <version>" on Google returns the correct
page for versions 8.3 through 9.4, for any other version (including lower) it
returns /current/.

This seems to indicate it just hasn't picked that up yet? That's the bahaviour we saw before it found the rel=canonical parts, isn't it?



Removing the old content might improve search results, but it might also just
remove it altogether bumping non-postgresql.org content higher.

Yeah, if we remove them completely then presumably they also stop counting as "link score" for us.
 
--

Re: once more: documentation search indexing

From
Daniel Gustafsson
Date:
> On 19 Apr 2022, at 15:17, Magnus Hagander <magnus@hagander.net> wrote:

> This seems to indicate it just hasn't picked that up yet?

Maybe, but I'm quickly running out of tea-leaves for reading search results in.
My impression is that searches are turning up current docs more frequently now,
but I might be wrong.

--
Daniel Gustafsson        https://vmware.com/




Re: once more: documentation search indexing

From
Peter Geoghegan
Date:
On Tue, Apr 19, 2022 at 12:00 PM Daniel Gustafsson <daniel@yesql.se> wrote:
> > On 19 Apr 2022, at 15:17, Magnus Hagander <magnus@hagander.net> wrote:
>
> > This seems to indicate it just hasn't picked that up yet?
>
> Maybe, but I'm quickly running out of tea-leaves for reading search results in.
> My impression is that searches are turning up current docs more frequently now,
> but I might be wrong.

There is zero doubt that Google search results have changed utterly
following the rel=canonical update. It hasn't yet had the effect of
making 100% of all results from our documentation link to the
/current/ page, but it's not too far off. That's what I see, at least.

DuckDuckGo is a different matter entirely -- that still links to older
versions (though mostly versions that are still supported). I suspect
that we can't do much about that.

-- 
Peter Geoghegan



Re: once more: documentation search indexing

From
"Jonathan S. Katz"
Date:
On 4/19/22 3:10 PM, Peter Geoghegan wrote:
> On Tue, Apr 19, 2022 at 12:00 PM Daniel Gustafsson <daniel@yesql.se> wrote:
>>> On 19 Apr 2022, at 15:17, Magnus Hagander <magnus@hagander.net> wrote:
>>
>>> This seems to indicate it just hasn't picked that up yet?
>>
>> Maybe, but I'm quickly running out of tea-leaves for reading search results in.
>> My impression is that searches are turning up current docs more frequently now,
>> but I might be wrong.

The initial data posted here seems to support this[1]

> There is zero doubt that Google search results have changed utterly
> following the rel=canonical update. It hasn't yet had the effect of
> making 100% of all results from our documentation link to the
> /current/ page, but it's not too far off. That's what I see, at least.

Agreed. Perhaps some of the lesser searched keywords don't have the docs 
indexed as such

> DuckDuckGo is a different matter entirely -- that still links to older
> versions (though mostly versions that are still supported). I suspect
> that we can't do much about that.

A quick search for how to solve this in DuckDuckGo yields nothing. Also, 
the amount of traffic we get sourced from DuckDuckGo is pretty 
insignificant so it may not be worth the effort to optimize.

Jonathan

[1] 
https://www.postgresql.org/message-id/25aa516b-4fa7-5083-366e-09cf4f838e0f%40postgresql.org

Attachment