Re: CREATE DATABASE command for non-libc providers - Mailing list pgsql-hackers

From Daniel Verite
Subject Re: CREATE DATABASE command for non-libc providers
Date
Msg-id eaafe5c4-a1eb-4028-92a1-722304875d86@manitou-mail.org
Whole thread Raw
In response to Re: CREATE DATABASE command for non-libc providers  (Jeff Davis <pgsql@j-davis.com>)
Responses Re: CREATE DATABASE command for non-libc providers
List pgsql-hackers
    Jeff Davis wrote:

> The main challenge is backwards compatibility. Users of FTS would need
> to recreate all of their tsvectors and indexes dependent on them. It's
> even possible that some users only have tsvectors and don't store the
> original data in the database, which would further complicate matters.

Why would it be that bad?
FTS indexes don't get corrupted that way. You may get different
lexems before and after the upgrade for some documents, and then
what?

The FTS parser had seen user-visible changes in the past, and
regenerating tsvectors because of that were merely a suggestion.

commit 61d66c44f18c73094a50a2ef97d26cc03e171dc0
Author: Teodor Sigaev <teodor@sigaev.ru>
Date:    Tue Mar 29 17:59:58 2016 +0300

    Fix support of digits in email/hostnames.

    When tsearch was implemented I did several mistakes in hostname/email
    definition rules:
    1) allow underscore in hostname what ted by RFC
    2) forget to allow leading digits separated by hyphen (like 123-x.com)
       in hostname
    3) do no allow underscore/hyphen after leading digits in localpart of
email

    Artur's patch resolves two last issues, but by the way allows hosts name
like
    123_x.com together with 123-x.com. RFC forbids underscore usage in
hostname
    but pg allows that since initial tsearch version in core, although only
    for non-digits. Patch syncs support digits and nondigits in both hostname
and
    email.

    Forbidding underscore in hostname may break existsing usage of tsearch
and,
    anyhow, it should be done by separate patch.

    Author: Artur Zakirov
    BUG: #13964

In the release notes:

  Fix the default text search parser to allow leading digits in email
  and host tokens (Artur Zakirov)

  In most cases this will result in few changes in the parsing of
  text. But if you have data where such addresses occur frequently, it
  may be worth rebuilding dependent tsvector columns and indexes so
  that addresses of this form will be found properly by text searches.


commit 2c265adea3129c917296b46a82786d67988ece2c
Author: Tom Lane <tgl@sss.pgh.pa.us>
Date:    Wed Apr 28 02:04:16 2010 +0000

    Modify the built-in text search parser to handle URLs more nearly
according
    to RFC 3986.  In particular, these characters now terminate the path part
    of a URL: '"', '<', '>', '\', '^', '`', '{', '|', '}'.  The previous
behavior
    was inconsistent and depended on whether a "?" was present in the path.
    Per gripe from Donald Fraser and spec research by Kevin Grittner.

    This is a pre-existing bug, but not back-patching since the risks of
    breaking existing applications seem to outweigh the benefits.

https://www.postgresql.org/docs/release/9.0.0/

E.24.3.5.1. Full Text Search

    Use more standards-compliant rules for parsing URL tokens (Tom Lane)


Best regards,
--
Daniel Vérité
https://postgresql.verite.pro/



pgsql-hackers by date:

Previous
From: Jeff Davis
Date:
Subject: Re: pg_dump --with-* options
Next
From: Jeff Davis
Date:
Subject: Re: CREATE DATABASE command for non-libc providers