Thread: POC: contrib/unaccent as IMMUTABLE

POC: contrib/unaccent as IMMUTABLE

From

Petru Ghita

Date:

03 October 2020, 10:53:35

Hi,

Would it be possible to mark functions in contrib/unnaccent [1] as IMMUTABLE?

We would gain the possibility to use it to enforce uniqueness.

I understand that the main drawback is that unicode rules are
periodically updated. So it would be needed to reindex indices on
unaccent() after upgrade.

Thank you,
petru ghita

[1] https://github.com/postgres/postgres/tree/master/contrib/unaccent

RE: POC: contrib/unaccent as IMMUTABLE

From

Petru Ghita

Date:

03 October 2020, 12:13:01

As instructed by Alexander Korotkov, I'm linking the patch:

https://github.com/petru-ghita/postgres/commit/d6d1f340cd34db8176744ede7f0fe588d870a33f

Kind regards,
petru ghita

Re: POC: contrib/unaccent as IMMUTABLE

From

Tomas Vondra

Date:

03 October 2020, 13:01:19

On Sat, Oct 03, 2020 at 02:13:01PM +0200, Petru Ghita wrote:
>As instructed by Alexander Korotkov, I'm linking the patch:
>
>https://github.com/petru-ghita/postgres/commit/d6d1f340cd34db8176744ede7f0fe588d870a33f
>

Please submit an actual patch, not just a link to github commit. That
may disappear in the future, and we want an actual copy here in the
archives. Also, we have a bot that tests patches sent as attachments,
and the github commit can't benefit from that either.

Also, add the patch to the next commitfest (i.e. 2020-11) here:

     https://commitfest.postgresql.org/

As for the patch, I wonder if we want to make this change. I'm not very
familiar with how unaccent works, but if changes to unicode rules would
really silently break indexes, it's kinda similar to the collation
issues caused by glibc updates. And we've generally considered that a
case of data corruption, I think, so it'd be strange to allow that here.

If a user really wants to allow this, it's possible to create a stable
wrapper function - the difference is that if it breaks, it's the user's
responsibility to deal with that.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: POC: contrib/unaccent as IMMUTABLE

From

Alexander Korotkov

Date:

03 October 2020, 15:14:14

On Sat, Oct 3, 2020 at 4:01 PM Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
> On Sat, Oct 03, 2020 at 02:13:01PM +0200, Petru Ghita wrote:
> >As instructed by Alexander Korotkov, I'm linking the patch:
> >
> >https://github.com/petru-ghita/postgres/commit/d6d1f340cd34db8176744ede7f0fe588d870a33f
> >
>
> Please submit an actual patch, not just a link to github commit. That
> may disappear in the future, and we want an actual copy here in the
> archives. Also, we have a bot that tests patches sent as attachments,
> and the github commit can't benefit from that either.

+1

> Also, add the patch to the next commitfest (i.e. 2020-11) here:
>
>      https://commitfest.postgresql.org/

+1

> As for the patch, I wonder if we want to make this change. I'm not very
> familiar with how unaccent works, but if changes to unicode rules would
> really silently break indexes, it's kinda similar to the collation
> issues caused by glibc updates. And we've generally considered that a
> case of data corruption, I think, so it'd be strange to allow that here.
>
> If a user really wants to allow this, it's possible to create a stable
> wrapper function - the difference is that if it breaks, it's the user's
> responsibility to deal with that.

I personally don't have an exact understanding of how strict we are
about marking functions immutable.  For example,
to_tsvector(regconfig, text) is immutable.  However, underlying
dictionaries in fulltext search configuration may change behavior in
major releases.  Even unaccent defines a dictionary.  So, it might be
used inside an immutable function in a "legal way" without explicitly
placing responsibility for that to the user.  I've raised a similar
question about jsonpath[1], because it's immutable functions are going
to change their behavior in future.  An answer by Tom Lane [2], shows
that it's not completely unacceptable to sometimes change behavior of
immutable function in a major release.

As I get currently there is no rule "function behavior might be
changed in major release" => "function can't be marked as immutable".
So, we've to consider risks in each case individually.

Links
1. https://www.postgresql.org/message-id/CAPpHfdvDci4iqNF9fhRkTqhe-5_8HmzeLt56drH%2B_Rv2rNRqfg%40mail.gmail.com
2. https://www.postgresql.org/message-id/9219.1564410991%40sss.pgh.pa.us

------
Regards,
Alexander Korotkov

Re: POC: contrib/unaccent as IMMUTABLE

From

Tom Lane

Date:

03 October 2020, 15:19:07

Tomas Vondra <tomas.vondra@2ndquadrant.com> writes:
> As for the patch, I wonder if we want to make this change. I'm not very
> familiar with how unaccent works, but if changes to unicode rules would
> really silently break indexes, it's kinda similar to the collation
> issues caused by glibc updates. And we've generally considered that a
> case of data corruption, I think, so it'd be strange to allow that here.

Yeah.  The fact that we have a problem with collation updates doesn't
mean that it's okay to introduce the same problem elsewhere.

Note the extremely large amount of work that's ongoing to try to
track collation changes so we can know whether such an update has
invalidated indexes.  Unless this patch comes with equivalent
infrastructure to detect when the unaccent mapping has changed,
I think it should be rejected.

The problem here is somewhat different than for collations, in
that collation changes are hidden behind more-or-less-obscure
library APIs.  Here I think we'd "just" have to track whether
the user has changed the associated unaccent mapping file.
However, detecting which indexes are invalidated by such a
change still involves a lot of infrastructure that's not there.
And it'd be qualitatively different, because a config file
change could happen at any time --- we could not relegate
the checks to pg_upgrade or the like.

            regards, tom lane

Re: POC: contrib/unaccent as IMMUTABLE

From

Tom Lane

Date:

03 October 2020, 15:37:17

Alexander Korotkov <aekorotkov@gmail.com> writes:
> I personally don't have an exact understanding of how strict we are
> about marking functions immutable.  For example,
> to_tsvector(regconfig, text) is immutable.

Yeah.  This is in fact wrong, because a TS configuration can *very*
easily be changed by the user.  We held our noses and did it anyway
to allow tsvector results to be stored in indexes, figuring that
(a) we'd put it on the user's head to reindex when necessary, and
(b) in most situations, small changes in a TS configuration don't
make an old tsvector index unusable --- at worst, some searches
will fail that should have succeeded.  (I'm not entirely sure that
I believe (b), but that was the argument that was advanced.)

I do not think we'd accept such a compromise if it were put forward
today.  The project's standards for such things have tightened over
time, as evidenced by the work that's going towards collation
change detection.

It's also worth noting that the consequences of an out-of-date
index for unaccent seem likely to be worse than they are for
tsvectors.  It's not hard to imagine somebody making a unique
index on an unaccent result, and then setting themselves up for
dump/reload failures by changing the unaccent rules.  Nobody
builds unique indexes on tsvectors.

            regards, tom lane

Re: POC: contrib/unaccent as IMMUTABLE

From

Alexander Korotkov

Date:

03 October 2020, 20:13:28

On Sat, Oct 3, 2020 at 6:37 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Alexander Korotkov <aekorotkov@gmail.com> writes:
> > I personally don't have an exact understanding of how strict we are
> > about marking functions immutable.  For example,
> > to_tsvector(regconfig, text) is immutable.
>
> Yeah.  This is in fact wrong, because a TS configuration can *very*
> easily be changed by the user.  We held our noses and did it anyway
> to allow tsvector results to be stored in indexes, figuring that
> (a) we'd put it on the user's head to reindex when necessary, and
> (b) in most situations, small changes in a TS configuration don't
> make an old tsvector index unusable --- at worst, some searches
> will fail that should have succeeded.  (I'm not entirely sure that
> I believe (b), but that was the argument that was advanced.)
>
> I do not think we'd accept such a compromise if it were put forward
> today.  The project's standards for such things have tightened over
> time, as evidenced by the work that's going towards collation
> change detection.
>
> It's also worth noting that the consequences of an out-of-date
> index for unaccent seem likely to be worse than they are for
> tsvectors.  It's not hard to imagine somebody making a unique
> index on an unaccent result, and then setting themselves up for
> dump/reload failures by changing the unaccent rules.  Nobody
> builds unique indexes on tsvectors.

Tom, thank you for the clarification.  Now I get more understanding on
the project policy here.  I agree that we shouldn't mark unaccent() as
immutable in the current infrastructure.

------
Regards,
Alexander Korotkov

Re: POC: contrib/unaccent as IMMUTABLE

From

Thomas Munro

Date:

25 October 2020, 22:48:03

On Sun, Oct 4, 2020 at 4:19 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Tomas Vondra <tomas.vondra@2ndquadrant.com> writes:
> > As for the patch, I wonder if we want to make this change. I'm not very
> > familiar with how unaccent works, but if changes to unicode rules would
> > really silently break indexes, it's kinda similar to the collation
> > issues caused by glibc updates. And we've generally considered that a
> > case of data corruption, I think, so it'd be strange to allow that here.
>
> Yeah.  The fact that we have a problem with collation updates doesn't
> mean that it's okay to introduce the same problem elsewhere.
>
> Note the extremely large amount of work that's ongoing to try to
> track collation changes so we can know whether such an update has
> invalidated indexes.  Unless this patch comes with equivalent
> infrastructure to detect when the unaccent mapping has changed,
> I think it should be rejected.
>
> The problem here is somewhat different than for collations, in
> that collation changes are hidden behind more-or-less-obscure
> library APIs.  Here I think we'd "just" have to track whether
> the user has changed the associated unaccent mapping file.
> However, detecting which indexes are invalidated by such a
> change still involves a lot of infrastructure that's not there.
> And it'd be qualitatively different, because a config file
> change could happen at any time --- we could not relegate
> the checks to pg_upgrade or the like.

As a potential next use for refobjversion (the system I'm hoping to
commit soon for collation version tracking), I have wondered about
declarative function versions.  For those not following that thread,
the basic idea is that you get WARNING messages telling you to REINDEX
if certain things change, and the warnings only stop for each
dependent index once you've actually done it.  Once the collation
stuff is in the tree, it wouldn't be too hard to do a naive function
version tracking system, but there are a couple of tricky problems to
make it really useful:  (1) plpgsql function f1() calls f2(), f2()
changed, so an index on ((f1(x))) needs to be rebuilt, (2) unaccent()
and ts_parse() access other complicated objects, perhaps even
depending on arguments passed in to them.  You'd probably need to
design some kind of dependency analyser handler that PLs could
provide, and likewise, individual functions could perhaps provide
their own analyser, or maybe you could redesign the functions so that
a naive single version scheme could work.  Or something like
that.</handwaving>