Thread: POC: contrib/unaccent as IMMUTABLE
Hi, Would it be possible to mark functions in contrib/unnaccent [1] as IMMUTABLE? We would gain the possibility to use it to enforce uniqueness. I understand that the main drawback is that unicode rules are periodically updated. So it would be needed to reindex indices on unaccent() after upgrade. Thank you, petru ghita [1] https://github.com/postgres/postgres/tree/master/contrib/unaccent
As instructed by Alexander Korotkov, I'm linking the patch: https://github.com/petru-ghita/postgres/commit/d6d1f340cd34db8176744ede7f0fe588d870a33f Kind regards, petru ghita
On Sat, Oct 03, 2020 at 02:13:01PM +0200, Petru Ghita wrote: >As instructed by Alexander Korotkov, I'm linking the patch: > >https://github.com/petru-ghita/postgres/commit/d6d1f340cd34db8176744ede7f0fe588d870a33f > Please submit an actual patch, not just a link to github commit. That may disappear in the future, and we want an actual copy here in the archives. Also, we have a bot that tests patches sent as attachments, and the github commit can't benefit from that either. Also, add the patch to the next commitfest (i.e. 2020-11) here: https://commitfest.postgresql.org/ As for the patch, I wonder if we want to make this change. I'm not very familiar with how unaccent works, but if changes to unicode rules would really silently break indexes, it's kinda similar to the collation issues caused by glibc updates. And we've generally considered that a case of data corruption, I think, so it'd be strange to allow that here. If a user really wants to allow this, it's possible to create a stable wrapper function - the difference is that if it breaks, it's the user's responsibility to deal with that. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Sat, Oct 3, 2020 at 4:01 PM Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote: > On Sat, Oct 03, 2020 at 02:13:01PM +0200, Petru Ghita wrote: > >As instructed by Alexander Korotkov, I'm linking the patch: > > > >https://github.com/petru-ghita/postgres/commit/d6d1f340cd34db8176744ede7f0fe588d870a33f > > > > Please submit an actual patch, not just a link to github commit. That > may disappear in the future, and we want an actual copy here in the > archives. Also, we have a bot that tests patches sent as attachments, > and the github commit can't benefit from that either. +1 > Also, add the patch to the next commitfest (i.e. 2020-11) here: > > https://commitfest.postgresql.org/ +1 > As for the patch, I wonder if we want to make this change. I'm not very > familiar with how unaccent works, but if changes to unicode rules would > really silently break indexes, it's kinda similar to the collation > issues caused by glibc updates. And we've generally considered that a > case of data corruption, I think, so it'd be strange to allow that here. > > If a user really wants to allow this, it's possible to create a stable > wrapper function - the difference is that if it breaks, it's the user's > responsibility to deal with that. I personally don't have an exact understanding of how strict we are about marking functions immutable. For example, to_tsvector(regconfig, text) is immutable. However, underlying dictionaries in fulltext search configuration may change behavior in major releases. Even unaccent defines a dictionary. So, it might be used inside an immutable function in a "legal way" without explicitly placing responsibility for that to the user. I've raised a similar question about jsonpath[1], because it's immutable functions are going to change their behavior in future. An answer by Tom Lane [2], shows that it's not completely unacceptable to sometimes change behavior of immutable function in a major release. As I get currently there is no rule "function behavior might be changed in major release" => "function can't be marked as immutable". So, we've to consider risks in each case individually. Links 1. https://www.postgresql.org/message-id/CAPpHfdvDci4iqNF9fhRkTqhe-5_8HmzeLt56drH%2B_Rv2rNRqfg%40mail.gmail.com 2. https://www.postgresql.org/message-id/9219.1564410991%40sss.pgh.pa.us ------ Regards, Alexander Korotkov
Tomas Vondra <tomas.vondra@2ndquadrant.com> writes: > As for the patch, I wonder if we want to make this change. I'm not very > familiar with how unaccent works, but if changes to unicode rules would > really silently break indexes, it's kinda similar to the collation > issues caused by glibc updates. And we've generally considered that a > case of data corruption, I think, so it'd be strange to allow that here. Yeah. The fact that we have a problem with collation updates doesn't mean that it's okay to introduce the same problem elsewhere. Note the extremely large amount of work that's ongoing to try to track collation changes so we can know whether such an update has invalidated indexes. Unless this patch comes with equivalent infrastructure to detect when the unaccent mapping has changed, I think it should be rejected. The problem here is somewhat different than for collations, in that collation changes are hidden behind more-or-less-obscure library APIs. Here I think we'd "just" have to track whether the user has changed the associated unaccent mapping file. However, detecting which indexes are invalidated by such a change still involves a lot of infrastructure that's not there. And it'd be qualitatively different, because a config file change could happen at any time --- we could not relegate the checks to pg_upgrade or the like. regards, tom lane
Alexander Korotkov <aekorotkov@gmail.com> writes: > I personally don't have an exact understanding of how strict we are > about marking functions immutable. For example, > to_tsvector(regconfig, text) is immutable. Yeah. This is in fact wrong, because a TS configuration can *very* easily be changed by the user. We held our noses and did it anyway to allow tsvector results to be stored in indexes, figuring that (a) we'd put it on the user's head to reindex when necessary, and (b) in most situations, small changes in a TS configuration don't make an old tsvector index unusable --- at worst, some searches will fail that should have succeeded. (I'm not entirely sure that I believe (b), but that was the argument that was advanced.) I do not think we'd accept such a compromise if it were put forward today. The project's standards for such things have tightened over time, as evidenced by the work that's going towards collation change detection. It's also worth noting that the consequences of an out-of-date index for unaccent seem likely to be worse than they are for tsvectors. It's not hard to imagine somebody making a unique index on an unaccent result, and then setting themselves up for dump/reload failures by changing the unaccent rules. Nobody builds unique indexes on tsvectors. regards, tom lane
On Sat, Oct 3, 2020 at 6:37 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > Alexander Korotkov <aekorotkov@gmail.com> writes: > > I personally don't have an exact understanding of how strict we are > > about marking functions immutable. For example, > > to_tsvector(regconfig, text) is immutable. > > Yeah. This is in fact wrong, because a TS configuration can *very* > easily be changed by the user. We held our noses and did it anyway > to allow tsvector results to be stored in indexes, figuring that > (a) we'd put it on the user's head to reindex when necessary, and > (b) in most situations, small changes in a TS configuration don't > make an old tsvector index unusable --- at worst, some searches > will fail that should have succeeded. (I'm not entirely sure that > I believe (b), but that was the argument that was advanced.) > > I do not think we'd accept such a compromise if it were put forward > today. The project's standards for such things have tightened over > time, as evidenced by the work that's going towards collation > change detection. > > It's also worth noting that the consequences of an out-of-date > index for unaccent seem likely to be worse than they are for > tsvectors. It's not hard to imagine somebody making a unique > index on an unaccent result, and then setting themselves up for > dump/reload failures by changing the unaccent rules. Nobody > builds unique indexes on tsvectors. Tom, thank you for the clarification. Now I get more understanding on the project policy here. I agree that we shouldn't mark unaccent() as immutable in the current infrastructure. ------ Regards, Alexander Korotkov
On Sun, Oct 4, 2020 at 4:19 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: > Tomas Vondra <tomas.vondra@2ndquadrant.com> writes: > > As for the patch, I wonder if we want to make this change. I'm not very > > familiar with how unaccent works, but if changes to unicode rules would > > really silently break indexes, it's kinda similar to the collation > > issues caused by glibc updates. And we've generally considered that a > > case of data corruption, I think, so it'd be strange to allow that here. > > Yeah. The fact that we have a problem with collation updates doesn't > mean that it's okay to introduce the same problem elsewhere. > > Note the extremely large amount of work that's ongoing to try to > track collation changes so we can know whether such an update has > invalidated indexes. Unless this patch comes with equivalent > infrastructure to detect when the unaccent mapping has changed, > I think it should be rejected. > > The problem here is somewhat different than for collations, in > that collation changes are hidden behind more-or-less-obscure > library APIs. Here I think we'd "just" have to track whether > the user has changed the associated unaccent mapping file. > However, detecting which indexes are invalidated by such a > change still involves a lot of infrastructure that's not there. > And it'd be qualitatively different, because a config file > change could happen at any time --- we could not relegate > the checks to pg_upgrade or the like. As a potential next use for refobjversion (the system I'm hoping to commit soon for collation version tracking), I have wondered about declarative function versions. For those not following that thread, the basic idea is that you get WARNING messages telling you to REINDEX if certain things change, and the warnings only stop for each dependent index once you've actually done it. Once the collation stuff is in the tree, it wouldn't be too hard to do a naive function version tracking system, but there are a couple of tricky problems to make it really useful: (1) plpgsql function f1() calls f2(), f2() changed, so an index on ((f1(x))) needs to be rebuilt, (2) unaccent() and ts_parse() access other complicated objects, perhaps even depending on arguments passed in to them. You'd probably need to design some kind of dependency analyser handler that PLs could provide, and likewise, individual functions could perhaps provide their own analyser, or maybe you could redesign the functions so that a naive single version scheme could work. Or something like that.</handwaving>