Thread: BUG #18057: unaccent removes intentional spaces

BUG #18057: unaccent removes intentional spaces

From

PG Bug reporting form

Date:

15 August 2023, 22:54:57

The following bug has been logged on the website:

Bug reference:      18057
Logged by:          Martin Schlossarek
Email address:      martin@schlossarek.me
PostgreSQL version: 15.1
Operating system:   Fedora 38
Description:

I discovered that the unaccent extension also removes intentional spaces
that are explicitly specified in the accent.rules. As far as I see it
correctly, all fraction characters are affected, for example:

```sql
# select unaccent('1½');
--- expected output: 1 1/2
--- actual output: 11/2
```

Affected characters:
```bash
$ curl -s
"https://git.postgresql.org/gitweb/?p=postgresql.git;a=blob_plain;f=contrib/unaccent/unaccent.rules;hb=HEAD"
| grep -E " "
¼        1/4
½        1/2
¾        3/4
⅐        1/7
⅑        1/9
⅒        1/10
⅓        1/3
⅔        2/3
⅕        1/5
⅖        2/5
⅗        3/5
⅘        4/5
⅙        1/6
⅚        5/6
⅛        1/8
⅜        3/8
⅝        5/8
⅞        7/8
⅟        1/
↉        0/3
```

Re: BUG #18057: unaccent removes intentional spaces

From

Michael Paquier

Date:

16 August 2023, 03:00:43

On Tue, Aug 15, 2023 at 07:54:57PM +0000, PG Bug reporting form wrote:
> I discovered that the unaccent extension also removes intentional spaces
> that are explicitly specified in the accent.rules. As far as I see it
> correctly, all fraction characters are affected, for example:
>
> ```sql
> # select unaccent('1½');
> --- expected output: 1 1/2
> --- actual output: 11/2
> ```

Agreed that this looks incorrect as-is.  This goes as far as 9a206d0
when these has been introduced, and it looks like the culprit is
around initTrie() where the entries are loaded.  See around t_isspace,
for example.
--
Michael

Attachment

signature.asc

Re: BUG #18057: unaccent removes intentional spaces

From

Michael Paquier

Date:

18 August 2023, 09:18:58

On Wed, Aug 16, 2023 at 09:00:43AM +0900, Michael Paquier wrote:
> Agreed that this looks incorrect as-is.  This goes as far as 9a206d0
> when these has been introduced, and it looks like the culprit is
> around initTrie() where the entries are loaded.  See around t_isspace,
> for example.

I was looking at the code, and my first impression was right.  All
leading and trailing whitespaces between the two characters listed in
the rule file are discarded.  The thing is that we clearly document
the parsing rules for the sake of any custom files one can feed to the
extension:
https://www.postgresql.org/docs/devel/unaccent.html

I am not sure what we can do here.  Doing nothing is certainly an
option, but I am wondering if we could put in place an extra rule
where whitespaces can be part of the translated character if it uses
double quotes, for example.  Thoughts?
--
Michael

Attachment

signature.asc

Re: BUG #18057: unaccent removes intentional spaces

From

Tom Lane

Date:

19 August 2023, 19:30:04

Michael Paquier <michael@paquier.xyz> writes:
> I was looking at the code, and my first impression was right.  All
> leading and trailing whitespaces between the two characters listed in
> the rule file are discarded.  The thing is that we clearly document
> the parsing rules for the sake of any custom files one can feed to the
> extension:
> https://www.postgresql.org/docs/devel/unaccent.html

> I am not sure what we can do here.  Doing nothing is certainly an
> option, but I am wondering if we could put in place an extra rule
> where whitespaces can be part of the translated character if it uses
> double quotes, for example.  Thoughts?

Yeah, we could extend the parsing rules that way.  It would break
any rules files that currently use double quote as a plain character,
but it seems unlikely that there are any.  In any case, as long as
this wasn't back-patched I think such a change would be acceptable.

            regards, tom lane

Re: BUG #18057: unaccent removes intentional spaces

From

Michael Paquier

Date:

20 August 2023, 02:57:29

On Sat, Aug 19, 2023 at 12:30:04PM -0400, Tom Lane wrote:
> Yeah, we could extend the parsing rules that way.  It would break
> any rules files that currently use double quote as a plain character,
> but it seems unlikely that there are any.  In any case, as long as
> this wasn't back-patched I think such a change would be acceptable.

Okay, thanks.  Note that we do use double-quotes as a translated
character in a few cases, but as long as these are only handled as a
single character we could be OK.  Or actually, wouldn't it be better
to always force escaping even for double quotes listed as single
characters?  Based on what unaccent.rules has, that's not necessary,
but it could simplify the python code generating the file or the C
parsing.  For example, the existing '”' would become listed as "\"" in
our unaccent.rules.
--
Michael

Attachment

signature.asc

Re: BUG #18057: unaccent removes intentional spaces

From

Tom Lane

Date:

20 August 2023, 03:08:26

Michael Paquier <michael@paquier.xyz> writes:
> On Sat, Aug 19, 2023 at 12:30:04PM -0400, Tom Lane wrote:
>> Yeah, we could extend the parsing rules that way.  It would break
>> any rules files that currently use double quote as a plain character,
>> but it seems unlikely that there are any.  In any case, as long as
>> this wasn't back-patched I think such a change would be acceptable.

> Okay, thanks.  Note that we do use double-quotes as a translated
> character in a few cases, but as long as these are only handled as a
> single character we could be OK.

Oh, shame on me for not actually looking into unaccent.rules :-(
That does make it more complicated, although I think it's still the
case that an incompatible change could be OK.

> Or actually, wouldn't it be better
> to always force escaping even for double quotes listed as single
> characters?

I was envisioning following SQL identifier rules, that is you double
the double quote.  Then the existing " entries would have to become
"""", which isn't much fun.  But what you're suggesting would make
both " and \ into magic characters, doubling the chance of problems
with existing rules files.

We could perhaps allow standalone " to still be considered valid,
but that adds ambiguity that I think we'd eventually regret.
For example, what is the interpretation of

"    "

Is that a (rather useless) translation of " to itself, or is it
a command to delete tab characters?  So I think a clean break
would be better.

            regards, tom lane

Re: BUG #18057: unaccent removes intentional spaces

From

Michael Paquier

Date:

20 August 2023, 03:20:48

On Sat, Aug 19, 2023 at 08:08:26PM -0400, Tom Lane wrote:
> I was envisioning following SQL identifier rules, that is you double
> the double quote.  Then the existing " entries would have to become
> """", which isn't much fun.  But what you're suggesting would make
> both " and \ into magic characters, doubling the chance of problems
> with existing rules files.

Apologies for the confusion.  I was thinking to also escape \ in
quoted strings.  Your suggestion to use a second double-quote for the
escaping is fine by me.  """" feels a bit ugly-ish in the rules file,
for sure, but that does not look like a huge issue to me as long as
the python script generates consistent contents ;)
--
Michael

Attachment

signature.asc

Re: BUG #18057: unaccent removes intentional spaces

From

Michael Paquier

Date:

21 August 2023, 10:14:20

On Sun, Aug 20, 2023 at 09:20:48AM +0900, Michael Paquier wrote:
> Apologies for the confusion.  I was thinking to also escape \ in
> quoted strings.  Your suggestion to use a second double-quote for the
> escaping is fine by me.  """" feels a bit ugly-ish in the rules file,
> for sure, but that does not look like a huge issue to me as long as
> the python script generates consistent contents ;)

Please find attached a patch to achieve that.  This includes tweaks
for the python script to update unaccent.rules, docs about the rules
for the quotes and tests.

The patch also includes a custom rule file that I have used to stress
more the parsing logic, but I intend to remove it in the final version
of the patch (and it fails with meson).  It would be possible to have
a TAP test that sneaks a custom rule file in the installation tree,
but that's not worth the extra cycles IMO.

What do you think?
--
Michael

Attachment

Re: BUG #18057: unaccent removes intentional spaces

From

Michael Paquier

Date:

19 September 2023, 10:26:43

On Mon, Aug 21, 2023 at 04:14:20PM +0900, Michael Paquier wrote:
> The patch also includes a custom rule file that I have used to stress
> more the parsing logic, but I intend to remove it in the final version
> of the patch (and it fails with meson).  It would be possible to have
> a TAP test that sneaks a custom rule file in the installation tree,
> but that's not worth the extra cycles IMO.

So, it's been a couple of weeks here.

I have given this patch a second look and something that stood out is
that we'd still report "more than two strings in unaccent rule" as
warning when loading a rule where the target string is an unfinished
quoted area, which was kind of confusing.

The patch has been updated to improve that by making the error state a
bit smarter, and I have removed the previous tests with the custom
file and its fancy rules.  Any comments and/or objections?
--
Michael

Attachment

Re: BUG #18057: unaccent removes intentional spaces

From

Michael Paquier

Date:

21 September 2023, 02:24:49

On Tue, Sep 19, 2023 at 04:26:43PM +0900, Michael Paquier wrote:
> The patch has been updated to improve that by making the error state a
> bit smarter, and I have removed the previous tests with the custom
> file and its fancy rules.  Any comments and/or objections?

And applied as of 59f47fb98dab.
--
Michael

Attachment

signature.asc