Thread: BUG #18057: unaccent removes intentional spaces
The following bug has been logged on the website: Bug reference: 18057 Logged by: Martin Schlossarek Email address: martin@schlossarek.me PostgreSQL version: 15.1 Operating system: Fedora 38 Description: I discovered that the unaccent extension also removes intentional spaces that are explicitly specified in the accent.rules. As far as I see it correctly, all fraction characters are affected, for example: ```sql # select unaccent('1½'); --- expected output: 1 1/2 --- actual output: 11/2 ``` Affected characters: ```bash $ curl -s "https://git.postgresql.org/gitweb/?p=postgresql.git;a=blob_plain;f=contrib/unaccent/unaccent.rules;hb=HEAD" | grep -E " " ¼ 1/4 ½ 1/2 ¾ 3/4 ⅐ 1/7 ⅑ 1/9 ⅒ 1/10 ⅓ 1/3 ⅔ 2/3 ⅕ 1/5 ⅖ 2/5 ⅗ 3/5 ⅘ 4/5 ⅙ 1/6 ⅚ 5/6 ⅛ 1/8 ⅜ 3/8 ⅝ 5/8 ⅞ 7/8 ⅟ 1/ ↉ 0/3 ```
On Tue, Aug 15, 2023 at 07:54:57PM +0000, PG Bug reporting form wrote: > I discovered that the unaccent extension also removes intentional spaces > that are explicitly specified in the accent.rules. As far as I see it > correctly, all fraction characters are affected, for example: > > ```sql > # select unaccent('1½'); > --- expected output: 1 1/2 > --- actual output: 11/2 > ``` Agreed that this looks incorrect as-is. This goes as far as 9a206d0 when these has been introduced, and it looks like the culprit is around initTrie() where the entries are loaded. See around t_isspace, for example. -- Michael
Attachment
On Wed, Aug 16, 2023 at 09:00:43AM +0900, Michael Paquier wrote: > Agreed that this looks incorrect as-is. This goes as far as 9a206d0 > when these has been introduced, and it looks like the culprit is > around initTrie() where the entries are loaded. See around t_isspace, > for example. I was looking at the code, and my first impression was right. All leading and trailing whitespaces between the two characters listed in the rule file are discarded. The thing is that we clearly document the parsing rules for the sake of any custom files one can feed to the extension: https://www.postgresql.org/docs/devel/unaccent.html I am not sure what we can do here. Doing nothing is certainly an option, but I am wondering if we could put in place an extra rule where whitespaces can be part of the translated character if it uses double quotes, for example. Thoughts? -- Michael
Attachment
Michael Paquier <michael@paquier.xyz> writes: > I was looking at the code, and my first impression was right. All > leading and trailing whitespaces between the two characters listed in > the rule file are discarded. The thing is that we clearly document > the parsing rules for the sake of any custom files one can feed to the > extension: > https://www.postgresql.org/docs/devel/unaccent.html > I am not sure what we can do here. Doing nothing is certainly an > option, but I am wondering if we could put in place an extra rule > where whitespaces can be part of the translated character if it uses > double quotes, for example. Thoughts? Yeah, we could extend the parsing rules that way. It would break any rules files that currently use double quote as a plain character, but it seems unlikely that there are any. In any case, as long as this wasn't back-patched I think such a change would be acceptable. regards, tom lane
On Sat, Aug 19, 2023 at 12:30:04PM -0400, Tom Lane wrote: > Yeah, we could extend the parsing rules that way. It would break > any rules files that currently use double quote as a plain character, > but it seems unlikely that there are any. In any case, as long as > this wasn't back-patched I think such a change would be acceptable. Okay, thanks. Note that we do use double-quotes as a translated character in a few cases, but as long as these are only handled as a single character we could be OK. Or actually, wouldn't it be better to always force escaping even for double quotes listed as single characters? Based on what unaccent.rules has, that's not necessary, but it could simplify the python code generating the file or the C parsing. For example, the existing '”' would become listed as "\"" in our unaccent.rules. -- Michael
Attachment
Michael Paquier <michael@paquier.xyz> writes: > On Sat, Aug 19, 2023 at 12:30:04PM -0400, Tom Lane wrote: >> Yeah, we could extend the parsing rules that way. It would break >> any rules files that currently use double quote as a plain character, >> but it seems unlikely that there are any. In any case, as long as >> this wasn't back-patched I think such a change would be acceptable. > Okay, thanks. Note that we do use double-quotes as a translated > character in a few cases, but as long as these are only handled as a > single character we could be OK. Oh, shame on me for not actually looking into unaccent.rules :-( That does make it more complicated, although I think it's still the case that an incompatible change could be OK. > Or actually, wouldn't it be better > to always force escaping even for double quotes listed as single > characters? I was envisioning following SQL identifier rules, that is you double the double quote. Then the existing " entries would have to become """", which isn't much fun. But what you're suggesting would make both " and \ into magic characters, doubling the chance of problems with existing rules files. We could perhaps allow standalone " to still be considered valid, but that adds ambiguity that I think we'd eventually regret. For example, what is the interpretation of " " Is that a (rather useless) translation of " to itself, or is it a command to delete tab characters? So I think a clean break would be better. regards, tom lane
On Sat, Aug 19, 2023 at 08:08:26PM -0400, Tom Lane wrote: > I was envisioning following SQL identifier rules, that is you double > the double quote. Then the existing " entries would have to become > """", which isn't much fun. But what you're suggesting would make > both " and \ into magic characters, doubling the chance of problems > with existing rules files. Apologies for the confusion. I was thinking to also escape \ in quoted strings. Your suggestion to use a second double-quote for the escaping is fine by me. """" feels a bit ugly-ish in the rules file, for sure, but that does not look like a huge issue to me as long as the python script generates consistent contents ;) -- Michael
Attachment
On Sun, Aug 20, 2023 at 09:20:48AM +0900, Michael Paquier wrote: > Apologies for the confusion. I was thinking to also escape \ in > quoted strings. Your suggestion to use a second double-quote for the > escaping is fine by me. """" feels a bit ugly-ish in the rules file, > for sure, but that does not look like a huge issue to me as long as > the python script generates consistent contents ;) Please find attached a patch to achieve that. This includes tweaks for the python script to update unaccent.rules, docs about the rules for the quotes and tests. The patch also includes a custom rule file that I have used to stress more the parsing logic, but I intend to remove it in the final version of the patch (and it fails with meson). It would be possible to have a TAP test that sneaks a custom rule file in the installation tree, but that's not worth the extra cycles IMO. What do you think? -- Michael
Attachment
On Mon, Aug 21, 2023 at 04:14:20PM +0900, Michael Paquier wrote: > The patch also includes a custom rule file that I have used to stress > more the parsing logic, but I intend to remove it in the final version > of the patch (and it fails with meson). It would be possible to have > a TAP test that sneaks a custom rule file in the installation tree, > but that's not worth the extra cycles IMO. So, it's been a couple of weeks here. I have given this patch a second look and something that stood out is that we'd still report "more than two strings in unaccent rule" as warning when loading a rule where the target string is an unfinished quoted area, which was kind of confusing. The patch has been updated to improve that by making the error state a bit smarter, and I have removed the previous tests with the custom file and its fancy rules. Any comments and/or objections? -- Michael
Attachment
On Tue, Sep 19, 2023 at 04:26:43PM +0900, Michael Paquier wrote: > The patch has been updated to improve that by making the error state a > bit smarter, and I have removed the previous tests with the custom > file and its fancy rules. Any comments and/or objections? And applied as of 59f47fb98dab. -- Michael