Thread: [PATCH] Expand character set for ltree labels

[PATCH] Expand character set for ltree labels

From
Garen Torikian
Date:
Dear hackers,

I am submitting a patch to expand the label requirements for ltree. 

The current format is restricted to alphanumeric characters, plus _. Unfortunately, for non-English labels, this set is insufficient. Rather than figure out how to expand this set to include characters beyond the ASCII limit, I have instead opted to provide users with some mechanism for storing encoded UTF-8 characters which is widely used: punycode (https://en.wikipedia.org/wiki/Punycode).

The punycode range of characters is the exact same set as the existing ltree range, with the addition of a hyphen (-). Within this system, any human language can be encoded using just A-Za-z0-9-. 

On top of this, I added support for two more characters: # and ;, which are used for HTML entities. Note that & and % have special significance in the existing ltree logic; users would have to encode items as #20; (rather than %20). This seems a fair compromise.

Since the encoding could make a regular slug even longer, I have also doubled the character limit, from 256 to 512.

Please let me know if I can provide any more information or changes.

Very sincerely,
Garen
Attachment

Re: [PATCH] Expand character set for ltree labels

From
Nathan Bossart
Date:
On Tue, Oct 04, 2022 at 12:54:46PM -0400, Garen Torikian wrote:
> The punycode range of characters is the exact same set as the existing
> ltree range, with the addition of a hyphen (-). Within this system, any
> human language can be encoded using just A-Za-z0-9-.

IIUC ASCII characters like '!' and '<' are valid Punycode characters, but
even with your proposal, those wouldn't be allowed.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com



Re: [PATCH] Expand character set for ltree labels

From
Garen Torikian
Date:
No, not quite.

Valid Punycode characters are `[A-Za-z0-9-]`. This proposal includes `-`, as well as `#` and `;` for HTML entities.  

I double-checked the RFC to see the valid Punycode characters and the set above is indeed correct: https://datatracker.ietf.org/doc/html/draft-ietf-idn-punycode-02#section-5

While it would be nice for ltree labels to support *any* printable character, it can't because symbols like `!` and `%` already have special meaning in the querying. This proposal leaves those as is and does not depend on any existing special character.

On Tue, Oct 4, 2022 at 6:32 PM Nathan Bossart <nathandbossart@gmail.com> wrote:
On Tue, Oct 04, 2022 at 12:54:46PM -0400, Garen Torikian wrote:
> The punycode range of characters is the exact same set as the existing
> ltree range, with the addition of a hyphen (-). Within this system, any
> human language can be encoded using just A-Za-z0-9-.

IIUC ASCII characters like '!' and '<' are valid Punycode characters, but
even with your proposal, those wouldn't be allowed.

--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

Re: [PATCH] Expand character set for ltree labels

From
Tom Lane
Date:
Garen Torikian <gjtorikian@gmail.com> writes:
> I am submitting a patch to expand the label requirements for ltree.

> The current format is restricted to alphanumeric characters, plus _.
> Unfortunately, for non-English labels, this set is insufficient.

Hm?  Perhaps the docs are a bit unclear about that, but it's not
restricted to ASCII alphanumerics.  AFAICS the code will accept
whatever iswalpha() and iswdigit() will accept in the database's
default locale.  There's certainly work that could/should be done
to allow use of not-so-default locales, but that's not specific
to ltree.  I'm not sure that doing an application-side encoding
is attractive compared to just using that ability directly.

If you do want to do application-side encoding, I'm unsure why
punycode would be the choice anyway, as opposed to something
that can fit in the existing restrictions.

> On top of this, I added support for two more characters: # and ;, which are
> used for HTML entities.

That seems really pretty random.

            regards, tom lane



Re: [PATCH] Expand character set for ltree labels

From
Garen Torikian
Date:
Hi Tom,

> Perhaps the docs are a bit unclear about that, but it's not
> restricted to ASCII alphanumerics.  AFAICS the code will accept
> whatever iswalpha() and iswdigit() will accept in the database's
> default locale.  

Sorry but I don't think that is correct. Here is the single definition check of what constitutes a valid character: https://github.com/postgres/postgres/blob/c3315a7da57be720222b119385ed0f7ad7c15268/contrib/ltree/ltree.h#L129

As you can see, there are no `is_*` calls at all. Where in this contrib package do you see `iswalpha`? Perhaps I missed it.

> That seems really pretty random.

Ok. I am trying to avoid a situation where other users may wish to use other delimiters other than `-`, due to its commonplace presence in words (eg., compound ones).

On Wed, Oct 5, 2022 at 2:59 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
Garen Torikian <gjtorikian@gmail.com> writes:
> I am submitting a patch to expand the label requirements for ltree.

> The current format is restricted to alphanumeric characters, plus _.
> Unfortunately, for non-English labels, this set is insufficient.

Hm?  Perhaps the docs are a bit unclear about that, but it's not
restricted to ASCII alphanumerics.  AFAICS the code will accept
whatever iswalpha() and iswdigit() will accept in the database's
default locale.  There's certainly work that could/should be done
to allow use of not-so-default locales, but that's not specific
to ltree.  I'm not sure that doing an application-side encoding
is attractive compared to just using that ability directly.

If you do want to do application-side encoding, I'm unsure why
punycode would be the choice anyway, as opposed to something
that can fit in the existing restrictions.

> On top of this, I added support for two more characters: # and ;, which are
> used for HTML entities.

That seems really pretty random.

                        regards, tom lane

Re: [PATCH] Expand character set for ltree labels

From
Tom Lane
Date:
Garen Torikian <gjtorikian@gmail.com> writes:
>> Perhaps the docs are a bit unclear about that, but it's not
>> restricted to ASCII alphanumerics.  AFAICS the code will accept
>> whatever iswalpha() and iswdigit() will accept in the database's
>> default locale.

> Sorry but I don't think that is correct. Here is the single
> definition check of what constitutes a valid character:
> https://github.com/postgres/postgres/blob/c3315a7da57be720222b119385ed0f7ad7c15268/contrib/ltree/ltree.h#L129

> As you can see, there are no `is_*` calls at all.

Did you chase down what t_isalpha and t_isdigit do?

            regards, tom lane



Re: [PATCH] Expand character set for ltree labels

From
Garen Torikian
Date:
After digging into it, you are completely correct. I had to do a bit more reading to understand the relationships between UTF-8 and wchar, but ultimately the existing locale support works for my use case.

Therefore I have updated the patch with three much smaller changes:

* Support for `-` in addition to `_`
* Expanding the limit to 512 chars (from the existing 256); again it's not uncommon for non-English strings to be much longer
* Fixed the documentation to expand on what the ltree label's relationship to the DB locale is

Thank you,
Garen 

On Wed, Oct 5, 2022 at 3:56 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
Garen Torikian <gjtorikian@gmail.com> writes:
>> Perhaps the docs are a bit unclear about that, but it's not
>> restricted to ASCII alphanumerics.  AFAICS the code will accept
>> whatever iswalpha() and iswdigit() will accept in the database's
>> default locale.

> Sorry but I don't think that is correct. Here is the single
> definition check of what constitutes a valid character:
> https://github.com/postgres/postgres/blob/c3315a7da57be720222b119385ed0f7ad7c15268/contrib/ltree/ltree.h#L129

> As you can see, there are no `is_*` calls at all.

Did you chase down what t_isalpha and t_isdigit do?

                        regards, tom lane
Attachment

Re: [PATCH] Expand character set for ltree labels

From
Ian Lawrence Barwick
Date:
2022年10月6日(木) 7:05 Garen Torikian <gjtorikian@gmail.com>:
>
> After digging into it, you are completely correct. I had to do a bit more reading to understand the relationships
betweenUTF-8 and wchar, but ultimately the existing locale support works for my use case. 
>
> Therefore I have updated the patch with three much smaller changes:
>
> * Support for `-` in addition to `_`
> * Expanding the limit to 512 chars (from the existing 256); again it's not uncommon for non-English strings to be
muchlonger 
> * Fixed the documentation to expand on what the ltree label's relationship to the DB locale is
>
> Thank you,
> Garen

This entry was marked as "Needs review" in the CommitFest app but cfbot
reports the patch no longer applies.

We've marked it as "Waiting on Author". As CommitFest 2022-11 is
currently underway, this would be an excellent time update the patch.

Once you think the patchset is ready for review again, you (or any
interested party) can  move the patch entry forward by visiting

    https://commitfest.postgresql.org/40/3929/

and changing the status to "Needs review".


Thanks

Ian Barwick



Re: [PATCH] Expand character set for ltree labels

From
vignesh C
Date:
On Thu, 6 Oct 2022 at 03:35, Garen Torikian <gjtorikian@gmail.com> wrote:
>
> After digging into it, you are completely correct. I had to do a bit more reading to understand the relationships
betweenUTF-8 and wchar, but ultimately the existing locale support works for my use case.
 
>
> Therefore I have updated the patch with three much smaller changes:
>
> * Support for `-` in addition to `_`
> * Expanding the limit to 512 chars (from the existing 256); again it's not uncommon for non-English strings to be
muchlonger
 
> * Fixed the documentation to expand on what the ltree label's relationship to the DB locale is

The patch does not apply on top of HEAD as in [1], please post a rebased patch:
=== Applying patches on top of PostgreSQL commit ID
e351f85418313e97c203c73181757a007dfda6d0 ===
=== applying patch ./0002-Expand-character-set-for-ltree-labels.patch
patching file contrib/ltree/expected/ltree.out
patching file contrib/ltree/ltree.h
Hunk #2 FAILED at 126.
1 out of 2 hunks FAILED -- saving rejects to file contrib/ltree/ltree.h.rej

[1] - http://cfbot.cputube.org/patch_41_3929.log

Regards,
Vignesh



Re: [PATCH] Expand character set for ltree labels

From
Garen Torikian
Date:
Sure. Rebased onto HEAD.

On Tue, Jan 3, 2023 at 7:27 AM vignesh C <vignesh21@gmail.com> wrote:
On Thu, 6 Oct 2022 at 03:35, Garen Torikian <gjtorikian@gmail.com> wrote:
>
> After digging into it, you are completely correct. I had to do a bit more reading to understand the relationships between UTF-8 and wchar, but ultimately the existing locale support works for my use case.
>
> Therefore I have updated the patch with three much smaller changes:
>
> * Support for `-` in addition to `_`
> * Expanding the limit to 512 chars (from the existing 256); again it's not uncommon for non-English strings to be much longer
> * Fixed the documentation to expand on what the ltree label's relationship to the DB locale is

The patch does not apply on top of HEAD as in [1], please post a rebased patch:
=== Applying patches on top of PostgreSQL commit ID
e351f85418313e97c203c73181757a007dfda6d0 ===
=== applying patch ./0002-Expand-character-set-for-ltree-labels.patch
patching file contrib/ltree/expected/ltree.out
patching file contrib/ltree/ltree.h
Hunk #2 FAILED at 126.
1 out of 2 hunks FAILED -- saving rejects to file contrib/ltree/ltree.h.rej

[1] - http://cfbot.cputube.org/patch_41_3929.log

Regards,
Vignesh
Attachment

Re: [PATCH] Expand character set for ltree labels

From
vignesh C
Date:
On Wed, 4 Jan 2023 at 00:27, Garen Torikian <gjtorikian@gmail.com> wrote:
>
> Sure. Rebased onto HEAD.
>

There is one more merge conflict,  please post a rebased patch:
=== Applying patches on top of PostgreSQL commit ID
eb5ad4ff05fd382ac98cab60b82f7fd6ce4cfeb8 ===
=== applying patch ./0003-Expand-character-set-for-ltree-labels.patch
patching file contrib/ltree/expected/ltree.out
Hunk #1 succeeded at 25 with fuzz 2.
Hunk #2 FAILED at 51.
Hunk #3 FAILED at 537.
Hunk #4 succeeded at 1201 with fuzz 2.
2 out of 4 hunks FAILED -- saving rejects to file
contrib/ltree/expected/ltree.out.rej

Regards,
Vignesh



Re: [PATCH] Expand character set for ltree labels

From
Andrew Dunstan
Date:
On 2022-10-05 We 18:05, Garen Torikian wrote:
> After digging into it, you are completely correct. I had to do a bit
> more reading to understand the relationships between UTF-8 and wchar,
> but ultimately the existing locale support works for my use case.
>
> Therefore I have updated the patch with three much smaller changes:
>
> * Support for `-` in addition to `_`
> * Expanding the limit to 512 chars (from the existing 256); again it's
> not uncommon for non-English strings to be much longer
> * Fixed the documentation to expand on what the ltree label's
> relationship to the DB locale is
>

Regardless of the punycode issue, allowing hyphens in ltree labels seems
quite reasonable. I haven't reviewed the patch yet, but if it's OK I
intend to commit it.


cheers


andrew


--
Andrew Dunstan
EDB: https://www.enterprisedb.com




Re: [PATCH] Expand character set for ltree labels

From
Tom Lane
Date:
Andrew Dunstan <andrew@dunslane.net> writes:
> On 2022-10-05 We 18:05, Garen Torikian wrote:
>> Therefore I have updated the patch with three much smaller changes:
>> 
>> * Support for `-` in addition to `_`
>> * Expanding the limit to 512 chars (from the existing 256); again it's
>> not uncommon for non-English strings to be much longer
>> * Fixed the documentation to expand on what the ltree label's
>> relationship to the DB locale is

> Regardless of the punycode issue, allowing hyphens in ltree labels seems
> quite reasonable. I haven't reviewed the patch yet, but if it's OK I
> intend to commit it.

No objection to allowing hyphens.  If we're going to increase the length
limit, how about using 1000 characters?  AFAICS we could even get away
with 10K, but it's probably best to hold a bit or two in reserve in case
we ever want flags or something associated with labels.

(I've not read the patch.)

            regards, tom lane



Re: [PATCH] Expand character set for ltree labels

From
Andrew Dunstan
Date:
On 2023-01-06 Fr 11:29, Tom Lane wrote:
> Andrew Dunstan <andrew@dunslane.net> writes:
>> Regardless of the punycode issue, allowing hyphens in ltree labels seems
>> quite reasonable. I haven't reviewed the patch yet, but if it's OK I
>> intend to commit it.
> No objection to allowing hyphens.  If we're going to increase the length
> limit, how about using 1000 characters?  AFAICS we could even get away
> with 10K, but it's probably best to hold a bit or two in reserve in case
> we ever want flags or something associated with labels.
>

OK, done that way.


cheers


andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com