Thread: Re: fixing tsearch locale support

Re: fixing tsearch locale support

From

Jeff Davis

Date:

12 December 2024, 21:14:21

On Mon, 2024-12-02 at 11:57 +0100, Peter Eisentraut wrote:
> t_isdigit() and t_isspace() are just used to parse various
> configuration
> and data files, and surely we don't need support for encoding-
> dependent
> multibyte support for parsing ASCII digits and ASCII spaces.  
> ... So these can
> be
> replaced by the normal isdigit() and isspace().

That would still call libc, and still depend on LC_CTYPE. Should we use
pure ASCII variants?

There was also some discussion about forcing LC_COLLATE and LC_CTYPE to
C, now that the default collation doesn't depend on them any more (cf.
option 1):

https://www.postgresql.org/message-id/CA+hUKGL82jG2PdgfQtwWG+_51TQ--6M9XNa3rtt7ub+S3Pmfsw@mail.gmail.com

If we do that, then it would be fine to use isdigit/isspace.

Regards,
    Jeff Davis

Re: fixing tsearch locale support

From

Peter Eisentraut

Date:

13 December 2024, 09:16:22

On 12.12.24 19:14, Jeff Davis wrote:
> On Mon, 2024-12-02 at 11:57 +0100, Peter Eisentraut wrote:
>> t_isdigit() and t_isspace() are just used to parse various
>> configuration
>> and data files, and surely we don't need support for encoding-
>> dependent
>> multibyte support for parsing ASCII digits and ASCII spaces.
>> ... So these can
>> be
>> replaced by the normal isdigit() and isspace().
> 
> That would still call libc, and still depend on LC_CTYPE. Should we use
> pure ASCII variants?

isdigit() and isspace() in particular are widely used throughout the 
backend code without such concerns.  I think the assumption is that this 
is not a problem in practice: For multibyte encodings, these functions 
would only be able to process the ASCII subset, and the character 
classification of that should be consistent across all locales.  For 
single-byte encodings, among the encodings that PostgreSQL supports, I 
don't think any of them actually provide non-ASCII digits or space 
characters.

Re: fixing tsearch locale support

From

Jeff Davis

Date:

13 December 2024, 20:07:54

On Fri, 2024-12-13 at 07:16 +0100, Peter Eisentraut wrote:
> isdigit() and isspace() in particular are widely used throughout the
> backend code without such concerns.  I think the assumption is that
> this
> is not a problem in practice: For multibyte encodings, these
> functions
> would only be able to process the ASCII subset, and the character
> classification of that should be consistent across all locales.  For
> single-byte encodings, among the encodings that PostgreSQL supports,
> I
> don't think any of them actually provide non-ASCII digits or space
> characters.

OK, that's fine with me for this patch series.

Eventually though, I think we should have built-in versions of these
ASCII functions. Even if there's no actual problem, it would more
clearly indicate that we only care about ASCII at that particular call
site, and eliminate questions about what libc might do on some platform
for some encoding/locale combination. It would also make it easier to
search for locale-sensitive functions in the codebase.

Regards,
    Jeff Davis

Re: fixing tsearch locale support

From

Andreas Karlsson

Date:

17 December 2024, 18:25:22

On 12/13/24 6:07 PM, Jeff Davis wrote:
> OK, that's fine with me for this patch series.
> 
> Eventually though, I think we should have built-in versions of these
> ASCII functions. Even if there's no actual problem, it would more
> clearly indicate that we only care about ASCII at that particular call
> site, and eliminate questions about what libc might do on some platform
> for some encoding/locale combination. It would also make it easier to
> search for locale-sensitive functions in the codebase.

+1 I had exactly the same idea.

Andreas

Re: fixing tsearch locale support

From

Peter Eisentraut

Date:

17 December 2024, 21:27:07

On 17.12.24 16:25, Andreas Karlsson wrote:
> On 12/13/24 6:07 PM, Jeff Davis wrote:
>> OK, that's fine with me for this patch series.
>>
>> Eventually though, I think we should have built-in versions of these
>> ASCII functions. Even if there's no actual problem, it would more
>> clearly indicate that we only care about ASCII at that particular call
>> site, and eliminate questions about what libc might do on some platform
>> for some encoding/locale combination. It would also make it easier to
>> search for locale-sensitive functions in the codebase.
> 
> +1 I had exactly the same idea.

Yes, I think that could make sense.