Thread: Re: fixing tsearch locale support
On Mon, 2024-12-02 at 11:57 +0100, Peter Eisentraut wrote: > t_isdigit() and t_isspace() are just used to parse various > configuration > and data files, and surely we don't need support for encoding- > dependent > multibyte support for parsing ASCII digits and ASCII spaces. > ... So these can > be > replaced by the normal isdigit() and isspace(). That would still call libc, and still depend on LC_CTYPE. Should we use pure ASCII variants? There was also some discussion about forcing LC_COLLATE and LC_CTYPE to C, now that the default collation doesn't depend on them any more (cf. option 1): https://www.postgresql.org/message-id/CA+hUKGL82jG2PdgfQtwWG+_51TQ--6M9XNa3rtt7ub+S3Pmfsw@mail.gmail.com If we do that, then it would be fine to use isdigit/isspace. Regards, Jeff Davis
On 12.12.24 19:14, Jeff Davis wrote: > On Mon, 2024-12-02 at 11:57 +0100, Peter Eisentraut wrote: >> t_isdigit() and t_isspace() are just used to parse various >> configuration >> and data files, and surely we don't need support for encoding- >> dependent >> multibyte support for parsing ASCII digits and ASCII spaces. >> ... So these can >> be >> replaced by the normal isdigit() and isspace(). > > That would still call libc, and still depend on LC_CTYPE. Should we use > pure ASCII variants? isdigit() and isspace() in particular are widely used throughout the backend code without such concerns. I think the assumption is that this is not a problem in practice: For multibyte encodings, these functions would only be able to process the ASCII subset, and the character classification of that should be consistent across all locales. For single-byte encodings, among the encodings that PostgreSQL supports, I don't think any of them actually provide non-ASCII digits or space characters.
On Fri, 2024-12-13 at 07:16 +0100, Peter Eisentraut wrote: > isdigit() and isspace() in particular are widely used throughout the > backend code without such concerns. I think the assumption is that > this > is not a problem in practice: For multibyte encodings, these > functions > would only be able to process the ASCII subset, and the character > classification of that should be consistent across all locales. For > single-byte encodings, among the encodings that PostgreSQL supports, > I > don't think any of them actually provide non-ASCII digits or space > characters. OK, that's fine with me for this patch series. Eventually though, I think we should have built-in versions of these ASCII functions. Even if there's no actual problem, it would more clearly indicate that we only care about ASCII at that particular call site, and eliminate questions about what libc might do on some platform for some encoding/locale combination. It would also make it easier to search for locale-sensitive functions in the codebase. Regards, Jeff Davis
On 12/13/24 6:07 PM, Jeff Davis wrote: > OK, that's fine with me for this patch series. > > Eventually though, I think we should have built-in versions of these > ASCII functions. Even if there's no actual problem, it would more > clearly indicate that we only care about ASCII at that particular call > site, and eliminate questions about what libc might do on some platform > for some encoding/locale combination. It would also make it easier to > search for locale-sensitive functions in the codebase. +1 I had exactly the same idea. Andreas
On 17.12.24 16:25, Andreas Karlsson wrote: > On 12/13/24 6:07 PM, Jeff Davis wrote: >> OK, that's fine with me for this patch series. >> >> Eventually though, I think we should have built-in versions of these >> ASCII functions. Even if there's no actual problem, it would more >> clearly indicate that we only care about ASCII at that particular call >> site, and eliminate questions about what libc might do on some platform >> for some encoding/locale combination. It would also make it easier to >> search for locale-sensitive functions in the codebase. > > +1 I had exactly the same idea. Yes, I think that could make sense.