Re: Remaining dependency on setlocale() - Mailing list pgsql-hackers

From Chao Li
Subject Re: Remaining dependency on setlocale()
Date
Msg-id 83F6E6A9-FF29-41C9-9E15-EA5360E912D0@gmail.com
Whole thread Raw
In response to Re: Remaining dependency on setlocale()  (Jeff Davis <pgsql@j-davis.com>)
List pgsql-hackers

> On Nov 26, 2025, at 01:20, Jeff Davis <pgsql@j-davis.com> wrote:
>
>
> New series attached with only these changes and a rebase.
>
> Regards,
> Jeff Davis
>
>
<v10-0001-Inline-pg_ascii_tolower-and-pg_ascii_toupper.patch><v10-0002-Add-define-for-UNICODE_CASEMAP_BUFSZ.patch><v10-0003-Change-some-callers-to-use-pg_ascii_toupper.patch><v10-0004-Allow-pg_locale_t-APIs-to-work-when-ctype_is_c.patch><v10-0005-Make-regex-max_chr-depend-on-encoding-not-provid.patch><v10-0006-Fix-inconsistency-between-ltree_strncasecmp-and-.patch><v10-0007-Remove-char_tolower-API.patch><v10-0008-Use-multibyte-aware-extraction-of-pattern-prefix.patch><v10-0009-fuzzystrmatch-use-pg_ascii_toupper.patch><v10-0010-downcase_identifier-use-method-table-from-locale.patch><v10-0011-Control-LC_COLLATE-with-GUC.patch>

I continued reviewing 0005-0008.

0005 - no comment. The change looks correct to me. Before the patch, pg_regex_locale->ctype->max_chr <= MAX_SIMPLE_CHR
shouldbe the more common cases, with the patch, MAX_SIMPLE_CHR >= UCHAR_MAX should be the more common cases, thus there
shouldbe not a behavior change. 

5 - 0006
```
@@ -77,10 +77,37 @@ compare_subnode(ltree_level *t, char *qn, int len, int (*cmpptr) (const char *,
 int
 ltree_strncasecmp(const char *a, const char *b, size_t s)
 {
-    char       *al = str_tolower(a, s, DEFAULT_COLLATION_OID);
-    char       *bl = str_tolower(b, s, DEFAULT_COLLATION_OID);
+    static pg_locale_t locale = NULL;
+    size_t        al_sz = s + 1;
+    char       *al = palloc(al_sz);
+    size_t        bl_sz = s + 1;
+    char       *bl = palloc(bl_sz);
+    size_t        needed;
     int            res;

+    if (!locale)
+        locale = pg_database_locale();
+
+    needed = pg_strfold(al, al_sz, a, s, locale);
+    if (needed + 1 > al_sz)
+    {
+        /* grow buffer if needed and retry */
+        al_sz = needed + 1;
+        al = repalloc(al, al_sz);
+        needed = pg_strfold(al, al_sz, a, s, locale);
+        Assert(needed + 1 <= al_sz);
+    }
+
+    needed = pg_strfold(bl, bl_sz, b, s, locale);
+    if (needed + 1 > bl_sz)
+    {
+        /* grow buffer if needed and retry */
+        bl_sz = needed + 1;
+        bl = repalloc(bl, bl_sz);
+        needed = pg_strfold(bl, bl_sz, b, s, locale);
+        Assert(needed + 1 <= bl_sz);
+    }
+
     res = strncmp(al, bl, s);

     pfree(al);
```

I do think the new implementation has some problem.

* The retry logic implies that a single-byte char may become multiple bytes after folding, otherwise retry is not
neededbecause you have allocated s+1 bytes for dest buffers. From this perspective, we should use two needed variables:
neededAand neededB, if neededA != neededB, then the two strings are different; if neededA == neededB, then we should be
performstrncmp, but here we should pass neededA (or neededB as they are identical) to strncmp(al, bl, neededA). 

* Based on the logic you implemented in 0004, first pg_strfold() has copied as many chars as possible to dest buffer,
sowhen retry, ideally we can should resume instead of start over. However, if single-byte->multi-byte folding happens,
wehave no information to decide from where to resume. From this perspective, in 0004, do we really need to take the
try-the-beststrategy for strlower_c()? If there are some other use cases that require data to be placed in dest buffer
evenif dest buffer doesn’t have enough space, then my patch [1] of changing strlower_libc_sb() should be considered. 

6 - 0007
```
     /*
-     * For efficiency reasons, in the single byte case we don't call lower()
-     * on the pattern and text, but instead call SB_lower_char on each
-     * character.  In the multi-byte case we don't have much choice :-(. Also,
-     * ICU does not support single-character case folding, so we go the long
-     * way.
+     * For efficiency reasons, in the C locale we don't call lower() on the
+     * pattern and text, but instead call SB_lower_char on each character.
      */
```

SB_lower_char should be changed to C_IMatchText.

7 - 0007
```
/* setup to compile like_match.c for single byte case insensitive matches */
-#define MATCH_LOWER(t, locale) SB_lower_char((unsigned char) (t), locale)
+#define MATCH_LOWER
 ```

I think the comment should be updated accordingly, like “for ILIKE in the C locale”.


8 - 0008
```
+    /* for text types, use pg_wchar; for BYTEA, use char */
     if (typeid != BYTEAOID)
     {
-        patt = TextDatumGetCString(patt_const->constvalue);
-        pattlen = strlen(patt);
+        text       *val = DatumGetTextPP(patt_const->constvalue);
+        pg_wchar   *wpatt;
+        pg_wchar   *wmatch;
+        char       *match;
+
+        patt = VARDATA_ANY(val);
+        pattlen = VARSIZE_ANY_EXHDR(val);
+        wpatt = palloc((pattlen + 1) * sizeof(pg_wchar));
+        wmatch = palloc((pattlen + 1) * sizeof(pg_wchar));
+        pg_mb2wchar_with_len(patt, wpatt, pattlen);
+
+        match = palloc(pattlen + 1);
```

* match is allocated with pattlen+1 bytes, is it long enough to hold pattlen multiple-byte chars?

* match is not freed, but looks like it should be:

*prefix_const = string_to_const(match, typeid);
 -> string_to_datum(str, datatype);
 -> CStringGetTextDatum(str);
 -> cstring_to_text(s)
 -> cstring_to_text_with_len(s, strlen(s));
 -> *result = (text *) palloc(len + VARHDRSZ);

So, match has been copied, it should be freed.

9 - 0008
```
-    }
+            /* Backslash escapes the next character */
+            if (patt[pos] == '\\')
+            {
+                pos++;
+                if (pos >= pattlen)
+                    break;
+            }

-    match[match_pos] = '\0';
+            match[match_pos++] = pos;
+        }
```

Should “pos” be “part[pos]” assigning to match[match_pos++]?

I will review the rest 3 commits tomorrow.

[1] https://postgr.es/m/CAEoWx2m9mUN397neL=p9x0vaVcj5EGiKD53F1MNTwTDXizxiaA@mail.gmail.com

--
Chao Li (Evan)
HighGo Software Co., Ltd.
https://www.highgo.com/







pgsql-hackers by date:

Previous
From: Masahiko Sawada
Date:
Subject: Re: POC: enable logical decoding when wal_level = 'replica' without a server restart
Next
From: Michael Paquier
Date:
Subject: Re: Extended Statistics set/restore/clear functions.