Thread: Performance degradation in Index searches with special characters

Performance degradation in Index searches with special characters

From
Andrey Stikheev
Date:

Dear PostgreSQL Community,

I am facing significant performance issues when executing queries that involve string comparisons with special characters, such as <, #, !, @, etc., especially when dealing with long strings. The query execution time increases drastically when these characters are used, whereas queries with alphabetic characters do not show such degradation. This behavior is observed both on macOS (using the official postgres:17 image via Docker) and on an Ubuntu 20.04 server running PostgreSQL in an LXC container.

Here is a minimal example:

testdb=# SELECT version();

                                                          version                                                          

---------------------------------------------------------------------------------------------------------------------------

 PostgreSQL 17.0 (Debian 17.0-1.pgdg120+1) on aarch64-unknown-linux-gnu, compiled by gcc (Debian 12.2.0-14) 12.2.0, 64-bit

(1 row)

testdb=# CREATE TABLE test (value VARCHAR(10) NOT NULL);
CREATE TABLE
Time: 3.562 ms

testdb=# CREATE INDEX idx_test ON test (value);
CREATE INDEX
Time: 3.080 ms

testdb=# INSERT INTO test (value) VALUES ('<');
INSERT 0 1
Time: 3.365 ms

testdb=# SELECT 1 FROM test WHERE value = repeat('<', 65536);
 ?column?
----------
(0 rows)
Time: 4454.535 ms (00:04.455)

testdb=# SELECT 1 FROM test WHERE value = repeat('a', 65536);
 ?column?
----------
(0 rows)
Time: 3.772 ms

testdb=# SELECT 1 FROM test WHERE value = repeat('<', 65536) || 'a';
 ?column?
----------
(0 rows)
Time: 4.352 ms


Time: 9.503 ms

testdb=# SELECT 1 FROM test WHERE value = repeat('<', 65536) COLLATE "C";

 ?column? 

----------

(0 rows)


Time: 3.299 ms


testdb=# SELECT 1 FROM test WHERE value = repeat('@', 8192);

 ?column? 

----------

(0 rows)


Time: 77.171 ms

testdb=# SELECT 1 FROM test WHERE value = repeat('@', 16384);

 ?column? 

----------

(0 rows)


Time: 325.190 ms

testdb=# SELECT 1 FROM test WHERE value = repeat('@', 32768);

 ?column? 

----------

(0 rows)


Time: 1154.850 ms (00:01.155)

testdb=# SELECT 1 FROM test WHERE value = repeat('@', 65536);

 ?column? 

----------

(0 rows)


Time: 4490.206 ms (00:04.490)


testdb=# explain (analyze, verbose, buffers, costs, settings, timing, wal) SELECT 1 FROM test WHERE value = repeat('<', 65000);

---------------------------------------------------------------------------------------------
 Bitmap Heap Scan on public.test  (cost=4.20..13.67 rows=6 width=4) (actual time=4425.459..4425.459 rows=0 loops=1)
   Output: 1
   Recheck Cond: ((test.value)::text = '<<<<<<<<<<<<<<<<<<<<<<<<<<<< ... a lot symbols ...
   <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<'::text)
   Buffers: shared hit=1
   ->  Bitmap Index Scan on idx_test  (cost=0.00..4.20 rows=6 width=0) (actual time=4425.432..4425.432 rows=0 loops=1)
         Index Cond: ((test.value)::text = '<<<<<<<<<<<<<<<<<<<<<<<<<<<< ... a lot symbols ...<<<<<<<<<<<<<<<<<<<<<<<<<<'::text)
         Buffers: shared hit=1
 Planning Time: 1.082 ms
 Execution Time: 4425.602 ms
(9 rows)

Time: 4433.001 ms (00:04.433)

Observations:

  • The performance degradation occurs with certain special characters like < , !,  >, @ , #, ... .
  • Queries using alphabetic characters or appending/prepending characters execute much faster.
  • The execution time increases exponentially with the length of the string composed of special characters.
  • Changing the collation to 'C' in the query significantly improves performance.

Questions:

  • Is this performance degradation expected due to collation handling of certain special characters in PostgreSQL?
  • Are there any recommendations to improve performance without changing the column or database collation?
--
Best regards,
Andrey Stikheev
Andrey Stikheev <andrey.stikheev@gmail.com> writes:
>    - Changing the collation to 'C' in the query significantly improves
>    performance.

What collation are you using, pray tell?  (And what database encoding?)

>    - Is this performance degradation expected due to collation handling of
>    certain special characters in PostgreSQL?

It seems like a performance bug indeed, but not ours --- I'm thinking
it must be in strcoll() or ICU.

Trying it here (on a RHEL8 machine) with en_US.utf8 locale, I see
something similar but not quite as bad:

u8=# SELECT 1 FROM test WHERE value = repeat('<', 65536);
 ?column?
----------
(0 rows)

Time: 1383.033 ms (00:01.383)

Poking into it with gdb says that the time is indeed spent inside
strcoll:

#0  get_next_seq (pass=1, indirect=0x7fabb4988ea8, extra=0x7fabb4984900 "",
    table=0x7fabb490e2b0, weights=<optimized out>,
    rulesets=0x7fabb490e2a8 "\001\002\001\005\001\001\001\005", nrules=4,
    seq=<synthetic pointer>) at strcoll_l.c:111
#1  __GI___strcoll_l (s1=0x1785878 "<",
    s2=0x178587a '<' <repeats 200 times>..., l=<optimized out>)
    at strcoll_l.c:338
#2  0x00000000009527a6 in strncoll_libc (arg1=<optimized out>, len1=1,
    arg2=<optimized out>, len2=65536, locale=<optimized out>,
    locale=<optimized out>) at pg_locale.c:1964
#3  0x00000000009ac760 in varstr_cmp (arg1=0x7fabc2dcbfe9 "<", len1=1,
    arg2=0x17958cc '<' <repeats 200 times>..., len2=65536,
    collid=<optimized out>) at varlena.c:1567
#4  0x00000000009acfe3 in bttextcmp (fcinfo=0x7ffddd3b0590) at varlena.c:1820
#5  0x00000000009d75fa in FunctionCall2Coll (
    flinfo=flinfo@entry=0x7ffddd3b10e8, collation=<optimized out>,
    arg1=<optimized out>, arg2=<optimized out>) at fmgr.c:1161
#6  0x0000000000594948 in _bt_compare (rel=0x7fabcde7eed0, key=0x7ffddd3b10c0,
    page=<optimized out>, offnum=<optimized out>) at nbtsearch.c:762
#7  0x0000000000594e32 in _bt_binsrch (rel=rel@entry=0x7fabcde7eed0,
    key=key@entry=0x7ffddd3b10c0, buf=<optimized out>) at nbtsearch.c:394

It's not the fault of the index machinery, because a single comparison
takes the same amount of time:

u8=# select '<' <= repeat('<', 65536);
 ?column?
----------
 t
(1 row)

Time: 1391.550 ms (00:01.392)

I didn't try it with ICU.

            regards, tom lane



Re: Performance degradation in Index searches with special characters

From
Andrey Stikheev
Date:
Hi, Tom!

Thanks for your feedback. After looking into it further, it seems the performance issue is indeed related to the default collation settings, particularly when handling certain special characters like < in the glibc strcoll_l function. This was confirmed during my testing on Debian 12  with glibc version 2.36 (this OS and glibc are being used in our office's Docker image: https://hub.docker.com/_/postgres).

My test database settings:

test_db=# \d+ test

                                                  Table "public.test"

 Column |         Type          | Collation | Nullable | Default | Storage  | Compression | Stats target | Description 

--------+-----------------------+-----------+----------+---------+----------+-------------+--------------+-------------

 value  | character varying(10) |           | not null |         | extended |             |              | 

Indexes:

    "idx_test" btree (value)

Access method: heap


test_db=# \l test_db

                                                 List of databases

  Name   |  Owner   | Encoding | Locale Provider |  Collate     Ctype    | Locale | ICU Rules | Access privileges 

---------+----------+----------+-----------------+------------+------------+--------+-----------+-------------------

 test_db | postgres | UTF8     | libc            | en_US.utf8 | en_US.utf8 |                  | 

(1 row)



strcoll_l tests:

root@715b19170a89:~# cat /etc/os-release 

PRETTY_NAME="Debian GNU/Linux 12 (bookworm)"

NAME="Debian GNU/Linux"

VERSION_ID="12"

VERSION="12 (bookworm)"

VERSION_CODENAME=bookworm

ID=debian

HOME_URL="https://www.debian.org/"

SUPPORT_URL="https://www.debian.org/support"

BUG_REPORT_URL="https://bugs.debian.org/"

root@715b19170a89:~# ldd --version

ldd (Debian GLIBC 2.36-9+deb12u8) 2.36

Copyright (C) 2022 Free Software Foundation, Inc.

This is free software; see the source for copying conditions.  There is NO

warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

Written by Roland McGrath and Ulrich Drepper.

root@715b19170a89:~# cat test.c 

#include <stdio.h>

#include <stdlib.h>

#include <locale.h>

#include <string.h>


int main() {

    char str1[] = "<";

    char *str2 = malloc(65536); // 65535 characters + 1 for null terminator


    if (!str2) return 1;


    memset(str2, '<', 65535);

    str2[65535] = '\0';


    locale_t locale = newlocale(LC_COLLATE_MASK, "en_US.UTF-8", NULL);

    int result = strcoll_l(str1, str2, locale);


    printf("Comparison result: %d\n", result);


    freelocale(locale);

    free(str2);


    return 0;

}


root@715b19170a89:~# time ./test

Comparison result: -1


real 0m4.487s

user 0m4.483s

sys 0m0.003s


I'm considering switching to ICU collations, as they might handle this more efficiently. However, as I know ICU isn’t the default collation provider in PostgreSQL, and switching to it in a live environment isn’t a straightforward process. The main concern is that glibc’s default collation (en_US.UTF-8) is widely used, and this opens up the potential for a Denial of Service (DoS) attack. For instance, if user input includes long strings of repeated characters like <, it can severely degrade performance due to the extended processing time for string comparisons, especially in high-traffic scenarios.

On Sun, 6 Oct 2024 at 19:39, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Andrey Stikheev <andrey.stikheev@gmail.com> writes:
>    - Changing the collation to 'C' in the query significantly improves
>    performance.

What collation are you using, pray tell?  (And what database encoding?) 

>    - Is this performance degradation expected due to collation handling of
>    certain special characters in PostgreSQL?

It seems like a performance bug indeed, but not ours --- I'm thinking
it must be in strcoll() or ICU.

Trying it here (on a RHEL8 machine) with en_US.utf8 locale, I see
something similar but not quite as bad:

u8=# SELECT 1 FROM test WHERE value = repeat('<', 65536);
 ?column?
----------
(0 rows)

Time: 1383.033 ms (00:01.383)

Poking into it with gdb says that the time is indeed spent inside
strcoll:

#0  get_next_seq (pass=1, indirect=0x7fabb4988ea8, extra=0x7fabb4984900 "",
    table=0x7fabb490e2b0, weights=<optimized out>,
    rulesets=0x7fabb490e2a8 "\001\002\001\005\001\001\001\005", nrules=4,
    seq=<synthetic pointer>) at strcoll_l.c:111
#1  __GI___strcoll_l (s1=0x1785878 "<",
    s2=0x178587a '<' <repeats 200 times>..., l=<optimized out>)
    at strcoll_l.c:338
#2  0x00000000009527a6 in strncoll_libc (arg1=<optimized out>, len1=1,
    arg2=<optimized out>, len2=65536, locale=<optimized out>,
    locale=<optimized out>) at pg_locale.c:1964
#3  0x00000000009ac760 in varstr_cmp (arg1=0x7fabc2dcbfe9 "<", len1=1,
    arg2=0x17958cc '<' <repeats 200 times>..., len2=65536,
    collid=<optimized out>) at varlena.c:1567
#4  0x00000000009acfe3 in bttextcmp (fcinfo=0x7ffddd3b0590) at varlena.c:1820
#5  0x00000000009d75fa in FunctionCall2Coll (
    flinfo=flinfo@entry=0x7ffddd3b10e8, collation=<optimized out>,
    arg1=<optimized out>, arg2=<optimized out>) at fmgr.c:1161
#6  0x0000000000594948 in _bt_compare (rel=0x7fabcde7eed0, key=0x7ffddd3b10c0,
    page=<optimized out>, offnum=<optimized out>) at nbtsearch.c:762
#7  0x0000000000594e32 in _bt_binsrch (rel=rel@entry=0x7fabcde7eed0,
    key=key@entry=0x7ffddd3b10c0, buf=<optimized out>) at nbtsearch.c:394

It's not the fault of the index machinery, because a single comparison
takes the same amount of time:

u8=# select '<' <= repeat('<', 65536);
 ?column?
----------
 t
(1 row)

Time: 1391.550 ms (00:01.392)

I didn't try it with ICU.

                        regards, tom lane


--
Best regards,
Andrey Stikheev
On 10/6/24 13:28, Andrey Stikheev wrote:
> Thanks for your feedback. After looking into it further, it seems the 
> performance issue is indeed related to the default collation settings, 
> particularly when handling certain special characters like |<| in the 
> glibc |strcoll_l| function. This was confirmed during my testing on 
> Debian 12  with glibc version 2.36 (this OS and glibc are being used in 
> our office's Docker image: https://hub.docker.com/_/postgres <https:// 
> hub.docker.com/_/postgres>).

This is not surprising. There is a performance regression that started 
in glibc 2.21 with regard to sorting unicode. Test with RHEL 7.x (glibc 
2.17) and I bet you will see comparable results to ICU. The best answer 
in the long term, IMHO, is likely to use the new built-in collation just 
released in Postgres 17.

-- 
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com



Joe Conway <mail@joeconway.com> writes:
> This is not surprising. There is a performance regression that started 
> in glibc 2.21 with regard to sorting unicode. Test with RHEL 7.x (glibc 
> 2.17) and I bet you will see comparable results to ICU. The best answer 
> in the long term, IMHO, is likely to use the new built-in collation just 
> released in Postgres 17.

It seems unrelated to unicode though --- I also reproduced the issue
in a database with LATIN1 encoding.

Whatever, it is pretty awful, but the place to be complaining to
is the glibc maintainers.  Not much we can do about it.

            regards, tom lane



Hi Andrey, 

I have tried my best to answer your queries below: 


### Performance Degradation with Special Characters in PostgreSQL

#### **Explanation**:
The performance degradation you're experiencing when using special characters like `<`, `@`, `#`, etc., is likely due to how PostgreSQL handles **collations**. By default, PostgreSQL uses locale-aware collations (typically UTF-8) for string comparisons, which involve complex character sorting and encoding. Special characters may be treated differently under certain locales, resulting in slower comparisons, especially with longer strings.

#### **Key Observations**:
1. **Alphabetic vs. Special Characters**:
   - Queries with alphabetic characters perform significantly faster than those with special characters.
   - This is due to differences in how the collation handles comparisons for alphabetic and non-alphabetic characters. Special characters often require more computational resources for comparison, resulting in longer execution times.

2. **Impact of Collation**:
   - When you switched the collation to **"C"**, the performance improved substantially. This is because the `"C"` collation uses **byte-wise** comparisons rather than locale-aware sorting, which simplifies the comparison logic, especially for special characters.

3. **String Length Impact**:
   - As the string length increases, the performance degrades exponentially when using special characters. This is due to the collation’s computational complexity for each additional character comparison.

#### **Recommendations**:

1. **Use the "C" Collation**:
   - The most effective improvement you observed came from using the `"C"` collation, which performs simple byte-by-byte comparisons. You can apply the `"C"` collation at the query level or alter the column definition:
     ```sql
     SELECT 1 FROM test WHERE value = repeat('<', 65536) COLLATE "C";
     ```
     or
     ```sql
     ALTER TABLE test ALTER COLUMN value SET DATA TYPE VARCHAR(10) COLLATE "C";
     ```

2. **Partial Indexing**:
   - Create an index with a specific collation for certain characters. This allows PostgreSQL to use the optimized comparison method only when dealing with special characters:
     ```sql
     CREATE INDEX idx_test_special_chars ON test (value COLLATE "C");
     ```

3. **Reduce String Length**:
   - If possible, reduce the string length for comparisons involving special characters. The performance impact grows with longer strings due to the nature of locale-aware collation.

4. **Use Functional Indexes**:
   - You can use functional indexes that apply transformations (e.g., trimming special characters or converting to ASCII) before comparison:
     ```sql
     CREATE INDEX idx_test_transformed ON test (lower(value));
     ```

5. **Optimize Queries**:
   - For repeated queries involving long strings with special characters, try caching results or using materialized views where the collation overhead is reduced by pre-computing.

6. **Custom Collations**:
   - If you need locale-aware sorting but can compromise on certain aspects, consider creating a **custom collation** that simplifies special character handling.

#### **Answers to Your Questions**:
1. **Is the performance degradation expected due to collation handling of certain special characters?**
   - Yes, this behavior is expected. Locale-aware collations can be complex for special characters, leading to longer comparison times.

2. **Are there any recommendations to improve performance without changing the column or database collation?**
   - Without changing the collation entirely, you can use:
     - **Query-level collation adjustments** (using the `"C"` collation in specific queries).
     - **Partial indexes** with the `"C"` collation.
     - **Functional indexes** or optimizations like materialized views.

On Sun, Oct 6, 2024 at 3:53 PM Andrey Stikheev <andrey.stikheev@gmail.com> wrote:

Dear PostgreSQL Community,

I am facing significant performance issues when executing queries that involve string comparisons with special characters, such as <, #, !, @, etc., especially when dealing with long strings. The query execution time increases drastically when these characters are used, whereas queries with alphabetic characters do not show such degradation. This behavior is observed both on macOS (using the official postgres:17 image via Docker) and on an Ubuntu 20.04 server running PostgreSQL in an LXC container.

Here is a minimal example:

testdb=# SELECT version();

                                                          version                                                          

---------------------------------------------------------------------------------------------------------------------------

 PostgreSQL 17.0 (Debian 17.0-1.pgdg120+1) on aarch64-unknown-linux-gnu, compiled by gcc (Debian 12.2.0-14) 12.2.0, 64-bit

(1 row)

testdb=# CREATE TABLE test (value VARCHAR(10) NOT NULL);
CREATE TABLE
Time: 3.562 ms

testdb=# CREATE INDEX idx_test ON test (value);
CREATE INDEX
Time: 3.080 ms

testdb=# INSERT INTO test (value) VALUES ('<');
INSERT 0 1
Time: 3.365 ms

testdb=# SELECT 1 FROM test WHERE value = repeat('<', 65536);
 ?column?
----------
(0 rows)
Time: 4454.535 ms (00:04.455)

testdb=# SELECT 1 FROM test WHERE value = repeat('a', 65536);
 ?column?
----------
(0 rows)
Time: 3.772 ms

testdb=# SELECT 1 FROM test WHERE value = repeat('<', 65536) || 'a';
 ?column?
----------
(0 rows)
Time: 4.352 ms


Time: 9.503 ms

testdb=# SELECT 1 FROM test WHERE value = repeat('<', 65536) COLLATE "C";

 ?column? 

----------

(0 rows)


Time: 3.299 ms


testdb=# SELECT 1 FROM test WHERE value = repeat('@', 8192);

 ?column? 

----------

(0 rows)


Time: 77.171 ms

testdb=# SELECT 1 FROM test WHERE value = repeat('@', 16384);

 ?column? 

----------

(0 rows)


Time: 325.190 ms

testdb=# SELECT 1 FROM test WHERE value = repeat('@', 32768);

 ?column? 

----------

(0 rows)


Time: 1154.850 ms (00:01.155)

testdb=# SELECT 1 FROM test WHERE value = repeat('@', 65536);

 ?column? 

----------

(0 rows)


Time: 4490.206 ms (00:04.490)


testdb=# explain (analyze, verbose, buffers, costs, settings, timing, wal) SELECT 1 FROM test WHERE value = repeat('<', 65000);

---------------------------------------------------------------------------------------------
 Bitmap Heap Scan on public.test  (cost=4.20..13.67 rows=6 width=4) (actual time=4425.459..4425.459 rows=0 loops=1)
   Output: 1
   Recheck Cond: ((test.value)::text = '<<<<<<<<<<<<<<<<<<<<<<<<<<<< ... a lot symbols ...
   <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<'::text)
   Buffers: shared hit=1
   ->  Bitmap Index Scan on idx_test  (cost=0.00..4.20 rows=6 width=0) (actual time=4425.432..4425.432 rows=0 loops=1)
         Index Cond: ((test.value)::text = '<<<<<<<<<<<<<<<<<<<<<<<<<<<< ... a lot symbols ...<<<<<<<<<<<<<<<<<<<<<<<<<<'::text)
         Buffers: shared hit=1
 Planning Time: 1.082 ms
 Execution Time: 4425.602 ms
(9 rows)

Time: 4433.001 ms (00:04.433)

Observations:

  • The performance degradation occurs with certain special characters like < , !,  >, @ , #, ... .
  • Queries using alphabetic characters or appending/prepending characters execute much faster.
  • The execution time increases exponentially with the length of the string composed of special characters.
  • Changing the collation to 'C' in the query significantly improves performance.

Questions:

  • Is this performance degradation expected due to collation handling of certain special characters in PostgreSQL?
  • Are there any recommendations to improve performance without changing the column or database collation?
--
Best regards,
Andrey Stikheev


--

Best Regards 
Shiv Iyer 

Re: Performance degradation in Index searches with special characters

From
Thomas Munro
Date:
On Mon, Oct 7, 2024 at 9:02 AM Shiv Iyer <shiv@minervadb.com> wrote:
>    - As the string length increases, the performance degrades exponentially when using special characters. This is
dueto the collation’s computational complexity for each additional character comparison. 

That's a pretty interesting observation, worthy of a bug report.  I
don't know the details offhand but the algorithm used should be
basically the same, or more likely a simpler subset, of what other
collation implementations are using IIRC (don't quote me but I think
it's supposed to be ISO 14651 which is essentially a subset of the UCA
stuff that ICU is using (it is "aligned with" UCA DUCET), without CLDR
customisations and perhaps various other complications), so it doesn't
sound like it should be fundamentally required to be *more* expensive
than ICU...



On 10/6/24 14:13, Tom Lane wrote:
> Joe Conway <mail@joeconway.com> writes:
>> This is not surprising. There is a performance regression that started 
>> in glibc 2.21 with regard to sorting unicode. Test with RHEL 7.x (glibc 
>> 2.17) and I bet you will see comparable results to ICU. The best answer 
>> in the long term, IMHO, is likely to use the new built-in collation just 
>> released in Postgres 17.
> 
> It seems unrelated to unicode though --- I also reproduced the issue
> in a database with LATIN1 encoding.
> 
> Whatever, it is pretty awful, but the place to be complaining to
> is the glibc maintainers.  Not much we can do about it.

Yeah, my reply was imprecise.

The regression was to strcoll in general. Specifically this commit which 
purports to improve performance but demonstrably causes massive regressions:

https://sourceware.org/git/?p=glibc.git;a=commit;h=0742aef6

-- 
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com



Re: Performance degradation in Index searches with special characters

From
Andrey Stikheev
Date:

Hi Joe,

Yes, it seems that the issue is indeed related to that glibc commit. I also found a bug report that highlights a similar problem: https://sourceware.org/bugzilla/show_bug.cgi?id=18441

Unfortunately, the situation doesn't look promising. As Carlos O'Donell mentioned in his last comment back in 2019:

Carlos O'Donell 2019-05-09 20:44:56 UTC

> Hello. Is there any chance that the issues will be fixed? Unfortunately
> PostgreSQL Is unable to use ICU for some basic features (e.g., in the
> analyze operation).

"We haven't had anyone working on strcoll_l performance improvements. So it's unlikely that this will get merged or reviewed any time soon."


On Mon, 7 Oct 2024 at 19:48, Joe Conway <mail@joeconway.com> wrote:
On 10/6/24 14:13, Tom Lane wrote:
> Joe Conway <mail@joeconway.com> writes:
>> This is not surprising. There is a performance regression that started
>> in glibc 2.21 with regard to sorting unicode. Test with RHEL 7.x (glibc
>> 2.17) and I bet you will see comparable results to ICU. The best answer
>> in the long term, IMHO, is likely to use the new built-in collation just
>> released in Postgres 17.
>
> It seems unrelated to unicode though --- I also reproduced the issue
> in a database with LATIN1 encoding.
>
> Whatever, it is pretty awful, but the place to be complaining to
> is the glibc maintainers.  Not much we can do about it.

Yeah, my reply was imprecise.

The regression was to strcoll in general. Specifically this commit which
purports to improve performance but demonstrably causes massive regressions:

https://sourceware.org/git/?p=glibc.git;a=commit;h=0742aef6

--
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com


--
Best regards,
Andrey Stikheev