Re: Performance degradation in Index searches with special characters - Mailing list pgsql-performance

From Andrey Stikheev
Subject Re: Performance degradation in Index searches with special characters
Date
Msg-id CALM5VP-GPRXcxs5BEfyxmsYDsHb4K4y+VST00wZGzizhEVvUsA@mail.gmail.com
Whole thread Raw
In response to Re: Performance degradation in Index searches with special characters  (Tom Lane <tgl@sss.pgh.pa.us>)
Responses Re: Performance degradation in Index searches with special characters
List pgsql-performance
Hi, Tom!

Thanks for your feedback. After looking into it further, it seems the performance issue is indeed related to the default collation settings, particularly when handling certain special characters like < in the glibc strcoll_l function. This was confirmed during my testing on Debian 12  with glibc version 2.36 (this OS and glibc are being used in our office's Docker image: https://hub.docker.com/_/postgres).

My test database settings:

test_db=# \d+ test

                                                  Table "public.test"

 Column |         Type          | Collation | Nullable | Default | Storage  | Compression | Stats target | Description 

--------+-----------------------+-----------+----------+---------+----------+-------------+--------------+-------------

 value  | character varying(10) |           | not null |         | extended |             |              | 

Indexes:

    "idx_test" btree (value)

Access method: heap


test_db=# \l test_db

                                                 List of databases

  Name   |  Owner   | Encoding | Locale Provider |  Collate     Ctype    | Locale | ICU Rules | Access privileges 

---------+----------+----------+-----------------+------------+------------+--------+-----------+-------------------

 test_db | postgres | UTF8     | libc            | en_US.utf8 | en_US.utf8 |                  | 

(1 row)



strcoll_l tests:

root@715b19170a89:~# cat /etc/os-release 

PRETTY_NAME="Debian GNU/Linux 12 (bookworm)"

NAME="Debian GNU/Linux"

VERSION_ID="12"

VERSION="12 (bookworm)"

VERSION_CODENAME=bookworm

ID=debian

HOME_URL="https://www.debian.org/"

SUPPORT_URL="https://www.debian.org/support"

BUG_REPORT_URL="https://bugs.debian.org/"

root@715b19170a89:~# ldd --version

ldd (Debian GLIBC 2.36-9+deb12u8) 2.36

Copyright (C) 2022 Free Software Foundation, Inc.

This is free software; see the source for copying conditions.  There is NO

warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

Written by Roland McGrath and Ulrich Drepper.

root@715b19170a89:~# cat test.c 

#include <stdio.h>

#include <stdlib.h>

#include <locale.h>

#include <string.h>


int main() {

    char str1[] = "<";

    char *str2 = malloc(65536); // 65535 characters + 1 for null terminator


    if (!str2) return 1;


    memset(str2, '<', 65535);

    str2[65535] = '\0';


    locale_t locale = newlocale(LC_COLLATE_MASK, "en_US.UTF-8", NULL);

    int result = strcoll_l(str1, str2, locale);


    printf("Comparison result: %d\n", result);


    freelocale(locale);

    free(str2);


    return 0;

}


root@715b19170a89:~# time ./test

Comparison result: -1


real 0m4.487s

user 0m4.483s

sys 0m0.003s


I'm considering switching to ICU collations, as they might handle this more efficiently. However, as I know ICU isn’t the default collation provider in PostgreSQL, and switching to it in a live environment isn’t a straightforward process. The main concern is that glibc’s default collation (en_US.UTF-8) is widely used, and this opens up the potential for a Denial of Service (DoS) attack. For instance, if user input includes long strings of repeated characters like <, it can severely degrade performance due to the extended processing time for string comparisons, especially in high-traffic scenarios.

On Sun, 6 Oct 2024 at 19:39, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Andrey Stikheev <andrey.stikheev@gmail.com> writes:
>    - Changing the collation to 'C' in the query significantly improves
>    performance.

What collation are you using, pray tell?  (And what database encoding?) 

>    - Is this performance degradation expected due to collation handling of
>    certain special characters in PostgreSQL?

It seems like a performance bug indeed, but not ours --- I'm thinking
it must be in strcoll() or ICU.

Trying it here (on a RHEL8 machine) with en_US.utf8 locale, I see
something similar but not quite as bad:

u8=# SELECT 1 FROM test WHERE value = repeat('<', 65536);
 ?column?
----------
(0 rows)

Time: 1383.033 ms (00:01.383)

Poking into it with gdb says that the time is indeed spent inside
strcoll:

#0  get_next_seq (pass=1, indirect=0x7fabb4988ea8, extra=0x7fabb4984900 "",
    table=0x7fabb490e2b0, weights=<optimized out>,
    rulesets=0x7fabb490e2a8 "\001\002\001\005\001\001\001\005", nrules=4,
    seq=<synthetic pointer>) at strcoll_l.c:111
#1  __GI___strcoll_l (s1=0x1785878 "<",
    s2=0x178587a '<' <repeats 200 times>..., l=<optimized out>)
    at strcoll_l.c:338
#2  0x00000000009527a6 in strncoll_libc (arg1=<optimized out>, len1=1,
    arg2=<optimized out>, len2=65536, locale=<optimized out>,
    locale=<optimized out>) at pg_locale.c:1964
#3  0x00000000009ac760 in varstr_cmp (arg1=0x7fabc2dcbfe9 "<", len1=1,
    arg2=0x17958cc '<' <repeats 200 times>..., len2=65536,
    collid=<optimized out>) at varlena.c:1567
#4  0x00000000009acfe3 in bttextcmp (fcinfo=0x7ffddd3b0590) at varlena.c:1820
#5  0x00000000009d75fa in FunctionCall2Coll (
    flinfo=flinfo@entry=0x7ffddd3b10e8, collation=<optimized out>,
    arg1=<optimized out>, arg2=<optimized out>) at fmgr.c:1161
#6  0x0000000000594948 in _bt_compare (rel=0x7fabcde7eed0, key=0x7ffddd3b10c0,
    page=<optimized out>, offnum=<optimized out>) at nbtsearch.c:762
#7  0x0000000000594e32 in _bt_binsrch (rel=rel@entry=0x7fabcde7eed0,
    key=key@entry=0x7ffddd3b10c0, buf=<optimized out>) at nbtsearch.c:394

It's not the fault of the index machinery, because a single comparison
takes the same amount of time:

u8=# select '<' <= repeat('<', 65536);
 ?column?
----------
 t
(1 row)

Time: 1391.550 ms (00:01.392)

I didn't try it with ICU.

                        regards, tom lane


--
Best regards,
Andrey Stikheev

pgsql-performance by date:

Previous
From: Tom Lane
Date:
Subject: Re: Performance degradation in Index searches with special characters
Next
From: Joe Conway
Date:
Subject: Re: Performance degradation in Index searches with special characters