Home > mailing lists

Re: Performance degradation in Index searches with special characters - Mailing list pgsql-performance

From	Andrey Stikheev
Subject	Re: Performance degradation in Index searches with special characters
Date	October 6, 2024 20:28:29
Msg-id	CALM5VP-GPRXcxs5BEfyxmsYDsHb4K4y+VST00wZGzizhEVvUsA@mail.gmail.com Whole thread Raw
In response to	Re: Performance degradation in Index searches with special characters (Tom Lane <tgl@sss.pgh.pa.us>)
Responses	Re: Performance degradation in Index searches with special characters
List	pgsql-performance

Tree view

Hi, Tom!

Thanks for your feedback. After looking into it further, it seems the performance issue is indeed related to the default collation settings, particularly when handling certain special characters like < in the glibc strcoll_l function. This was confirmed during my testing on Debian 12 with glibc version 2.36 (this OS and glibc are being used in our office's Docker image: https://hub.docker.com/_/postgres).

My test database settings:

test_db=# \d+ test

Table "public.test"

--------+-----------------------+-----------+----------+---------+----------+-------------+--------------+-------------

Indexes:

"idx_test" btree (value)

Access method: heap

test_db=# \l test_db

List of databases

---------+----------+----------+-----------------+------------+------------+--------+-----------+-------------------

(1 row)

strcoll_l tests:

root@715b19170a89:~# cat /etc/os-release

PRETTY_NAME="Debian GNU/Linux 12 (bookworm)"

NAME="Debian GNU/Linux"

VERSION_ID="12"

VERSION="12 (bookworm)"

VERSION_CODENAME=bookworm

ID=debian

HOME_URL="https://www.debian.org/"

SUPPORT_URL="https://www.debian.org/support"

BUG_REPORT_URL="https://bugs.debian.org/"

root@715b19170a89:~# ldd --version

ldd (Debian GLIBC 2.36-9+deb12u8) 2.36

This is free software; see the source for copying conditions. There is NO

warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

Written by Roland McGrath and Ulrich Drepper.

root@715b19170a89:~# cat test.c

#include <stdio.h>

#include <stdlib.h>

#include <locale.h>

#include <string.h>

int main() {

char str1[] = "<";

char *str2 = malloc(65536); // 65535 characters + 1 for null terminator

if (!str2) return 1;

memset(str2, '<', 65535);

str2[65535] = '\0';

locale_t locale = newlocale(LC_COLLATE_MASK, "en_US.UTF-8", NULL);

int result = strcoll_l(str1, str2, locale);

printf("Comparison result: %d\n", result);

freelocale(locale);

free(str2);

return 0;

}

root@715b19170a89:~# time ./test

Comparison result: -1

real 0m4.487s

user 0m4.483s

sys 0m0.003s

I'm considering switching to ICU collations, as they might handle this more efficiently. However, as I know ICU isn’t the default collation provider in PostgreSQL, and switching to it in a live environment isn’t a straightforward process. The main concern is that glibc’s default collation (en_US.UTF-8) is widely used, and this opens up the potential for a Denial of Service (DoS) attack. For instance, if user input includes long strings of repeated characters like <, it can severely degrade performance due to the extended processing time for string comparisons, especially in high-traffic scenarios.

On Sun, 6 Oct 2024 at 19:39, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Andrey Stikheev <andrey.stikheev@gmail.com> writes:
> - Changing the collation to 'C' in the query significantly improves
> performance.

What collation are you using, pray tell? (And what database encoding?)

> - Is this performance degradation expected due to collation handling of
> certain special characters in PostgreSQL?

It seems like a performance bug indeed, but not ours --- I'm thinking
it must be in strcoll() or ICU.

Trying it here (on a RHEL8 machine) with en_US.utf8 locale, I see
something similar but not quite as bad:

u8=# SELECT 1 FROM test WHERE value = repeat('<', 65536);
?column?
----------
(0 rows)

Time: 1383.033 ms (00:01.383)

Poking into it with gdb says that the time is indeed spent inside
strcoll:

#0 get_next_seq (pass=1, indirect=0x7fabb4988ea8, extra=0x7fabb4984900 "",
table=0x7fabb490e2b0, weights=<optimized out>,
rulesets=0x7fabb490e2a8 "\001\002\001\005\001\001\001\005", nrules=4,
seq=<synthetic pointer>) at strcoll_l.c:111
#1 __GI___strcoll_l (s1=0x1785878 "<",
s2=0x178587a '<' <repeats 200 times>..., l=<optimized out>)
at strcoll_l.c:338
#2 0x00000000009527a6 in strncoll_libc (arg1=<optimized out>, len1=1,
arg2=<optimized out>, len2=65536, locale=<optimized out>,
locale=<optimized out>) at pg_locale.c:1964
#3 0x00000000009ac760 in varstr_cmp (arg1=0x7fabc2dcbfe9 "<", len1=1,
arg2=0x17958cc '<' <repeats 200 times>..., len2=65536,
collid=<optimized out>) at varlena.c:1567
#4 0x00000000009acfe3 in bttextcmp (fcinfo=0x7ffddd3b0590) at varlena.c:1820
#5 0x00000000009d75fa in FunctionCall2Coll (
flinfo=flinfo@entry=0x7ffddd3b10e8, collation=<optimized out>,
arg1=<optimized out>, arg2=<optimized out>) at fmgr.c:1161
#6 0x0000000000594948 in _bt_compare (rel=0x7fabcde7eed0, key=0x7ffddd3b10c0,
page=<optimized out>, offnum=<optimized out>) at nbtsearch.c:762
#7 0x0000000000594e32 in _bt_binsrch (rel=rel@entry=0x7fabcde7eed0,
key=key@entry=0x7ffddd3b10c0, buf=<optimized out>) at nbtsearch.c:394

It's not the fault of the index machinery, because a single comparison
takes the same amount of time:

u8=# select '<' <= repeat('<', 65536);
?column?
----------
t
(1 row)

Time: 1391.550 ms (00:01.392)

I didn't try it with ICU.

regards, tom lane

Best regards,
Andrey Stikheev

pgsql-performance by date:

From: Tom Lane
Date: 06 October 2024, 19:39:38
Subject: Re: Performance degradation in Index searches with special characters

From: Joe Conway
Date: 06 October 2024, 20:45:15
Subject: Re: Performance degradation in Index searches with special characters

Re: Performance degradation in Index searches with special characters - Mailing list pgsql-performance

Previous

Next