Thread: GUCs to control abbreviated sort keys
The attached patch adds GUCs to control the use of the abbreviated keys optimization when sorting. Also, I changed the TRUST_STRXFRM from a #define into a GUC. One reason for these GUCs is to make it easier to diagnose any issues that come up with my collation work. Another is that I observed cases with ICU where the abbreviated keys optimization resulted in a ~8-10% regression, and it would be good to have some basic control over it. I made them developer options because they are more about diagnosing and I don't expect users to change these in production. If the issues with abbreviated keys get more complex (where maybe we need to consider costing each provider?), we can make it more user-facing. This is fairly simple, so I plan to commit soon. -- Jeff Davis PostgreSQL Contributor Team - AWS
Attachment
On Sat, Jan 21, 2023 at 05:16:01PM -0800, Jeff Davis wrote: > + <varlistentry id="guc-sort-abbreviated-keys" xreflabel="sort_abbreviated_keys"> > + <term><varname>sort_abbreviated_keys</varname> (<type>boolean</type>) > + <indexterm> > + <primary><varname>sort_abbreviated_keys</varname> configuration parameter</primary> > + </indexterm> > + </term> > + <listitem> > + <para> > + Enables or disables the use of abbreviated sort keys, an optimization, > + if applicable. The default is <literal>true</literal>. Disabling may I think "an optimization, if applicable" is either too terse, or somehow wrong. Maybe: | Enables or disables the use of abbreviated keys, a sort optimization... > + optimization could return wrong results. Set to > + <literal>true</literal> if certain that <function>strxfrm()</function> > + can be trusted. "if you are certain"; or "if it is ..." -- Justin
On Sat, Jan 21, 2023 at 8:16 PM Jeff Davis <pgsql@j-davis.com> wrote: > This is fairly simple, so I plan to commit soon. I find it a bit premature to include this comment in the very first email.... what if other people don't like the idea? I would like to hear about the cases where abbreviated keys resulted in a regression. I'd also like to know whether there's a realistic possibility that making this a run-time test could itself result in a regression. -- Robert Haas EDB: http://www.enterprisedb.com
On Tue, 2023-01-24 at 21:42 -0500, Robert Haas wrote: > I find it a bit premature to include this comment in the very first > email.... what if other people don't like the idea? The trust_strxfrm GUC was pulled from the larger collation refactoring patch, which has been out for a while. The sort_abbreviated_keys GUC is new, and I posted these both in a new thread because they started to look independently useful. If someone doesn't like the idea, they are free to comment, like in every other case (though this patch doesn't seem very controversial to me?). I suppose the wording was off-putting, so I'll choose different words next time. > I would like to hear about the cases where abbreviated keys resulted > in a regression. I want to be clear that this is not a general criticism of the abbreviated keys optimization, nor a comprehensive analysis of its performance. I am highlighting this case because the existence of a single non- contrived case or regression suggests that we may want to explore further and tweak heuristics. That's quite natural when the heuristics are based on a complex dependency like a collation provider. The sort_abbreviated_keys GUC makes that kind of exploration and tweaking a lot easier. Built with meson on linux, gcc 11.3.0, opt -O3. Times are the middle of three runs, taken from the sort operator's "first returned tuple" time in EXPLAIN ANALYZE. Total runtime (as reported in EXPLAIN ANALYZE) is pretty much the same story, but I think there was slightly more noise in that number. $ perl text_generator.pl 10000000 10 > /tmp/strings.txt CREATE TABLE s (t TEXT); COPY s FROM '/tmp/strings.txt'; VACUUM FREEZE s; CHECKPOINT; SET work_mem='10GB'; SET max_parallel_workers = 0; SET max_parallel_workers_per_gather = 0; SET sort_abbreviated_keys = false; EXPLAIN ANALYZE SELECT t FROM s ORDER BY t COLLATE "en-US-x-icu"; -- 20875ms SET sort_abbreviated_keys = true; EXPLAIN ANALYZE SELECT t FROM s ORDER BY t COLLATE "en-US-x-icu"; -- 22931ms Regression for abbreviated keys optimization in this case: 9.8% > I'd also like to know whether there's a realistic possibility that > making this a run-time test could itself result in a regression. The sort_abbreviated_keys branch is happening after tuplesort_begin_common (which creates memory contexts, etc.) and before preparing the sort keys (which involves catalog lookups). The trust_strxfrm branch is happening in the type-specific sort support function, which needs to be looked up in the catalog before being called (using V1 calling convention). It doesn't look likely that a single branch in that path will have a perf impact. Do you have a more specific concern? -- Jeff Davis PostgreSQL Contributor Team - AWS
Attachment
On Tue, 2023-01-24 at 19:43 -0600, Justin Pryzby wrote: > I think "an optimization, if applicable" is either too terse, or > somehow > wrong. Maybe: > > > Enables or disables the use of abbreviated keys, a sort > > optimization... Done. > > + optimization could return wrong results. Set to > > + <literal>true</literal> if certain that > > <function>strxfrm()</function> > > + can be trusted. > > "if you are certain"; or "if it is ..." Done. Thank you, rebased patch attached. -- Jeff Davis PostgreSQL Contributor Team - AWS
Attachment
On 25.01.23 22:16, Jeff Davis wrote: > I am highlighting this case because the existence of a single non- > contrived case or regression suggests that we may want to explore > further and tweak heuristics. That's quite natural when the heuristics > are based on a complex dependency like a collation provider. The > sort_abbreviated_keys GUC makes that kind of exploration and tweaking a > lot easier. Maybe an easier way to enable or disable it in the source code with a #define would serve this. Making it a GUC right away seems a bit heavy-handed. Further exploration and tweaking might well require further source code changes, so relying on a source code level toggle would seem appropriate.
On Thu, 2023-01-26 at 22:39 +0100, Peter Eisentraut wrote: > Maybe an easier way to enable or disable it in the source code with a > #define would serve this. Making it a GUC right away seems a bit > heavy-handed. Further exploration and tweaking might well require > further source code changes, so relying on a source code level toggle > would seem appropriate. I am using these GUCs for testing the various collation paths in my collation refactoring branch. I find them pretty useful, and when I saw a regression, I thought others might think it was useful, too. But if not I'll just leave them in my branch and withdraw from this thread. -- Jeff Davis PostgreSQL Contributor Team - AWS
On Wed, Jan 25, 2023 at 4:16 PM Jeff Davis <pgsql@j-davis.com> wrote: > $ perl text_generator.pl 10000000 10 > /tmp/strings.txt > > CREATE TABLE s (t TEXT); > COPY s FROM '/tmp/strings.txt'; > VACUUM FREEZE s; > CHECKPOINT; > SET work_mem='10GB'; > SET max_parallel_workers = 0; > SET max_parallel_workers_per_gather = 0; > > SET sort_abbreviated_keys = false; > EXPLAIN ANALYZE SELECT t FROM s ORDER BY t COLLATE "en-US-x-icu"; > -- 20875ms > > SET sort_abbreviated_keys = true; > EXPLAIN ANALYZE SELECT t FROM s ORDER BY t COLLATE "en-US-x-icu"; > -- 22931ms > > Regression for abbreviated keys optimization in this case: 9.8% That's interesting. Do you have any idea why this happens? I've been a bit busy the last few days so haven't had a chance to look at the test case until now. It seems like it's just a lorum ipsum generator, except that each line is made to contain a random number of words, and certain letters from the Latin alphabet are replaced with other symbols. But why is that a problem for abbreviated keys? The most obvious way for things to go wrong is for the first 8 bytes of the strxfrm() blob to be very low-entropy, but it's not really clear to me what about your test case would make that more likely. I guess another explanation could be if having a few non-ASCII characters mixed into the string makes strxfrm() a lot slower. -- Robert Haas EDB: http://www.enterprisedb.com
On Thu, Jan 26, 2023 at 3:29 PM Jeff Davis <pgsql@j-davis.com> wrote: > On Thu, 2023-01-26 at 22:39 +0100, Peter Eisentraut wrote: > > Maybe an easier way to enable or disable it in the source code with a > > #define would serve this. Making it a GUC right away seems a bit > > heavy-handed. Further exploration and tweaking might well require > > further source code changes, so relying on a source code level toggle > > would seem appropriate. > > I am using these GUCs for testing the various collation paths in my > collation refactoring branch. I'm fine with adding the GUC as a developer option. I think that there is zero chance of the changes to tuplesort.c having appreciable overhead. > I find them pretty useful, and when I saw a regression, I thought > others might think it was useful, too. But if not I'll just leave them > in my branch and withdraw from this thread. I cannot recreate the issue you describe. With abbreviated keys, your exact test case takes 00:16.620 on my system. Without abbreviated keys, it takes 00:21.255. To me it appears to be a moderately good case for abbreviated keys, though certainly not as good as some cases that I've seen -- ~3x improvements are common enough. As a point of reference, the same test case with the C collation and with abbreviated keys takes 00:10.822. When I look at the "trace_sort" output for the C collation with abbreviated keys, and compare it to the equivalent "trace_sort" output for the original "en-US-x-icu" collation from your test case, it is clear that the overhead of generating collated abbreviated keys within ICU is relatively high -- the initial scan of the table (which is where we generate all abbreviated keys here) takes 4.45 seconds in the ICU case, and only 1.65 seconds in the "C" locale case. I think that you should look into that same difference on your own system, so that we can compare and contrast. The underlying issue might well have something to do with the ICU version you're using, or some other detail of your environment. I'm using Debian unstable here. Postgres links to the system ICU, which is ICU 72. It's not impossible that the perl program you wrote produces non-deterministic output, which should be controlled for, since it might just be significant. I see this on my system, having run the perl program as outlined in your test case: $ ls -l /tmp/strings.txt -rw-r--r-- 1 pg pg 431886574 Jan 27 11:13 /tmp/strings.txt $ sha1sum /tmp/strings.txt 22f60dc12527c215c8e3992e49d31dc531261a83 /tmp/strings.txt Does that match what you see on your system? -- Peter Geoghegan
On Fri, 2023-01-27 at 11:41 -0800, Peter Geoghegan wrote: > I cannot recreate the issue you describe. Interesting. For my test: glibc 2.35 ICU 70.1 gcc 11.3.0 LLVM 14.0.0 > It's not impossible that the perl program you wrote produces > non-deterministic output It is non-deterministic, but I tried with two generated files, and got similar results. Right now I suspect the ICU version might be the reason. I'll try with 72. -- Jeff Davis PostgreSQL Contributor Team - AWS
On Fri, Jan 27, 2023 at 12:34 PM Jeff Davis <pgsql@j-davis.com> wrote: > It is non-deterministic, but I tried with two generated files, and got > similar results. Jeff and I coordinated off-list. It turned out that the nondeterministic nature of the program to generate test data was behind my initial inability to recreate Jeff's results. Once Jeff provided me with the exact data that he saw the problem with, I was able to recreate the problematic case for abbreviated keys. It turns out that this was due to aborting abbreviation way too late in the process. It would happen relatively late in the process, when more than 50% of all tuples had already had abbreviations generated by ICU. This was a marginal case for abbreviated keys, which is precisely why it only happened this long into the process. That factor is also likely why I couldn't recreate the problem at first, even though I had test data that was substantially the same as the data required to show the problem. Attached patch fixes the issue. It teaches varstr_abbrev_abort to do something similar to every other abbreviated keys abort function: stop estimating cardinality entirely (give up on giving up) once there are a certain number of distinct abbreviated keys, regardless of any other factor. This is very closely based on existing code from numeric_abbrev_abort, though I use a cutoff of 10k rather than a cutoff of 100k. This difference is justified by the special considerations for text, where we authoritative comparisons have further optimizations such as strcoll caching and the memcmp equality fast path. It's also required to actually fix the test case at hand -- 100k isn't enough to avoid the performance issue Jeff reported. I think that this should be committed to HEAD only. -- Peter Geoghegan
Attachment
On Fri, Jan 27, 2023 at 12:29 PM Jeff Davis <pgsql@j-davis.com> wrote: > I am using these GUCs for testing the various collation paths in my > collation refactoring branch. Speaking of testing, has anyone ever tried porting Tom's random test program[1] to ICU? [1] https://www.postgresql.org/message-id/31913.1458747836@sss.pgh.pa.us