Thread: GUCs to control abbreviated sort keys

GUCs to control abbreviated sort keys

From
Jeff Davis
Date:
The attached patch adds GUCs to control the use of the abbreviated keys
optimization when sorting. Also, I changed the TRUST_STRXFRM from a
#define into a GUC.

One reason for these GUCs is to make it easier to diagnose any issues
that come up with my collation work. Another is that I observed cases
with ICU where the abbreviated keys optimization resulted in a ~8-10%
regression, and it would be good to have some basic control over it.

I made them developer options because they are more about diagnosing
and I don't expect users to change these in production. If the issues
with abbreviated keys get more complex (where maybe we need to consider
costing each provider?), we can make it more user-facing.

This is fairly simple, so I plan to commit soon.

--
Jeff Davis
PostgreSQL Contributor Team - AWS



Attachment

Re: GUCs to control abbreviated sort keys

From
Justin Pryzby
Date:
On Sat, Jan 21, 2023 at 05:16:01PM -0800, Jeff Davis wrote:

> +     <varlistentry id="guc-sort-abbreviated-keys" xreflabel="sort_abbreviated_keys">
> +      <term><varname>sort_abbreviated_keys</varname> (<type>boolean</type>)
> +      <indexterm>
> +       <primary><varname>sort_abbreviated_keys</varname> configuration parameter</primary>
> +      </indexterm>
> +      </term>
> +      <listitem>
> +       <para>
> +        Enables or disables the use of abbreviated sort keys, an optimization,
> +        if applicable. The default is <literal>true</literal>. Disabling may

I think "an optimization, if applicable" is either too terse, or somehow
wrong.  Maybe:

| Enables or disables the use of abbreviated keys, a sort optimization...

> +        optimization could return wrong results. Set to
> +        <literal>true</literal> if certain that <function>strxfrm()</function>
> +        can be trusted.

"if you are certain"; or "if it is ..."

-- 
Justin



Re: GUCs to control abbreviated sort keys

From
Robert Haas
Date:
On Sat, Jan 21, 2023 at 8:16 PM Jeff Davis <pgsql@j-davis.com> wrote:
> This is fairly simple, so I plan to commit soon.

I find it a bit premature to include this comment in the very first
email.... what if other people don't like the idea?

I would like to hear about the cases where abbreviated keys resulted
in a regression.

I'd also like to know whether there's a realistic possibility that
making this a run-time test could itself result in a regression.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: GUCs to control abbreviated sort keys

From
Jeff Davis
Date:
On Tue, 2023-01-24 at 21:42 -0500, Robert Haas wrote:
> I find it a bit premature to include this comment in the very first
> email.... what if other people don't like the idea?

The trust_strxfrm GUC was pulled from the larger collation refactoring
patch, which has been out for a while. The sort_abbreviated_keys GUC is
new, and I posted these both in a new thread because they started to
look independently useful.

If someone doesn't like the idea, they are free to comment, like in
every other case (though this patch doesn't seem very controversial to
me?). I suppose the wording was off-putting, so I'll choose different
words next time.

> I would like to hear about the cases where abbreviated keys resulted
> in a regression.

I want to be clear that this is not a general criticism of the
abbreviated keys optimization, nor a comprehensive analysis of its
performance.

I am highlighting this case because the existence of a single non-
contrived case or regression suggests that we may want to explore
further and tweak heuristics. That's quite natural when the heuristics
are based on a complex dependency like a collation provider. The
sort_abbreviated_keys GUC makes that kind of exploration and tweaking a
lot easier.

Built with meson on linux, gcc 11.3.0, opt -O3. Times are the middle of
three runs, taken from the sort operator's "first returned tuple" time
in EXPLAIN ANALYZE. Total runtime (as reported in EXPLAIN ANALYZE) is
pretty much the same story, but I think there was slightly more noise
in that number.

$ perl text_generator.pl 10000000 10 > /tmp/strings.txt

CREATE TABLE s (t TEXT);
COPY s FROM '/tmp/strings.txt';
VACUUM FREEZE s;
CHECKPOINT;
SET work_mem='10GB';
SET max_parallel_workers = 0;
SET max_parallel_workers_per_gather = 0;

SET sort_abbreviated_keys = false;
EXPLAIN ANALYZE SELECT t FROM s ORDER BY t COLLATE "en-US-x-icu";
-- 20875ms

SET sort_abbreviated_keys = true;
EXPLAIN ANALYZE SELECT t FROM s ORDER BY t COLLATE "en-US-x-icu";
-- 22931ms

Regression for abbreviated keys optimization in this case: 9.8%

> I'd also like to know whether there's a realistic possibility that
> making this a run-time test could itself result in a regression.

The sort_abbreviated_keys branch is happening after
tuplesort_begin_common (which creates memory contexts, etc.) and before
preparing the sort keys (which involves catalog lookups). The
trust_strxfrm branch is happening in the type-specific sort support
function, which needs to be looked up in the catalog before being
called (using V1 calling convention).

It doesn't look likely that a single branch in that path will have a
perf impact. Do you have a more specific concern?

--
Jeff Davis
PostgreSQL Contributor Team - AWS



Attachment

Re: GUCs to control abbreviated sort keys

From
Jeff Davis
Date:
On Tue, 2023-01-24 at 19:43 -0600, Justin Pryzby wrote:
> I think "an optimization, if applicable" is either too terse, or
> somehow
> wrong.  Maybe:
>
> > Enables or disables the use of abbreviated keys, a sort
> > optimization...

Done.

> > +        optimization could return wrong results. Set to
> > +        <literal>true</literal> if certain that
> > <function>strxfrm()</function>
> > +        can be trusted.
>
> "if you are certain"; or "if it is ..."

Done.

Thank you, rebased patch attached.

--
Jeff Davis
PostgreSQL Contributor Team - AWS



Attachment

Re: GUCs to control abbreviated sort keys

From
Peter Eisentraut
Date:
On 25.01.23 22:16, Jeff Davis wrote:
> I am highlighting this case because the existence of a single non-
> contrived case or regression suggests that we may want to explore
> further and tweak heuristics. That's quite natural when the heuristics
> are based on a complex dependency like a collation provider. The
> sort_abbreviated_keys GUC makes that kind of exploration and tweaking a
> lot easier.

Maybe an easier way to enable or disable it in the source code with a 
#define would serve this.  Making it a GUC right away seems a bit 
heavy-handed.  Further exploration and tweaking might well require 
further source code changes, so relying on a source code level toggle 
would seem appropriate.




Re: GUCs to control abbreviated sort keys

From
Jeff Davis
Date:
On Thu, 2023-01-26 at 22:39 +0100, Peter Eisentraut wrote:
> Maybe an easier way to enable or disable it in the source code with a
> #define would serve this.  Making it a GUC right away seems a bit
> heavy-handed.  Further exploration and tweaking might well require
> further source code changes, so relying on a source code level toggle
> would seem appropriate.

I am using these GUCs for testing the various collation paths in my
collation refactoring branch.

I find them pretty useful, and when I saw a regression, I thought
others might think it was useful, too. But if not I'll just leave them
in my branch and withdraw from this thread.


--
Jeff Davis
PostgreSQL Contributor Team - AWS





Re: GUCs to control abbreviated sort keys

From
Robert Haas
Date:
On Wed, Jan 25, 2023 at 4:16 PM Jeff Davis <pgsql@j-davis.com> wrote:
> $ perl text_generator.pl 10000000 10 > /tmp/strings.txt
>
> CREATE TABLE s (t TEXT);
> COPY s FROM '/tmp/strings.txt';
> VACUUM FREEZE s;
> CHECKPOINT;
> SET work_mem='10GB';
> SET max_parallel_workers = 0;
> SET max_parallel_workers_per_gather = 0;
>
> SET sort_abbreviated_keys = false;
> EXPLAIN ANALYZE SELECT t FROM s ORDER BY t COLLATE "en-US-x-icu";
> -- 20875ms
>
> SET sort_abbreviated_keys = true;
> EXPLAIN ANALYZE SELECT t FROM s ORDER BY t COLLATE "en-US-x-icu";
> -- 22931ms
>
> Regression for abbreviated keys optimization in this case: 9.8%

That's interesting. Do you have any idea why this happens?

I've been a bit busy the last few days so haven't had a chance to look
at the test case until now. It seems like it's just a lorum ipsum
generator, except that each line is made to contain a random number of
words, and certain letters from the Latin alphabet are replaced with
other symbols. But why is that a problem for abbreviated keys? The
most obvious way for things to go wrong is for the first 8 bytes of
the strxfrm() blob to be very low-entropy, but it's not really clear
to me what about your test case would make that more likely. I guess
another explanation could be if having a few non-ASCII characters
mixed into the string makes strxfrm() a lot slower.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: GUCs to control abbreviated sort keys

From
Peter Geoghegan
Date:
On Thu, Jan 26, 2023 at 3:29 PM Jeff Davis <pgsql@j-davis.com> wrote:
> On Thu, 2023-01-26 at 22:39 +0100, Peter Eisentraut wrote:
> > Maybe an easier way to enable or disable it in the source code with a
> > #define would serve this.  Making it a GUC right away seems a bit
> > heavy-handed.  Further exploration and tweaking might well require
> > further source code changes, so relying on a source code level toggle
> > would seem appropriate.
>
> I am using these GUCs for testing the various collation paths in my
> collation refactoring branch.

I'm fine with adding the GUC as a developer option. I think that there
is zero chance of the changes to tuplesort.c having appreciable
overhead.

> I find them pretty useful, and when I saw a regression, I thought
> others might think it was useful, too. But if not I'll just leave them
> in my branch and withdraw from this thread.

I cannot recreate the issue you describe. With abbreviated keys, your
exact test case takes 00:16.620 on my system. Without abbreviated
keys, it takes 00:21.255.

To me it appears to be a moderately good case for abbreviated keys,
though certainly not as good as some cases that I've seen -- ~3x
improvements are common enough.

As a point of reference, the same test case with the C collation and
with abbreviated keys takes 00:10.822. When I look at the "trace_sort"
output for the C collation with abbreviated keys, and compare it to
the equivalent "trace_sort" output for the original "en-US-x-icu"
collation from your test case, it is clear that the overhead of
generating collated abbreviated keys within ICU is relatively high --
the initial scan of the table (which is where we generate all
abbreviated keys here) takes 4.45 seconds in the ICU case, and only
1.65 seconds in the "C" locale case. I think that you should look into
that same difference on your own system, so that we can compare and
contrast.

The underlying issue might well have something to do with the ICU
version you're using, or some other detail of your environment. I'm
using Debian unstable here. Postgres links to the system ICU, which is
ICU 72.

It's not impossible that the perl program you wrote produces
non-deterministic output, which should be controlled for, since it
might just be significant. I see this on my system, having run the
perl program as outlined in your test case:

$ ls -l /tmp/strings.txt
-rw-r--r-- 1 pg pg 431886574 Jan 27 11:13 /tmp/strings.txt
$ sha1sum /tmp/strings.txt
22f60dc12527c215c8e3992e49d31dc531261a83  /tmp/strings.txt

Does that match what you see on your system?

--
Peter Geoghegan



Re: GUCs to control abbreviated sort keys

From
Jeff Davis
Date:
On Fri, 2023-01-27 at 11:41 -0800, Peter Geoghegan wrote:
> I cannot recreate the issue you describe.

Interesting. For my test:

glibc  2.35      ICU 70.1
gcc    11.3.0    LLVM 14.0.0

> It's not impossible that the perl program you wrote produces
> non-deterministic output

It is non-deterministic, but I tried with two generated files, and got
similar results.

Right now I suspect the ICU version might be the reason. I'll try with
72.


--
Jeff Davis
PostgreSQL Contributor Team - AWS





Re: GUCs to control abbreviated sort keys

From
Peter Geoghegan
Date:
On Fri, Jan 27, 2023 at 12:34 PM Jeff Davis <pgsql@j-davis.com> wrote:
> It is non-deterministic, but I tried with two generated files, and got
> similar results.

Jeff and I coordinated off-list. It turned out that the
nondeterministic nature of the program to generate test data was
behind my initial inability to recreate Jeff's results. Once Jeff
provided me with the exact data that he saw the problem with, I was
able to recreate the problematic case for abbreviated keys.

It turns out that this was due to aborting abbreviation way too late
in the process. It would happen relatively late in the process, when
more than 50% of all tuples had already had abbreviations generated by
ICU. This was a marginal case for abbreviated keys, which is precisely
why it only happened this long into the process. That factor is also
likely why I couldn't recreate the problem at first, even though I had
test data that was substantially the same as the data required to show
the problem.

Attached patch fixes the issue. It teaches varstr_abbrev_abort to do
something similar to every other abbreviated keys abort function: stop
estimating cardinality entirely (give up on giving up) once there are
a certain number of distinct abbreviated keys, regardless of any other
factor.

This is very closely based on existing code from numeric_abbrev_abort,
though I use a cutoff of 10k rather than a cutoff of 100k. This
difference is justified by the special considerations for text, where
we authoritative comparisons have further optimizations such as
strcoll caching and the memcmp equality fast path. It's also required
to actually fix the test case at hand -- 100k isn't enough to avoid
the performance issue Jeff reported.

I think that this should be committed to HEAD only.

-- 
Peter Geoghegan

Attachment

Re: GUCs to control abbreviated sort keys

From
Thomas Munro
Date:
On Fri, Jan 27, 2023 at 12:29 PM Jeff Davis <pgsql@j-davis.com> wrote:
> I am using these GUCs for testing the various collation paths in my
> collation refactoring branch.

Speaking of testing, has anyone ever tried porting Tom's random test
program[1] to ICU?

[1] https://www.postgresql.org/message-id/31913.1458747836@sss.pgh.pa.us