Re: [PATCH] Incremental sort (was: PoC: Partial sort) - Mailing list pgsql-hackers

From Tomas Vondra
Subject Re: [PATCH] Incremental sort (was: PoC: Partial sort)
Date
Msg-id 20190625160235.6ctryjzvw5je7cet@development
Whole thread Raw
In response to Re: [PATCH] Incremental sort (was: PoC: Partial sort)  (James Coleman <jtc331@gmail.com>)
Responses Re: [PATCH] Incremental sort (was: PoC: Partial sort)
List pgsql-hackers
On Mon, Jun 24, 2019 at 07:34:19PM -0400, James Coleman wrote:
>On Mon, Jun 24, 2019 at 4:16 PM Tomas Vondra
><tomas.vondra@2ndquadrant.com> wrote:
>>
>> On Mon, Jun 24, 2019 at 01:00:50PM -0400, James Coleman wrote:
>> >On Mon, Jun 24, 2019 at 12:56 PM Simon Riggs <simon@2ndquadrant.com> wrote:
>> >>
>> >> On Mon, 24 Jun 2019 at 16:10, James Coleman <jtc331@gmail.com> wrote:
>> >>>
>> >>> On Thu, Jun 13, 2019 at 11:38:12PM -0400, James Coleman wrote:
>...
>> >>> As I see it the two most significant concerning cases right now are:
>> >>> 1. Very large batches (in particular where the batch is effectively
>> >>> all of the matching rows such that we're really just doing a standard
>> >>> sort).
>> >>> 2. Many very small batches.
>> >>
>> >>
>> >> What is the specific use case for this? This sounds quite general case.
>> >
>> >They are both general cases in some sense, but the concerns lie mostly
>> >with what happens when they're unexpectedly encountered. For example,
>> >if the expected row count or group size is off by a good bit and we
>> >effectively have to perform a sort of all (or most) possible rows.
>> >
>> >If we can get the performance to a point where that misestimated row
>> >count or group size doesn't much matter, then ISTM including the patch
>> >becomes a much more obvious total win.
>> >
>>
>> Yes, that seems like a reasonable approach. Essentially, we're trying to
>> construct plausible worst case examples, and then minimize the overhead
>> compared to regular sort. If we get sufficiently close, then it's fine
>> to rely on somewhat shaky stats - like group size estimates.
>
>I have a bit of a mystery in my performance testing. I've been setting
>up a table like so:
>
>create table foo(pk serial primary key, owner_fk integer, created_at timestamp);
>insert into foo(owner_fk, created_at)
>select fk_t.i, now() - (time_t.i::text || ' minutes')::interval
>from generate_series(1, 10000) time_t(i)
>cross join generate_series(1, 1000) fk_t(i);
>-- double up on one set to guarantee matching prefixes
>insert into foo (owner_fk, created_at) select owner_fk, created_at
>from foo where owner_fk = 23;
>create index idx_foo_on_owner_and_created_at on foo(owner_fk, created_at);
>analyze foo;
>
>and then I have the following query:
>
>select *
>from foo
>where owner_fk = 23
>order by created_at desc, pk desc
>limit 20000;
>
>The idea here is to force a bit of a worst case for small groups: we
>have 10,000 batches (i.e., equal prefix groups) of 2 tuples each and
>then query with a limit matching the actual number of rows we know
>will match the query -- so even though there's a limit we're forcing a
>total sort (and also guaranteeing both plans have to touch the same
>number of rows). Note: I know that batches of size is actually the
>worst case, but I chose batches of two because I've also been testing
>a change that would skip the sort entirely for single tuple batches.
>
>On master (really the commit right before the current revision of the
>patch), I get:
>latency average = 14.271 ms
>tps = 70.075243 (excluding connections establishing)
>
>With the patch (and incremental sort enabled):
>latency average = 11.975 ms
>tps = 83.512090 (excluding connections establishing)
>
>With the patch (but incremental sort disabled):
>latency average = 11.884 ms
>tps = 84.149834 (excluding connections establishing)
>
>All of those are 60 seconds runs on pgbench with a single thread.
>
>So we have a very substantial speedup with patch *even if the new
>feature isn't enabled*. I've confirmed the plan looks the same on
>patched with incremental sort disabled and master. The only changes
>that would seem to really effect execution time would be the changes
>to tuplesort.c, but looking through them I don't see anything I'd
>expect to change things so dramatically.
>
>Any thoughts on this?
>

I can reproduce the same thing, so it's not just you. On my machine, I see
these tps numbers (average of 10 runs, 60 seconds each):

  master:        65.177
  patched (on):  80.368
  patched (off): 80.750

The numbers are very consistent (within 1 tps).

I've done a bit of CPU profiling, and on master I see this:

    13.84%  postgres  postgres            [.] comparetup_heap
     4.83%  postgres  postgres            [.] qsort_tuple
     3.87%  postgres  postgres            [.] pg_ltostr_zeropad
     3.55%  postgres  postgres            [.] pg_ltoa
     3.19%  postgres  postgres            [.] AllocSetAlloc
     2.68%  postgres  libc-2.28.so        [.] __GI___strlen_sse2
     2.38%  postgres  postgres            [.] LWLockRelease
     2.38%  postgres  postgres            [.] AppendSeconds.constprop.9
     2.22%  postgres  libc-2.28.so        [.] __memmove_sse2_unaligned_erms
     2.17%  postgres  postgres            [.] GetPrivateRefCountEntry
     2.03%  postgres  postgres            [.] j2date
     ...

while on patched versions I see this:

     4.60%  postgres  postgres            [.] pg_ltostr_zeropad
     4.51%  postgres  postgres            [.] pg_ltoa
     3.50%  postgres  postgres            [.] AllocSetAlloc
     3.34%  postgres  libc-2.28.so        [.] __GI___strlen_sse2
     2.99%  postgres  postgres            [.] LWLockRelease
     2.84%  postgres  postgres            [.] AppendSeconds.constprop.9
     2.65%  postgres  postgres            [.] GetPrivateRefCountEntry
     2.64%  postgres  postgres            [.] j2date
     2.60%  postgres  postgres            [.] printtup
     2.56%  postgres  postgres            [.] heap_hot_search_buffer
     ...
     1.35%  postgres  postgres            [.] comparetup_heap
     ...

So either we're calling comparetup_heap less often, or it's cheaper.

But it seems to be very dependent on the data set. If you do this:

    create table foo_2 as select * from foo order by random();
    alter table foo_2 add primary key (pk);
    create index idx_foo_2_on_owner_and_created_at on foo_2 (owner_fk, created_at);

and then run the test against this table, there's no difference.

So my guess is this particular data set triggers slightly different
behavior in tuplesort, reducing the cost of comparetup_heap. The speedup
is quite significant (~20% on my system), the question is how widely
applicable can it be.


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: psql UPDATE field [tab] expands to DEFAULT?
Next
From: James Coleman
Date:
Subject: Re: [PATCH] Incremental sort (was: PoC: Partial sort)