hashagg slowdown due to spill changes - Mailing list pgsql-hackers

From Andres Freund
Subject hashagg slowdown due to spill changes
Date
Msg-id 20200606041146.slqfg7cuptx27tuy@alap3.anarazel.de
Whole thread Raw
Responses Re: hashagg slowdown due to spill changes  (Alvaro Herrera <alvherre@2ndquadrant.com>)
Re: hashagg slowdown due to spill changes  (Jeff Davis <pgsql@j-davis.com>)
Re: hashagg slowdown due to spill changes  (Jeff Davis <pgsql@j-davis.com>)
Re: hashagg slowdown due to spill changes  (Jeff Davis <pgsql@j-davis.com>)
List pgsql-hackers
Hi,

While preparing my pgcon talk I noticed that our hash-agg performance
degraded noticeably. Looks to me like it's due to the spilling-hashagg
changes.

Sample benchmark:

config:
-c huge_pages=on -c shared_buffers=32GB -c jit=0 -c max_parallel_workers_per_gather=0
(largely just to reduce variance)

data prep:
CREATE TABLE fewgroups_many_rows AS SELECT (random() * 4)::int cat, (random()*10000)::int val FROM generate_series(1,
100000000);
VACUUM (FREEZE, ANALYZE) fewgroups_many_rows;

test prep:
CREATE EXTENSION IF NOT EXISTS pg_prewarm;SELECT pg_prewarm('fewgroups_many_rows', 'buffer');

test:
SELECT cat, count(*) FROM fewgroups_many_rows GROUP BY 1;

Using best-of-three timing:

12                  12221.031 ms
master              13855.129 ms

While not the end of the world, that's a definitely noticable and
reproducible slowdown (~12%).

I don't think this is actually an inherent cost, but a question of how
the code ended up being organized. Here's a perf diff of profiles for
both versions:

# Baseline  Delta Abs  Shared Object     Symbol
# ........  .........  ................  .........................................
#
               +6.70%  postgres          [.] LookupTupleHashEntryHash
               +6.37%  postgres          [.] prepare_hash_slot
               +4.74%  postgres          [.] TupleHashTableHash_internal.isra.0
    20.36%     -2.89%  postgres          [.] ExecInterpExpr
     6.31%     -2.73%  postgres          [.] lookup_hash_entries
               +2.36%  postgres          [.] lookup_hash_entry
               +2.14%  postgres          [.] ExecJustAssignScanVar
     2.28%     +1.97%  postgres          [.] ExecScan
     2.54%     +1.93%  postgres          [.] MemoryContextReset
     3.84%     -1.86%  postgres          [.] SeqNext
    10.19%     -1.50%  postgres          [.] tts_buffer_heap_getsomeattrs
               +1.42%  postgres          [.] hash_bytes_uint32
               +1.39%  postgres          [.] TupleHashTableHash
               +1.10%  postgres          [.] tts_virtual_clear
     3.36%     -0.74%  postgres          [.] ExecAgg
               +0.45%  postgres          [.] CheckForSerializableConflictOutNeeded
     0.25%     +0.44%  postgres          [.] hashint4
     5.80%     -0.35%  postgres          [.] tts_minimal_getsomeattrs
     1.91%     -0.33%  postgres          [.] heap_getnextslot
     4.86%     -0.32%  postgres          [.] heapgettup_pagemode
     1.46%     -0.32%  postgres          [.] tts_minimal_clear

While some of this is likely is just noise, it's pretty clear that we
spend a substantial amount of additional time below
lookup_hash_entries().

And looking at the code, I'm not too surprised:

Before there was basically one call from nodeAgg.c to execGrouping.c for
each tuple and hash table. Now it's a lot more complicated:
1) nodeAgg.c: prepare_hash_slot()
2) execGrouping.c: TupleHashTableHash()
3) nodeAgg.c: lookup_hash_entry()
4) execGrouping.c: LookupTupleHashEntryHash()

For each of these data needs to be peeled out of one or more of AggState
/ AggStatePerHashData / TupleHashTable. There's no way the compiler can
know that nothing inside those changes, therefore it has to reload the
contents repeatedly.  By my look at the profiles, that's where most of
the time is going.

There's also the issue that the signalling whether to insert / not to
insert got unnecessarily complicated. There's several checks:
1) lookup_hash_entry() (p_isnew = aggstate->hash_spill_mode ? NULL : &isnew;)
2) LookupTupleHashEntry_internal() (if (isnew))
3) lookup_hash_entry() (if (entry == NULL) and if (isnew))
4) lookup_hash_entries() if (!in_hash_table)

Not performance related: I am a bit confused why the new per-hash stuff
in lookup_hash_entries() isn't in lookup_hash_entry()? I assume that's
because of agg_refill_hash_table()?


Why isn't the flow more like this:
1) prepare_hash_slot()
2) if (aggstate->hash_spill_mode) goto 3; else goto 4
3) entry = LookupTupleHashEntry(&hash); if (!entry) hashagg_spill_tuple();
4) InsertTupleHashEntry(&hash, &isnew); if (isnew) initialize(entry)

That way there's again exactly one call to execGrouping.c, there's no
need for nodeAgg to separately compute the hash, there's far fewer
branches...

Doing things this way might perhaps make agg_refill_hash_table() a tiny
bit more complicated, but it'll also avoid the slowdown for the vast
majority of cases where we're not spilling.

Greetings,

Andres Freund



pgsql-hackers by date:

Previous
From: Andres Freund
Date:
Subject: Re: Atomic operations within spinlocks
Next
From: Andres Freund
Date:
Subject: Re: libpq copy error handling busted