Thread: Macro customizable hashtable / bitmapscan & aggregation perf
Hi, As previously mentioned, dynahash isn't particularly fast. In many cases that doesn't particularly matter. But e.g. hash aggregation and bitmap index scans are quite sensitive to hash performance. The biggest problems with dynahash (also discussed at [1]) are a) two level structure doubling the amount of indirect lookups b) indirect function calls c) using separate chaining based conflict resolution d) being too general. In the post referenced above I'd implemented an open-coded hashtable addressing these issues for the tidbitmap.c case, with quite some success. It's not particularly desirable to have various slightly differing hash-table implementations across the backend though. The only solution I could think of, that preserves the efficiency, is to have a hash-table implementation which is customizable to the appropriate types et, via macros. In the attached patch I've attached simplehash.h, which can be customized by a bunch of macros, before being inlined. There's also a patch using this for tidbitmap.c and nodeAgg/nodeSubplan/... via execGrouping.c. To show the motivation: Bitmapscan: before: tpch[22005][1]=# EXPLAIN ANALYZE SELECT SUM(l_extendedprice) FROM lineitem WHERE (l_shipdate >= '1995-01-01'::date) AND (l_shipdate<= '1996-12-31'::date); ┌──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐ │ QUERY PLAN │ ├──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤ │ Aggregate (cost=942330.27..942330.28 rows=1 width=8) (actual time=5283.026..5283.026 rows=1 loops=1) │ │ -> Bitmap Heap Scan on lineitem (cost=193670.02..919511.38 rows=9127557 width=8) (actual time=3041.072..4436.569 rows=9113219loops=1) │ │ Recheck Cond: ((l_shipdate >= '1995-01-01'::date) AND (l_shipdate <= '1996-12-31'::date)) │ │ Heap Blocks: exact=585542 │ │ -> Bitmap Index Scan on i_l_shipdate (cost=0.00..191388.13 rows=9127557 width=0) (actual time=2807.246..2807.246rows=9113219 loops=1) │ │ Index Cond: ((l_shipdate >= '1995-01-01'::date) AND (l_shipdate <= '1996-12-31'::date)) │ │ Planning time: 0.077 ms │ │ Execution time: 5284.038 ms │ └──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘ (8 rows) after: tpch[21953][1]=# EXPLAIN ANALYZE SELECT SUM(l_extendedprice) FROM lineitem WHERE (l_shipdate >= '1995-01-01'::date) AND (l_shipdate<= '1996-12-31'::date); ┌────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐ │ QUERY PLAN │ ├────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤ │ Aggregate (cost=942330.27..942330.28 rows=1 width=8) (actual time=3431.602..3431.602 rows=1 loops=1) │ │ -> Bitmap Heap Scan on lineitem (cost=193670.02..919511.38 rows=9127557 width=8) (actual time=1158.783..2588.357 rows=9113219loops=1) │ │ Recheck Cond: ((l_shipdate >= '1995-01-01'::date) AND (l_shipdate <= '1996-12-31'::date)) │ │ Heap Blocks: exact=585542 │ │ -> Bitmap Index Scan on i_l_shipdate (cost=0.00..191388.13 rows=9127557 width=0) (actual time=958.341..958.341rows=9113219 loops=1) │ │ Index Cond: ((l_shipdate >= '1995-01-01'::date) AND (l_shipdate <= '1996-12-31'::date)) │ │ Planning time: 0.070 ms │ │ Execution time: 3435.259 ms │ └────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘ (8 rows) Hash-Agg: before: tpch[22005][1]=# EXPLAIN (ANALYZE, TIMING OFF) SELECT l_partkey, SUM(l_extendedprice) FROM lineitem GROUP BY 1 ORDER BY 2DESC LIMIT 10; ┌──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐ │ QUERY PLAN │ ├──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤ │ Limit (cost=1069653.05..1069653.08 rows=10 width=12) (actual rows=10 loops=1) │ │ -> Sort (cost=1069653.05..1072083.33 rows=972112 width=12) (actual rows=10 loops=1) │ │ Sort Key: (sum(l_extendedprice)) DESC │ │ Sort Method: top-N heapsort Memory: 25kB │ │ -> HashAggregate (cost=1038924.94..1048646.06 rows=972112 width=12) (actual rows=1000000 loops=1) │ │ Group Key: l_partkey │ │ -> Seq Scan on lineitem (cost=0.00..888925.96 rows=29999796 width=12) (actual rows=29999795 loops=1) │ │ Planning time: 0.210 ms │ │ Execution time: 20887.906 ms │ └──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘ (9 rows) after: tpch[22342][1]=# EXPLAIN (ANALYZE, TIMING OFF) SELECT l_partkey, SUM(l_extendedprice) FROM lineitem GROUP BY 1 ORDER BY 2DESC LIMIT 10; ┌──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐ │ QUERY PLAN │ ├──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤ │ Limit (cost=1069653.05..1069653.08 rows=10 width=12) (actual rows=10 loops=1) │ │ -> Sort (cost=1069653.05..1072083.33 rows=972112 width=12) (actual rows=10 loops=1) │ │ Sort Key: (sum(l_extendedprice)) DESC │ │ Sort Method: top-N heapsort Memory: 25kB │ │ -> HashAggregate (cost=1038924.94..1048646.06 rows=972112 width=12) (actual rows=1000000 loops=1) │ │ Group Key: l_partkey │ │ -> Seq Scan on lineitem (cost=0.00..888925.96 rows=29999796 width=12) (actual rows=29999795 loops=1) │ │ Planning time: 0.104 ms │ │ Execution time: 14617.481 ms │ └──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘ (9 rows) The hash-table performance difference itself is bigger in both cases, but other things, notably slot deforming and expression evaluation, becomes the bottleneck. Does anybody have a better idea how to approach the hash-table problem? While it'd be nice not to resort to such macro-magic, I can't think of an alternative short of switching to C++ and using templates. Currently this is used like: #define SH_PREFIX pagetable #define SH_KEYTYPE BlockNumber #define SH_KEY blockno #define SH_CONTAINS PagetableEntry #define SH_HASH_KEY(tb, key) hash_blockno(key) #define SH_EQUAL(tb, a, b) a == b #define SH_SCOPE static #define SH_DEFINE #define SH_DECLARE #include "lib/simplehash.h" which then creates functions like pagetable_create/insert/lookup/delete/... I strongly suspect there are some other cases that could benefit from another hash-table implementation. Particularly catcache.c seems like a candidate (instead of it's hand-rolled stuff, which frequently shows up in profiles). Note that these patches are *NOT* in a shape for in-detail review. I'd first like to know what people think about the general approach, and about better alternatives. Also note that some queries in the regression test change their result, because the ordering is unspecified... Greetings, Andres Freund [1] http://archives.postgresql.org/message-id/20160714030616.dvlkboz6cw555ixb%40alap3.anarazel.de
Attachment
On Tue, Jul 26, 2016 at 8:43 PM, Andres Freund <andres@anarazel.de> wrote: > As previously mentioned, dynahash isn't particularly fast. In many cases > that doesn't particularly matter. But e.g. hash aggregation and bitmap > index scans are quite sensitive to hash performance. > > The biggest problems with dynahash (also discussed at [1]) are > > a) two level structure doubling the amount of indirect lookups > b) indirect function calls > c) using separate chaining based conflict resolution > d) being too general. I am ... skeptical about open addressing. It's less unappealing for this application than in general because we don't actually need to delete anything, but that wouldn't be true for the catcache. All the same, I feel that we need to assess the risk that we're improving average-case performance while creating large regressions in other cases (i.e. where there is clustering). Do likely() and unlikely() actually buy us anything here? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 2016-07-27 10:04:52 -0400, Robert Haas wrote: > On Tue, Jul 26, 2016 at 8:43 PM, Andres Freund <andres@anarazel.de> wrote: > > As previously mentioned, dynahash isn't particularly fast. In many cases > > that doesn't particularly matter. But e.g. hash aggregation and bitmap > > index scans are quite sensitive to hash performance. > > > > The biggest problems with dynahash (also discussed at [1]) are > > > > a) two level structure doubling the amount of indirect lookups > > b) indirect function calls > > c) using separate chaining based conflict resolution > > d) being too general. > > I am ... skeptical about open addressing. It's less unappealing for > this application than in general because we don't actually need to > delete anything, but that wouldn't be true for the catcache. I don't think deletions really change the picture for open addressing - you don't need toombstones, you can instead move elements closer to their origin at deletion. I just hadn't implemented that yet, because it didn't seem like a crucial point. Unfortunately open addressing, particularly linear one, has such vastly superiour cache access patterns, that it's hard to come close with a chained hash table. I've previously tried a chained hash table, and the performance gains were a lot smaller. For that reason many hash-table implementations used in performance critical paths appear to be open addressing. Don't get me wrong, I do certainly think that there's good cases for using separate chaining in places where performance is less of a concern; and possibly when you want lock-free concurrency. > All the same, I feel that we need to assess the risk that we're > improving average-case performance while creating large regressions in > other cases (i.e. where there is clustering). The actual algorithmic worst-case complexity isn't different. It's obviously important to resize appropriately. It's not that hard to beat dynahash's memory efficiency, so fillfactors don't have to be kept super high to not regress in memory usage. > Do likely() and unlikely() actually buy us anything here? It's only a few percent here, but overall, especially when used around error checks, it's above 10%. The reason I used it here is primary that it made the assembly easier to read for me... ;) Greetings, Andres Freund
Hi, On 2016-07-26 17:43:33 -0700, Andres Freund wrote: > In the attached patch I've attached simplehash.h, which can be > customized by a bunch of macros, before being inlined. There's also a > patch using this for tidbitmap.c and nodeAgg/nodeSubplan/... via > execGrouping.c. Attached is a significantly improved version of this series. The main changes are: - The hash table now uses robin-hood style hash collision handling. See the commit message of 0002 (or simplehash.h) for an introduction. That significantly reduces the impact of "clustering" due to linear addressing. - Significant comment and correctness fixes, both in simplehash.h - itself, and 0003/0004. - a lot of other random performance improvements for the hashing code. Unfortunately I'm running out battery right now, so I don't want to re-run the benchmarks posted upthread (loading takes a while). But the last time I ran them all the results after the patches were better than before. This patch series currently consists out of four patches: - 0001 - add unlikely/likely() macros. The use here is not entirely mandatory, but they do provide a quite consistent improvement. There are other threads where they proved to be beneficial, so I see little reason not to introduce them. - 0002 - the customizable hashtable itself. It now contains documentation. - 0003 - convert tidbitmap.c to use the new hashmap. This includes a fix for the issue mentioned in [1], to improve peformance in heavily lossy scenarios (otherwise we could regress in some extreme cases) - 0004 - convert execGrouping.c, e.g. used by hash aggregates While not quite perfect yet, I do think this is at a state where input is needed. I don't want to go a lot deeper into this rabbit hole, before we have some agreement that this is a sensible course of action. I think there's several possible additional users of simplehash.h, e.g. catcache.c - which has an open coded, not particularly fast hash table and frequently shows up in profiles - but I think the above two conversions are plenty to start with. Comments? Greetings, Andres Freund [1] http://archives.postgresql.org/message-id/20160923205843.zcs533sqdzlh4cpo%40alap3.anarazel.de
Attachment
On 10/01/2016 02:44 AM, Andres Freund wrote: > Hi, > > On 2016-07-26 17:43:33 -0700, Andres Freund wrote: >> In the attached patch I've attached simplehash.h, which can be >> customized by a bunch of macros, before being inlined. There's also a >> patch using this for tidbitmap.c and nodeAgg/nodeSubplan/... via >> execGrouping.c. > > Attached is a significantly improved version of this series. The main > changes are: > > - The hash table now uses robin-hood style hash collision handling. See the > commit message of 0002 (or simplehash.h) for an introduction. That > significantly reduces the impact of "clustering" due to linear > addressing. Interesting. Have you looked at cuckoo hashing? That seems to be the go-to hashing in several database-related papers I've read recently, so I guess there's a reason for that (IIRC it has constant worst case for lookups, constant amortized time for inserts/deletes). Not sure how it compares to robin-hood hashing regarding cache-friendliness though. > - Significant comment and correctness fixes, both in simplehash.h > - itself, and 0003/0004. > - a lot of other random performance improvements for the hashing code. > > > Unfortunately I'm running out battery right now, so I don't want to > re-run the benchmarks posted upthread (loading takes a while). But the > last time I ran them all the results after the patches were better than > before. > Well, I have rather bad experience with running benchmarks on laptops anyway - a lot of noise due to power management, etc. What about running a bigger benchmark - say, TPC-DS - and evaluating the impact? I think a crucial part of the benchmarking will be identifying and measuring corner cases, e.g. a lot of rows with the same key, etc. Although that probably is not a major issue for the two places switched to the new implementation (e.g. execGrouping merges the duplicates to a single group, by definition). > > This patch series currently consists out of four patches: > - 0001 - add unlikely/likely() macros. The use here is not entirely > mandatory, but they do provide a quite consistent improvement. There > are other threads where they proved to be beneficial, so I see little > reason not to introduce them. > - 0002 - the customizable hashtable itself. It now contains documentation. > - 0003 - convert tidbitmap.c to use the new hashmap. This includes a fix > for the issue mentioned in [1], to improve peformance in heavily lossy > scenarios (otherwise we could regress in some extreme cases) > - 0004 - convert execGrouping.c, e.g. used by hash aggregates > > > While not quite perfect yet, I do think this is at a state where input > is needed. I don't want to go a lot deeper into this rabbit hole, > before we have some agreement that this is a sensible course of action. > So, is it the right time to do some benchmarking? > > I think there's several possible additional users of simplehash.h, > e.g. catcache.c - which has an open coded, not particularly fast hash > table and frequently shows up in profiles - but I think the above two > conversions are plenty to start with. > regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Sat, Oct 1, 2016 at 1:44 AM, Andres Freund <andres@anarazel.de> wrote: > > Unfortunately I'm running out battery right now, so I don't want to > re-run the benchmarks posted upthread (loading takes a while). But the > last time I ran them all the results after the patches were better than > before. I have a machine sitting idle now too if you have specific ideas of what to benchmark. -- greg
Hi, On 2016-10-01 20:19:21 +0200, Tomas Vondra wrote: > On 10/01/2016 02:44 AM, Andres Freund wrote: > > Hi, > > > > On 2016-07-26 17:43:33 -0700, Andres Freund wrote: > > > In the attached patch I've attached simplehash.h, which can be > > > customized by a bunch of macros, before being inlined. There's also a > > > patch using this for tidbitmap.c and nodeAgg/nodeSubplan/... via > > > execGrouping.c. > > > > Attached is a significantly improved version of this series. The main > > changes are: > > > > - The hash table now uses robin-hood style hash collision handling. See the > > commit message of 0002 (or simplehash.h) for an introduction. That > > significantly reduces the impact of "clustering" due to linear > > addressing. > > Interesting. Have you looked at cuckoo hashing? Yes. I don't think it's a good fit for modern CPUs. Because it doesn't probe linearly you constantly have random uncached memory accesses. I've played with a few schemes, and they all seemed to be slower than linear probing deviations, unless you intentionally used a wrong hash-distribution, or intentionally "unbalanced" the hash-space by iterating over either end of the keyspace and moved the entries around. > > - Significant comment and correctness fixes, both in simplehash.h > > - itself, and 0003/0004. > > - a lot of other random performance improvements for the hashing code. > > > > > > Unfortunately I'm running out battery right now, so I don't want to > > re-run the benchmarks posted upthread (loading takes a while). But the > > last time I ran them all the results after the patches were better than > > before. > > > > Well, I have rather bad experience with running benchmarks on laptops anyway > - a lot of noise due to power management, etc. Well, with factor ~2x improvements thats not really a big detractor. Using a new CPU makes some forms of analysis easier (better PMUs). For longrunning tests I agree. > What about running a bigger benchmark - say, TPC-DS - and evaluating > the impact? Worthwhile, although obviously the impact will be a lot smaller, since they're not entirely bottlenecked on hash-aggs and bitmap scans. > I think a crucial part of the benchmarking will be identifying and measuring > corner cases, e.g. a lot of rows with the same key, etc. > Although that probably is not a major issue for the two places > switched to the new implementation (e.g. execGrouping merges the > duplicates to a single group, by definition). Hm. I don't really see a case where that's going to cause issues - all the execGrouping.c users only store one key, and either ignore later ones, or add the data to the initial tuple in the group. I don't really see any case where repeated keys should cause an issue for a hash table? > > This patch series currently consists out of four patches: > > - 0001 - add unlikely/likely() macros. The use here is not entirely > > mandatory, but they do provide a quite consistent improvement. There > > are other threads where they proved to be beneficial, so I see little > > reason not to introduce them. > > - 0002 - the customizable hashtable itself. It now contains documentation. > > - 0003 - convert tidbitmap.c to use the new hashmap. This includes a fix > > for the issue mentioned in [1], to improve peformance in heavily lossy > > scenarios (otherwise we could regress in some extreme cases) > > - 0004 - convert execGrouping.c, e.g. used by hash aggregates > > > > > > While not quite perfect yet, I do think this is at a state where input > > is needed. I don't want to go a lot deeper into this rabbit hole, > > before we have some agreement that this is a sensible course of action. > > > > So, is it the right time to do some benchmarking? That's one thing that's required, yes. The other is whether we can live with the uglyness of implementing templates via macros. I do think we can, but... Greetings, Andres Freund
On 2016-10-01 20:04:18 +0100, Greg Stark wrote: > On Sat, Oct 1, 2016 at 1:44 AM, Andres Freund <andres@anarazel.de> wrote: > > > > Unfortunately I'm running out battery right now, so I don't want to > > re-run the benchmarks posted upthread (loading takes a while). But the > > last time I ran them all the results after the patches were better than > > before. > I have a machine sitting idle now too if you have specific ideas of > what to benchmark. Last time I just extracted interesting parts of queries from tpc-h (to be able to look at bitmapscans/hash-aggs in isolation) and ran tpc-h as a whole. During development I also tried to come up with adverse cases (e.g. huge bitmapscans, with very low work_mem). I don't really have a lot better ideas than that. Andres
On 10/01/2016 09:59 PM, Andres Freund wrote: > Hi, > > On 2016-10-01 20:19:21 +0200, Tomas Vondra wrote: >> On 10/01/2016 02:44 AM, Andres Freund wrote: >>> Hi, >>> >>> On 2016-07-26 17:43:33 -0700, Andres Freund wrote: >>>> In the attached patch I've attached simplehash.h, which can be >>>> customized by a bunch of macros, before being inlined. There's also a >>>> patch using this for tidbitmap.c and nodeAgg/nodeSubplan/... via >>>> execGrouping.c. >>> >>> Attached is a significantly improved version of this series. The main >>> changes are: >>> >>> - The hash table now uses robin-hood style hash collision handling. See the >>> commit message of 0002 (or simplehash.h) for an introduction. That >>> significantly reduces the impact of "clustering" due to linear >>> addressing. >> >> Interesting. Have you looked at cuckoo hashing? > > Yes. I don't think it's a good fit for modern CPUs. Because it > doesn't probe linearly you constantly have random uncached memory > accesses. > > I've played with a few schemes, and they all seemed to be slower > than linear probing deviations, unless you intentionally used a > wrong hash-distribution, or intentionally "unbalanced" the hash-space > by iterating over either end of the keyspace and moved the entries > around. OK, understood. > >>> - Significant comment and correctness fixes, both in simplehash.h >>> - itself, and 0003/0004. >>> - a lot of other random performance improvements for the hashing code. >>> >>> >>> Unfortunately I'm running out battery right now, so I don't want to >>> re-run the benchmarks posted upthread (loading takes a while). But the >>> last time I ran them all the results after the patches were better than >>> before. >>> >> >> Well, I have rather bad experience with running benchmarks on laptops anyway >> - a lot of noise due to power management, etc. > > Well, with factor ~2x improvements thats not really a big detractor. > Using a new CPU makes some forms of analysis easier (better PMUs). > True. > > For longrunning tests I agree. > > >> What about running a bigger benchmark - say, TPC-DS - and evaluating >> the impact? > > Worthwhile, although obviously the impact will be a lot smaller, > since they're not entirely bottlenecked on hash-aggs and bitmap > scans. > Sure, the improvement won't be as significant as on the simple queries, but it's interesting IMHO. > >> I think a crucial part of the benchmarking will be identifying and >> measuring corner cases, e.g. a lot of rows with the same key, etc. >> Although that probably is not a major issue for the two places >> switched to the new implementation (e.g. execGrouping merges the >> duplicates to a single group, by definition). > > Hm. I don't really see a case where that's going to cause issues - all > the execGrouping.c users only store one key, and either ignore later > ones, or add the data to the initial tuple in the group. I don't really > see any case where repeated keys should cause an issue for a hash table? > Yep, that's pretty much what I suggested. I was wondering more about the other places where this hash table might be used - I've been thinking about hashjoins, but now that I think of it - that's a fairly specialized and tuned code. In any case, let's not complicate the discussion for now. > >>> This patch series currently consists out of four patches: >>> - 0001 - add unlikely/likely() macros. The use here is not entirely >>> mandatory, but they do provide a quite consistent improvement. There >>> are other threads where they proved to be beneficial, so I see little >>> reason not to introduce them. >>> - 0002 - the customizable hashtable itself. It now contains documentation. >>> - 0003 - convert tidbitmap.c to use the new hashmap. This includes a fix >>> for the issue mentioned in [1], to improve peformance in heavily lossy >>> scenarios (otherwise we could regress in some extreme cases) >>> - 0004 - convert execGrouping.c, e.g. used by hash aggregates >>> >>> >>> While not quite perfect yet, I do think this is at a state where input >>> is needed. I don't want to go a lot deeper into this rabbit hole, >>> before we have some agreement that this is a sensible course of action. >>> >> >> So, is it the right time to do some benchmarking? > > That's one thing that's required, yes. The other is whether we can live > with the uglyness of implementing templates via macros. I do think we > can, but... > Hmmm ... not sure. If one of the points is to get rid of function calls determined at runtime (which make it impossible for the compiler to optimize the code), then I can think of three options: (a) copy-paste the code for each place (b) use some templating (c) use JIT I think (b) is way better than (a), and I don't think we're using JIT anywhere at this point. So while the macro-based templates look a bit awkward, I'm not aware of a better C thing. A few comments, after quickly skimming through the first two patches: 1) SH_CREATE does this: /* round up size to the next power of 2, eases lookups */if (size < 2) size = 2;else size = sh_pow2_int(size); I very much doubt a hash table with 2 buckets is very useful. What about using some reasonable lower boundary - IIRC hashjoins use 1024 buckets, which might be a bit too much, but perhaps 256 would work? 2) tb->resize_above I'd say 'resize_threshold' is a better name. Also, 0.8 should be defined as a constant somewhere (not sure if it makes sense to make it part of SH_TYPE, so that hash tables may use different load factors). 3) 'size' is a rather poor name for parameter (e.g. in SH_CREATE or SH_RESIZE), as it's unclear whether it's 'element size in bytes', 'total size in bytes' or 'number of entries'. 4) SH_CONTAINS sounds more like a function checking whether a hash table contains a key, not as a 'type of hash table entry'. What about SH_ENTRY_TYPE instead? Also, SH_KEYTYPE should be SH_KEY_TYPE I guess. 5) SH_RESIZE Do we expect to shrink hash tables? If not, SH_RESIZE should probably check that newsize > oldsize. If yes, it should check that there's enough space for all entries (with the load factor). It's not entirely clear why is it guaranteed that there's always an element with optimal position (when load factor < 1)? Adding an explanation or a link to a paper would be helpful, I guess. 6) SH_STAT This bit is a bit suspicious: uint32 max_collisions = 0; ... /* single contained element is not a collision */ curcoll--; total_collisions += curcoll; if (curcoll > max_collisions) max_collisions = curcoll - 1; Shouldn't the last line be just "max_collisions = curcoll"? 7) Should hash agg size estimation for the planner consider the fillfactor? I think we should account for fillfactor - we should account for the allocated memory, even if there's free space in the hash table. After all, that's what hashjoins do. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Hi, On 2016-10-02 01:37:35 +0200, Tomas Vondra wrote: > On 10/01/2016 09:59 PM, Andres Freund wrote: > >>What about running a bigger benchmark - say, TPC-DS - and evaluating > >>the impact? > > > >Worthwhile, although obviously the impact will be a lot smaller, > >since they're not entirely bottlenecked on hash-aggs and bitmap > >scans. > > > > Sure, the improvement won't be as significant as on the simple queries, but > it's interesting IMHO. Agreed. > >>I think a crucial part of the benchmarking will be identifying and > >>measuring corner cases, e.g. a lot of rows with the same key, etc. > >>Although that probably is not a major issue for the two places > >>switched to the new implementation (e.g. execGrouping merges the > >>duplicates to a single group, by definition). > > > >Hm. I don't really see a case where that's going to cause issues - all > >the execGrouping.c users only store one key, and either ignore later > >ones, or add the data to the initial tuple in the group. I don't really > >see any case where repeated keys should cause an issue for a hash table? > > > > Yep, that's pretty much what I suggested. I was wondering more about the > other places where this hash table might be used - I've been thinking about > hashjoins, but now that I think of it - that's a fairly specialized and > tuned code. In any case, let's not complicate the discussion for now. My point is that I don't really see any potential usecase where that's an issue. > A few comments, after quickly skimming through the first two patches: > > 1) SH_CREATE does this: > > /* round up size to the next power of 2, eases lookups */ > if (size < 2) > size = 2; > else > size = sh_pow2_int(size); > > I very much doubt a hash table with 2 buckets is very useful. What about > using some reasonable lower boundary - IIRC hashjoins use 1024 buckets, > which might be a bit too much, but perhaps 256 would work? For some cases that'll waste a good bit of space. Consider e.g. catcache.c. Callers can (and do) specify more appropriate sizes for the individual uses. > 2) tb->resize_above > > I'd say 'resize_threshold' is a better name. Also, 0.8 should be defined as > a constant somewhere (not sure if it makes sense to make it part of SH_TYPE, > so that hash tables may use different load factors). Fair point. > 3) 'size' is a rather poor name for parameter (e.g. in SH_CREATE or > SH_RESIZE), as it's unclear whether it's 'element size in bytes', 'total > size in bytes' or 'number of entries'. Ok. > 4) SH_CONTAINS sounds more like a function checking whether a hash table > contains a key, not as a 'type of hash table entry'. What about > SH_ENTRY_TYPE instead? Also, SH_KEYTYPE should be SH_KEY_TYPE I guess. Hm, ok. > 5) SH_RESIZE > > Do we expect to shrink hash tables? Not at the moment. Should be doable, but I'm not sure it's worth developing and testing - I don't see any usecases in pg atm. > It's not entirely clear why is it guaranteed that there's always an element > with optimal position (when load factor < 1)? Adding an explanation or a > link to a paper would be helpful, I guess. As the load factor always has to be below 1, there always has to be an empty bucket. Every bucket after an empty bucket has to be at it's optimal position. > > 6) SH_STAT > > This bit is a bit suspicious: > > uint32 max_collisions = 0; > > ... > > /* single contained element is not a collision */ > curcoll--; > total_collisions += curcoll; > if (curcoll > max_collisions) > max_collisions = curcoll - 1; > > Shouldn't the last line be just "max_collisions = curcoll"? Hm. That's probably a refactoring artefact. Nice catch. > 7) Should hash agg size estimation for the planner consider the fillfactor? > > I think we should account for fillfactor - we should account for the > allocated memory, even if there's free space in the hash table. After all, > that's what hashjoins do. We didn't really for dynahash (as it basically assumed a fillfactor of 1 all the time), not sure if changing it won't also cause issues. Thanks! Greetings, Andres Freund
On 10/02/2016 02:17 AM, Andres Freund wrote: > Hi,> ... > >>>> I think a crucial part of the benchmarking will be identifying and >>>> measuring corner cases, e.g. a lot of rows with the same key, etc. >>>> Although that probably is not a major issue for the two places >>>> switched to the new implementation (e.g. execGrouping merges the >>>> duplicates to a single group, by definition). >>> >>> Hm. I don't really see a case where that's going to cause issues - all >>> the execGrouping.c users only store one key, and either ignore later >>> ones, or add the data to the initial tuple in the group. I don't really >>> see any case where repeated keys should cause an issue for a hash table? >>> >> >> Yep, that's pretty much what I suggested. I was wondering more about the >> other places where this hash table might be used - I've been thinking about >> hashjoins, but now that I think of it - that's a fairly specialized and >> tuned code. In any case, let's not complicate the discussion for now. > > My point is that I don't really see any potential usecase where > that's an issue. > OK, I think we agree the two places modified by the submitted patch series are fine. Let's leave discussion about places modified by future patches for the future ;-) > >> A few comments, after quickly skimming through the first two patches: >> >> 1) SH_CREATE does this: >> >> /* round up size to the next power of 2, eases lookups */ >> if (size < 2) >> size = 2; >> else >> size = sh_pow2_int(size); >> >> I very much doubt a hash table with 2 buckets is very useful. What about >> using some reasonable lower boundary - IIRC hashjoins use 1024 buckets, >> which might be a bit too much, but perhaps 256 would work? > > For some cases that'll waste a good bit of space. Consider > e.g. catcache.c. Callers can (and do) specify more appropriate sizes for > the individual uses. > Hmmm, OK. I have my doubts about those hardcoded constants, but you're right 256 is probably an overkill. > >> 5) SH_RESIZE >> >> Do we expect to shrink hash tables? > > Not at the moment. Should be doable, but I'm not sure it's worth > developing and testing - I don't see any usecases in pg atm. > OK, sure. Then what about adding this to SH_RESIZE? Assert(oldsize <= newsize); I assumed we might shrink the catcache after invalidation, for example (but maybe I don't really know how that works). > >> It's not entirely clear why is it guaranteed that there's always >> an element with optimal position (when load factor < 1)? Adding an >> explanation or a link to a paper would be helpful, I guess. > > As the load factor always has to be below 1, there always has to be > an empty bucket. Every bucket after an empty bucket has to be at > it's optimal position. > Hmmm, thanks to moving the entries after delete? Adding an assert() enforcing this seems like a good idea. > >> 7) Should hash agg size estimation for the planner consider the fillfactor? >> >> I think we should account for fillfactor - we should account for the >> allocated memory, even if there's free space in the hash table. After all, >> that's what hashjoins do. > > We didn't really for dynahash (as it basically assumed a fillfactor of 1 > all the time), not sure if changing it won't also cause issues. > That's a case of glass half-full vs. half-empty, I guess. If we assumed load factor 1.0, then I see it as accounting for load factor (while you see it as not accounting of it). I don't see why this should cause issues - of course, the hash table may be a bit larger, exceed work_mem and make the tidbitmap lossy, or switch HashAggregate to GroupAggregate. But I don't think that's a major issue, as those decisions depend on estimates anyway, so we can't really guarantee them. Also, with load factor 0.8 the table is only ~20% larger, so even if we don't include that into work_mem it's a reasonably small error (easily smaller than errors in the pre-9.5 HashJoin accounting, which did not include chunk headers etc.). So I won't fight for this, but I don't see why not to account for it. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 2016-10-02 02:59:04 +0200, Tomas Vondra wrote: > On 10/02/2016 02:17 AM, Andres Freund wrote: > > Hi, > > ... > > > > > > > I think a crucial part of the benchmarking will be identifying and > > > > > measuring corner cases, e.g. a lot of rows with the same key, etc. > > > > > Although that probably is not a major issue for the two places > > > > > switched to the new implementation (e.g. execGrouping merges the > > > > > duplicates to a single group, by definition). > > > > > > > > Hm. I don't really see a case where that's going to cause issues - all > > > > the execGrouping.c users only store one key, and either ignore later > > > > ones, or add the data to the initial tuple in the group. I don't really > > > > see any case where repeated keys should cause an issue for a hash table? > > > > > > > > > > Yep, that's pretty much what I suggested. I was wondering more about the > > > other places where this hash table might be used - I've been thinking about > > > hashjoins, but now that I think of it - that's a fairly specialized and > > > tuned code. In any case, let's not complicate the discussion for now. > > > > My point is that I don't really see any potential usecase where > > that's an issue. > > > > OK, I think we agree the two places modified by the submitted patch series > are fine. Let's leave discussion about places modified by future patches for > the future ;-) I think my problem here is that I just can't see a potential use-case for hashtables where that's an issue ;) > > > 5) SH_RESIZE > > > > > > Do we expect to shrink hash tables? > > > > Not at the moment. Should be doable, but I'm not sure it's worth > > developing and testing - I don't see any usecases in pg atm. > > > > OK, sure. Then what about adding this to SH_RESIZE? > > Assert(oldsize <= newsize); Yes. And potentially rename it to SH_GROW. > I assumed we might shrink the catcache after invalidation, for example (but > maybe I don't really know how that works). They don't shrink so far, and in most invalidation cases we don't delete entries, just mark them as invalid (as they might be referenced on the outside at that moment). > > > It's not entirely clear why is it guaranteed that there's always > > > an element with optimal position (when load factor < 1)? Adding an > > > explanation or a link to a paper would be helpful, I guess. > > > > As the load factor always has to be below 1, there always has to be > > an empty bucket. Every bucket after an empty bucket has to be at > > it's optimal position. > Hmmm, thanks to moving the entries after delete? Pretty much, yes. > Adding an assert() enforcing this seems like a good idea. Probably requires some additional state to be meaningfully enforced. Not sure that's worth the effort. If that invariant is broken, not a lot of stuff works (believe me, I have broken it ;)). Otherwise searches won't necessarily work anymore, if they didn't end up in their original position (as the cell would now be empty, a lookup/insert/delete would not find the existing element). > > > 7) Should hash agg size estimation for the planner consider the fillfactor? > > > > > > I think we should account for fillfactor - we should account for the > > > allocated memory, even if there's free space in the hash table. After all, > > > that's what hashjoins do. > > > > We didn't really for dynahash (as it basically assumed a fillfactor of 1 > > all the time), not sure if changing it won't also cause issues. > > > > That's a case of glass half-full vs. half-empty, I guess. If we assumed load > factor 1.0, then I see it as accounting for load factor (while you see it as > not accounting of it). Well, even before we grow by factors of two, so 1 wasn't accurate for most of the time. My issue isn't really that I don't want to do it, but that I'm not entirely sure that there's a good way. TupleHashEntryData itself isn't exactly large, and we'd have to multiply it by the load factor. At the same time, after the patch, AggStatePerGroupData isn't allocated for empty elements anymore, and that's the largest part of the memory. So I'm kind of inclined to just disregard the fillfactor (and document that). > Also, with load factor 0.8 the table is only ~20% larger, so even if we > don't include that into work_mem it's a reasonably small error (easily > smaller than errors in the pre-9.5 HashJoin accounting, which did not > include chunk headers etc.). I think in either cases the entries themselves are smaller after the patch by enough that the overhead doesn't matter once you crossed a few dozen entries. Regards, Andres
<div dir="ltr"><div class="gmail_extra"><br /><div class="gmail_quote">On Sat, Oct 1, 2016 at 2:44 AM, Andres Freund <spandir="ltr"><<a href="mailto:andres@anarazel.de" target="_blank">andres@anarazel.de</a>></span> wrote:<br /><blockquoteclass="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">Hi,<br/><br /> On 2016-07-26 17:43:33 -0700, Andres Freund wrote:<br /> > In the attachedpatch I've attached simplehash.h, which can be<br /> > customized by a bunch of macros, before being inlined. There's also a<br /> > patch using this for tidbitmap.c and nodeAgg/nodeSubplan/... via<br /> > execGrouping.c.<br/><br /> Attached is a significantly improved version of this series. The main<br /> changes are:<br /><br/> - The hash table now uses robin-hood style hash collision handling. See the<br /> commit message of 0002 (or simplehash.h)for an introduction. That<br /> significantly reduces the impact of "clustering" due to linear<br /> addressing.<br/> - Significant comment and correctness fixes, both in simplehash.h<br /> - itself, and 0003/0004.<br /> -a lot of other random performance improvements for the hashing code.<br /><br /><br /> Unfortunately I'm running out batteryright now, so I don't want to<br /> re-run the benchmarks posted upthread (loading takes a while). But the<br /> lasttime I ran them all the results after the patches were better than<br /> before.<br /><br /><br /> This patch seriescurrently consists out of four patches:<br /> - 0001 - add unlikely/likely() macros. The use here is not entirely<br/> mandatory, but they do provide a quite consistent improvement. There<br /> are other threads where theyproved to be beneficial, so I see little<br /> reason not to introduce them.<br /> - 0002 - the customizable hashtableitself. It now contains documentation.<br /> - 0003 - convert tidbitmap.c to use the new hashmap. This includesa fix<br /> for the issue mentioned in [1], to improve peformance in heavily lossy<br /> scenarios (otherwisewe could regress in some extreme cases)<br /> - 0004 - convert execGrouping.c, e.g. used by hash aggregates<br/><br /><br /> While not quite perfect yet, I do think this is at a state where input<br /> is needed. I don'twant to go a lot deeper into this rabbit hole,<br /> before we have some agreement that this is a sensible course ofaction.<br /><br /><br /> I think there's several possible additional users of simplehash.h,<br /> e.g. catcache.c - whichhas an open coded, not particularly fast hash<br /> table and frequently shows up in profiles - but I think the abovetwo<br /> conversions are plenty to start with.<br /><br /><br /> Comments?<br /><br /><br /> Greetings,<br /><br />Andres Freund<br /><br /> [1] <a href="http://archives.postgresql.org/message-id/20160923205843.zcs533sqdzlh4cpo%40alap3.anarazel.de"rel="noreferrer" target="_blank">http://archives.postgresql.org<wbr/>/message-id/20160923205843.zcs<wbr />533sqdzlh4cpo%40alap3.anarazel<wbr/>.de</a><br /><br /><br /> --<br /> Sent via pgsql-hackers mailing list (<a href="mailto:pgsql-hackers@postgresql.org"target="_blank">pgsql-hackers@postgresql.org</a>)<br /> To make changes to yoursubscription:<br /><a href="http://www.postgresql.org/mailpref/pgsql-hackers" rel="noreferrer" target="_blank">http://www.postgresql.org/mail<wbr/>pref/pgsql-hackers</a><br /><br /></blockquote></div><br /></div><divclass="gmail_extra">This is a great addition.<br /><br /></div><div class="gmail_extra">A couple of comments.<br/></div><div class="gmail_extra"><br /></div><div class="gmail_extra">* 80% occupancy is a bit conservative forRH hashing, it works well up to 90% if you use the early stops due to distance. So that TODO is worth pursuing.<br /><br/>* An optimized memory layout for RH hashmap is along the lines of HHHKVKVKV, using one bit of the hash as the present/absentflag. That plays specially well with large occupancies (~90%). Due to RH the probings on the H array are veryshort and within a single cache line. Even with a 31bit hash it's reallyyy rare to have to probe more than 1 entry inthe KV array. Essentially guaranteeing that the 99% percentile of lookups will only hit a couple of cache lines (if youignore crossing cache lines and other details).<br /></div><br /></div>
Hi, On 2016-10-03 13:26:09 +0200, Arthur Silva wrote: > On Sat, Oct 1, 2016 at 2:44 AM, Andres Freund <andres@anarazel.de> wrote: > A couple of comments. > * 80% occupancy is a bit conservative for RH hashing, it works well up to > 90% if you use the early stops due to distance. So that TODO is worth > pursuing. I found 90% a tiny bit slower during modifications, due to the increased moving of cells around. I actually implemented that TODO at some point, it's not actually a very clear win for narrow elements and mid-sized tables - the additional instructions for distance computations cost. I've played with a different version of robin hood hashing, where the buckets are ordered by hash-value. I.e. the hashvalue is shifted right, instead of being masked, to get the hash bucket. That allows to have a hashtable which is ordered by the hash-value, thus distance checks simply become >=. The problem with that is that it only works if you have an "overflow" area at the end of the table, which is hard to size correctly. > * An optimized memory layout for RH hashmap is along the lines of > HHHKVKVKV, using one bit of the hash as the present/absent flag. That plays > specially well with large occupancies (~90%). Due to RH the probings on the > H array are very short and within a single cache line. Even with a 31bit > hash it's reallyyy rare to have to probe more than 1 entry in the KV array. > Essentially guaranteeing that the 99% percentile of lookups will only hit a > couple of cache lines (if you ignore crossing cache lines and other > details). That seems likely to end up being noticeably more complicated - I'm not sure the cpu overhead of the more complex lookups weighs of the benefits. I'd like to get the basics in, we can optimize further later on. Based on my instrumentation the majority of time is currently spent in the cache-miss for the initial bucket, everything else inside the hash table code is quite small. After that it's hash value computations in the execGrouping case. Greetings, Andres Freund
Hi, Attached is an updated version of the patchset. The main changes are - address most of Tomas' feedback - address regression test output changes by adding explicit ORDER BYs, in a separate commit. - fix issues around hash tables sized up to PG_UINT32_MAX - fix a bug in iteration with "concurrent" deletions > > We didn't really for dynahash (as it basically assumed a fillfactor of 1 > > all the time), not sure if changing it won't also cause issues. > > > > That's a case of glass half-full vs. half-empty, I guess. If we assumed load > factor 1.0, then I see it as accounting for load factor (while you see it as > not accounting of it). Well, that load factor is almost never achieved, because we'd have grown since... I added a comment about disregarding fill factor and growth policy to estimate_hashagg_tablesize, which is actually the place where it'd seemingly make sense to handle it. Tomas, did you have time to run a benchmark? I think this is starting to look good. Regards, Andres
Attachment
Hi, On 2016-10-11 02:38:26 +0200, Tomas Vondra wrote: > Yes, I've done a bunch of TPC-H and TPC-DS tests on the patch, but it took > quite a bit of time. These tests were done on a fairly small machine (the > usual i5-2500k with 8GB of RAM), with only 1GB data sets for both > benchmarks, to keep it completely in memory as I presume once we start > hitting I/O, it becomes the dominant part. Thanks! > TPC-DS (tpcds.ods) > ------------------ > > In this case, I'd say the results are less convincing. There are quite a few > queries that got slower by ~10%, which is well above - for example queries > 22 and 67. There are of course queries that got ~10% faster, and in total > the patched version executed more queries (so overall the result is slightly > positive, but not significantly). That's interesting. I wonder whether that's plan changes just due to the changing memory estimates, or what's causing that. I'll look into it. Greetings, Andres Freund
On 10/10/2016 01:38 AM, Andres Freund wrote: > Hi, > > Attached is an updated version of the patchset. The main changes are > - address most of Tomas' feedback > - address regression test output changes by adding explicit ORDER BYs, > in a separate commit. > - fix issues around hash tables sized up to PG_UINT32_MAX > - fix a bug in iteration with "concurrent" deletions > >>> We didn't really for dynahash (as it basically assumed a fillfactor of 1 >>> all the time), not sure if changing it won't also cause issues. >>> >> >> That's a case of glass half-full vs. half-empty, I guess. If we assumed load >> factor 1.0, then I see it as accounting for load factor (while you see it as >> not accounting of it). > > Well, that load factor is almost never achieved, because we'd have grown > since... I added a comment about disregarding fill factor and growth > policy to estimate_hashagg_tablesize, which is actually the place where > it'd seemingly make sense to handle it. > > Tomas, did you have time to run a benchmark? > Yes, I've done a bunch of TPC-H and TPC-DS tests on the patch, but it took quite a bit of time. These tests were done on a fairly small machine (the usual i5-2500k with 8GB of RAM), with only 1GB data sets for both benchmarks, to keep it completely in memory as I presume once we start hitting I/O, it becomes the dominant part. The machine was tuned a bit (shared_buffers=1GB, work_mem=256MB). There was no parallelism enabled, and the tests were done first with a single client and then with 4 clients running the queries for 4 hours. I plan to run the tests on some larger machine (with more RAM, allowing to use larger data sets while still keeping it in memory). But the machine is busy doing something else, so it'll have to wait. I didn't have time to do any sort of in-depth analysis (and I don't expect to have the time in foreseeable future) - in particular I have not looked at execution plans or anything like that. But let me show some quick summary: TPC-H (tpch.ods) ---------------- The results are either neutral or slightly positive - both in the case of single stream (summary) and 4 parallel streams (summary-4-streams). BTW I've happened to run the single stream tests twice while debugging some tooling issues, so that's why there are two sets of columns. Each stream executed ~9200 queries in total, which means ~450 executions for each query template. So, a lot of data, and the results look quite stable (despite the query parameters being random). There are a few queries that got a tiny bit (~3%) slower, but most of those queries are very short anyway (0.1s) making them susceptible to noise. On the other hand, there are a few queries that got ~10% faster, which is nice. It's not something that would significantly change our TPC-H scores, but not bad either. TPC-DS (tpcds.ods) ------------------ In this case, I'd say the results are less convincing. There are quite a few queries that got slower by ~10%, which is well above - for example queries 22 and 67. There are of course queries that got ~10% faster, and in total the patched version executed more queries (so overall the result is slightly positive, but not significantly). regards Tomas -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Attachment
On 2016-10-10 17:46:22 -0700, Andres Freund wrote: > > TPC-DS (tpcds.ods) > > ------------------ > > > > In this case, I'd say the results are less convincing. There are quite a few > > queries that got slower by ~10%, which is well above - for example queries > > 22 and 67. There are of course queries that got ~10% faster, and in total > > the patched version executed more queries (so overall the result is slightly > > positive, but not significantly). > > That's interesting. I wonder whether that's plan changes just due to the > changing memory estimates, or what's causing that. I'll look into it. Hm. Based on an initial look those queries aren't planned with any of the affected codepaths. Could this primarily be a question of randomness? Would it perhaps make sense to run the tests in a comparable order? Looking at tpcds.py and the output files, it seems that the query order differes between the runs, that can easily explain bigger difference than the above. For me a scale=1 run creates a database of approximately 4.5GB, thus with shared_buffers=1GB execution order is likely to have a significant performance impact. Greetings, Andres Freund
On 10/11/2016 04:07 AM, Andres Freund wrote: > On 2016-10-10 17:46:22 -0700, Andres Freund wrote: >>> TPC-DS (tpcds.ods) >>> ------------------ >>> >>> In this case, I'd say the results are less convincing. There are quite a few >>> queries that got slower by ~10%, which is well above - for example queries >>> 22 and 67. There are of course queries that got ~10% faster, and in total >>> the patched version executed more queries (so overall the result is slightly >>> positive, but not significantly). >> >> That's interesting. I wonder whether that's plan changes just due to the >> changing memory estimates, or what's causing that. I'll look into it. > > Hm. Based on an initial look those queries aren't planned with any of > the affected codepaths. Could this primarily be a question of > randomness? Would it perhaps make sense to run the tests in a comparable > order? Looking at tpcds.py and the output files, it seems that the query > order differes between the runs, that can easily explain bigger > difference than the above. For me a scale=1 run creates a database of > approximately 4.5GB, thus with shared_buffers=1GB execution order is > likely to have a significant performance impact. > Yes, I see similar plans (no bitmap index scans or hash aggregates). But the difference is there, even when running the query alone (so it's not merely due to the randomized ordering). I wonder whether this is again due to compiler moving stuff around. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 2016-10-11 04:29:31 +0200, Tomas Vondra wrote: > On 10/11/2016 04:07 AM, Andres Freund wrote: > > On 2016-10-10 17:46:22 -0700, Andres Freund wrote: > > > > TPC-DS (tpcds.ods) > > > > ------------------ > > > > > > > > In this case, I'd say the results are less convincing. There are quite a few > > > > queries that got slower by ~10%, which is well above - for example queries > > > > 22 and 67. There are of course queries that got ~10% faster, and in total > > > > the patched version executed more queries (so overall the result is slightly > > > > positive, but not significantly). > > > > > > That's interesting. I wonder whether that's plan changes just due to the > > > changing memory estimates, or what's causing that. I'll look into it. > > > > Hm. Based on an initial look those queries aren't planned with any of > > the affected codepaths. Could this primarily be a question of > > randomness? Would it perhaps make sense to run the tests in a comparable > > order? Looking at tpcds.py and the output files, it seems that the query > > order differes between the runs, that can easily explain bigger > > difference than the above. For me a scale=1 run creates a database of > > approximately 4.5GB, thus with shared_buffers=1GB execution order is > > likely to have a significant performance impact. > > > > Yes, I see similar plans (no bitmap index scans or hash aggregates). But the > difference is there, even when running the query alone (so it's not merely > due to the randomized ordering). > I wonder whether this is again due to compiler moving stuff around. It looks like that. I looked through a significant set of plans where there we time differences (generated on my machine), and none of them had bitmap or hash groupings to any significant degree. Comparing profiles in those cases usually shows a picture like: 24.98% +0.32% postgres [.] slot_deform_tuple 16.58% -0.05% postgres [.] ExecMakeFunctionResultNoSets 12.41% -0.01% postgres [.] slot_getattr 6.10% +0.39% postgres [.] heap_getnext 4.41% -0.37% postgres [.] ExecQual 3.08% +0.12% postgres [.] ExecEvalScalarVarFast 2.85% -0.11% postgres [.] check_stack_depth 2.48% +0.42% postgres [.] ExecEvalConst 2.44% -0.33% postgres [.] heapgetpage 2.34% +0.11% postgres [.] ExecScan 2.14% -0.20% postgres [.] ExecStoreTuple I.e. pretty random performance changes. This indeed looks like binary layout changes. Looking at these plans and at profiles spanning a run of all queries shows that bitmap scans and hash aggregations, while present, account for a very small amount of time in total. So tpc-ds doesn't look particularly interesting to evaluate these patches - but vey interesting for my slot deforming and qual evaluation patches. Btw, query_14.sql as generated by your templates (in pgperffarm) doesn't seem to work here. And I never had the patience to run query_1.sql to completion... Looks like we could very well use some planner improvements here. Greetings, Andres Freund
On 10/11/2016 05:56 PM, Andres Freund wrote: > On 2016-10-11 04:29:31 +0200, Tomas Vondra wrote: >> On 10/11/2016 04:07 AM, Andres Freund wrote: >>> On 2016-10-10 17:46:22 -0700, Andres Freund wrote: >>>>> TPC-DS (tpcds.ods) >>>>> ------------------ >>>>> >>>>> In this case, I'd say the results are less convincing. There are quite a few >>>>> queries that got slower by ~10%, which is well above - for example queries >>>>> 22 and 67. There are of course queries that got ~10% faster, and in total >>>>> the patched version executed more queries (so overall the result is slightly >>>>> positive, but not significantly). >>>> >>>> That's interesting. I wonder whether that's plan changes just due to the >>>> changing memory estimates, or what's causing that. I'll look into it. >>> >>> Hm. Based on an initial look those queries aren't planned with any of >>> the affected codepaths. Could this primarily be a question of >>> randomness? Would it perhaps make sense to run the tests in a comparable >>> order? Looking at tpcds.py and the output files, it seems that the query >>> order differes between the runs, that can easily explain bigger >>> difference than the above. For me a scale=1 run creates a database of >>> approximately 4.5GB, thus with shared_buffers=1GB execution order is >>> likely to have a significant performance impact. >>> >> >> Yes, I see similar plans (no bitmap index scans or hash aggregates). But the >> difference is there, even when running the query alone (so it's not merely >> due to the randomized ordering). > >> I wonder whether this is again due to compiler moving stuff around. > > It looks like that. I looked through a significant set of plans where > there we time differences (generated on my machine), and none of them > had bitmap or hash groupings to any significant degree. Comparing > profiles in those cases usually shows a picture like: > 24.98% +0.32% postgres [.] slot_deform_tuple > 16.58% -0.05% postgres [.] ExecMakeFunctionResultNoSets > 12.41% -0.01% postgres [.] slot_getattr > 6.10% +0.39% postgres [.] heap_getnext > 4.41% -0.37% postgres [.] ExecQual > 3.08% +0.12% postgres [.] ExecEvalScalarVarFast > 2.85% -0.11% postgres [.] check_stack_depth > 2.48% +0.42% postgres [.] ExecEvalConst > 2.44% -0.33% postgres [.] heapgetpage > 2.34% +0.11% postgres [.] ExecScan > 2.14% -0.20% postgres [.] ExecStoreTuple > > I.e. pretty random performance changes. This indeed looks like binary > layout changes. Looking at these plans and at profiles spanning a run > of all queries shows that bitmap scans and hash aggregations, while > present, account for a very small amount of time in total. So tpc-ds > doesn't look particularly interesting to evaluate these patches - but > vey interesting for my slot deforming and qual evaluation patches. > Meh, that's annoying. Anyway, the reason why many of the TPC-DS queries can't really benefit from the patch is that a lot of the queries use grouping sets, and we only have sorted implementation for that. I wonder whether the TPC-H differences are also affected by this ... > Btw, query_14.sql as generated by your templates (in pgperffarm) doesn't > seem to work here. And I never had the patience to run query_1.sql to > completion... Looks like we could very well use some planner > improvements here. > Yeah, planner improvements would be great ;-) Query 14 needs an explicit cast to text, otherwise you get something like "can't cast unknown to text" error. Not sure if that's an expected issue or not, but the cast fixes it. I haven't pushed that to the pgperffarm repo yet, along with several other tooling fixes. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Andres, Your patch, below, appears to have been applied to master in commit 5dfc198146b49ce7ecc8a1fc9d5e171fb75f6ba5. It makes mention of a function, tuplehash_start_iterate, in a macro, but the function is not defined or declared, and its signature and intention is not clear. Is there any chance you could add some documentation about how this function is intended to be used and defined? See InitTupleHashIterator in src/include/nodes/execnodes.h Thanks, Mark Dilger > On Oct 9, 2016, at 4:38 PM, Andres Freund <andres@anarazel.de> wrote: > > Hi, > > Attached is an updated version of the patchset. The main changes are > - address most of Tomas' feedback > - address regression test output changes by adding explicit ORDER BYs, > in a separate commit. > - fix issues around hash tables sized up to PG_UINT32_MAX > - fix a bug in iteration with "concurrent" deletions > >>> We didn't really for dynahash (as it basically assumed a fillfactor of 1 >>> all the time), not sure if changing it won't also cause issues. >>> >> >> That's a case of glass half-full vs. half-empty, I guess. If we assumed load >> factor 1.0, then I see it as accounting for load factor (while you see it as >> not accounting of it). > > Well, that load factor is almost never achieved, because we'd have grown > since... I added a comment about disregarding fill factor and growth > policy to estimate_hashagg_tablesize, which is actually the place where > it'd seemingly make sense to handle it. > > Tomas, did you have time to run a benchmark? > > I think this is starting to look good. > > > Regards, > > Andres > <0001-Add-likely-unlikely-branch-hint-macros.patch><0002-Make-regression-tests-less-dependent-on-hash-table-o.patch><0003-Add-a-macro-customizable-hashtable.patch><0004-Use-more-efficient-hashtable-for-tidbitmap.c-to-spee.patch><0005-Use-more-efficient-hashtable-for-execGrouping.c-to-s.patch> > -- > Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-hackers
Hi, On 2016-12-09 15:21:36 -0800, Mark Dilger wrote: > Andres, > > Your patch, below, appears to have been applied to master in commit > 5dfc198146b49ce7ecc8a1fc9d5e171fb75f6ba5. It makes mention of a > function, tuplehash_start_iterate, in a macro, but the function is not > defined or declared, and its signature and intention is not clear. Is there > any chance you could add some documentation about how this function > is intended to be used and defined? > > See InitTupleHashIterator in src/include/nodes/execnodes.h The patch generates functions based on the defined prefix. E.g. /* define paramters necessary to generate the tuple hash table interface */ #define SH_PREFIX tuplehash #define SH_ELEMENT_TYPE TupleHashEntryData #define SH_KEY_TYPE MinimalTuple #define SH_SCOPE extern #define SH_DECLARE #include "lib/simplehash.h" makes tuplehash_iterate out of #define SH_START_ITERATE SH_MAKE_NAME(start_iterate) ... SH_SCOPE void SH_START_ITERATE(SH_TYPE *tb, SH_ITERATOR *iter); SH_SCOPE void SH_START_ITERATE(SH_TYPE *tb, SH_ITERATOR *iter) { ... } See the simplehash.h's header for some explanation. Hope that helps, Andres