Thread: [HACKERS] Performance degradation in TPC-H Q18
Hello everyone, While conducting the experiments for parallelism, Rafia came across a hang in Q18 when plan uses partial and finalize hash aggregate. This could be seen on both scale factors - 20 and 300, on setting work_mem high enough so that the query uses hash aggregate. It seems that commit b81b5a96f424531b97cdd1dba97d9d1b9c9d372e does not solve the issue completely. I've tested on scale factor 10 and work_mem 100mb and the issue is reproducible. Below is the simplified part of the query which uses the hash aggregate. tpch10GB=# explain select l_orderkey from lineitem group by l_orderkey; QUERY PLAN ------------------------------------------------------------------------------------------------Finalize HashAggregate (cost=1612872.73..1616801.32rows=392859 width=8) Group Key: l_orderkey -> Gather (cost=1528408.04..1610908.43 rows=785718width=8) Workers Planned: 2 -> Partial HashAggregate (cost=1527408.04..1531336.63 rows=392859 width=8) Group Key: l_orderkey -> Parallel Seq Scan on lineitem (cost=0.00..1464922.43 rows=24994243 width=8) (7 rows) I've also tried to execute the same query with a different column name. tpch10GB=# explain analyze select l_suppkey from lineitem group by l_suppkey; QUERY PLAN -----------------------------------------------------------------------------------------------------------------------------------------------------Finalize HashAggregate (cost=1549511.80..1550493.37 rows=98157 width=8) (actual time=22363.901..22380.361 rows=100000loops=1) Group Key: l_suppkey -> Gather (cost=1528408.04..1549021.01 rows=196314 width=8) (actualtime=22107.387..22244.848 rows=300000 loops=1) Workers Planned: 2 Workers Launched: 2 -> PartialHashAggregate (cost=1527408.04..1528389.61 rows=98157 width=8) (actual time=22100.776..22135.063 rows=100000 loops=3) Group Key: l_suppkey -> Parallel Seq Scan on lineitem (cost=0.00..1464922.43 rows=24994243 width=8) (actual time=0.846..10664.258 rows=19995351 loops=3)Planningtime: 0.171 msExecution time: 22393.030 ms (10 rows) And, it finished in time although the stats in both the plans are exactly same. Hence, the problem is with the query itself. As Robert suggested offline, I've produced the perf report for the query. 99.44% 0.01% postgres postgres [.] LookupTupleHashEntry 74.31% 53.55% postgres postgres [.] tuplehash_insert 17.05% 17.04% postgres libc-2.17.so [.] __memcpy_ssse3_back 11.06% 11.05% postgres postgres [.] tuplehash_prev 10.81% 10.80% postgres postgres [.] tuplehash_next It's clear from the perf report that something terrible is going on in the simple hash insert. I've enabled the logging for hash ops(SH_STAT()) during hash growing phase. 2017-02-21 16:42:39.634 IST [9372] LOG: size: 1048576, members: 838860, filled: 0.799999, total chain: 2008236860, max chain: 66743, avg chain: 2394.007176, total_collisions: 278348, max_collisions: 8, avg_collisions: 0.331817 2017-02-21 16:42:39.634 IST [9372] STATEMENT: explain analyze select l_orderkey from lineitem group by l_orderkey; 2017-02-21 16:42:39.741 IST [9372] LOG: size: 2097152, members: 838860, filled: 0.400000, total chain: 788175, max chain: 124, avg chain: 0.939579, total_collisions: 216052, max_collisions: 6, avg_collisions: 0.257554 2017-02-21 16:42:39.741 IST [9372] STATEMENT: explain analyze select l_orderkey from lineitem group by l_orderkey; The value of max chain length is really high which hampers the query's performance. To further understand the scenario, I've taken the following results: tpch10GB=# select count(distinct l_orderkey) from lineitem; count ----------15000000 tpch10GB=# select correlation from pg_stats where tablename='lineitem' and attname='l_orderkey';correlation ------------- 1 tpch10GB=# select count(distinct l_suppkey) from lineitem;count --------100000 tpch10GB=# select correlation from pg_stats where tablename='lineitem' and attname='l_suppkey';correlation ------------- 0.00900259 It seems that l_orderkey produces a pretty bad case for hashing with linear probing and a bad choice of hash function(as linear probing is sensitive to elements with nearby hash codes). In [1], Andres's proposed a hack to resize the hashtable when the chain is growing quickly. Below is a comment from the patch explaining the objective: + /* + * To avoid negative consequences from overly imbalanced + * hashtables, grow the hashtable if collisions lead to large + * runs. The most likely cause of such imbalance is filling a + * (currently) small table, from a currently big one, in + * hash-table order. + * + * FIXME: compute boundaries in a more principled manner. + */ After applying the patch, TPC-H Q18 and the simplified query, both completes in time. Although the patch works in this scenario, we need a smarter way to trigger the hash expansion. The problem with the patch is that it triggers a hash expansion even when the filled percentage is pretty low, causing unnecessary memory consumption. Thoughts? [1] https://www.postgresql.org/message-id/20161123083351.5vramz52nmdokhzz%40alap3.anarazel.de 0001-Resize-simplehash.h-table-in-case-of-long-runs.patch -- Thanks & Regards, Kuntal Ghosh EnterpriseDB: http://www.enterprisedb.com
On Wed, Feb 22, 2017 at 11:23 AM, Kuntal Ghosh <kuntalghosh.2007@gmail.com> wrote: > While conducting the experiments for parallelism, Rafia came across a > hang in Q18 when plan uses partial and finalize hash aggregate. This > could be seen on both scale factors - 20 and 300, on setting work_mem > high enough so that the query uses hash aggregate. It seems that > commit b81b5a96f424531b97cdd1dba97d9d1b9c9d372e does not solve the > issue completely. Andres, any thoughts? Isn't the same issue that we were discussing https://www.postgresql.org/message-id/CA+TgmoYNO8qouPVO=1Q2aXzuxe942d_T5bcvZd9iKOC9tb3uLg@mail.gmail.com over a month ago? To me, it seems like a big problem that we have large, unfixed performance regressions with simplehash four months after it went in. I hate to suggest ripping the whole thing out, and it seems like overkill, but it's pretty clear to me that the current state of things is unacceptable, and that we're going to have a lot of unhappy users if we ship it the way that it is right now. I want to point out that the kinds of problems we're hitting here are exactly what I told you I was worried about before it went in - that the average-case performance would be better but that there would be all-too-easy-to-hit cases where things got much worse because the whole thing degenerated into a linear search. Not only did that happen, but there seem to be multiple ways of producing it without half trying, of which b81b5a96f424531b97cdd1dba97d9d1b9c9d372e fixed only one. Something that's 5-10% faster in common cases but 2x or 10x slower when things go wrong is not an improvement. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Hi, On 2017-02-26 19:30:32 +0530, Robert Haas wrote: > On Wed, Feb 22, 2017 at 11:23 AM, Kuntal Ghosh > <kuntalghosh.2007@gmail.com> wrote: > > While conducting the experiments for parallelism, Rafia came across a > > hang in Q18 when plan uses partial and finalize hash aggregate. This > > could be seen on both scale factors - 20 and 300, on setting work_mem > > high enough so that the query uses hash aggregate. It seems that > > commit b81b5a96f424531b97cdd1dba97d9d1b9c9d372e does not solve the > > issue completely. > > Andres, any thoughts? Isn't the same issue that we were discussing > https://www.postgresql.org/message-id/CA+TgmoYNO8qouPVO=1Q2aXzuxe942d_T5bcvZd9iKOC9tb3uLg@mail.gmail.com > over a month ago? Yes, I presume that it is. > To me, it seems like a big problem that we have large, unfixed > performance regressions with simplehash four months after it went in. Yea, I agree. I'm fairly sure that the patch I posted in that thread actually fixes the issue (and would also have made already applied hash patch of yours a small-ish optimization). I held back back because I disliked the idea of magic constants, and I couldn't figure out a way to properly "derive" them - but I'm inclined to simply live with the magic constsnts. > I hate to suggest ripping the whole thing out, and it seems like > overkill Yea, I don't think we're there yet. Let me push the patch that resizes adaptively, and we can see how things are going from there. > , but it's pretty clear to me that the current state of things > is unacceptable, and that we're going to have a lot of unhappy users > if we ship it the way that it is right now. Yea, if we can't improve upon the current state, we'd need to revert. > I want to point out that > the kinds of problems we're hitting here are exactly what I told you I > was worried about before it went in - that the average-case > performance would be better but that there would be > all-too-easy-to-hit cases where things got much worse because the > whole thing degenerated into a linear search. Not only did that > happen, but there seem to be multiple ways of producing it without > half trying, of which b81b5a96f424531b97cdd1dba97d9d1b9c9d372e fixed > only one. Note that that one is also fixed with what I'm proposing (but it's a worthwhile small-ish improvement nonetheless). > Something that's 5-10% faster in common cases but 2x or 10x > slower when things go wrong is not an improvement. The hash-table part is more like 3x faster - but as so often when making things faster, the next bottleneck is just around the corner... - Andres
On Tue, Feb 28, 2017 at 11:16 PM, Andres Freund <andres@anarazel.de> wrote: >> To me, it seems like a big problem that we have large, unfixed >> performance regressions with simplehash four months after it went in. > > Yea, I agree. I'm fairly sure that the patch I posted in that thread > actually fixes the issue (and would also have made already applied hash > patch of yours a small-ish optimization). I held back back because I > disliked the idea of magic constants, and I couldn't figure out a way to > properly "derive" them - but I'm inclined to simply live with the magic constsnts. I think it's better to push in a proposed fix and see how it holds up than to leave the tree in an unfixed state for long periods of time. If the fix is good, then the fact that there's a magic constant doesn't really matter all that much, and it can always be improved later. On the other hand, if the fix is bad, pushing it improves the chances of finding the problems. Not many people are going to test an uncommitted fix. BTW, another option to consider is lowering the target fillfactor. IIRC, Kuntal mentioned to me that cranking it down seemed to fix the issue. Obviously, it's better to detect when we need a lower fillfactor than to always use a lower one, but obviously the tighter you pack the hash table, the more likely it is that you're going to have these kinds of problems. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 2017-03-01 08:42:35 +0530, Robert Haas wrote: > On Tue, Feb 28, 2017 at 11:16 PM, Andres Freund <andres@anarazel.de> wrote: > >> To me, it seems like a big problem that we have large, unfixed > >> performance regressions with simplehash four months after it went in. > > > > Yea, I agree. I'm fairly sure that the patch I posted in that thread > > actually fixes the issue (and would also have made already applied hash > > patch of yours a small-ish optimization). I held back back because I > > disliked the idea of magic constants, and I couldn't figure out a way to > > properly "derive" them - but I'm inclined to simply live with the magic constsnts. > > I think it's better to push in a proposed fix and see how it holds up > than to leave the tree in an unfixed state for long periods of time. Yea, that was a mistake. I kind of was hoping for an epiphany that has not yet come. > BTW, another option to consider is lowering the target fillfactor. > IIRC, Kuntal mentioned to me that cranking it down seemed to fix the > issue. Obviously, it's better to detect when we need a lower > fillfactor than to always use a lower one, but obviously the tighter > you pack the hash table, the more likely it is that you're going to > have these kinds of problems. Yea, that'd be one approach, but I feel better dynamically increasing the fillfactor like in the patch. Even with a lower fillfactor you could see issues, and as you say a higher fillfactor is nice [TM]; after the patch I played with *increasing* the fillfactor, without being able to measure a downside. Greetings, Andres Freund
On Wed, Mar 1, 2017 at 8:48 AM, Andres Freund <andres@anarazel.de> wrote: >> BTW, another option to consider is lowering the target fillfactor. >> IIRC, Kuntal mentioned to me that cranking it down seemed to fix the >> issue. Obviously, it's better to detect when we need a lower >> fillfactor than to always use a lower one, but obviously the tighter >> you pack the hash table, the more likely it is that you're going to >> have these kinds of problems. > > Yea, that'd be one approach, but I feel better dynamically increasing > the fillfactor like in the patch. Even with a lower fillfactor you could > see issues, and as you say a higher fillfactor is nice [TM]; after the > patch I played with *increasing* the fillfactor, without being able to > measure a downside. I've tested with different fill factors and the query got completed with fill factor 0.6. With linear probing, the performance degrades more quickly at high fill factors because of primary clustering, a tendency for one collision to cause more nearby collisions. So, increasing the fill factor further doesn't seem to be a good idea. Besides, when I've tested with the patch, I've seen hash table grows even when the 10-12% of the hash table is filled. LOG: size: 8388608, members: 845056, filled: 0.100739, total chain: 735723, max chain: 37, avg chain: 0.870620, total_collisions: 219101, max_collisions: 6, avg_collisions: 0.259274 STATEMENT: explain analyze select l_orderkey from lineitem group by l_orderkey; LOG: size: 16777216, members: 845056, filled: 0.050369, total chain: 197047, max chain: 6, avg chain: 0.233176, total_collisions: 121185, max_collisions: 5, avg_collisions: 0.143405 STATEMENT: explain analyze select l_orderkey from lineitem group by l_orderkey; LOG: size: 16777216, members: 2127649, filled: 0.126818, total chain: 1471327, max chain: 35, avg chain: 0.691527, total_collisions: 486310, max_collisions: 6, avg_collisions: 0.228567 STATEMENT: explain analyze select l_orderkey from lineitem group by l_orderkey; LOG: size: 33554432, members: 2127649, filled: 0.063409, total chain: 413073, max chain: 7, avg chain: 0.194145, total_collisions: 265766, max_collisions: 5, avg_collisions: 0.124911 STATEMENT: explain analyze select l_orderkey from lineitem group by l_orderkey; This seems bad. We're wasting a lot of memory in the hash table. Apart from that, I was thinking about the following optimization in the code. /* * If the bucket is not empty,we either found a match (in which case * we're done), or we have to decide whether to skip over or move the * colliding entry. When the colliding element's distance to its * optimal position is smaller than the to-be-insertedentry's, we * shift the colliding entry (and its followers) forward by one. */ I'm worried about the fact that we are increasing the chain length of each follower. It may happen that we've increased the chain length of the last element by a huge factor. So, I was looking for other alternatives and I've found one called RobinHood hashing. In this approach, when you probe for a position to insert a new element, if the probe length for the existing element is less than the current probe length for the element being inserted, swap the two elements and keep going. It leads to a low variance for the chain lengths rather than the last one. Is this approach already considered? -- Thanks & Regards, Kuntal Ghosh EnterpriseDB: http://www.enterprisedb.com
On 2017-03-01 09:13:15 +0530, Kuntal Ghosh wrote: > On Wed, Mar 1, 2017 at 8:48 AM, Andres Freund <andres@anarazel.de> wrote: > >> BTW, another option to consider is lowering the target fillfactor. > >> IIRC, Kuntal mentioned to me that cranking it down seemed to fix the > >> issue. Obviously, it's better to detect when we need a lower > >> fillfactor than to always use a lower one, but obviously the tighter > >> you pack the hash table, the more likely it is that you're going to > >> have these kinds of problems. > > > > Yea, that'd be one approach, but I feel better dynamically increasing > > the fillfactor like in the patch. Even with a lower fillfactor you could > > see issues, and as you say a higher fillfactor is nice [TM]; after the > > patch I played with *increasing* the fillfactor, without being able to > > measure a downside. > I've tested with different fill factors and the query got completed > with fill factor 0.6. That's without the patch in http://archives.postgresql.org/message-id/20161123083351.5vramz52nmdokhzz%40alap3.anarazel.de ? With that patch it should complete without that change, right? > With linear probing, the performance degrades more quickly at high > fill factors because of primary clustering, a tendency for one > collision to cause more nearby collisions. So, increasing the fill > factor further doesn't seem to be a good idea. Well, the idea is increasing it, but *also* detecting clustering and adaptively resizing earlier. > So, I was looking for other alternatives and I've found one called > RobinHood hashing. simplehash.h implements robin hood hashing. - Andres
On Wed, Mar 1, 2017 at 8:48 AM, Andres Freund <andres@anarazel.de> wrote: >> BTW, another option to consider is lowering the target fillfactor. >> IIRC, Kuntal mentioned to me that cranking it down seemed to fix the >> issue. Obviously, it's better to detect when we need a lower >> fillfactor than to always use a lower one, but obviously the tighter >> you pack the hash table, the more likely it is that you're going to >> have these kinds of problems. > > Yea, that'd be one approach, but I feel better dynamically increasing > the fillfactor like in the patch. Even with a lower fillfactor you could > see issues, and as you say a higher fillfactor is nice [TM]; after the > patch I played with *increasing* the fillfactor, without being able to > measure a downside. I am going to bet that increasing the fillfactor would be a mistake. The expected length of a clump in the table is going to be about 1/(1-ff), which increases very quickly beyond the current value of 0.8. For example, if the actual fillfactor is 0.9 and the keys are perfectly distributed -- nine in a row and then one empty slot, lather, rinse, repeat -- probing for a key that's not there has a 10% chance of hitting an empty slot, a 10% chance of hitting the slot just before an empty slot, and so on. So the expected number of probes to find that there's no match is 4.5. However, it could easily happen that you have 3 or 4 empty slots in a row in one place and 27 or 36 occupied slots in a row in another place, and that could cause all kinds of grief. Moreover, real-world hash functions on real-world data aren't always perfect, which increases the chances of things going off track. I'm not exactly sure how high a fillfactor is too high, but note that https://en.wikipedia.org/wiki/Hash_table#Open_addressing claims that "performance dramatically degrades when the load factor grows beyond 0.7 or so", which isn't cited but the same text appears in https://www.unf.edu/~wkloster/3540/wiki_book2.pdf for whatever that's worth. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 2017-03-01 09:21:47 +0530, Robert Haas wrote: > On Wed, Mar 1, 2017 at 8:48 AM, Andres Freund <andres@anarazel.de> wrote: > >> BTW, another option to consider is lowering the target fillfactor. > >> IIRC, Kuntal mentioned to me that cranking it down seemed to fix the > >> issue. Obviously, it's better to detect when we need a lower > >> fillfactor than to always use a lower one, but obviously the tighter > >> you pack the hash table, the more likely it is that you're going to > >> have these kinds of problems. > > > > Yea, that'd be one approach, but I feel better dynamically increasing > > the fillfactor like in the patch. Even with a lower fillfactor you could > > see issues, and as you say a higher fillfactor is nice [TM]; after the > > patch I played with *increasing* the fillfactor, without being able to > > measure a downside. > > I am going to bet that increasing the fillfactor would be a mistake. > The expected length of a clump in the table is going to be about > 1/(1-ff), which increases very quickly beyond the current value of > 0.8. For example, if the actual fillfactor is 0.9 and the keys are > perfectly distributed -- nine in a row and then one empty slot, > lather, rinse, repeat -- probing for a key that's not there has a 10% > chance of hitting an empty slot, a 10% chance of hitting the slot just > before an empty slot, and so on. So the expected number of probes to > find that there's no match is 4.5. However, it could easily happen > that you have 3 or 4 empty slots in a row in one place and 27 or 36 > occupied slots in a row in another place, and that could cause all > kinds of grief. Moreover, real-world hash functions on real-world > data aren't always perfect, which increases the chances of things > going off track. I'm not exactly sure how high a fillfactor is too > high, but note that > https://en.wikipedia.org/wiki/Hash_table#Open_addressing claims that > "performance dramatically degrades when the load factor grows beyond > 0.7 or so", which isn't cited but the same text appears in > https://www.unf.edu/~wkloster/3540/wiki_book2.pdf for whatever that's > worth. That's without the "trick" of "robin hood" hashing though: https://en.wikipedia.org/wiki/Hash_table#Robin_Hood_hashing http://codecapsule.com/2013/11/17/robin-hood-hashing-backward-shift-deletion/ https://www.pvk.ca/Blog/2013/11/26/the-other-robin-hood-hashing/ it probably depends a bit on the scenario how high you want to have the fillfactor - for hashaggs a high fill factor would probably be good (they're often "read" heavy), for bitmapscans less so. But I guess there's not much point in immediately increasing the fillfactor, can do that in a second step. Greetings, Andres Freund
On Wed, Mar 1, 2017 at 9:19 AM, Andres Freund <andres@anarazel.de> wrote: > On 2017-03-01 09:13:15 +0530, Kuntal Ghosh wrote: >> On Wed, Mar 1, 2017 at 8:48 AM, Andres Freund <andres@anarazel.de> wrote: >> >> BTW, another option to consider is lowering the target fillfactor. >> >> IIRC, Kuntal mentioned to me that cranking it down seemed to fix the >> >> issue. Obviously, it's better to detect when we need a lower >> >> fillfactor than to always use a lower one, but obviously the tighter >> >> you pack the hash table, the more likely it is that you're going to >> >> have these kinds of problems. >> > >> > Yea, that'd be one approach, but I feel better dynamically increasing >> > the fillfactor like in the patch. Even with a lower fillfactor you could >> > see issues, and as you say a higher fillfactor is nice [TM]; after the >> > patch I played with *increasing* the fillfactor, without being able to >> > measure a downside. > >> I've tested with different fill factors and the query got completed >> with fill factor 0.6. > > That's without the patch in > http://archives.postgresql.org/message-id/20161123083351.5vramz52nmdokhzz%40alap3.anarazel.de > ? With that patch it should complete without that change, right? > No, it's with the patch in the above link which is 0001-Resize-simplehash.h-table-in-case-of-long-runs.patch. But, I've kept the SH_FILLFACTOR to 0.8 and SH_MAX_FILLFACTOR to 0.95. I'll do another round of testing with the constants provided by you. >> With linear probing, the performance degrades more quickly at high >> fill factors because of primary clustering, a tendency for one >> collision to cause more nearby collisions. So, increasing the fill >> factor further doesn't seem to be a good idea. > > Well, the idea is increasing it, but *also* detecting clustering and > adaptively resizing earlier. > > >> So, I was looking for other alternatives and I've found one called >> RobinHood hashing. > > simplehash.h implements robin hood hashing. > > But, it doesn't implement the swapping idea, right? -- Thanks & Regards, Kuntal Ghosh EnterpriseDB: http://www.enterprisedb.com
Hi, On 2017-03-01 09:33:07 +0530, Kuntal Ghosh wrote: > On Wed, Mar 1, 2017 at 9:19 AM, Andres Freund <andres@anarazel.de> wrote: > >> So, I was looking for other alternatives and I've found one called > >> RobinHood hashing. > > > > simplehash.h implements robin hood hashing. > But, it doesn't implement the swapping idea, right? It does, that's the if (insertdist > curdist) block in SH_INSERT. Unless I misunderstand what you're proposing? - Andres
On Wed, Mar 1, 2017 at 9:33 AM, Kuntal Ghosh <kuntalghosh.2007@gmail.com> wrote: > On Wed, Mar 1, 2017 at 9:19 AM, Andres Freund <andres@anarazel.de> wrote: >> That's without the patch in >> http://archives.postgresql.org/message-id/20161123083351.5vramz52nmdokhzz%40alap3.anarazel.de >> ? With that patch it should complete without that change, right? >> > No, it's with the patch in the above link which is > 0001-Resize-simplehash.h-table-in-case-of-long-runs.patch. But, I've > kept the SH_FILLFACTOR to 0.8 and SH_MAX_FILLFACTOR to 0.95. I'll do > another round of testing with the constants provided by you. > I've tested with the patch 0001-Resize-simplehash.h-table-in-case-of-long-runs.patch and the results are same, i.e., hash table grows even when it is only 10-12% filled. I've attached the logfile for reference. -- Thanks & Regards, Kuntal Ghosh EnterpriseDB: http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Attachment
On Wed, Mar 1, 2017 at 9:38 AM, Andres Freund <andres@anarazel.de> wrote: > Hi, > > On 2017-03-01 09:33:07 +0530, Kuntal Ghosh wrote: >> On Wed, Mar 1, 2017 at 9:19 AM, Andres Freund <andres@anarazel.de> wrote: >> >> So, I was looking for other alternatives and I've found one called >> >> RobinHood hashing. >> > >> > simplehash.h implements robin hood hashing. > >> But, it doesn't implement the swapping idea, right? > > It does, that's the if (insertdist > curdist) block in SH_INSERT. > Unless I misunderstand what you're proposing? > I think the idea of the existing implementation is: if (insertdist > curdist) { find an empty space at the end of the cluster. shift all the followers by 1 position to make space for the element to be inserted. insert the element at the current place. } What I'm trying to say is: if (insertdist > curdist) { swap the entry to be inserted with the current entry. Try to insert the current entry in the same logic. } So, the second approach will not cause all the followers to be shifted by 1. -- Thanks & Regards, Kuntal Ghosh EnterpriseDB: http://www.enterprisedb.com
On 2017-03-01 10:47:45 +0530, Kuntal Ghosh wrote: > if (insertdist > curdist) > { > swap the entry to be inserted with the current entry. > Try to insert the current entry in the same logic. > } > > So, the second approach will not cause all the followers to be shifted by 1. How not? You'll have to do that game until you found a free slot.
On Wed, Mar 1, 2017 at 10:53 AM, Andres Freund <andres@anarazel.de> wrote: > On 2017-03-01 10:47:45 +0530, Kuntal Ghosh wrote: >> if (insertdist > curdist) >> { >> swap the entry to be inserted with the current entry. >> Try to insert the current entry in the same logic. >> } >> >> So, the second approach will not cause all the followers to be shifted by 1. > > How not? You'll have to do that game until you found a free slot. You can skip increasing the curdist of some elements for which it is already pretty high. Suppose, we've the following hash table and an element e, _,_,_,i,_,_,j,j+1,j+2,_,_ We want to insert e at i but there is a collision and we start doing probing. At j, the condition if (insertdist > curdist) satisfies and we'll swap e with the element at j. Now, we try to insert e( = element at j) and so on. In this process, if the element at j+1,j+2 has already high distance from their optimal location, you can skip those element and go ahead. But, in the existing approach, we increase their distance further. Thoughts? -- Thanks & Regards, Kuntal Ghosh EnterpriseDB: http://www.enterprisedb.com
Hi, On 2017-03-01 10:39:11 +0530, Kuntal Ghosh wrote: > On Wed, Mar 1, 2017 at 9:33 AM, Kuntal Ghosh <kuntalghosh.2007@gmail.com> wrote: > > On Wed, Mar 1, 2017 at 9:19 AM, Andres Freund <andres@anarazel.de> wrote: > >> That's without the patch in > >> http://archives.postgresql.org/message-id/20161123083351.5vramz52nmdokhzz%40alap3.anarazel.de > >> ? With that patch it should complete without that change, right? > >> > > No, it's with the patch in the above link which is > > 0001-Resize-simplehash.h-table-in-case-of-long-runs.patch. But, I've > > kept the SH_FILLFACTOR to 0.8 and SH_MAX_FILLFACTOR to 0.95. I'll do > > another round of testing with the constants provided by you. > > > I've tested with the patch > 0001-Resize-simplehash.h-table-in-case-of-long-runs.patch and the > results are same, i.e., hash table grows even when it is only 10-12% > filled. I've attached the logfile for reference. So, I observed similar things, where the *leader backend* reports a lot of "long runs". WARNING: clumping, growing this thing LOG: size: 1048576, members: 158071, filled: 0.150748, total chain: 95622, max chain: 35, avg chain: 0.604931, total_collisions:35909, max_collisions: 5, avg_collisions: 0.227170 STATEMENT: EXPLAIN ANALYZE select l_orderkey from lineitem group by l_orderkey; WARNING: clumping, growing this thing LOG: size: 2097152, members: 238992, filled: 0.113960, total chain: 1850141, max chain: 422, avg chain: 7.741435, total_collisions:55363, max_collisions: 5, avg_collisions: 0.231652 STATEMENT: EXPLAIN ANALYZE select l_orderkey from lineitem group by l_orderkey; WARNING: clumping, growing this thing LOG: size: 4194304, members: 881089, filled: 0.210068, total chain: 567907, max chain: 29, avg chain: 0.644551, total_collisions:174681, max_collisions: 6, avg_collisions: 0.198256 STATEMENT: EXPLAIN ANALYZE select l_orderkey from lineitem group by l_orderkey; WARNING: clumping, growing this thing LOG: size: 8388608, members: 5397649, filled: 0.643450, total chain: 5752690, max chain: 22, avg chain: 1.065777, total_collisions:1437957, max_collisions: 8, avg_collisions: 0.266404 STATEMENT: EXPLAIN ANALYZE select l_orderkey from lineitem group by l_orderkey; The pertinent part is that there's "forced" resizes due to long runs at 0.15%, 0.11%, 0.21%, 0.64%. Note that there's a good number of "ordinary" resizes, and at the end there's 7500000 tuples in the hashtable. I.e. there's a number of resizes after the last one due to clumping, and all of those are entirely guided through fillfactor; the final hash-table is entirely reasonably sized. Trying to do the math, that just didn't sit well with me. There's just no way there could be that many and such long runs, without something broken. And indeed: Thinking about this even more I realized the reason this happens is that b81b5a96f4 didn't actually address the problem of inserting in hash-table order. To understand, let's first look at the query plan: Finalize HashAggregate (cost=899227.75..903176.55 rows=394880 width=4) (actual time=6255.957..7415.152 rows=7500000 loops=1)Group Key: l_orderkey -> Gather (cost=652427.75..893304.55 rows=2369280 width=4) (actual time=1741.706..3699.248rows=7940881 loops=1) Workers Planned: 6 Workers Launched: 6 -> Partial HashAggregate (cost=651427.75..655376.55 rows=394880 width=4) (actual time=1731.468..1956.693 rows=11344 GroupKey: l_orderkey -> Parallel Seq Scan on lineitem (cost=0.00..638927.80 rows=4999980 width=4) (actual time=0.025..711.771rows Planning time: 0.192 ms Execution time: 7700.027 ms The important part is the estimated rows=394880 and actual rows=7500000. That means that the hash-table on the leader backend will first be created suitably for 394880 rows. But we obviously get a lot more. The problem then is that even after introducing the hash_iv stuff (b81b5a96f), we essentially still iterate over the tuples in hash-order. TupleHashTableHash() computes the hash for a single-column group condition as: uint32 hashkey = hashtable->hash_iv; hashkey = (hashkey << 1) | ((hashkey & 0x80000000) ? 1 : 0); attr = slot_getattr(slot, att, &isNull); if (!isNull) /* treat nulls as having hash key 0 */ { uint32 hkey; hkey = DatumGetUInt32(FunctionCall1(&hashfunctions[i], attr)); hashkey ^= hkey; } the resulting hash-values aren't actually meaningfully influenced by the IV. Because we just xor with the IV, most hash-value that without the IV would have fallen into a single hash-bucket, fall into a single hash-bucket afterwards as well; just somewhere else in the hash-range. Which then leads to the issue that we insert hundreds of rows into the leader's hash-value that fall into the same hash-bucket (as the leader's hashtable is quite small at this point, the hash-buckets on the worker tables have *far* fewer entries per bucket / are far smaller on the workers than the leaders). Which then leads to super-long runs. As soon as the table gets to it "real" size, the issue of the overlong runs is gone, because now the hash-table has the appropriate size for its contents. So that explains why there's a there's a lot of resizes when the fillfactor is low (as you've observed seen in the 10-20% range) - in specific hash-ranges the fillfactor is extremely high, but in other parts the table is effectively empty. The question remains what to do about that... I think it's pretty clear that we need a "defense" against such unbalanced hashtables - now that I understand how we can get into the situation of these being so unbalanced I'm a lot less uncomfortable adding the necessary logic. In addition to that it seems quite worthwhile to provide an iterator that's not vulnerable to this. An approach that I am, seemingly successfully, testing is to iterate the hashtable in multiple (in my case 23, because why not) passes, accessing only every nth element. That allows the data to be inserted in a lot less "dense" fashion. But that's more an optimization, so I'll just push something like the patch mentioned in the thread already. Makes some sense? - Andres
On 2017-03-01 11:05:33 +0530, Kuntal Ghosh wrote: > On Wed, Mar 1, 2017 at 10:53 AM, Andres Freund <andres@anarazel.de> wrote: > > On 2017-03-01 10:47:45 +0530, Kuntal Ghosh wrote: > >> if (insertdist > curdist) > >> { > >> swap the entry to be inserted with the current entry. > >> Try to insert the current entry in the same logic. > >> } > >> > >> So, the second approach will not cause all the followers to be shifted by 1. > > > > How not? You'll have to do that game until you found a free slot. > You can skip increasing the curdist of some elements for which it is > already pretty high. Suppose, we've the following hash table and an > element e, > _,_,_,i,_,_,j,j+1,j+2,_,_ > We want to insert e at i but there is a collision and we start doing > probing. At j, the condition if (insertdist > curdist) satisfies and > we'll swap e with the element at j. Now, we try to insert e( = element > at j) and so on. In this process, if the element at j+1,j+2 has > already high distance from their optimal location, you can skip those > element and go ahead. But, in the existing approach, we increase their > distance further. Thoughts? All the following elements already are at their "optimal" position given the previous occupation of the hash-table. As we only start to move once the insert-distance of the existing element is lower than the one of the to-be-inserted one, the following elements will all be moved. Doing the swap on a element-by-element basis would be possible, but just achieves the same result at a higher cost. Regards, Andres
On Fri, Mar 3, 2017 at 1:22 AM, Andres Freund <andres@anarazel.de> wrote: > the resulting hash-values aren't actually meaningfully influenced by the > IV. Because we just xor with the IV, most hash-value that without the IV > would have fallen into a single hash-bucket, fall into a single > hash-bucket afterwards as well; just somewhere else in the hash-range. Wow, OK. I had kind of assumed (without looking) that setting the hash IV did something a little more useful than that. Maybe we should do something like struct blah { int iv; int hv; }; newhv = hash_any(&blah, sizeof(blah)). > In addition to that it seems quite worthwhile to provide an iterator > that's not vulnerable to this. An approach that I am, seemingly > successfully, testing is to iterate the hashtable in multiple (in my > case 23, because why not) passes, accessing only every nth element. That > allows the data to be inserted in a lot less "dense" fashion. But > that's more an optimization, so I'll just push something like the patch > mentioned in the thread already. > > Makes some sense? Yep. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Fri, Mar 3, 2017 at 8:41 AM, Robert Haas <robertmhaas@gmail.com> wrote: > On Fri, Mar 3, 2017 at 1:22 AM, Andres Freund <andres@anarazel.de> wrote: >> the resulting hash-values aren't actually meaningfully influenced by the >> IV. Because we just xor with the IV, most hash-value that without the IV >> would have fallen into a single hash-bucket, fall into a single >> hash-bucket afterwards as well; just somewhere else in the hash-range. > > Wow, OK. I had kind of assumed (without looking) that setting the > hash IV did something a little more useful than that. Maybe we should > do something like struct blah { int iv; int hv; }; newhv = > hash_any(&blah, sizeof(blah)). > Sounds good. I've seen a post from Thomas Munro suggesting some alternative approach for combining hash values in execGrouping.c[1]. >> In addition to that it seems quite worthwhile to provide an iterator >> that's not vulnerable to this. An approach that I am, seemingly >> successfully, testing is to iterate the hashtable in multiple (in my >> case 23, because why not) passes, accessing only every nth element. That >> allows the data to be inserted in a lot less "dense" fashion. But >> that's more an optimization, so I'll just push something like the patch >> mentioned in the thread already. >> >> Makes some sense? > > Yep. > Yes, it makes sense. Quadratic probing is another approach, but it would require an extra shift op every time we want to find the next or prev location during a collision. [1] https://www.postgresql.org/message-id/CAEepm%3D3rdgjfxW4cKvJ0OEmya2-34B0qHNG1xV0vK7TGPJGMUQ@mail.gmail.com -- Thanks & Regards, Kuntal Ghosh EnterpriseDB: http://www.enterprisedb.com
Hi, attached is a patch to address this problem, and the one reported by Dilip. I ran a lot of TPC-H and other benchmarks, and so far this addresses all the performance issues, often being noticeably faster than with the dynahash code. Comments? On 2017-03-03 11:23:00 +0530, Kuntal Ghosh wrote: > On Fri, Mar 3, 2017 at 8:41 AM, Robert Haas <robertmhaas@gmail.com> wrote: > > On Fri, Mar 3, 2017 at 1:22 AM, Andres Freund <andres@anarazel.de> wrote: > >> the resulting hash-values aren't actually meaningfully influenced by the > >> IV. Because we just xor with the IV, most hash-value that without the IV > >> would have fallen into a single hash-bucket, fall into a single > >> hash-bucket afterwards as well; just somewhere else in the hash-range. > > > > Wow, OK. I had kind of assumed (without looking) that setting the > > hash IV did something a little more useful than that. Maybe we should > > do something like struct blah { int iv; int hv; }; newhv = > > hash_any(&blah, sizeof(blah)). The hash invocations are already noticeable performancewise, so I'm a bit hesitant to go there. I'd rather introduce a decent 'hash_combine' function or such. > Sounds good. I've seen a post from Thomas Munro suggesting some > alternative approach for combining hash values in execGrouping.c[1]. Yea, I played with that too - it's a bit better, but doesn't really address the problem fully either. Seems to generally reduce collisions in multi-column keys a bit, but not that much. > >> In addition to that it seems quite worthwhile to provide an iterator > >> that's not vulnerable to this. An approach that I am, seemingly > >> successfully, testing is to iterate the hashtable in multiple (in my > >> case 23, because why not) passes, accessing only every nth element. That > >> allows the data to be inserted in a lot less "dense" fashion. But > >> that's more an optimization, so I'll just push something like the patch > >> mentioned in the thread already. > >> > >> Makes some sense? > > > > Yep. > > > Yes, it makes sense. Quadratic probing is another approach, but it > would require an extra shift op every time we want to find the next or > prev location during a collision. That's not really the big problem with quadratic probing. The issue is that it leads to further random unpredictable memory accesses in case of collisions, whereas linear probing has easy to predict accesses. Greetings, Andres Freund -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Attachment
On Sat, Mar 4, 2017 at 5:56 AM, Andres Freund <andres@anarazel.de> wrote: > attached is a patch to address this problem, and the one reported by > Dilip. I ran a lot of TPC-H and other benchmarks, and so far this > addresses all the performance issues, often being noticeably faster than > with the dynahash code. > > Comments? I'm still not convinced that raising the fillfactor like this is going to hold up in testing, but I don't mind you committing it and we'll see what happens. If it is possible to survive with a fillfactor that high, it certainly has some advantages, and it's obviously more likely to work out with the other logic you've added. I think we need a lot more people playing with this to know whether any given approach is right, and I think getting something committed will help more than any amount of theoretical discussion. I think DEBUG1 is far too high for something that could occur with some frequency on a busy system; I'm fairly strongly of the opinion that you ought to downgrade that by a couple of levels, say to DEBUG3 or so. > On 2017-03-03 11:23:00 +0530, Kuntal Ghosh wrote: >> On Fri, Mar 3, 2017 at 8:41 AM, Robert Haas <robertmhaas@gmail.com> wrote: >> > On Fri, Mar 3, 2017 at 1:22 AM, Andres Freund <andres@anarazel.de> wrote: >> >> the resulting hash-values aren't actually meaningfully influenced by the >> >> IV. Because we just xor with the IV, most hash-value that without the IV >> >> would have fallen into a single hash-bucket, fall into a single >> >> hash-bucket afterwards as well; just somewhere else in the hash-range. >> > >> > Wow, OK. I had kind of assumed (without looking) that setting the >> > hash IV did something a little more useful than that. Maybe we should >> > do something like struct blah { int iv; int hv; }; newhv = >> > hash_any(&blah, sizeof(blah)). > > The hash invocations are already noticeable performancewise, so I'm a > bit hesitant to go there. I'd rather introduce a decent 'hash_combine' > function or such. Yes, that might be better. I wasn't too sure the approach I proposed would actually do a sufficiently-good job mixing it the bits from the IV anyway. It's important to keep in mind that the values we're using as IVs aren't necessarily going to be uniformly distributed in any meaningful way. They're just PIDs, so you might only have 1-3 bits of difference between one value and another within the same parallel query. If you don't do something fairly aggressive to make that change perturb the final hash value, it probably won't. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Sat, Mar 4, 2017 at 11:09 AM, Robert Haas <robertmhaas@gmail.com> wrote: > On Sat, Mar 4, 2017 at 5:56 AM, Andres Freund <andres@anarazel.de> wrote: >> attached is a patch to address this problem, and the one reported by >> Dilip. I ran a lot of TPC-H and other benchmarks, and so far this >> addresses all the performance issues, often being noticeably faster than >> with the dynahash code. >> >> Comments? + errdetail("size %f/members %f: factor %.2f", + (double)tb->size, (double)tb->members, + (double) tb->members / tb->size), In SH_STAT, we use 'filled' instead of 'factor'. For displaying size and members, there is no need to convert those into double. > I'm still not convinced that raising the fillfactor like this is going > to hold up in testing, but I don't mind you committing it and we'll > see what happens. If it is possible to survive with a fillfactor that > high, it certainly has some advantages, and it's obviously more likely > to work out with the other logic you've added. I think we need a lot > more people playing with this to know whether any given approach is > right, and I think getting something committed will help more than any > amount of theoretical discussion. +1 > I think DEBUG1 is far too high for something that could occur with > some frequency on a busy system; I'm fairly strongly of the opinion > that you ought to downgrade that by a couple of levels, say to DEBUG3 > or so. +1 I've tested with TPC-H query 18 and it's working fine. -- Thanks & Regards, Kuntal Ghosh EnterpriseDB: http://www.enterprisedb.com
On 2017-03-04 11:09:40 +0530, Robert Haas wrote: > On Sat, Mar 4, 2017 at 5:56 AM, Andres Freund <andres@anarazel.de> wrote: > > attached is a patch to address this problem, and the one reported by > > Dilip. I ran a lot of TPC-H and other benchmarks, and so far this > > addresses all the performance issues, often being noticeably faster than > > with the dynahash code. > > > > Comments? > > I'm still not convinced that raising the fillfactor like this is going > to hold up in testing, but I don't mind you committing it and we'll > see what happens. I didn't see anything in testing, but I agree that it's debatable. But I'd rather commit it now, when we all know it's new code. Raising it in a new release will be a lot harder. > I think DEBUG1 is far too high for something that could occur with > some frequency on a busy system; I'm fairly strongly of the opinion > that you ought to downgrade that by a couple of levels, say to DEBUG3 > or so. I actually planned to remove it entirely, before committing. It was more left in for testers to be able to see when the code triggers. > > On 2017-03-03 11:23:00 +0530, Kuntal Ghosh wrote: > >> On Fri, Mar 3, 2017 at 8:41 AM, Robert Haas <robertmhaas@gmail.com> wrote: > >> > On Fri, Mar 3, 2017 at 1:22 AM, Andres Freund <andres@anarazel.de> wrote: > >> >> the resulting hash-values aren't actually meaningfully influenced by the > >> >> IV. Because we just xor with the IV, most hash-value that without the IV > >> >> would have fallen into a single hash-bucket, fall into a single > >> >> hash-bucket afterwards as well; just somewhere else in the hash-range. > >> > > >> > Wow, OK. I had kind of assumed (without looking) that setting the > >> > hash IV did something a little more useful than that. Maybe we should > >> > do something like struct blah { int iv; int hv; }; newhv = > >> > hash_any(&blah, sizeof(blah)). > > > > The hash invocations are already noticeable performancewise, so I'm a > > bit hesitant to go there. I'd rather introduce a decent 'hash_combine' > > function or such. > > Yes, that might be better. I wasn't too sure the approach I proposed > would actually do a sufficiently-good job mixing it the bits from the > IV anyway. It's important to keep in mind that the values we're using > as IVs aren't necessarily going to be uniformly distributed in any > meaningful way. They're just PIDs, so you might only have 1-3 bits of > difference between one value and another within the same parallel > query. If you don't do something fairly aggressive to make that > change perturb the final hash value, it probably won't. FWIW, I played with some better mixing, and it helps a bit with accurately sized hashtables and multiple columns. What's however more interesting is that a better mixed IV and/or better iteration now *slightly* *hurts* performance with grossly misestimated sizes, because resizing has to copy more rows... Not what I predicted. Greetings, Andres Freund
On Mon, Mar 6, 2017 at 3:32 PM, Andres Freund <andres@anarazel.de> wrote: >> I think DEBUG1 is far too high for something that could occur with >> some frequency on a busy system; I'm fairly strongly of the opinion >> that you ought to downgrade that by a couple of levels, say to DEBUG3 >> or so. > > I actually planned to remove it entirely, before committing. It was more > left in for testers to be able to see when the code triggers. Oh, OK. That'd be fine too. > FWIW, I played with some better mixing, and it helps a bit with > accurately sized hashtables and multiple columns. > > What's however more interesting is that a better mixed IV and/or better > iteration now *slightly* *hurts* performance with grossly misestimated > sizes, because resizing has to copy more rows... Not what I predicted. I don't quite follow this. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 2017-03-06 16:58:23 -0500, Robert Haas wrote: > On Mon, Mar 6, 2017 at 3:32 PM, Andres Freund <andres@anarazel.de> wrote: > >> I think DEBUG1 is far too high for something that could occur with > >> some frequency on a busy system; I'm fairly strongly of the opinion > >> that you ought to downgrade that by a couple of levels, say to DEBUG3 > >> or so. > > > > I actually planned to remove it entirely, before committing. It was more > > left in for testers to be able to see when the code triggers. > > Oh, OK. That'd be fine too. And pushed that way. > > FWIW, I played with some better mixing, and it helps a bit with > > accurately sized hashtables and multiple columns. > > > > What's however more interesting is that a better mixed IV and/or better > > iteration now *slightly* *hurts* performance with grossly misestimated > > sizes, because resizing has to copy more rows... Not what I predicted. > > I don't quite follow this. The whole performance issue trigger this thread only exists when the hashtable sizes are mis-estimated, right? Turns out that after applying the just-committed changes, that "fixing" the bad-mixing and/or doing iteration that's not entirely in hash-order, slighty degrades performance. The difference is that without either of those additional changes, we resize to the "right" size very early, when the hashtable is barely filled (i.e. only few entries need to be moved), because the imbalance is observed at tsart. With the changes however the resizing happens when the table is pretty full (i.e. a lot of entries need to be moved). So the early imbalance ends up actually not hurting performance... - Andres
On Mon, Mar 6, 2017 at 5:19 PM, Andres Freund <andres@anarazel.de> wrote: > The whole performance issue trigger this thread only exists when the > hashtable sizes are mis-estimated, right? Turns out that after applying > the just-committed changes, that "fixing" the bad-mixing and/or doing > iteration that's not entirely in hash-order, slighty degrades > performance. The difference is that without either of those additional > changes, we resize to the "right" size very early, when the hashtable is > barely filled (i.e. only few entries need to be moved), because the > imbalance is observed at tsart. With the changes however the resizing > happens when the table is pretty full (i.e. a lot of entries need to be > moved). So the early imbalance ends up actually not hurting > performance... Hmm. I don't know what to do about that. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 2017-03-06 18:59:02 -0500, Robert Haas wrote: > On Mon, Mar 6, 2017 at 5:19 PM, Andres Freund <andres@anarazel.de> wrote: > > The whole performance issue trigger this thread only exists when the > > hashtable sizes are mis-estimated, right? Turns out that after applying > > the just-committed changes, that "fixing" the bad-mixing and/or doing > > iteration that's not entirely in hash-order, slighty degrades > > performance. The difference is that without either of those additional > > changes, we resize to the "right" size very early, when the hashtable is > > barely filled (i.e. only few entries need to be moved), because the > > imbalance is observed at tsart. With the changes however the resizing > > happens when the table is pretty full (i.e. a lot of entries need to be > > moved). So the early imbalance ends up actually not hurting > > performance... > > Hmm. I don't know what to do about that. Oh, I don't think we need to do much about it either way - I think it's more an interesting curiosity than anything. Seems to reduce the urgency for better iteration a bit, that's it. - Andres