Re: BUG #16104: Invalid DSA Memory Alloc Request in Parallel Hash - Mailing list pgsql-bugs
From | Tomas Vondra |
---|---|
Subject | Re: BUG #16104: Invalid DSA Memory Alloc Request in Parallel Hash |
Date | |
Msg-id | 20191110172925.mlkglpwzwego3wtp@development Whole thread Raw |
In response to | Re: BUG #16104: Invalid DSA Memory Alloc Request in Parallel Hash (James Coleman <jtc331@gmail.com>) |
Responses |
Re: BUG #16104: Invalid DSA Memory Alloc Request in Parallel Hash
|
List | pgsql-bugs |
On Sun, Nov 10, 2019 at 09:15:06AM -0500, James Coleman wrote: >On Sat, Nov 9, 2019 at 10:44 PM Thomas Munro <thomas.munro@gmail.com> wrote: >> >> On Sun, Nov 10, 2019 at 3:25 PM James Coleman <jtc331@gmail.com> wrote: >> > So I should have run the earlier attached plan with VERBOSE, but >> > here's the interesting thing: the parallel hash node's seq scan node >> > outputs two columns: let's call them (from the redacted plan) >> > items.system_id and items.account_id. The first (system_id) is both >> > not null and unique; the second (account_id) definitely has massive >> > skew. I'm not very up-to-speed on how the hash building works, but I >> > would have (perhaps naïvely?) assumed that the first column being >> > unique would make the hash keys very likely not to collide in any >> > significantly skewed way. Am I missing something here? >> >> Hrm. So the compound key is unique then. I was assuming up until now >> that it had duplicates. The hashes of the individual keys are >> combined (see ExecHashGetHashValue()), so assuming there is nothing >> funky about the way citext gets hashed (and it's not jumping out at >> me), your unique keys should give you uniform hash values and thus >> partition size, and repartitioning should be an effective way of >> reducing hash table size. So now it sounds like you have a simple >> case of underestimation, but now I'm confused about how you got a >> 344MB hash table with work_mem = 150MB: > Why would the compound key matter? The join only does this Hash Cond: (group_details.group_number = items.system_id) so the only thing that really matters when it comes to skew is system_id, i.e. the citext column. The other column is used by aggregation, but that's irrelevant here, no? >Looking at the source, citext's hash is a pretty standard call to >hash_any, so I don't see how that would lead to any oddities (barring >an intentional hash collision etc., but this is real data). > >Do you have any theories about where the underestimation is happening? >It knows the number of rows reasonably well. The one thing I'm not >sure about yet is the row width = 16. The account_id is a bigint, so >we can assume it knows the size properly there. The system_id being >citext...I think the 8 bytes it has for that is probably a bit low on >average, but I'm not sure yet by how much (I'm going to run a query to >find out). Hmm. Query came back, and average length is just shy of >7.9...so the 16 byte row size is looking pretty good. So I'm at a loss >of why/what it would be underestimating (does it know about >uniqueness/non-uniqueness? could that be a factor?)? > I think the easiest thing we can do is running a couple of queries collecting useful stats, like -- the most frequent system_id values select system_id, count(*), sum(length(system_id)) from items gtoup by system_id order by count(*) desc limit 100; -- the largest group by system_id select system_id, count(*), sum(length(system_id)) from items gtoup by system_id order by sum(length(system_id)) desc limit 100; That should tell us if there's something off. >I also don't know why it seems to regularly fail on the primary, but >not on the sync, unless we adjust the work_mem up. I've double-checked >all GUCs and the only differences are things related to replication >that you'd expect to be different on primary/replica. > >I know there are some things that execute differently on replicas, so >I assume it's something like that, but I don't know what specifically >would cause this here. > Hmm, I haven't realized it fails on primary more often. That's certainly strange, I don't have a very good idea why that could be happening. Do the query plans look the same? How does the EXPLAIN ANALYZE look like? >> Buckets: 4194304 (originally 4194304) Batches: 32768 (originally >> 4096) Memory Usage: 344448kB >> >> And I'm confused about what was different when it wanted the crazy >> number of batches. > >I'm not quite sure how to find out; if you have any ideas, I'd love to >hear them. The one thing I can think of to try is to slowly increase >work_mem (since setting it to 500MB reproduced the error on the >replica) and see if the bucket info starts to trend up. > Chances are it actually happened to detect skew and disabled the growth, hence the table grew above work_mem. We should probably add info about this (skew, disabling growth) to EXPLAIN ANALYZE verbose mode. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
pgsql-bugs by date: