Re: Parallel Seq Scan vs kernel read ahead - Mailing list pgsql-hackers
From | Robert Haas |
---|---|
Subject | Re: Parallel Seq Scan vs kernel read ahead |
Date | |
Msg-id | CA+TgmoY7M4wxoLgc-+-hZV2fFHooQ0JDDYLLoWVbX8QUK5rWmg@mail.gmail.com Whole thread Raw |
In response to | Re: Parallel Seq Scan vs kernel read ahead (David Rowley <dgrowleyml@gmail.com>) |
Responses |
Re: Parallel Seq Scan vs kernel read ahead
|
List | pgsql-hackers |
On Mon, Jun 15, 2020 at 5:09 PM David Rowley <dgrowleyml@gmail.com> wrote: > To summarise what's all been proposed so far: > > 1. Use a constant, (e.g. 64) as the parallel step size > 2. Ramp up the step size over time > 3. Ramp down the step size towards the end of the scan. > 4. Auto-determine a good stepsize based on the size of the relation. > 5. Add GUC to allow users to control or override the step size. > 6. Add relption to allow users to control or override the step size. > > Here are my thoughts on each of those: > > #1 is a bad idea as there are legitimate use-cases for using parallel > query on small tables. e.g calling some expensive parallel safe > function. Small tables are more likely to be cached. I agree. > #2 I don't quite understand why this is useful I was thinking that if the query had a small LIMIT, you'd want to avoid handing out excessively large chunks, but actually that seems like it might just be fuzzy thinking on my part. We're not committing to scanning the entirety of the chunk just because we've assigned it to a worker. > #3 I understand this is to try to make it so workers all complete > around about the same time. > #4 We really should be doing it this way. > #5 Having a global knob to control something that is very specific to > the size of a relation does not make much sense to me. > #6. I imagine someone will have some weird use-case that works better > when parallel workers get 1 page at a time. I'm not convinced that > they're not doing something else wrong. Agree with all of that. > So my vote is for 4 with possibly 3, if we can come up with something > smart enough * that works well in parallel. I think there's less of a > need for this if we divided the relation into more chunks, e.g. 8192 > or 16384. I agree with that too. > * Perhaps when there are less than 2 full chunks remaining, workers > can just take half of what is left. Or more specifically > Max(pg_next_power2(remaining_blocks) / 2, 1), which ideally would work > out allocating an amount of pages proportional to the amount of beer > each mathematician receives in the "An infinite number of > mathematicians walk into a bar" joke, obviously with the exception > that we stop dividing when we get to 1. However, I'm not quite sure > how well that can be made to work with multiple bartenders working in > parallel. That doesn't sound nearly aggressive enough to me. I mean, let's suppose that we're concerned about the scenario where one chunk takes 50x as long as all the other chunks. Well, if we have 1024 chunks total, and we hit the problem chunk near the beginning, there will be no problem. In effect, there are 1073 units of work instead of 1024, and we accidentally assigned one guy 50 units of work when we thought we were assigning 1 unit of work. If there's enough work left that we can assign each other worker 49 units more than what we would have done had that chunk been the same cost as all the others, then there's no problem. So for instance if there are 4 workers, we can still even things out if we hit the problematic chunk more than ~150 chunks from the end. If we're closer to the end than that, there's no way to avoid the slow chunk delaying the overall completion time, and the problem gets worse as the problem chunk gets closer to the end. How can we improve? Well, if when we're less than 150 chunks from the end, we reduce the chunk size by 2x, then instead of having 1 chunk that is 50x as expensive, hopefully we'll have 2 smaller chunks that are each 50x as expensive. They'll get assigned to 2 different workers, and the remaining 2 workers now need enough extra work from other chunks to even out the work distribution, which should still be possible. It gets tough though if breaking the one expensive chunk in half produces 1 regular-price half-chunk and one half-chunk that is 50x as expensive as all the others. Because we have <150 chunks left, there's no way to keep everybody else busy until the sole expensive half-chunk completes. In a sufficiently-extreme scenario, assigning even a single full block to a worker is too much, and you really want to handle the tuples out individually. Anyway, if we don't do anything special until we get down to the last 2 chunks, it's only going to make a difference when one of those last 2 chunks happens to be the expensive one. If say the third-to-last chunk is the expensive one, subdividing the last 2 chunks lets all the workers who didn't get the expensive chunk fight over the scraps, but that's not an improvement. If anything it's worse, because there's more communication overhead and you don't gain anything vs. just assigning each chunk to a worker straight up. In a real example we don't know that we have a single expensive chunk -- each chunk just has its own cost, and they could all be the same, or some could be much more expensive. When we have a lot of work left, we can be fairly cavalier in handing out larger chunks of it with the full confidence that even if some of those chunks turn out to be way more expensive than others, we'll still be able to equalize the finishing times by our choice of how to distribute the remaining work. But as there's less and less work left, I think you need to hand out the work in smaller increments to maximize the chances of obtaining an equal work distribution. So maybe one idea is to base the chunk size on the number of blocks remaining to be scanned. Say that the chunk size is limited to 1/512 of the *remaining* blocks in the relation, probably with some upper limit. I doubt going beyond 1GB makes any sense just because that's how large the files are, for example, and that might be too big for other reasons. But let's just use that as an example. Say you have a 20TB relation. You hand out 1GB segments until you get down to 512GB remaining. Then you hand out 512MB segments until you get down to 256GB remaining, and then 256MB segments until you get down to 128GB remaining, and so on. Once you get down to the last 4MB you're handing out individual blocks, just as you would do from the beginning if the whole relation size was 4MB. This kind of thing is a bit overcomplicated and doesn't really help if the first 1GB you hand out at the very beginning turns out to be the 1GB chunk of death, and it takes a bazillion times longer than anything else, and it's just going to be the last worker to finish no matter what you do about anything else. The increasing granularity near the end is just fighting over scraps in that case. The only thing you can do to avoid this kind of problem is use a lower maximum chunk size from the beginning, and I think we might want to consider doing that, because I suspect that the incremental benefits from 64MB chunks to 1GB chunks are pretty small, for example. But, in more normal cases where you have some somewhat-expensive chunks mixed in with the regular-price chunks, I think this sort of thing should work pretty well. If you never give out more that 1/512 of the remaining blocks, then you can still achieve an equal work distribution as long as you don't hit a chunk whose cost relative to others is more than 512/(# of processes you have - 1). So for example with 6 processes, you need a single chunk that's more than 100x as expensive as the others to break it. That can definitely happen, because we can construct arbitrarily bad cases for this sort of thing, but hopefully they wouldn't come up all that frequently... -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
pgsql-hackers by date: