Re: Parallel Seq Scan vs kernel read ahead - Mailing list pgsql-hackers

From Robert Haas
Subject Re: Parallel Seq Scan vs kernel read ahead
Date
Msg-id CA+TgmoY7M4wxoLgc-+-hZV2fFHooQ0JDDYLLoWVbX8QUK5rWmg@mail.gmail.com
Whole thread Raw
In response to Re: Parallel Seq Scan vs kernel read ahead  (David Rowley <dgrowleyml@gmail.com>)
Responses Re: Parallel Seq Scan vs kernel read ahead
List pgsql-hackers
On Mon, Jun 15, 2020 at 5:09 PM David Rowley <dgrowleyml@gmail.com> wrote:
> To summarise what's all been proposed so far:
>
> 1. Use a constant, (e.g. 64) as the parallel step size
> 2. Ramp up the step size over time
> 3. Ramp down the step size towards the end of the scan.
> 4. Auto-determine a good stepsize based on the size of the relation.
> 5. Add GUC to allow users to control or override the step size.
> 6. Add relption to allow users to control or override the step size.
>
> Here are my thoughts on each of those:
>
> #1 is a bad idea as there are legitimate use-cases for using parallel
> query on small tables. e.g calling some expensive parallel safe
> function. Small tables are more likely to be cached.

I agree.

> #2 I don't quite understand why this is useful

I was thinking that if the query had a small LIMIT, you'd want to
avoid handing out excessively large chunks, but actually that seems
like it might just be fuzzy thinking on my part. We're not committing
to scanning the entirety of the chunk just because we've assigned it
to a worker.

> #3 I understand this is to try to make it so workers all complete
> around about the same time.
> #4 We really should be doing it this way.
> #5 Having a global knob to control something that is very specific to
> the size of a relation does not make much sense to me.
> #6. I imagine someone will have some weird use-case that works better
> when parallel workers get 1 page at a time. I'm not convinced that
> they're not doing something else wrong.

Agree with all of that.

> So my vote is for 4 with possibly 3, if we can come up with something
> smart enough * that works well in parallel.  I think there's less of a
> need for this if we divided the relation into more chunks, e.g. 8192
> or 16384.

I agree with that too.

> * Perhaps when there are less than 2 full chunks remaining, workers
> can just take half of what is left. Or more specifically
> Max(pg_next_power2(remaining_blocks) / 2, 1), which ideally would work
> out allocating an amount of pages proportional to the amount of beer
> each mathematician receives in the "An infinite number of
> mathematicians walk into a bar" joke, obviously with the exception
> that we stop dividing when we get to 1. However, I'm not quite sure
> how well that can be made to work with multiple bartenders working in
> parallel.

That doesn't sound nearly aggressive enough to me. I mean, let's
suppose that we're concerned about the scenario where one chunk takes
50x as long as all the other chunks. Well, if we have 1024 chunks
total, and we hit the problem chunk near the beginning, there will be
no problem. In effect, there are 1073 units of work instead of 1024,
and we accidentally assigned one guy 50 units of work when we thought
we were assigning 1 unit of work. If there's enough work left that we
can assign each other worker 49 units more than what we would have
done had that chunk been the same cost as all the others, then there's
no problem. So for instance if there are 4 workers, we can still even
things out if we hit the problematic chunk more than ~150 chunks from
the end. If we're closer to the end than that, there's no way to avoid
the slow chunk delaying the overall completion time, and the problem
gets worse as the problem chunk gets closer to the end.

How can we improve? Well, if when we're less than 150 chunks from the
end, we reduce the chunk size by 2x, then instead of having 1 chunk
that is 50x as expensive, hopefully we'll have 2 smaller chunks that
are each 50x as expensive. They'll get assigned to 2 different
workers, and the remaining 2 workers now need enough extra work from
other chunks to even out the work distribution, which should still be
possible. It gets tough though if breaking the one expensive chunk in
half produces 1 regular-price half-chunk and one half-chunk that is
50x as expensive as all the others. Because we have <150 chunks left,
there's no way to keep everybody else busy until the sole expensive
half-chunk completes. In a sufficiently-extreme scenario, assigning
even a single full block to a worker is too much, and you really want
to handle the tuples out individually.

Anyway, if we don't do anything special until we get down to the last
2 chunks, it's only going to make a difference when one of those last
2 chunks happens to be the expensive one. If say the third-to-last
chunk is the expensive one, subdividing the last 2 chunks lets all the
workers who didn't get the expensive chunk fight over the scraps, but
that's not an improvement. If anything it's worse, because there's
more communication overhead and you don't gain anything vs. just
assigning each chunk to a worker straight up.

In a real example we don't know that we have a single expensive chunk
-- each chunk just has its own cost, and they could all be the same,
or some could be much more expensive. When we have a lot of work left,
we can be fairly cavalier in handing out larger chunks of it with the
full confidence that even if some of those chunks turn out to be way
more expensive than others, we'll still be able to equalize the
finishing times by our choice of how to distribute the remaining work.
But as there's less and less work left, I think you need to hand out
the work in smaller increments to maximize the chances of obtaining an
equal work distribution.

So maybe one idea is to base the chunk size on the number of blocks
remaining to be scanned. Say that the chunk size is limited to 1/512
of the *remaining* blocks in the relation, probably with some upper
limit. I doubt going beyond 1GB makes any sense just because that's
how large the files are, for example, and that might be too big for
other reasons. But let's just use that as an example. Say you have a
20TB relation. You hand out 1GB segments until you get down to 512GB
remaining. Then you hand out 512MB segments until you get down to
256GB remaining, and then 256MB segments until you get down to 128GB
remaining, and so on. Once you get down to the last 4MB you're handing
out individual blocks, just as you would do from the beginning if the
whole relation size was 4MB.

This kind of thing is a bit overcomplicated and doesn't really help if
the first 1GB you hand out at the very beginning turns out to be the
1GB chunk of death, and it takes a bazillion times longer than
anything else, and it's just going to be the last worker to finish no
matter what you do about anything else. The increasing granularity
near the end is just fighting over scraps in that case. The only thing
you can do to avoid this kind of problem is use a lower maximum chunk
size from the beginning, and I think we might want to consider doing
that, because I suspect that the incremental benefits from 64MB chunks
to 1GB chunks are pretty small, for example.

But, in more normal cases where you have some somewhat-expensive
chunks mixed in with the regular-price chunks, I think this sort of
thing should work pretty well. If you never give out more that 1/512
of the remaining blocks, then you can still achieve an equal work
distribution as long as you don't hit a chunk whose cost relative to
others is more than 512/(# of processes you have - 1). So for example
with 6 processes, you need a single chunk that's more than 100x as
expensive as the others to break it. That can definitely happen,
because we can construct arbitrarily bad cases for this sort of thing,
but hopefully they wouldn't come up all that frequently...

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



pgsql-hackers by date:

Previous
From: Marc Cousin
Date:
Subject: slow get_actual_variable_range with long running transactions
Next
From: Mark Dilger
Date:
Subject: Re: snowball ASCII stemmer configuration