Re: Parallel Seq Scan vs kernel read ahead - Mailing list pgsql-hackers

From David Rowley
Subject Re: Parallel Seq Scan vs kernel read ahead
Date
Msg-id CAApHDvrfJfYH51_WY-iQqPw8yGR4fDoTxAQKqn+Sa7NTKEVWtg@mail.gmail.com
Whole thread Raw
In response to Re: Parallel Seq Scan vs kernel read ahead  (Thomas Munro <thomas.munro@gmail.com>)
Responses Re: Parallel Seq Scan vs kernel read ahead
Re: Parallel Seq Scan vs kernel read ahead
List pgsql-hackers
On Thu, 21 May 2020 at 14:32, Thomas Munro <thomas.munro@gmail.com> wrote:
> Thanks.  So it seems like Linux, Windows and anything using ZFS are
> OK, which probably explains why we hadn't heard complaints about it.

I tried out a different test on a Windows 8.1 machine I have here.  I
was concerned that the test that was used here ends up with tuples
that are too narrow and that the executor would spend quite a bit of
time going between nodes and performing the actual aggregation.  I
thought it might be good to add some padding so that there are far
fewer tuples on the page.

I ended up with:

create table t (a int, b text);
-- create a table of 100GB in size.
insert into t select x,md5(x::text) from
generate_series(1,1000000*1572.7381809)x; -- took 1 hr 18 mins
vacuum freeze t;

query = select count(*) from t;
Disk = Samsung SSD 850 EVO mSATA 1TB.

Master:
workers = 0 : Time: 269104.281 ms (04:29.104)  380MB/s
workers = 1 : Time: 741183.646 ms (12:21.184)  138MB/s
workers = 2 : Time: 656963.754 ms (10:56.964)  155MB/s

Patched:

workers = 0 : Should be the same as before as the code for this didn't change.
workers = 1 : Time: 300299.364 ms (05:00.299) 340MB/s
workers = 2 : Time: 270213.726 ms (04:30.214) 379MB/s

(A better query would likely have been just: SELECT * FROM t WHERE a =
1; but I'd run the test by the time I thought of that.)

So, this shows that Windows, at least 8.1, does suffer from this too.

For the patch. I know you just put it together quickly, but I don't
think you can do that ramp up the way you have. It looks like there's
a risk of torn reads and torn writes and I'm unsure how much that
could affect the test results here. It looks like there's a risk that
a worker gets some garbage number of pages to read rather than what
you think it will. Also, I also don't quite understand the need for a
ramp-up in pages per serving. Shouldn't you instantly start at some
size and hold that, then only maybe ramp down at the end so that
workers all finish at close to the same time?  However, I did have
other ideas which I'll explain below.

From my previous work on that function to add the atomics. I did think
that it would be better to dish out more than 1 page at a time.
However, there is the risk that the workload is not evenly distributed
between the workers.  My thoughts were that we could divide the total
pages by the number of workers then again by 100 and dish out blocks
based on that. That way workers will get about 100th of their fair
share of pages at once, so assuming there's an even amount of work to
do per serving of pages, then the last worker should only run on at
most 1% longer.  Perhaps that 100 should be 1000, then the run on time
for the last worker is just 0.1%.  Perhaps the serving size can also
be capped at some maximum like 64. We'll certainly need to ensure it's
at least 1!   I imagine that will eliminate the need for any ramp down
of pages per serving near the end of the scan.

David



pgsql-hackers by date:

Previous
From: Santhosh Kumar
Date:
Subject: Behaviour of failed Primary
Next
From: Ahsan Hadi
Date:
Subject: Re: WIP/PoC for parallel backup