RE: Parallel Seq Scan vs kernel read ahead - Mailing list pgsql-hackers

From k.jamison@fujitsu.com
Subject RE: Parallel Seq Scan vs kernel read ahead
Date
Msg-id OSBPR01MB2341378F27AD79873177F853EF780@OSBPR01MB2341.jpnprd01.prod.outlook.com
Whole thread Raw
In response to Re: Parallel Seq Scan vs kernel read ahead  (Amit Kapila <amit.kapila16@gmail.com>)
Responses Re: Parallel Seq Scan vs kernel read ahead
List pgsql-hackers
On Tuesday, July 21, 2020 12:18 PM, Amit Kapila wrote:
> On Tue, Jul 21, 2020 at 8:06 AM k.jamison@fujitsu.com <k.jamison@fujitsu.com>
> wrote:
> >
> > Thank you for the advice. I repeated the test as per your advice and
> > average of 3 runs per worker/s planned.
> > It still shows the following similar performance results between Master and
> Patch V2.
> > I wonder why there's no difference though.
> >
> > The test on my machine is roughly like this:
> >
> > createdb test
> > psql -d test
> > create table t_heap as select generate_series(1, 100000000) i; \q
> >
> > pg_ctl restart
> > psql -d test
> > SET track_io_timing = on;
> > SET max_parallel_workers_per_gather = 0; SHOW
> > max_parallel_workers_per_gather; EXPLAIN (ANALYZE, BUFFERS) SELECT
> > count(*) from t_heap; \timing SELECT count(*) from t_heap;
> >
> > drop table t_heap;
> > \q
> > dropdb test
> > pg_ctl restart
> >
> > Below are the results. Again, almost no discernible difference between the
> master and patch.
> > Also, the results when max_parallel_workers_per_gather is more than 4
> > could be inaccurate due to my machine's limitation of only having v4
> > CPUs. Even so, query planner capped it at
> > 6 workers.
> >
> > Query Planner I/O Timings (track_io_timing = on) in ms :
> > | Worker | I/O READ (Master) | I/O READ (Patch) | I/O WRITE (Master) |
> > | I/O WRITE (Patch) |
> >
> |--------|-------------------|------------------|--------------------|-------------
> ------|
> > | 0      | "1,130.777"       | "1,250.821"      | "01,698.051"       |
> "01,733.439"      |
> > | 1      | "1,603.016"       | "1,660.767"      | "02,312.248"       |
> "02,291.661"      |
> > | 2      | "2,036.269"       | "2,107.066"      | "02,698.216"       |
> "02,796.893"      |
> > | 3      | "2,298.811"       | "2,307.254"      | "05,695.991"       |
> "05,894.183"      |
> > | 4      | "2,098.642"       | "2,135.960"      | "23,837.088"       |
> "26,537.158"      |
> > | 5      | "1,956.536"       | "1,997.464"      | "45,891.851"       |
> "48,049.338"      |
> > | 6      | "2,201.816"       | "2,219.001"      | "61,937.828"       |
> "67,809.486"      |
> >
> > Query Planner Execution Time (ms):
> > | Worker | QueryPlanner (Master) | QueryPlanner (Patch) |
> > |--------|-----------------------|----------------------|
> > | 0.000  | "40,454.252"          | "40,521.578"         |
> > | 1.000  | "21,332.067"          | "21,205.068"         |
> > | 2.000  | "14,266.756"          | "14,385.539"         |
> > | 3.000  | "11,597.936"          | "11,722.055"         |
> > | 4.000  | "12,937.468"          | "13,439.247"         |
> > | 5.000  | "14,383.083"          | "14,782.866"         |
> > | 6.000  | "14,671.336"          | "15,507.581"         |
> >
> > Based from the results above, the I/O latency increases as number of
> > workers also increase. Despite that, the query planner execution time
> > is almost closely same when 2 or more workers are used (14~11s). Same
> results between Master and Patch V2.
> >
> > As for buffers, same results are shown per worker (both Master and Patch).
> > | Worker | Buffers                                          |
> > |--------|--------------------------------------------------|
> > | 0      | shared read=442478 dirtied=442478 written=442446 |
> > | 1      | shared read=442478 dirtied=442478 written=442414 |
> > | 2      | shared read=442478 dirtied=442478 written=442382 |
> > | 3      | shared read=442478 dirtied=442478 written=442350 |
> > | 4      | shared read=442478 dirtied=442478 written=442318 |
> > | 5      | shared read=442478 dirtied=442478 written=442286 |
> > | 6      | shared read=442478 dirtied=442478 written=442254 |
> >
> >
> > SQL Query Execution Time (ms) :
> > | Worker | SQL (Master) | SQL (Patch)  |
> > |--------|--------------|--------------|
> > | 0      | "10,418.606" | "10,377.377" |
> > | 1      | "05,427.460"  | "05,402.727" |
> > | 2      | "03,662.998"  | "03,650.277" |
> > | 3      | "02,718.837"  | "02,692.871" |
> > | 4      | "02,759.802"  | "02,693.370" |
> > | 5      | "02,761.834"  | "02,682.590" |
> > | 6      | "02,711.434"  | "02,726.332" |
> >
> > The SQL query execution time definitely benefitted from previous run
> > of query planner, so the results are faster. But again, both Master and Patched
> have almost the same results.
> > Nonetheless, the execution time is almost consistent when
> > max_parallel_workers_per_gather is 2 (default) and above.
> >
> > I am definitely missing something. Perhaps I think I could not
> > understand why there's no I/O difference between the Master and
> > Patched (V2). Or has it been already improved even without this patch?
> >
> 
> I don't think it is strange that you are not seeing much difference because as per
> the initial email by Thomas this patch is not supposed to give benefits on all
> systems.  I think we wanted to check that the patch should not regress
> performance in cases where it doesn't give benefits.  I think it might be okay to
> run with a higher number of workers than you have CPUs in the system as we
> wanted to check if such cases regress as shown by Soumyadeep above [1].  Can
> you once try with
> 8 and or 10 workers as well?
> 
> [1] -
> https://www.postgresql.org/message-id/CADwEdoqirzK3H8bB%3DxyJ192EZCN
> wGfcCa_WJ5GHVM7Sv8oenuA%40mail.gmail.com

You are right. Kindly excuse me on that part, which only means it may or may not have any
benefits on the filesystem I am using. But for other fs, as we can see from David's benchmarks 
significant results/benefits.

Following your advice on regression test case, I increased the number of workers,
but the query planner still capped the workers at 6, so the results from 6 workers onwards
are almost the same.
I don't see significant difference between master and patched on my machine
as per my test results below. (Just for reconfirmation)

Query Planner I/O Timings (ms):
| Worker | I/O READ (Master) | I/O READ (Patch) | I/O WRITE (Master) | I/O WRITE (Patch) | 
|--------|-------------------|------------------|--------------------|-------------------| 
| 0      | "1,130.78"        | "1,250.82"       | "1,698.05"         | "1,733.44"        | 
| 1      | "1,603.02"        | "1,660.77"       | "2,312.25"         | "2,291.66"        | 
| 2      | "2,036.27"        | "2,107.07"       | "2,698.22"         | "2,796.89"        | 
| 3      | "2,298.81"        | "2,307.25"       | "5,695.99"         | "5,894.18"        | 
| 4      | "2,098.64"        | "2,135.96"       | "23,837.09"        | "26,537.16"       | 
| 5      | "1,956.54"        | "1,997.46"       | "45,891.85"        | "48,049.34"       | 
| 6      | "2,201.82"        | "2,219.00"       | "61,937.83"        | "67,809.49"       | 
| 8      | "2,117.80"        | "2,169.67"       | "60,671.22"        | "68,676.36"       | 
| 16     | "2,052.73"        | "2,134.86"       | "60,635.17"        | "66,462.82"       | 
| 32     | "2,036.00"        | "2,200.98"       | "60,833.92"        | "67,702.49"       |

Query Planner Execution Time (ms):
| Worker | QueryPlanner (Master) | QueryPlanner (Patch) | 
|--------|-----------------------|----------------------| 
| 0      | "40,454.25"           | "40,521.58"          | 
| 1      | "21,332.07"           | "21,205.07"          | 
| 2      | "14,266.76"           | "14,385.54"          | 
| 3      | "11,597.94"           | "11,722.06"          | 
| 4      | "12,937.47"           | "13,439.25"          | 
| 5      | "14,383.08"           | "14,782.87"          | 
| 6      | "14,671.34"           | "15,507.58"          | 
| 8      | "14,679.50"           | "15,615.69"          | 
| 16     | "14,474.78"           | "15,274.61"          | 
| 32     | "14,462.11"           | "15,470.68"          |

| Worker | Buffers                                          | 
|--------|--------------------------------------------------| 
| 0      | shared read=442478 dirtied=442478 written=442446 | 
| 1      | shared read=442478 dirtied=442478 written=442414 | 
| 2      | shared read=442478 dirtied=442478 written=442382 | 
| 3      | shared read=442478 dirtied=442478 written=442350 | 
| 4      | shared read=442478 dirtied=442478 written=442318 | 
| 5      | shared read=442478 dirtied=442478 written=442286 | 
| 6      | shared read=442478 dirtied=442478 written=442254 | 
| 8      | shared read=442478 dirtied=442478 written=442254 | 
| 16     | shared read=442478 dirtied=442478 written=442254 | 
| 32     | shared read=442478 dirtied=442478 written=442254 |

I also re-ran the query and measured the execution time (ms) with \timing
| Worker | SQL (Master) | SQL (Patch) | 
|--------|--------------|-------------| 
| 0      | 15476.458    | 15278.772   | 
| 1      | 8292.702     | 8426.435    | 
| 2      | 6256.673     | 6232.456    | 
| 3      | 6357.217     | 6340.013    | 
| 4      | 7591.311     | 7913.881    | 
| 5      | 8165.315     | 8070.592    | 
| 6      | 8065.578     | 8200.076    | 
| 8      | 7988.302     | 8609.138    | 
| 16     | 8025.170     | 8469.895    | 
| 32     | 8019.393     | 8645.150    |

Again tested on:
XFS filesystem, HDD virtual machine, 8GB RAM
RHEL4, 64-bit,
4 CPUs, Intel Core Processor (Haswell, IBRS)
PostgreSQL 14devel on x86_64-pc-linux-gnu

So I guess it does not affect the filesystem that I am using. So I think it's OK.

Kind regards,
Kirk Jamison


pgsql-hackers by date:

Previous
From: Amit Khandekar
Date:
Subject: Re: Auto-vectorization speeds up multiplication of large-precision numerics
Next
From: Amit Kapila
Date:
Subject: Re: Parallel copy