Thread: Huge performance penalty with parallel queries in Windows x64 v. Linux x64
Huge performance penalty with parallel queries in Windows x64 v. Linux x64
From
Hans Buschmann
Date:
When developing a solution for a new customer request I created a new query over the production data. Despite the relatively low row counts of the involved tables (all < 100k) I noticed quite a long execution time of about85 ms to 100 ms. The explain anaylze plan showed a parallel execution plan with 2 parallels. The data structure and index structure was not quite optimal for this kind of query (which does not matter in this case). The comparison of the explain analyze plans on win-x64 and Linux x64 showed about 3 times longer execution on windows. For comparison I reinstalled the production data on two test databases on different virtual machines on the same hardware(the very same machine with Hyper-V virtualization). The steps were only (on a mostly complete idle machine): 1. create test database 2. pg_restore of the production data from same dump file 3. analyze on the database 4. run the query multiple times (about 5 times) and took the fastest explain analyze. On fedora 34 64 bit, PG 13.2 unmodified self compiled the query took about 33 ms. On Windows Server 2019 64 bit, PG 13.2 from EDB download packages the query took about 85 ms. version ---------------------------------------------------------------------------------------------------------- PostgreSQL 13.2 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 11.1.1 20210428 (Red Hat 11.1.1-1), 64-bit version ------------------------------------------------------------ PostgreSQL 13.2, compiled by Visual C++ build 1914, 64-bit The corresponding explain plans are available at explain.depesz.com -- fedora https://explain.depesz.com/s/Mq3P -- windows https://explain.depesz.com/s/VLtZ The main difference is the time shown for the Gather Merge step (65 ms vs. 7 ms) The explain plans and the non-standard GUCs are included in the attachments, the configuration for the databases seems quitecomparable. Unfortunately I cannot disclose the query and the table data. My experience with parallel queries is not very wide, but this massive execution time difference of the exact same queryon the exact same data on the exact same hardware with the same, unmodified last stable Postgres version is very astonishing. BTW I generally observed slower execution under Windows, so production has moved now to Linux. There seem no relevant GUC differences concerning query execution, so the performance penalty of 300% to 900% (one step only)is not easily explainable. The databases remain on the system to repeat the queries on request in the queue of further investigation. Thanks for looking. Hans Buschmann
Attachment
Re: Huge performance penalty with parallel queries in Windows x64 v. Linux x64
From
Thomas Munro
Date:
On Tue, May 4, 2021 at 4:05 AM Hans Buschmann <buschmann@nidsa.net> wrote: > The main difference is the time shown for the Gather Merge step (65 ms vs. 7 ms) No Windows here, but could it be super slow at launching workers? How does a trivial parallel query compare, something like? SET force_parallel_mode = on; EXPLAIN ANALYZE SELECT 42;
AW: Huge performance penalty with parallel queries in Windows x64 v. Linux x64
From
Hans Buschmann
Date:
>No Windows here, but could it be super slow at launching workers? How >does a trivial parallel query compare, something like? >SET force_parallel_mode = on; >EXPLAIN ANALYZE SELECT 42; indeed this query takes about 40ms in windows and 7ms on Linux (lowest values). Due to remoting the machine the reported times vary quite a bit. The problem seems that this (probably inherent) performance disadvantage of windows is not reflected in the cost model. This causes little to middle complex queries to prioritize parallel execution on windows which is certainly not the bestoption in these cases. The starting of processes should have an adequate cost penalty to guide the planner in the right direction. Generally disabling parallel queries seems not a viable option with mixed loads. Here are the query plans: QUERY PLAN Windows -------------------------------------------------------------------------------------------- Gather (cost=1000.00..1000.11 rows=1 width=4) (actual time=34.995..38.207 rows=1 loops=1) Workers Planned: 1 Workers Launched: 1 Single Copy: true -> Result (cost=0.00..0.01 rows=1 width=4) (actual time=0.004..0.004 rows=1 loops=1) Planning Time: 0.016 ms Execution Time: 39.136 ms (7 Zeilen) QUERY PLAN Linux ------------------------------------------------------------------------------------------ Gather (cost=1000.00..1000.11 rows=1 width=4) (actual time=6.864..7.764 rows=1 loops=1) Workers Planned: 1 Workers Launched: 1 Single Copy: true -> Result (cost=0.00..0.01 rows=1 width=4) (actual time=0.002..0.003 rows=1 loops=1) Planning Time: 0.026 ms Execution Time: 7.812 ms (7 rows) Hans Buschmann
Re: Huge performance penalty with parallel queries in Windows x64 v. Linux x64
From
Thomas Munro
Date:
On Tue, May 4, 2021 at 7:40 PM Hans Buschmann <buschmann@nidsa.net> wrote: > The problem seems that this (probably inherent) performance disadvantage of windows is not reflected in the cost model. https://www.postgresql.org/docs/13/runtime-config-query.html#GUC-PARALLEL-SETUP-COST is for that. It might be interesting to do some profiling to see exactly what is slow. Presumably CreateProcess(), but I wonder why exactly. It'd be nice if we had a way to reuse parallel workers, but alas we do not, yet. Or threads.
AW: Huge performance penalty with parallel queries in Windows x64 v. Linux x64
From
Hans Buschmann
Date:
Thank you Thomas for pointing me to this GUC which I haven't realized before. From the documentation I take that a cost of 1.0 is set for a sequential page fetch. In my opinion, even for Linux the default for parallel_setup_cost is set too low (1000). It should reflect the sequentialaccess of 1000 pages, which normally is faster from buffer cache on modern hardware. For Windows, these costs are much higher, so I would propose to set the default to at least 10000, perhaps 25000 to reflectthe real parallel overhead. (BTW: Is this cost multiplied by the real count of workers choosen (max_parallel_workers_per_gather) or only a value independentof the number of workers?. This would matter in windows-high-parallel scenarios) The inadequate default gives more and more slower-then-necessary plans when people are moving to newer PG versions with goodparallel support. For them it's like for me a little surprise, which most won't even notice or remedy nor full understand. For bigger installations the knowledge of query tuning is more probable and people can react on their real situation. Perhaps someone with more knowledge with parallel queries can make some profiling / performance tests to justify my proposals(e.g. what is the sequential page access equivalent of 40 ms on selected platforms): New defaults proposal: -- Linux and comparable architectures with fast process creation: parallel_setup_cost = 2500 -- Windows parallel_setup_cost = 25000 Thanks Hans Buschmann
Re: AW: Huge performance penalty with parallel queries in Windows x64 v. Linux x64
From
Tom Lane
Date:
Hans Buschmann <buschmann@nidsa.net> writes: > In my opinion, even for Linux the default for parallel_setup_cost is set too low (1000). It should reflect the sequentialaccess of 1000 pages, which normally is faster from buffer cache on modern hardware. I tend to agree with you that the default costs for parallel queries are far too optimistic --- we've seen enough complaints about parallelism making things slower to make that clear. However, proposing some new numbers without gathering some hard evidence to support them is not helpful. regards, tom lane
Re: Huge performance penalty with parallel queries in Windows x64 v. Linux x64
From
Thomas Munro
Date:
On Wed, May 5, 2021 at 3:50 AM Hans Buschmann <buschmann@nidsa.net> wrote: > (BTW: Is this cost multiplied by the real count of workers choosen (max_parallel_workers_per_gather) or only a value independentof the number of workers?. This would matter in windows-high-parallel scenarios) It's not multiplied: https://github.com/postgres/postgres/blob/50e17ad281b8d1c1b410c9833955bc80fbad4078/src/backend/optimizer/path/costsize.c#L398 It might be interesting to know how that 40ms time scales as you add more workers. For what it's worth, I see that the following query takes around about ~6ms + ~1.5ms per worker on my FreeBSD machine, and on Linux it's harder to pin down, varying a lot, usually a bit slower (sorry I didn't have time to do proper statistics). create table t (); alter table t set (parallel_workers=8); set min_parallel_table_scan_size = 0; set parallel_setup_cost = 0; set parallel_tuple_cost = 0; set max_parallel_workers_per_gather = 1; explain analyze select count(*) from t; ... set max_parallel_workers_per_gather = 7; explain analyze select count(*) from t;
Re: Huge performance penalty with parallel queries in Windows x64 v. Linux x64
From
Thomas Munro
Date:
On Wed, May 5, 2021 at 2:12 PM Thomas Munro <thomas.munro@gmail.com> wrote: > It might be interesting to know how that 40ms time scales as you add > more workers. ... Another thought: I'd also try tests like that in large databases (ie large virtual memory) vs small ones, and with and without huge/locked memory pages configured[1], to see how much of the overhead is due to virtual memory work (though I know nothing about Windows VM, it's just an idea). [1] https://www.postgresql.org/docs/13/runtime-config-resource.html#GUC-HUGE-PAGES