Thread: Huge performance penalty with parallel queries in Windows x64 v. Linux x64

Huge performance penalty with parallel queries in Windows x64 v. Linux x64

From
Hans Buschmann
Date:
When developing a solution for a new customer request I created a new query over the production data.

Despite the relatively low row counts of the involved tables (all < 100k) I noticed quite a long execution time of
about85 ms to 100 ms. 

The explain anaylze plan showed a parallel execution plan with 2 parallels.

The data structure and index structure was not quite optimal for this kind of query (which does not matter in this
case).

The comparison of the explain analyze plans on win-x64 and Linux x64 showed about 3 times longer execution on windows.

For comparison I reinstalled the production data on two test databases on different virtual machines on the same
hardware(the very same machine with Hyper-V virtualization). 

The steps were only (on a mostly complete idle machine):
1. create test database
2. pg_restore of the production data from same dump file
3. analyze on the database
4. run the query multiple times (about 5 times) and took the fastest explain analyze.

On fedora 34 64 bit, PG 13.2 unmodified self compiled the query took about 33 ms.
On Windows Server 2019 64 bit, PG 13.2 from EDB download packages the query took about 85 ms.

                                                 version
----------------------------------------------------------------------------------------------------------
 PostgreSQL 13.2 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 11.1.1 20210428 (Red Hat 11.1.1-1), 64-bit

                          version
------------------------------------------------------------
 PostgreSQL 13.2, compiled by Visual C++ build 1914, 64-bit

The corresponding explain plans are available at explain.depesz.com

-- fedora
https://explain.depesz.com/s/Mq3P

-- windows
https://explain.depesz.com/s/VLtZ

The main difference is the time shown for the Gather Merge step (65 ms vs. 7 ms)

The explain plans and the non-standard GUCs are included in the attachments, the configuration for the databases seems
quitecomparable. 

Unfortunately I cannot disclose the query and the table data.

My experience with parallel queries is not very wide, but this massive execution time difference of the exact same
queryon the exact same data on the exact same hardware with the same, unmodified last stable Postgres version is very
astonishing.

BTW I generally observed slower execution under Windows, so production has moved now to Linux.

There seem no relevant GUC differences concerning query execution, so the performance penalty of 300% to 900% (one step
only)is not easily explainable. 

The databases remain on the system to repeat the queries on request in the queue of further investigation.


Thanks for looking.

Hans Buschmann


Attachment
On Tue, May 4, 2021 at 4:05 AM Hans Buschmann <buschmann@nidsa.net> wrote:
> The main difference is the time shown for the Gather Merge step (65 ms vs. 7 ms)

No Windows here, but could it be super slow at launching workers?  How
does a trivial parallel query compare, something like?

SET force_parallel_mode = on;
EXPLAIN ANALYZE SELECT 42;



>No Windows here, but could it be super slow at launching workers?  How
>does a trivial parallel query compare, something like?

>SET force_parallel_mode = on;
>EXPLAIN ANALYZE SELECT 42;

indeed this query takes about 40ms in windows and 7ms on Linux (lowest values).

Due to remoting the machine the reported times vary quite a bit.

The problem seems that this (probably inherent) performance disadvantage of windows is not reflected in the cost model.

This causes little to middle complex queries to prioritize parallel execution on windows which is certainly not the
bestoption in these cases. 

The starting of processes should have an adequate cost penalty to guide the planner in the right direction.

Generally disabling parallel queries seems not a viable option with mixed loads.

Here are the query plans:

                                         QUERY PLAN Windows
--------------------------------------------------------------------------------------------
 Gather  (cost=1000.00..1000.11 rows=1 width=4) (actual time=34.995..38.207 rows=1 loops=1)
   Workers Planned: 1
   Workers Launched: 1
   Single Copy: true
   ->  Result  (cost=0.00..0.01 rows=1 width=4) (actual time=0.004..0.004 rows=1 loops=1)
 Planning Time: 0.016 ms
 Execution Time: 39.136 ms
(7 Zeilen)

                                        QUERY PLAN Linux
------------------------------------------------------------------------------------------
 Gather  (cost=1000.00..1000.11 rows=1 width=4) (actual time=6.864..7.764 rows=1 loops=1)
   Workers Planned: 1
   Workers Launched: 1
   Single Copy: true
   ->  Result  (cost=0.00..0.01 rows=1 width=4) (actual time=0.002..0.003 rows=1 loops=1)
 Planning Time: 0.026 ms
 Execution Time: 7.812 ms
(7 rows)


Hans Buschmann


On Tue, May 4, 2021 at 7:40 PM Hans Buschmann <buschmann@nidsa.net> wrote:
> The problem seems that this (probably inherent) performance disadvantage of windows is not reflected in the cost
model.

https://www.postgresql.org/docs/13/runtime-config-query.html#GUC-PARALLEL-SETUP-COST
is for that.

It might be interesting to do some profiling to see exactly what is
slow.  Presumably CreateProcess(), but I wonder why exactly.

It'd be nice if we had a way to reuse parallel workers, but alas we do
not, yet.  Or threads.



Thank you Thomas for pointing me to this GUC which I haven't realized before.

From the documentation I take that a cost of 1.0 is set for a sequential page fetch.

In my opinion, even for Linux the default for parallel_setup_cost is set too low (1000). It should reflect the
sequentialaccess of 1000 pages, which normally is faster from buffer cache on modern hardware. 

For Windows, these costs are much higher, so I would propose to set the default to at least 10000, perhaps 25000 to
reflectthe real parallel overhead. 

(BTW: Is this cost multiplied by the real count of workers choosen (max_parallel_workers_per_gather) or only a value
independentof the number of workers?. This would matter in windows-high-parallel scenarios) 

The inadequate default gives more and more slower-then-necessary plans when people are moving to newer PG versions with
goodparallel support. For them it's like for me a little surprise, which most won't even notice or remedy nor full
understand.

For bigger installations the knowledge of query tuning is more probable and people can react on their real situation.

Perhaps someone with more knowledge with parallel queries can make some profiling / performance tests to justify my
proposals(e.g. what is the sequential page access equivalent of 40 ms on selected platforms): 

New defaults proposal:
-- Linux and comparable architectures with fast process creation:
parallel_setup_cost  =  2500

-- Windows
parallel_setup_cost = 25000


Thanks

Hans Buschmann


Hans Buschmann <buschmann@nidsa.net> writes:
> In my opinion, even for Linux the default for parallel_setup_cost is set too low (1000). It should reflect the
sequentialaccess of 1000 pages, which normally is faster from buffer cache on modern hardware. 

I tend to agree with you that the default costs for parallel queries
are far too optimistic --- we've seen enough complaints about parallelism
making things slower to make that clear.  However, proposing some new
numbers without gathering some hard evidence to support them is not
helpful.

            regards, tom lane



On Wed, May 5, 2021 at 3:50 AM Hans Buschmann <buschmann@nidsa.net> wrote:
> (BTW: Is this cost multiplied by the real count of workers choosen (max_parallel_workers_per_gather) or only a value
independentof the number of workers?. This would matter in windows-high-parallel scenarios)
 

It's not multiplied:


https://github.com/postgres/postgres/blob/50e17ad281b8d1c1b410c9833955bc80fbad4078/src/backend/optimizer/path/costsize.c#L398

It might be interesting to know how that 40ms time scales as you add
more workers.  For what it's worth, I see that the following query
takes around about ~6ms + ~1.5ms per worker on my FreeBSD machine, and
on Linux it's harder to pin down, varying a lot, usually a bit slower
(sorry I didn't have time to do proper statistics).

create table t ();
alter table t set (parallel_workers=8);
set min_parallel_table_scan_size = 0;
set parallel_setup_cost = 0;
set parallel_tuple_cost = 0;

set max_parallel_workers_per_gather = 1;
explain analyze select count(*) from t;

...

set max_parallel_workers_per_gather = 7;
explain analyze select count(*) from t;



On Wed, May 5, 2021 at 2:12 PM Thomas Munro <thomas.munro@gmail.com> wrote:
> It might be interesting to know how that 40ms time scales as you add
> more workers.  ...

Another thought: I'd also try tests like that in large databases (ie
large virtual memory) vs small ones, and with and without huge/locked
memory pages configured[1], to see how much of the overhead is due to
virtual memory work (though I know nothing about Windows VM, it's just
an idea).

[1] https://www.postgresql.org/docs/13/runtime-config-resource.html#GUC-HUGE-PAGES