Thread: Re: COPY performance on Windows

Re: COPY performance on Windows

From
Robert Haas
Date:
Hello Takahashi-san,

I am reluctant to draw conclusions about the general performance of
this patch from one example. It seems that the performance could
depend on many things: table size, column definitions, row width,
hardware, OS version, shared_buffers, max_wal_size. I don't think we
can say from your test here that performance is always worse on
Windows. If it is, then I agree that we should think of what to do
about it; but it seems possible to me that the behavior will be
different in other circumstances.

What I don't understand is why increasing the number of blocks should
be worse. The comment before the FileZero() call has a comment
explaining why a larger extension is thought to be better. If it's
wrong, we should try to figure out why it's wrong. But it seems quite
surprising that doing more work at once would be less efficient.
That's not usually how things work.

--
Robert Haas
EDB: http://www.enterprisedb.com



RE: COPY performance on Windows

From
"Ryohei Takahashi (Fujitsu)"
Date:
Hi,


Thank you for your reply.


I don't want to "speed up" the COPY command.
I just want to "prevent speed down" compared with PG16.

But anyway, my current analysis is not convincing.
So, I will do more analysis and get back to you.


Regards,
Ryohei Takahashi


Re: COPY performance on Windows

From
Vladlen Popolitov
Date:
Ryohei Takahashi (Fujitsu) писал(а) 2024-12-11 15:18:
> Hi,
> 
> 
> I continuously investigate the performance problem of COPY on Windows.
> 
> I noticed that not only PG17.0 but also PG16.6 have performance problem 
> compared to PG16.4.
> The performance is 2.5%-5.8% worse, especially when the number of 
> clients is 1 or 2.
> 
> I modified the performance measurement script of the thread in [1].
> * Enabled to run on Windows git bash
> * Enabled to compare PG16.4, PG16.6 and PG17.0
> * Increase the row number to 10 times (about 10GB)
> 
> I measured on Windows Server 2022 machine with 44 core CPU and 512GB 
> memory.
> The results are following.
> 

> 
> (1)
> I attach the performance measurement script.
> If you have a Windows environment, could you please reproduce the same 
> performance problem?
> 
> (2)
> The performance of PG16.6 and PG17.0 are worse than PG16.4.
> So, I think the commits between August and September affects the 
> performance.
> I will analyze these commits.
> 
> 
> [1] 
> https://postgr.es/m/CAD21AoDvDmUQeJtZrau1ovnT_smN940=Kp6mszNGK3bq9yRN6g@mail.gmail.com
> 
> 
> Regards,
> Ryohei Takahashi
COPY FROM code had changes with DEFAULT option and numbers conversion, 
and you suspect disk read/writes also.
I propose to isolate read-write case. You could create RAM DISK in 
Windows server
(Microsoft has documentation on its site how to configure it, googling I 
found this but not tested it
https://learn.microsoft.com/ru-ru/archive/blogs/windowsinternals/how-to-create-a-ram-disk-in-windows-server 
).
You could create database on RAM disk and make benchmarks.
You will exclude disk influence, and probably other changes affected 
performance.
The number conversion patch was tested on many processors, but not on 
all. Probably you run Windows
on one that has worse performance or other issues possible.

P.S. Could you point where is the performance measurement script you 
mantioned in you email?
I will try to play with it.

-- 
Best regards,

Vladlen Popolitov.



RE: COPY performance on Windows

From
"Ryohei Takahashi (Fujitsu)"
Date:
Hi,


Thank you for your reply.
I tried your patch and report in this e-mail.

> 1.  Those code paths finish up in pg_pwritev(), but it has a loop over
> 8kb writes on Windows.  Does it help if we just make "zbuffer" bigger?
>  How big?

This patch improves the performance.

I applied 0001-Use-bigger-writes-in-pg_pwrite_zeros-on-Windows.patch over REL_16_6.
I changed the value "zbuffer" from 2 to 32.
I measured with nclients = 1.

16.6:                      453s
16.6 + patch (zbuffer = 2):  442s
16.6 + patch (zbuffer = 4):  434s
16.6 + patch (zbuffer = 8):  430s
16.6 + patch (zbuffer = 16): 429s
16.6 + patch (zbuffer = 32): 428s


Performance improved up to 8KB and remained stable after that.


> 2.  While pondering the goals of posix_fallocate(), I had a
> realisation about how we might implement FileFallocate() on Windows.
> Does this idea work?  Well?

This patch degrades the performance.

16.6:         453s
16.6 + patch: 479s


Regards,
Ryohei Takahashi


RE: COPY performance on Windows

From
"Ryohei Takahashi (Fujitsu)"
Date:
Hi,


Thank you for your interest in this thread.


> You could create database on RAM disk and make benchmarks.

According to your advice, I created RAM disk and put input files and data directory on RAM disk.
But the result changed only a few seconds.

In this test case, the table is unlogged table and shared_buffers is enough.
So, I think the disk performance does not affect so much.


> P.S. Could you point where is the performance measurement script you
> mantioned in you email?
> I will try to play with it.

Thank you.
Please use the "test.sh" in the following e-mail.

https://www.postgresql.org/message-id/flat/TY3PR01MB11891C0FD066F069B113A2376823E2%40TY3PR01MB11891.jpnprd01.prod.outlook.com#8455c9f7b66780a356511f5cfe029d57


Regards,
Ryohei Takahashi


Re: COPY performance on Windows

From
Vladlen Popolitov
Date:
Ryohei Takahashi (Fujitsu) писал(а) 2024-12-16 15:10:

Hi
> According to your advice, I created RAM disk and put input files and 
> data directory on RAM disk.
> But the result changed only a few seconds.
> In this test case, the table is unlogged table and shared_buffers is 
> enough.
> So, I think the disk performance does not affect so much.
If test on RAM drive got the same result, it could mean, that other 
operations affect performance (not disk).
It is only idea, that numeric conversion gives some increase in time due 
to new functionality added.
I think, it could be checked, if table has text fields instead of 
numeric - we could exclude numeric conversion
and have the same input-output operations (really more IO-operation, but 
we need to compare)

> Please use the "test.sh" in the following e-mail.
>
https://www.postgresql.org/message-id/flat/TY3PR01MB11891C0FD066F069B113A2376823E2%40TY3PR01MB11891.jpnprd01.prod.outlook.com#8455c9f7b66780a356511f5cfe029d57
OK, I will use it.

By the way, do you use prebuild Postgres versions for this test or
build it by yourself with the same options? I am going to use built 
myself.

-- 
Best regards,

Vladlen Popolitov.



Re: COPY performance on Windows

From
Vladlen Popolitov
Date:
Ryohei Takahashi (Fujitsu) писал(а) 2024-12-16 15:10:
Hi
> Please use the "test.sh" in the following e-mail.
>
https://www.postgresql.org/message-id/flat/TY3PR01MB11891C0FD066F069B113A2376823E2%40TY3PR01MB11891.jpnprd01.prod.outlook.com#8455c9f7b66780a356511f5cfe029d57
I cannot reproduce your results. In all of my runs final result depends 
on run order -
benchmark for first versin get higher time, than time is smaller,
f.e. my last run (in start time order, time is in seconds):
PG164: nclients = 1, time  = 251
PG164: nclients = 2, time  = 210
PG164: nclients = 4, time  = 126
PG164: nclients = 8, time  = 107
PG164: nclients = 16, time  = 99
PG164: nclients = 32, time  = 109
PG164: nclients = 64, time  = 112
PG164: nclients = 128, time  = 113
PG164: nclients = 256, time  = 120
PG166: nclients = 1, time  = 244
PG166: nclients = 2, time  = 222
PG166: nclients = 4, time  = 131
PG166: nclients = 8, time  = 109
PG166: nclients = 16, time  = 101
PG166: nclients = 32, time  = 110
PG166: nclients = 64, time  = 115
PG166: nclients = 128, time  = 116
PG166: nclients = 256, time  = 123
PG170: nclients = 1, time  = 240
PG170: nclients = 2, time  = 213
PG170: nclients = 4, time  = 129
PG170: nclients = 8, time  = 110
PG170: nclients = 16, time  = 101
PG170: nclients = 32, time  = 112
PG170: nclients = 64, time  = 115
PG170: nclients = 128, time  = 116
PG170: nclients = 256, time  = 122


I slightly modified your script:
1) exclude creation of input files to the separate step to decrease 
influence of system disk cache.
2) run PostgreSQL servers on separate PC (Windows 10, 11th Gen Intel(R) 
Core(TM) i5-1135G7 @ 2.40GHz , RAM 16GB),
clients on separate PC
3) I added CHECKPOINT in the end of every COPY FROM to flush wal.
4) I used EDB build for Windows from their site. Unfortunatelly, they 
distribute
files without debug symbols like other distributions, it does not help 
during profiling.
5) I think, that better to decrease shared_buffers as small as possible 
to measure all IO time,
but I used 25% of RAM.


My observations
1) for 1-2 clients read time decreases every run (independent on 
Postgres version) -
looks like Windows disk cache (I think, HTFS system information like 
btree of file locations,
not the input file itself) - it contradicts to your main point, that 
17.0 version is slower.

2) 1-client - Postgres backend takes only 12% of CPU, the rest time it 
waits kernel operations.

3) 16-256 clients - I have not made any analysis of multiprocessor 
effect to time increase:
OS process implementation, waiting on PostgreSQL locks or spinlocks, 
parallel access to one
input file or other factors.

Could you confirm, that you receive you results on all execution orders 
(17.0 first and 17.0 last)?


-- 
Best regards,

Vladlen Popolitov.



RE: COPY performance on Windows

From
"Ryohei Takahashi (Fujitsu)"
Date:
Hi 


Thank you for your advice and testing.

> I think, it could be checked, if table has text fields instead of 
> numeric - we could exclude numeric conversion
> and have the same input-output operations (really more IO-operation, but 
> we need to compare)

I changed the column from int to text.
The performance becomes worse in each version,
but the rate of the difference of duration did not change.


> By the way, do you use prebuild Postgres versions for this test or
> build it by yourself with the same options? I am going to use built 
> myself.

In the mail in 2024-12-16 12:09:03, I used the modules which I build by myself.
In the other mail, I used the modules which community provides at following.
https://www.postgresql.org/download/windows/


> Could you confirm, that you receive you results on all execution orders 
> (17.0 first and 17.0 last)?

In my environment, the results do not depend on order.
(The performance of the order 16.4, 16.6, 17.0 is same as that of the order 17.0, 16.6, 16.4)


Regards,
Ryohei Takahashi

Re: COPY performance on Windows

From
Vladlen Popolitov
Date:
Ryohei Takahashi (Fujitsu) писал(а) 2024-12-19 16:13:
> Hi
> 
> 
> Thank you for your advice and testing.
> 
>> I think, it could be checked, if table has text fields instead of
>> numeric - we could exclude numeric conversion
>> and have the same input-output operations (really more IO-operation, 
>> but
>> we need to compare)
> 
> I changed the column from int to text.
> The performance becomes worse in each version,
> but the rate of the difference of duration did not change.
> 
> 
>> By the way, do you use prebuild Postgres versions for this test or
>> build it by yourself with the same options? I am going to use built
>> myself.
> 
> In the mail in 2024-12-16 12:09:03, I used the modules which I build by 
> myself.
> In the other mail, I used the modules which community provides at 
> following.
> https://www.postgresql.org/download/windows/
> 
> 
>> Could you confirm, that you receive you results on all execution 
>> orders
>> (17.0 first and 17.0 last)?
> 
> In my environment, the results do not depend on order.
> (The performance of the order 16.4, 16.6, 17.0 is same as that of the 
> order 17.0, 16.6, 16.4)
> 
> 
> Regards,
> Ryohei Takahashi

Hi!

  I tested with text (CREATE UNLOGGED TABLE test (c TEXT STORAGE PLAIN) 
WITH (autovacuum_enabled = off)).
Time is ~10% higher, but I also do not see strict dependence on version.

I think, antivirus can influence the performace, I did not switched it 
off.

How can I help you in testing? Probably better to test and compare 
result only with one quantity of clients.
For 1 client the read time of the input file has the big effect, for 
many clients their concurrency overhead has
big effect. Maybe make the series of the runs with 2 or 4 clients only 
and compare?

-- 
Best regards,

Vladlen Popolitov.