Thread: Re: COPY performance on Windows
Hello Takahashi-san, I am reluctant to draw conclusions about the general performance of this patch from one example. It seems that the performance could depend on many things: table size, column definitions, row width, hardware, OS version, shared_buffers, max_wal_size. I don't think we can say from your test here that performance is always worse on Windows. If it is, then I agree that we should think of what to do about it; but it seems possible to me that the behavior will be different in other circumstances. What I don't understand is why increasing the number of blocks should be worse. The comment before the FileZero() call has a comment explaining why a larger extension is thought to be better. If it's wrong, we should try to figure out why it's wrong. But it seems quite surprising that doing more work at once would be less efficient. That's not usually how things work. -- Robert Haas EDB: http://www.enterprisedb.com
Hi, Thank you for your reply. I don't want to "speed up" the COPY command. I just want to "prevent speed down" compared with PG16. But anyway, my current analysis is not convincing. So, I will do more analysis and get back to you. Regards, Ryohei Takahashi
Ryohei Takahashi (Fujitsu) писал(а) 2024-12-11 15:18: > Hi, > > > I continuously investigate the performance problem of COPY on Windows. > > I noticed that not only PG17.0 but also PG16.6 have performance problem > compared to PG16.4. > The performance is 2.5%-5.8% worse, especially when the number of > clients is 1 or 2. > > I modified the performance measurement script of the thread in [1]. > * Enabled to run on Windows git bash > * Enabled to compare PG16.4, PG16.6 and PG17.0 > * Increase the row number to 10 times (about 10GB) > > I measured on Windows Server 2022 machine with 44 core CPU and 512GB > memory. > The results are following. > > > (1) > I attach the performance measurement script. > If you have a Windows environment, could you please reproduce the same > performance problem? > > (2) > The performance of PG16.6 and PG17.0 are worse than PG16.4. > So, I think the commits between August and September affects the > performance. > I will analyze these commits. > > > [1] > https://postgr.es/m/CAD21AoDvDmUQeJtZrau1ovnT_smN940=Kp6mszNGK3bq9yRN6g@mail.gmail.com > > > Regards, > Ryohei Takahashi COPY FROM code had changes with DEFAULT option and numbers conversion, and you suspect disk read/writes also. I propose to isolate read-write case. You could create RAM DISK in Windows server (Microsoft has documentation on its site how to configure it, googling I found this but not tested it https://learn.microsoft.com/ru-ru/archive/blogs/windowsinternals/how-to-create-a-ram-disk-in-windows-server ). You could create database on RAM disk and make benchmarks. You will exclude disk influence, and probably other changes affected performance. The number conversion patch was tested on many processors, but not on all. Probably you run Windows on one that has worse performance or other issues possible. P.S. Could you point where is the performance measurement script you mantioned in you email? I will try to play with it. -- Best regards, Vladlen Popolitov.
Hi, Thank you for your reply. I tried your patch and report in this e-mail. > 1. Those code paths finish up in pg_pwritev(), but it has a loop over > 8kb writes on Windows. Does it help if we just make "zbuffer" bigger? > How big? This patch improves the performance. I applied 0001-Use-bigger-writes-in-pg_pwrite_zeros-on-Windows.patch over REL_16_6. I changed the value "zbuffer" from 2 to 32. I measured with nclients = 1. 16.6: 453s 16.6 + patch (zbuffer = 2): 442s 16.6 + patch (zbuffer = 4): 434s 16.6 + patch (zbuffer = 8): 430s 16.6 + patch (zbuffer = 16): 429s 16.6 + patch (zbuffer = 32): 428s Performance improved up to 8KB and remained stable after that. > 2. While pondering the goals of posix_fallocate(), I had a > realisation about how we might implement FileFallocate() on Windows. > Does this idea work? Well? This patch degrades the performance. 16.6: 453s 16.6 + patch: 479s Regards, Ryohei Takahashi
Hi, Thank you for your interest in this thread. > You could create database on RAM disk and make benchmarks. According to your advice, I created RAM disk and put input files and data directory on RAM disk. But the result changed only a few seconds. In this test case, the table is unlogged table and shared_buffers is enough. So, I think the disk performance does not affect so much. > P.S. Could you point where is the performance measurement script you > mantioned in you email? > I will try to play with it. Thank you. Please use the "test.sh" in the following e-mail. https://www.postgresql.org/message-id/flat/TY3PR01MB11891C0FD066F069B113A2376823E2%40TY3PR01MB11891.jpnprd01.prod.outlook.com#8455c9f7b66780a356511f5cfe029d57 Regards, Ryohei Takahashi
Ryohei Takahashi (Fujitsu) писал(а) 2024-12-16 15:10: Hi > According to your advice, I created RAM disk and put input files and > data directory on RAM disk. > But the result changed only a few seconds. > In this test case, the table is unlogged table and shared_buffers is > enough. > So, I think the disk performance does not affect so much. If test on RAM drive got the same result, it could mean, that other operations affect performance (not disk). It is only idea, that numeric conversion gives some increase in time due to new functionality added. I think, it could be checked, if table has text fields instead of numeric - we could exclude numeric conversion and have the same input-output operations (really more IO-operation, but we need to compare) > Please use the "test.sh" in the following e-mail. > https://www.postgresql.org/message-id/flat/TY3PR01MB11891C0FD066F069B113A2376823E2%40TY3PR01MB11891.jpnprd01.prod.outlook.com#8455c9f7b66780a356511f5cfe029d57 OK, I will use it. By the way, do you use prebuild Postgres versions for this test or build it by yourself with the same options? I am going to use built myself. -- Best regards, Vladlen Popolitov.
Ryohei Takahashi (Fujitsu) писал(а) 2024-12-16 15:10: Hi > Please use the "test.sh" in the following e-mail. > https://www.postgresql.org/message-id/flat/TY3PR01MB11891C0FD066F069B113A2376823E2%40TY3PR01MB11891.jpnprd01.prod.outlook.com#8455c9f7b66780a356511f5cfe029d57 I cannot reproduce your results. In all of my runs final result depends on run order - benchmark for first versin get higher time, than time is smaller, f.e. my last run (in start time order, time is in seconds): PG164: nclients = 1, time = 251 PG164: nclients = 2, time = 210 PG164: nclients = 4, time = 126 PG164: nclients = 8, time = 107 PG164: nclients = 16, time = 99 PG164: nclients = 32, time = 109 PG164: nclients = 64, time = 112 PG164: nclients = 128, time = 113 PG164: nclients = 256, time = 120 PG166: nclients = 1, time = 244 PG166: nclients = 2, time = 222 PG166: nclients = 4, time = 131 PG166: nclients = 8, time = 109 PG166: nclients = 16, time = 101 PG166: nclients = 32, time = 110 PG166: nclients = 64, time = 115 PG166: nclients = 128, time = 116 PG166: nclients = 256, time = 123 PG170: nclients = 1, time = 240 PG170: nclients = 2, time = 213 PG170: nclients = 4, time = 129 PG170: nclients = 8, time = 110 PG170: nclients = 16, time = 101 PG170: nclients = 32, time = 112 PG170: nclients = 64, time = 115 PG170: nclients = 128, time = 116 PG170: nclients = 256, time = 122 I slightly modified your script: 1) exclude creation of input files to the separate step to decrease influence of system disk cache. 2) run PostgreSQL servers on separate PC (Windows 10, 11th Gen Intel(R) Core(TM) i5-1135G7 @ 2.40GHz , RAM 16GB), clients on separate PC 3) I added CHECKPOINT in the end of every COPY FROM to flush wal. 4) I used EDB build for Windows from their site. Unfortunatelly, they distribute files without debug symbols like other distributions, it does not help during profiling. 5) I think, that better to decrease shared_buffers as small as possible to measure all IO time, but I used 25% of RAM. My observations 1) for 1-2 clients read time decreases every run (independent on Postgres version) - looks like Windows disk cache (I think, HTFS system information like btree of file locations, not the input file itself) - it contradicts to your main point, that 17.0 version is slower. 2) 1-client - Postgres backend takes only 12% of CPU, the rest time it waits kernel operations. 3) 16-256 clients - I have not made any analysis of multiprocessor effect to time increase: OS process implementation, waiting on PostgreSQL locks or spinlocks, parallel access to one input file or other factors. Could you confirm, that you receive you results on all execution orders (17.0 first and 17.0 last)? -- Best regards, Vladlen Popolitov.
Hi Thank you for your advice and testing. > I think, it could be checked, if table has text fields instead of > numeric - we could exclude numeric conversion > and have the same input-output operations (really more IO-operation, but > we need to compare) I changed the column from int to text. The performance becomes worse in each version, but the rate of the difference of duration did not change. > By the way, do you use prebuild Postgres versions for this test or > build it by yourself with the same options? I am going to use built > myself. In the mail in 2024-12-16 12:09:03, I used the modules which I build by myself. In the other mail, I used the modules which community provides at following. https://www.postgresql.org/download/windows/ > Could you confirm, that you receive you results on all execution orders > (17.0 first and 17.0 last)? In my environment, the results do not depend on order. (The performance of the order 16.4, 16.6, 17.0 is same as that of the order 17.0, 16.6, 16.4) Regards, Ryohei Takahashi
Ryohei Takahashi (Fujitsu) писал(а) 2024-12-19 16:13: > Hi > > > Thank you for your advice and testing. > >> I think, it could be checked, if table has text fields instead of >> numeric - we could exclude numeric conversion >> and have the same input-output operations (really more IO-operation, >> but >> we need to compare) > > I changed the column from int to text. > The performance becomes worse in each version, > but the rate of the difference of duration did not change. > > >> By the way, do you use prebuild Postgres versions for this test or >> build it by yourself with the same options? I am going to use built >> myself. > > In the mail in 2024-12-16 12:09:03, I used the modules which I build by > myself. > In the other mail, I used the modules which community provides at > following. > https://www.postgresql.org/download/windows/ > > >> Could you confirm, that you receive you results on all execution >> orders >> (17.0 first and 17.0 last)? > > In my environment, the results do not depend on order. > (The performance of the order 16.4, 16.6, 17.0 is same as that of the > order 17.0, 16.6, 16.4) > > > Regards, > Ryohei Takahashi Hi! I tested with text (CREATE UNLOGGED TABLE test (c TEXT STORAGE PLAIN) WITH (autovacuum_enabled = off)). Time is ~10% higher, but I also do not see strict dependence on version. I think, antivirus can influence the performace, I did not switched it off. How can I help you in testing? Probably better to test and compare result only with one quantity of clients. For 1 client the read time of the input file has the big effect, for many clients their concurrency overhead has big effect. Maybe make the series of the runs with 2 or 4 clients only and compare? -- Best regards, Vladlen Popolitov.