Thread: insert performance for win32
Hi, I usually use PostgreSQL coupled with Linux, but I have to use Windows for a perticular project. So I wanted to do some tests to know if the performance will be acceptable (I don't need PostgreSQL to be as fast with windows as with linux, but it has to be usable...). I started with trying to do lots of inserts, and I'm quite astonished by the catastrophics results ... The test : The computer was the same (my workstation, a PIV Dell with SATA disk), dual boot The windows OS is XP. Both Oses are PostgreSQL 8.0.3 Both PostgreSQL clusters (windows and linux) have the same tuning (shared_buffers=20000, wal_buffers=128, checkpoint_segments=10) Before each test, the clusters are vacuum analyzed, and the test database is recreated. The script is quite dumb : BEGIN; CREATE TABLE test (col1 serial, col2 text); INSERT INTO test (col2) values ('test'); INSERT INTO test (col2) values ('test'); INSERT INTO test (col2) values ('test'); INSERT INTO test (col2) values ('test'); INSERT INTO test (col2) values ('test'); ...... 500,000 times Then COMMIT. I know it isn't realistic, but I needed to start with something :) The results are as follows : Linux : 1'9'' Windows : 9'38'' What I've tried to solve, and didn't work : - Deactivate antivirus on windows - fsync=no - raise the checkpoint_segments value (32) - remove hyperthreading (who knows...) I don't know what could cause this (I'm not a windows admin...at all). All I see is a very high kernel load during the execution of this script, but I can't determine where it comes from. I'd like to know if this is a know problem, if there is something I can do, etc... Thanks a lot.
> Hi, > > I usually use PostgreSQL coupled with Linux, but I have to use Windows for > a > perticular project. > > So I wanted to do some tests to know if the performance will be acceptable > (I > don't need PostgreSQL to be as fast with windows as with linux, but it has > to > be usable...). In my experience win32 is par with linux generally with a few gotchas on either side. Are your times with fsync=no? It's much harder to give apples-apples comparison with fsync=on for various reasons. Are you running stats_command_string=on? Try disabling and compare results. Is your loading app running locally or on the server? I am very interesting in discovering sources of high cpu load problems on win32. If you are still having problems could you get a gprof profile together? There is a recent thread on win32-hackers discussing how to do this. Merlin
> > In my experience win32 is par with linux generally with a few gotchas on > either side. Are your times with fsync=no? It's much harder to give > apples-apples comparison with fsync=on for various reasons. It is with fsync=off on windows, fsync=on on linux > > Are you running stats_command_string=on? Try disabling and compare > results. Deactivated on windows, activated on linux > Is your loading app running locally or on the server? Yes > > I am very interesting in discovering sources of high cpu load problems > on win32. If you are still having problems could you get a gprof > profile together? There is a recent thread on win32-hackers discussing > how to do this. I'll give it a look.... > > Merlin
> > In my experience win32 is par with linux generally with a few gotchas on > > either side. Are your times with fsync=no? It's much harder to give > > apples-apples comparison with fsync=on for various reasons. > It is with fsync=off on windows, fsync=on on linux well, inside a transaction this shouldn't have mattered anyways. > > Are you running stats_command_string=on? Try disabling and compare > > results. > Deactivated on windows, activated on linux > > Is your loading app running locally or on the server? > Yes hm :(. Well, you had me curious so I went ahead and re-ran your test case and profiled it (on windows). I got similar resultstime wise. It's interesting to note that the command I used to generate the test table before dumping w/inserts insert into test select nextval('test_id_seq'), 'test' from generate_series(1,500000) ran in just a few seconds. Well, I cut the #recs down to 50k and here is profile trace: % cumulative self self total time seconds seconds calls s/call s/call name 10.78 0.62 0.62 50001 0.00 0.00 yyparse 5.39 0.93 0.31 5101422 0.00 0.00 AllocSetAlloc 4.52 1.19 0.26 799970 0.00 0.00 base_yylex 2.78 1.35 0.16 299998 0.00 0.00 SearchCatCache 2.43 1.49 0.14 554245 0.00 0.00 hash_search 2.26 1.62 0.13 49998 0.00 0.00 XLogInsert 1.74 1.72 0.10 453363 0.00 0.00 LWLockAcquire 1.74 1.82 0.10 299988 0.00 0.00 ScanKeywordLookup This makes me wonder if we are looking in the wrong place. Maybe the problem is coming from psql? More results to follow. Merlin
> This makes me wonder if we are looking in the wrong place. Maybe the > problem is coming from psql? More results to follow. problem is not coming from psql. One thing I did notice that in a 250k insert transaction the insert time grows with #recs inserted. Time to insert first 50k recs is about 27 sec and last 50 k recs is 77 sec. I also confimed that size of table is not playing a role here. Marc, can you do select timeofday() every 50k recs from linux? Also a gprof trace from linux would be helpful. Merlin
On Tuesday 06 September 2005 19:11, Merlin Moncure wrote: > > This makes me wonder if we are looking in the wrong place. Maybe the > > problem is coming from psql? More results to follow. > > problem is not coming from psql. > > One thing I did notice that in a 250k insert transaction the insert time > grows with #recs inserted. Time to insert first 50k recs is about 27 > sec and last 50 k recs is 77 sec. I also confimed that size of table is > not playing a role here. > > Marc, can you do select timeofday() every 50k recs from linux? Also a > gprof trace from linux would be helpful. > Here's the timeofday ... i'll do the gprof as soon as I can. Every 50000 rows... Wed Sep 07 13:58:13.860378 2005 CEST Wed Sep 07 13:58:20.926983 2005 CEST Wed Sep 07 13:58:27.928385 2005 CEST Wed Sep 07 13:58:35.472813 2005 CEST Wed Sep 07 13:58:42.825709 2005 CEST Wed Sep 07 13:58:50.789486 2005 CEST Wed Sep 07 13:58:57.553869 2005 CEST Wed Sep 07 13:59:04.298136 2005 CEST Wed Sep 07 13:59:11.066059 2005 CEST Wed Sep 07 13:59:19.368694 2005 CEST > Merlin > > ---------------------------(end of broadcast)--------------------------- > TIP 9: In versions below 8.0, the planner will ignore your desire to > choose an index scan if your joining column's datatypes do not > match
> > One thing I did notice that in a 250k insert transaction the insert time > > grows with #recs inserted. Time to insert first 50k recs is about 27 > > sec and last 50 k recs is 77 sec. I also confimed that size of table is > > not playing a role here. > > > > Marc, can you do select timeofday() every 50k recs from linux? Also a > > gprof trace from linux would be helpful. > > > > Here's the timeofday ... i'll do the gprof as soon as I can. > Every 50000 rows... > Were those all in a single transaction? Merlin
> On Tuesday 06 September 2005 19:11, Merlin Moncure wrote: > Here's the timeofday ... i'll do the gprof as soon as I can. > Every 50000 rows... > > Wed Sep 07 13:58:13.860378 2005 CEST > Wed Sep 07 13:58:20.926983 2005 CEST > Wed Sep 07 13:58:27.928385 2005 CEST > Wed Sep 07 13:58:35.472813 2005 CEST > Wed Sep 07 13:58:42.825709 2005 CEST > Wed Sep 07 13:58:50.789486 2005 CEST > Wed Sep 07 13:58:57.553869 2005 CEST > Wed Sep 07 13:59:04.298136 2005 CEST > Wed Sep 07 13:59:11.066059 2005 CEST > Wed Sep 07 13:59:19.368694 2005 CEST ok, I've been in crunching profile profile graphs, and so far have been only been able to draw following conclusions. For bulk, 'in-transaction' insert: 1. win32 is slower than linux. win32 time for each insert grows with # inserts in xact, linux does not (or grows much slower). Win32 starts out about 3x slower and grows to 10x slower after 250k inserts. 2. ran a 50k profile vs. 250k profile. Nothing jumps out as being slower or faster: most time is spent in yyparse on either side. From this my preliminary conclusion is that there is something going on in the win32 api which is not showing in the profile. 3. The mingw gprof cumulative seconds does not show measurable growth in cpu time/insert in 50k/250k profile. I'm now talking suggestions about where to look for performance problems :(. Merlin
"Merlin Moncure" <merlin.moncure@rcsonline.com> writes: > ok, I've been in crunching profile profile graphs, and so far have been > only been able to draw following conclusions. > For bulk, 'in-transaction' insert: > 1. win32 is slower than linux. win32 time for each insert grows with # > inserts in xact, linux does not (or grows much slower). Win32 starts > out about 3x slower and grows to 10x slower after 250k inserts. Just to be clear: what you were testing was BEGIN; INSERT ... VALUES (...); repeat insert many times COMMIT; with each statement issued as a separate PQexec() operation, correct? Was this set up as a psql script, or specialized C code? (If a psql script, I wonder whether it's psql that's chewing the time.) > 2. ran a 50k profile vs. 250k profile. Nothing jumps out as being > slower or faster: most time is spent in yyparse on either side. From > this my preliminary conclusion is that there is something going on in > the win32 api which is not showing in the profile. Hmm. Client/server data transport maybe? It would be interesting to try inserting the same data in other ways: * COPY from client * COPY from disk file * INSERT/SELECT from another table and see whether you see a similar slowdown. regards, tom lane
> I've done the tests with rc1. This is still as slow on windows ... about > 6-10 > times slower thant linux (via Ip socket). (depending on using prepared > queries, etc...) > > By the way, we've tried to insert into the windows database from a linux > psql > client, via the network. In this configuration, inserting is only about 2 > times slower than inserting locally (the linux client had a slower CPU > 1700Mhz agains 3000). > Could it be related to a problem in the windows psql client ? [OK, I'm bringing this back on-list, and bringing it to QingQing's attention, who I secretly hope is the right person to be looking at this problem :)] Just to recap Marc and I have been looking at the performance disparity between windows and linux for a single transaction statement by statement insert on a very narrow table with no keys from a remote client. Marc's observations showed (and I verified) that windows is much slower in this case than it should be. I gprof'ed both the psql client and the server during the insert and didn't see anything seriously out of order...unfortunately QQ's latest win32 performance tweaks haven't helped. Marc's observation that by switching to a linux client drops time down drastically is really intersing! Merlin
> > I've done the tests with rc1. This is still as slow on windows ... > about > > 6-10 > > times slower thant linux (via Ip socket). (depending on > using prepared > > queries, etc...) > > > > By the way, we've tried to insert into the windows database from a > linux > > psql > > client, via the network. In this configuration, inserting is only > about 2 > > times slower than inserting locally (the linux client had a > slower CPU > > 1700Mhz agains 3000). > > Could it be related to a problem in the windows psql client ? > > [OK, I'm bringing this back on-list, and bringing it to > QingQing's attention, who I secretly hope is the right person > to be looking at this problem :)] > > Just to recap Marc and I have been looking at the performance > disparity between windows and linux for a single transaction > statement by statement insert on a very narrow table with no > keys from a remote client. Marc's observations showed (and I > verified) that windows is much slower in this case than it > should be. I gprof'ed both the psql client and the server > during the insert and didn't see anything seriously out of > order...unfortunately QQ's latest win32 performance tweaks > haven't helped. > > Marc's observation that by switching to a linux client drops > time down drastically is really intersing! Could this be a case of the network being slow, as we've seen a couple of times before? And if you run psql on the local box, you get it double. Do you get a speed difference between the local windows box and a remote wnidows box? //Magnus
Le Mercredi 02 Novembre 2005 14:54, Magnus Hagander a écrit : > > > I've done the tests with rc1. This is still as slow on windows ... > > > > about > > > > > 6-10 > > > times slower thant linux (via Ip socket). (depending on > > > > using prepared > > > > > queries, etc...) > > > > > > By the way, we've tried to insert into the windows database from a > > > > linux > > > > > psql > > > client, via the network. In this configuration, inserting is only > > > > about 2 > > > > > times slower than inserting locally (the linux client had a > > > > slower CPU > > > > > 1700Mhz agains 3000). > > > Could it be related to a problem in the windows psql client ? > > > > [OK, I'm bringing this back on-list, and bringing it to > > QingQing's attention, who I secretly hope is the right person > > to be looking at this problem :)] > > > > Just to recap Marc and I have been looking at the performance > > disparity between windows and linux for a single transaction > > statement by statement insert on a very narrow table with no > > keys from a remote client. Marc's observations showed (and I > > verified) that windows is much slower in this case than it > > should be. I gprof'ed both the psql client and the server > > during the insert and didn't see anything seriously out of > > order...unfortunately QQ's latest win32 performance tweaks > > haven't helped. > > > > Marc's observation that by switching to a linux client drops > > time down drastically is really intersing! > > Could this be a case of the network being slow, as we've seen a couple > of times before? And if you run psql on the local box, you get it > double. > > Do you get a speed difference between the local windows box and a remote > wnidows box? > > //Magnus The Windows-Windows test is local (via loopback interface) The Linux (client) - Windows (server) is via network (100Mbits) I can't test with 2 windows box ... I haven't got that much (all machines linux, except the test one...)
"Magnus Hagander" <mha@sollentuna.net> writes: >> Marc's observation that by switching to a linux client drops >> time down drastically is really intersing! > Could this be a case of the network being slow, I'm wondering about nonstandard junk lurking in the TCP stack of the Windows client machine. Also, I seem to recall something about a "QOS patch" that people are supposed to apply, or not apply as the case may be, to get Windows' TCP stack to behave reasonably ... ring a bell? regards, tom lane
On Wed, 2 Nov 2005, Merlin Moncure wrote: > > > > By the way, we've tried to insert into the windows database from a > > linux psql client, via the network. In this configuration, inserting > > is only about 2 times slower than inserting locally (the linux client > > had a slower CPU 1700Mhz agains 3000). Could it be related to a > > problem in the windows psql client ? > > If you put client/server on the same machine, then we don't know how the CPU is splitted. Can you take a look at the approximate number by observing the task manager data while running? If communication code is the suspect, can we measure the difference if we disable the redefinition of recv()/send() etc in port/win32.h (may require change related code a little bit as well). In this way, the socket will not be able to pickup signals, but let see if there is any performance difference first. Regards, Qingqing > > [OK, I'm bringing this back on-list, and bringing it to QingQing's > attention, who I secretly hope is the right person to be looking at this > problem :)] > P.s. You scared me ;-)
> On Wed, 2 Nov 2005, Merlin Moncure wrote: > If you put client/server on the same machine, then we don't know how the > CPU is splitted. Can you take a look at the approximate number by > observing the task manager data while running? ok, I generated a test case which was 250k inserts to simple two column table all in single transaction. Every 50k inserts, time is recorded via timeofday(). Running from remote, Time progression is: First 50k: 20 sec Second : 29 sec [...] final: : 66 sec so, clear upward progression of time/rec. Initial time is 2.5k inserts/sec which is decent but not great for such a narrow table. CPU time on server starts around 50% and drops in exact proportion to insert performance. My earlier gprof test also suggest there is no smoking gun sucking down all the cpu time. cpu time on the client is very volatile but with a clear increase over time starting around 20 and ending perhaps 60. My client box is pretty quick, 3ghz p4. Running the script locally, from the server, cpu time is pegged at 100% and stays...first 50k is 23 sec with a much worse decomposition to almost three minutes for final 50k. Merlin > If communication code is the suspect, can we measure the difference if we > disable the redefinition of recv()/send() etc in port/win32.h (may require > change related code a little bit as well). In this way, the socket will > not be able to pickup signals, but let see if there is any performance > difference first. > > Regards, > Qingqing > > > > > > [OK, I'm bringing this back on-list, and bringing it to QingQing's > > attention, who I secretly hope is the right person to be looking at this > > problem :)] > > > P.s. You scared me ;-)
Both win32 send/recv have pgwin32_poll_signals() in them. This is glorified WaitForSingleObjectEx on global pgwin32_signal_event. This is probably part of the problem. Can we work some of the same magic you put into check interrupts macro? ISTM everything also in win32 functions is either API call, or marginal case. Merlin
> Both win32 send/recv have pgwin32_poll_signals() in them. This is > glorified WaitForSingleObjectEx on global pgwin32_signal_event. This is > probably part of the problem. Can we work some of the same magic you put > into check interrupts macro? Whoop! following a cvs update I see this is already nailed :) Back to the drawing board... Merlin
> Both win32 send/recv have pgwin32_poll_signals() in them. > This is glorified WaitForSingleObjectEx on global > pgwin32_signal_event. This is probably part of the problem. > Can we work some of the same magic you put into check > interrupts macro? > > ISTM everything also in win32 functions is either API call, > or marginal case. Uh, we already do that, don't we? http://developer.postgresql.org/cvsweb.cgi/pgsql/src/backend/port/win32/ socket.c?rev=1.10 has: static int pgwin32_poll_signals(void) { if (UNBLOCKED_SIGNAL_QUEUE()) { pgwin32_dispatch_queued_signals(); errno = EINTR; return 1; } return 0; } Are you testing this on 8.0.x? Or a pre-RC version of 8.1? //Magnus
> > > Both win32 send/recv have pgwin32_poll_signals() in them. > > > This is glorified WaitForSingleObjectEx on global > > > pgwin32_signal_event. This is probably part of the problem. > > > Can we work some of the same magic you put into check interrupts > > > macro? > > > > > > > Uh, we already do that, don't we? > > > http://developer.postgresql.org/cvsweb.cgi/pgsql/src/backend/port/win3 > > 2/ > > socket.c?rev=1.10 > > has: > > > > Yeah, we did this. I am thinking of just use simple mechanism > of the win32 sockets, which could not pick up signals, but I > would like to see if there is any difference -- do you think > there is any point to try this? Sorry, I don't follow you here - what do you mean to do? Remove the event completely so we can't wait on it? //Magnus
On Thu, 3 Nov 2005, Magnus Hagander wrote: > > Sorry, I don't follow you here - what do you mean to do? Remove the > event completely so we can't wait on it? > I'd like to use the win32 provided recv(), send() functions instead of redirect them to pgwin32_recv()/pgwin32_send(), just like libpq does. If we do this, we will lose some functionalities, but I'd like to see the performance difference first. -- do you think that will be any difference? Regards, Qingqing
> > Sorry, I don't follow you here - what do you mean to do? Remove the > > event completely so we can't wait on it? > > > > I'd like to use the win32 provided recv(), send() functions > instead of redirect them to pgwin32_recv()/pgwin32_send(), > just like libpq does. If we do this, we will lose some > functionalities, but I'd like to see the performance > difference first. -- do you think that will be any difference? Doesn't work, really. It will no longer be possible to send a signal to an idle backend. The idle backend will be blocking on recv(), that's how it works. So unless we can get around that somehow, it's a non-starter I think. I doubt there will be much performance difference, as you hav eto hit the kernel anyway (in the recv/send call). But that part is just a guess :-) //Magnus
On Thu, 3 Nov 2005, Magnus Hagander wrote: > > > Sorry, I don't follow you here - what do you mean to do? Remove the > > > event completely so we can't wait on it? > > > > > > > I'd like to use the win32 provided recv(), send() functions > > instead of redirect them to pgwin32_recv()/pgwin32_send(), > > just like libpq does. If we do this, we will lose some > > functionalities, but I'd like to see the performance > > difference first. -- do you think that will be any difference? > > Doesn't work, really. It will no longer be possible to send a signal to > an idle backend. The idle backend will be blocking on recv(), that's how > it works. So unless we can get around that somehow, it's a non-starter I > think. Yeah, agreed. An alternative is set tiemout like 100 ms or so. When timeout happens, check the signals. But I guess you will be strongly against it. > > I doubt there will be much performance difference, as you hav eto hit > the kernel anyway (in the recv/send call). But that part is just a guess > :-) I know what you mean ... I will take a look -- if the patch (not including fix signaling problem), if doesn't change much, I will give it a try. Regards, Qingqing
> > > I'd like to use the win32 provided recv(), send() > functions instead > > > of redirect them to pgwin32_recv()/pgwin32_send(), just > like libpq > > > does. If we do this, we will lose some functionalities, > but I'd like > > > to see the performance difference first. -- do you think > that will > > > be any difference? > > > > Doesn't work, really. It will no longer be possible to send > a signal > > to an idle backend. The idle backend will be blocking on recv(), > > that's how it works. So unless we can get around that > somehow, it's a > > non-starter I think. > > Yeah, agreed. An alternative is set tiemout like 100 ms or > so. When timeout happens, check the signals. But I guess you > will be strongly against it. Not on principle, but I don't think it'll give us enough gain for the cost. But if it does, I'm certainly not against it. //Magnus
""Merlin Moncure"" <merlin.moncure@rcsonline.com> wrote > > Running from remote, Time progression is: > First 50k: 20 sec > Second : 29 sec > [...] > final: : 66 sec > This may due to the maintainence cost of a big transaction, I am not sure ... Tom? > so, clear upward progression of time/rec. Initial time is 2.5k > inserts/sec which is decent but not great for such a narrow table. CPU > time on server starts around 50% and drops in exact proportion to insert > performance. My earlier gprof test also suggest there is no smoking gun > sucking down all the cpu time. > Not to 100%, so this means the server is always starving. It is waiting on something -- of couse not lock. That's why I think there is some problem on network communication. Another suspect will be the write - I knwo NTFS system will issue an internal log when extending a file. To remove the second suspect, I will try to hack the source to do a "fake" write ... Regards, Qingqing
"Qingqing Zhou" <zhouqq@cs.toronto.edu> wrote > > Not to 100%, so this means the server is always starving. It is waiting on > something -- of couse not lock. That's why I think there is some problem > on network communication. Another suspect will be the write - I knwo NTFS > system will issue an internal log when extending a file. To remove the > second suspect, I will try to hack the source to do a "fake" write ... > To patch: ------------------------- Here is a quite straight hack to implement "fake" write for both relation and xlog. Now the server becomes pure CPU play. 1. RelationGetBufferForTuple()/hio.c: remove line (if you do not enable cassert, then doesn't matter): - Assert(PageIsNew((PageHeader) pageHeader)); 2. ReadBuffer()/bufmgr.c: remove line - smgrextend(reln->rd_smgr, blockNum, (char *) bufBlock, - reln->rd_istemp); 3. XLogWrite()/xlog.c errno = 0; + goto fake; if (write(openLogFile, from, nbytes) != nbytes) { /* if write didn't set errno, assume no disk space */ ... } + + fake: /* Update state for write */ To use it: ------------------------- 1. have several copies of a correct data; 2. patch the server; 3. when you startup postmaster, use the following parameters: postmaster -c"checkpoint_timeout=3600" -c"bgwriter_all_percent=0" -Ddata Note now the database server is one-shoot usable -- after you shutdown, it won't startup again. Just run begin; many inserts; end; To observe: ------------------------- (1) In this case, what's the remote server CPU usage -- 100%? I don't have several machines to test it. In my single machine, I run 35000 insert commands from psql by cut and paste into it and could observe that: --- 25% kernel time 75% user time 20% postgresql (--enable-debug --enable-cassert) 65% psql (as same above) 10% csrss (system process, manage graphics commands (not sure, just googled it), etc) 5% system (system process) --- (2) In this case, Linux still keeps almost 10 times faster? After this, we may need more observations like comparison of simple "select 1;" to reduce the code space we may want to explore ... Regards, Qingqing
On Thu, 3 Nov 2005, Magnus Hagander wrote: > > Both win32 send/recv have pgwin32_poll_signals() in them. > > This is glorified WaitForSingleObjectEx on global > > pgwin32_signal_event. This is probably part of the problem. > > Can we work some of the same magic you put into check > > interrupts macro? > > > > Uh, we already do that, don't we? > http://developer.postgresql.org/cvsweb.cgi/pgsql/src/backend/port/win32/ > socket.c?rev=1.10 > has: > Yeah, we did this. I am thinking of just use simple mechanism of the win32 sockets, which could not pick up signals, but I would like to see if there is any difference -- do you think there is any point to try this? Regards, Qingqing
> > Sorry, I don't follow you here - what do you mean to do? Remove the > > event completely so we can't wait on it? > > > > I'd like to use the win32 provided recv(), send() functions instead of > redirect them to pgwin32_recv()/pgwin32_send(), just like libpq does. If > we do this, we will lose some functionalities, but I'd like to see the > performance difference first. -- do you think that will be any difference? I personally strongly doubt this will make a diffenrence. Anyways I think we might be looking at the wrong place. Here was my test: 1. drop/create table two fields (id int, f text) no keys 2. begin 3. insert 500k rows. every 50k get time get geometric growth in insert time 4. commit I am doing this via type dump.sql | psql -q mydb I rearrange: every 50k rows get time but also restart transaction. I would ex Guess what...no change. This was a shocker. So I wrap dump.sql with another file that is just \i dump.sql \i dump.sql and get time to insert 50k recs resets after first dump... Merlin
"Merlin Moncure" <merlin.moncure@rcsonline.com> writes: > ok, I generated a test case which was 250k inserts to simple two column > table all in single transaction. Every 50k inserts, time is recorded > via timeofday(). You mean something like the attached? > Running from remote, Time progression is: > First 50k: 20 sec > Second : 29 sec > [...] > final: : 66 sec On Unix I get a dead flat line (within measurement noise), both local loopback and across my LAN. after 50000 30.20 sec after 100000 31.67 sec after 150000 30.98 sec after 200000 29.64 sec after 250000 29.83 sec "top" shows nearly constant CPU usage over the run, too. With a local connection it's pretty well pegged, with LAN connection the server's about 20% idle and the client about 90% (client machine is much faster than server which may affect this, but I'm too lazy to try it in the other direction). I think it's highly likely that you are looking at some strange behavior of the Windows TCP stack. regards, tom lane
Attachment
On Thu, 3 Nov 2005, Tom Lane wrote: > > On Unix I get a dead flat line (within measurement noise), both local > loopback and across my LAN. > > after 50000 30.20 sec > after 100000 31.67 sec > after 150000 30.98 sec > after 200000 29.64 sec > after 250000 29.83 sec > Confirmed in Linux. And on a winxp machine(sp2) with server, client together, with (see almost no performance difference) or without my "fake" write, the observation is still hold for both cases: after 50000 25.21 sec after 100000 26.26 sec after 150000 25.23 sec after 200000 26.25 sec after 250000 26.58 sec In both cases, postgres 67% cpu, psql 15~20%, rest: system process. Kernel time is 40+% -- where from? Regards, Qingqing
> You mean something like the attached? not quite: attached is a file to generate test. to do it: psql yadda \i timeit.sql \t \o dump.sql select make_dump(50000, false); \q cat dump.sql | psql -q yadda and see what pops out. I had to do it that way because redirecting psql to dump file caused psql sit forever waiting on more with cpu load... Merlin
Attachment
> > You mean something like the attached? oh, btw I ran timeit.c and performance is flat and fairly fast. I'm pretty sure psql is the culprit here. Merlin
"Merlin Moncure" <merlin.moncure@rcsonline.com> writes: >> You mean something like the attached? > not quite: attached is a file to generate test. > cat dump.sql | psql -q yadda Ah. Does your psql have readline support? if so, does adding -n to that command change anything? regards, tom lane
> > not quite: attached is a file to generate test. > > > cat dump.sql | psql -q yadda > > Ah. Does your psql have readline support? if so, does adding -n to > that command change anything? > It doesn't, and it doesn't. :/ Ok, here's where it gets interesting. I removed all the newlines from the test output (dump.sql) and got flat times ;). Merlin
"Merlin Moncure" <merlin.moncure@rcsonline.com> writes: > It doesn't, and it doesn't. :/ Ok, here's where it gets interesting. I > removed all the newlines from the test output (dump.sql) and got flat > times ;). That's bizarre ... I'd have thought a very long line would be more likely to trigger internal performance problems than the original. What happens if you read the file with "psql -f dump.sql" instead of cat/stdin? BTW, I get flat times for your psql test case on Unix, again both with local and remote client. So whatever is going on here, it's Windows-specific. regards, tom lane
> That's bizarre ... I'd have thought a very long line would be more > likely to trigger internal performance problems than the original. > > What happens if you read the file with "psql -f dump.sql" instead > of cat/stdin? non-flat. Also ran via \i and got non flat times. > BTW, I get flat times for your psql test case on Unix, again both with > local and remote client. So whatever is going on here, it's > Windows-specific. yeah. I'm guessing problem is in the mingw flex/bison (which I really, really hope is not the case) or some other win32 specific block of code. I'm snooping around there... Merlin
"Merlin Moncure" <merlin.moncure@rcsonline.com> writes: > yeah. I'm guessing problem is in the mingw flex/bison (which I really, > really hope is not the case) or some other win32 specific block of code. > I'm snooping around there... Maybe I'm confused here, but I thought we had established that the local and remote cases behave differently for you? If so I'd suppose that it must be a networking issue, and there's little point in looking inside psql. If the problem is internal to psql, gprof or similar tool would be helpful ... got anything like that on Windows? regards, tom lane
> > "Merlin Moncure" <merlin.moncure@rcsonline.com> writes: > > yeah. I'm guessing problem is in the mingw flex/bison (which I really, > > really hope is not the case) or some other win32 specific block of code. > > I'm snooping around there... > > Maybe I'm confused here, but I thought we had established that the local > and remote cases behave differently for you? If so I'd suppose that it > must be a networking issue, and there's little point in looking inside > psql. > The local case is *worse*...presumably because psql is competing with the server for cpu time...cpu load is pegged at 100%. On the remote case, I'm getting 50-60% cpu load which is way to high. The problem is definitely in psql. Merlin
ok, here is gprof output from newlines/no newlines [newlines] % cumulative self self total time seconds seconds calls s/call s/call name 19.03 0.67 0.67 1 0.67 3.20 MainLoop 17.61 1.29 0.62 500031 0.00 0.00 yylex 15.63 1.84 0.55 1500094 0.00 0.00 GetVariable 11.08 2.23 0.39 250018 0.00 0.00 SendQuery 4.26 2.38 0.15 750051 0.00 0.00 GetVariableBool 3.41 2.50 0.12 250024 0.00 0.00 SetVariable 2.56 2.59 0.09 250015 0.00 0.00 gets_fromFile 2.27 2.67 0.08 750044 0.00 0.00 yy_switch_to_buffer 2.27 2.75 0.08 500031 0.00 0.00 psql_scan 2.27 2.83 0.08 pg_strcasecmp 1.70 2.89 0.06 4250078 0.00 0.00 emit 1.70 2.95 0.06 500031 0.00 0.00 VariableEquals 1.70 3.01 0.06 250018 0.00 0.00 AcceptResult 1.42 3.06 0.05 250018 0.00 0.00 ResetCancelConn [no newlines] % cumulative self self total time seconds seconds calls s/call s/call name 23.01 0.26 0.26 250019 0.00 0.00 yylex 19.47 0.48 0.22 250018 0.00 0.00 SendQuery 11.50 0.61 0.13 1000070 0.00 0.00 GetVariable 9.73 0.72 0.11 250042 0.00 0.00 pg_strdup 9.73 0.83 0.11 250024 0.00 0.00 SetVariable 6.19 0.90 0.07 500039 0.00 0.00 GetVariableBool 5.31 0.96 0.06 pg_strcasecmp 4.42 1.01 0.05 4250078 0.00 0.00 emit 2.65 1.04 0.03 1 0.03 1.01 MainLoop ok, mingw gprof is claiming MainLoop is a culprit here, along with general efficiency penalty otherwise in several things (twice many calls to yylex, 33%more to getvariable, etc). Just for fun I double checked string len of query input to SendQuery and everything is the right length. Same # calls to SendQuery, but 2.5 times call time in newlines case...anything jump out? Merlin
"Merlin Moncure" <merlin.moncure@rcsonline.com> writes: > ok, mingw gprof is claiming MainLoop is a culprit here, The only thing I can see that would be different for Windows is the SetConsoleCtrlHandler kernel call ... could that be expensive? Why do we have either sigsetjmp or setup_cancel_handler inside the per-line loop, rather than just before it? There is a lot of stuff in MainLoop that doesn't seem like it really needs to be done on every single line, particularly not the repeated fetching of psql variables that couldn't possibly change except inside HandleSlashCmds. But that all ought to be the same on Unix or Windows. regards, tom lane
Nailed it. problem is in mainloop.c -> setup_cancel_handler. Apparently you can have multiple handlers and windows keeps track of them all, even if they do the same thing. Keeping track of so many system handles would naturally slow the whole process down. Commenting that line times are flat as a pancake. I am thinking keeping track of a global flag would be appropriate. Merlin
"Merlin Moncure" <merlin.moncure@rcsonline.com> writes: > Nailed it. > problem is in mainloop.c -> setup_cancel_handler. Apparently you can > have multiple handlers and windows keeps track of them all, even if they > do the same thing. Keeping track of so many system handles would > naturally slow the whole process down. Yipes. So we really want to do that only once. AFAICS it is appropriate to move the sigsetjmp and setup_cancel_handler calls in front of the per-line loop inside MainLoop --- can anyone see a reason not to? I'm inclined to treat this as an outright bug, not just a minor performance issue, because it implies that a sufficiently long psql script would probably crash a Windows machine. regards, tom lane
> "Merlin Moncure" <merlin.moncure@rcsonline.com> writes: > > Nailed it. > > > problem is in mainloop.c -> setup_cancel_handler. Apparently you can > > have multiple handlers and windows keeps track of them all, even if they > > do the same thing. Keeping track of so many system handles would > > naturally slow the whole process down. > > Yipes. So we really want to do that only once. > > AFAICS it is appropriate to move the sigsetjmp and setup_cancel_handler > calls in front of the per-line loop inside MainLoop --- can anyone see > a reason not to? hm. mainloop is re-entrant, right? That means each \i would reset the handler...what is downside to keeping global flag? > I'm inclined to treat this as an outright bug, not just a minor certainly... > performance issue, because it implies that a sufficiently long psql > script would probably crash a Windows machine. actually, it's worse than that, it's more of a dos on the whole system, as windows will eventually stop granting handles, but there is a good chance of side effects on other applications. Merlin
"Merlin Moncure" <merlin.moncure@rcsonline.com> writes: >> AFAICS it is appropriate to move the sigsetjmp and >> setup_cancel_handler >> calls in front of the per-line loop inside MainLoop --- can anyone see >> a reason not to? > hm. mainloop is re-entrant, right? That means each \i would reset the > handler...what is downside to keeping global flag? Ah, right, and in fact I'd missed the comment at line 325 pointing out that we're relying on the sigsetjmp to be re-executed every time through. That could be improved on, likely, but not right before a release. Does the flag need to be global? I'm thinking void setup_cancel_handler(void) { + static bool done = false; + + if (!done) SetConsoleCtrlHandler(consoleHandler, TRUE); + done = true; } regards, tom lane
Tom Lane wrote: > "Merlin Moncure" <merlin.moncure@rcsonline.com> writes: > > Nailed it. > > > problem is in mainloop.c -> setup_cancel_handler. Apparently you can > > have multiple handlers and windows keeps track of them all, even if they > > do the same thing. Keeping track of so many system handles would > > naturally slow the whole process down. > > Yipes. So we really want to do that only once. > > AFAICS it is appropriate to move the sigsetjmp and setup_cancel_handler > calls in front of the per-line loop inside MainLoop --- can anyone see > a reason not to? Nope. > I'm inclined to treat this as an outright bug, not just a minor > performance issue, because it implies that a sufficiently long psql > script would probably crash a Windows machine. Agreed. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania 19073
David Fetter <david@fetter.org> writes: > On Fri, Nov 04, 2005 at 01:01:20PM -0500, Tom Lane wrote: >> I'm inclined to treat this as an outright bug, not just a minor >> performance issue, because it implies that a sufficiently long psql >> script would probably crash a Windows machine. > Ouch. In light of this, are we *sure* what we've got a is a candidate > for release? Sure. This problem exists in 8.0.* too. Pre-existing bugs don't disqualify an RC in my mind --- we fix them and move on, same as we would do at any other time. regards, tom lane
> > I'm inclined to treat this as an outright bug, not just a minor > certainly... > > > performance issue, because it implies that a sufficiently long psql > > script would probably crash a Windows machine. > > actually, it's worse than that, it's more of a dos on the > whole system, as windows will eventually stop granting > handles, but there is a good chance of side effects on other > applications. Does it actually use up *handles* there? I don't see anything in the docs that says it should do that - and they usually do document when handles are used. You should be seeing a *huge* increase in system handles very fast if it does, right? That said, I definitly agree with calling it a bug :-) //Magnus
""Merlin Moncure"" <merlin.moncure@rcsonline.com> wrote > ok, here is gprof output from newlines/no newlines > [newlines] > % cumulative self self total > time seconds seconds calls s/call s/call name > 19.03 0.67 0.67 1 0.67 3.20 MainLoop > 17.61 1.29 0.62 500031 0.00 0.00 yylex > 15.63 1.84 0.55 1500094 0.00 0.00 GetVariable > 11.08 2.23 0.39 250018 0.00 0.00 SendQuery > 4.26 2.38 0.15 750051 0.00 0.00 GetVariableBool > 3.41 2.50 0.12 250024 0.00 0.00 SetVariable > 2.56 2.59 0.09 250015 0.00 0.00 gets_fromFile > 2.27 2.67 0.08 750044 0.00 0.00 > yy_switch_to_buffer > 2.27 2.75 0.08 500031 0.00 0.00 psql_scan > 2.27 2.83 0.08 pg_strcasecmp > 1.70 2.89 0.06 4250078 0.00 0.00 emit > 1.70 2.95 0.06 500031 0.00 0.00 VariableEquals > 1.70 3.01 0.06 250018 0.00 0.00 AcceptResult > 1.42 3.06 0.05 250018 0.00 0.00 ResetCancelConn > Maybe I missed some threads .... do you think it is interesting to test the *absoulte* time difference of the same machine on Windows/Linux by using timeit.c? I wonder if windows is slower than Linux ... Regards, Qingqing
""Magnus Hagander"" <mha@sollentuna.net> wrote >> >> I'd like to use the win32 provided recv(), send() functions >> instead of redirect them to pgwin32_recv()/pgwin32_send(), >> just like libpq does. If we do this, we will lose some >> functionalities, but I'd like to see the performance >> difference first. -- do you think that will be any difference? > > I doubt there will be much performance difference, as you hav eto hit > the kernel anyway (in the recv/send call). But that part is just a guess > :-) > On a separate line -- I verified Magnus's doubt -- revert pgwin32_recv() to recv() does not improve performance visiblly. Regards, Qingqing