Re: commit so slow program looks frozen - Mailing list pgsql-performance

From Rob Lemley
Subject Re: commit so slow program looks frozen
Date
Msg-id 4547B8EC.2040309@booksys.com
Whole thread Raw
In response to Re: commit so slow program looks frozen  ("Merlin Moncure" <mmoncure@gmail.com>)
List pgsql-performance
Merlin Moncure wrote:
> On 10/28/06, Simon Riggs <simon@2ndquadrant.com> wrote:
>> On Thu, 2006-10-26 at 11:06 -0400, Merlin Moncure wrote:
>> > On 10/26/06, Carlo Stonebanks <stonec.register@sympatico.ca> wrote:
>> > > This is pretty interesting - where can I read more on this?
>> Windows isn't
>> > > actually hanging, one single command line window is - from its
>> behaviour, it
>> > > looks like the TCL postgresql package is waiting for pg_exec to
>> come back
>> > > from the commit (I believe the commit has actually gone through).
>> > >
>> > > It could even be that there's something wrong with the TCL
>> package, but from
>> > > my understanding it is one of the most complete interfaces out
>> there - which
>> > > is weird, because TCL seems to be the most unpopular language in the
>> > > community.
>> >
>> > when it happens, make sure to query pg_locks and see what is going on
>> > there lock issues are not supposed to manifest on a commit, which
>> > releases locks, but you never know.  There have been reports of
>> > insonsistent lock ups on windows (espeically multi-processor) which
>> > you might be experiencing. Make sure you have the very latest version
>> > of pg 8.1.x.  Also consider checking out 8.2 and see if you can
>> > reproduce the behavior there...this will require compiling postgresql.
>>
>> Merlin,
>>
>> Rumour has it you managed to get a BT from Windows. That sounds like it
>> would be very useful here.

Could it be there is a hangup in communication with the backend via the
libpq library?

I have a situation on Windows where psql seems to be hanging randomly
AFTER completing (or almost completing) a vacuum full analyze verbose.

I'm running the same databases on a single postgres instance on a Dell
4gb RAM 2 processor xeon (hyper-threading turned off) running Debian
GNU/Linux.  The windows system is an IBM 24gb RAM, 4 processor xeon
(hyperthreading turned off).  No problems on the Dell, it runs pgbench
faster than the windows IBM system.  The Dell Linux system zips through
vacuumdb --all --analyze --full --verbose with no problems.  The windows
machine is running 6 instances of postgresql because of problems trying
to load all of the databases into one instance on windows.

The last output from psql is:

INFO:  free space map contains 474 pages in 163 relations
DETAIL:  A total of 2864 page slots are in use (including overhead).
2864 page slots are required to track all free space.
Current limits are:  420000 page slots, 25000 relations, using 4154 KB.

(I've currently restarted postgresql with more reasonable fsm_page_slots
and fsm_relations).

It appears that psql is hung in the call to WS2_32!select.
The psql stack trace looks like this:

ntdll!KiFastSystemCallRet
ntdll!NtWaitForSingleObject+0xc
mswsock!SockWaitForSingleObject+0x19d
mswsock!WSPSelect+0x380
WS2_32!select+0xb9
WARNING: Stack unwind information not available. Following frames may be
wrong.
libpq!PQenv2encoding+0x1fb
libpq!PQenv2encoding+0x3a1
libpq!PQenv2encoding+0x408
libpq!PQgetResult+0x58
libpq!PQgetResult+0x188
psql+0x4c0f
psql+0x954d
psql+0x11e7
psql+0x1238
kernel32!IsProcessorFeaturePresent+0x9e

With more detail:

 # ChildEBP RetAddr  Args to Child
00 0022f768 7c822124 71b23a09 000007a8 00000001
ntdll!KiFastSystemCallRet (FPO: [0,0,0])
01 0022f76c 71b23a09 000007a8 00000001 0022f794
ntdll!NtWaitForSingleObject+0xc (FPO: [3,0,0])
02 0022f7a8 71b23a52 000007a8 00000780 00000000
mswsock!SockWaitForSingleObject+0x19d (FPO: [Non-Fpo])
03 0022f898 71c0470c 00000781 0022fc40 0022fb30 mswsock!WSPSelect+0x380
(FPO: [Non-Fpo])
04 0022f8e8 6310830b 00000781 0022fc40 0022fb30 WS2_32!select+0xb9 (FPO:
[Non-Fpo])
WARNING: Stack unwind information not available. Following frames may be
wrong.
05 0022fd68 631084b1 00000000 ffffffff 0000001d libpq!PQenv2encoding+0x1fb
06 0022fd88 63108518 00000001 00000000 00614e70 libpq!PQenv2encoding+0x3a1
07 0022fda8 631060f8 00000001 00000000 00614e70 libpq!PQenv2encoding+0x408
08 0022fdc8 63106228 00614e70 00613a71 00615188 libpq!PQgetResult+0x58
09 0022fde8 00404c0f 00614e70 00613a71 0041ac7a libpq!PQgetResult+0x188
0a 0022fe98 0040954d 00613a71 00423180 00423185 psql+0x4c0f
0b 0022ff78 004011e7 00000006 00613b08 00612aa8 psql+0x954d
0c 0022ffb0 00401238 00000001 00000009 0022fff0 psql+0x11e7
0d 0022ffc0 77e523e5 00000000 00000000 7ffdc000 psql+0x1238
0e 0022fff0 00000000 00401220 00000000 78746341
kernel32!IsProcessorFeaturePresent+0x9e

the pg_locks table:
-[ RECORD 1 ]-+----------------
locktype      | relation
database      | 19553
relation      | 10342
page          |
tuple         |
transactionid |
classid       |
objid         |
objsubid      |
transaction   | 1998424
pid           | 576
mode          | AccessShareLock
granted       | t
-[ RECORD 2 ]-+----------------
locktype      | transactionid
database      |
relation      |
page          |
tuple         |
transactionid | 1998424
classid       |
objid         |
objsubid      |
transaction   | 1998424
pid           | 576
mode          | ExclusiveLock
granted       | t


The call stack on the postgres.exe process id 576:

ntdll!KiFastSystemCallRet
ntdll!NtWaitForMultipleObjects+0xc
WARNING: Stack unwind information not available. Following frames may be
wrong.
kernel32!ResetEvent+0x45
postgres!pgwin32_waitforsinglesocket+0x89
postgres!pgwin32_recv+0x82
postgres!secure_read+0x7b
postgres!TouchSocketFile+0x93
postgres!pq_getbyte+0x22
postgres!PostgresMain+0x1056
postgres!SubPostmasterMain+0x9ca
postgres!main+0x33f
postgres+0x11e7
postgres+0x1238
kernel32!IsProcessorFeaturePresent+0x9e


These are the parameters:

listen_addresses = '*'
port = 5432
max_connections = 300
shared_buffers = 30000
temp_buffers = 5000
work_mem = 4096
max_fsm_pages = 25000
max_fsm_relations = 500
vacuum_cost_delay = 50
wal_buffers = 32
checkpoint_segments = 16
effective_cache_size = 50000
random_page_cost = 3
default_statistics_target = 300
log_destination = 'stderr'
redirect_stderr = on

(Since there's 24gb RAM on this thing, I've apparently gotten the system
cache up to about 6 gb, according sysinternals.com "System Information"
applet.)

"Cache Data Map Hits %" runs at 100%
Physical Disk Queue Lengths are almost non-existent
Processors are not very busy at all

Seems like once something is in memory, we never have to go back to the
disk again except to write. (hope so with that kind of RAM).


>>
>> Carlo,
>>
>> Many things can happen at commit time. Temp tables dropped, TRUNCATEd
>> old relations unlinked, init files removed, deferred foreign key checks
>> (and subsequent cascading), dropped tables flushed. The assumption that
>> COMMIT is a short request may not be correct according to the wide range
>> of tasks that could occur according to standard SQL:2003 behaviour.
>>
>> Some of those effects take longer on larger systems. Any and all of
>> those things have potential secondary effects, all of which can also
>> conflict with other user tasks and especially with a CHECKPOINT. Then
>> there's various forms of contention caused by misconfiguration.
>>
>> I do think we need some better instrumentation for this kind of thing.
>>
>> --
>>   Simon Riggs
>>   EnterpriseDB   http://www.enterprisedb.com
>
> start here:
> http://beta.linuxports.com/pgsql-hackers-win32/2005-08/msg00051.php
>
> merlin
>
> ---------------------------(end of broadcast)---------------------------
> TIP 5: don't forget to increase your free space map settings
>
>


pgsql-performance by date:

Previous
From: "Worky Workerson"
Date:
Subject: Re: Best COPY Performance
Next
From: Teemu Torma
Date:
Subject: Re: Best COPY Performance