BUG #4958: Stats collector hung on WaitForMultipleObjectsEx while attempting to recv a datagram - Mailing list pgsql-bugs
From | Luke Koops |
---|---|
Subject | BUG #4958: Stats collector hung on WaitForMultipleObjectsEx while attempting to recv a datagram |
Date | |
Msg-id | 200907311459.n6VExuSa036743@wwwmaster.postgresql.org Whole thread Raw |
Responses |
Re: BUG #4958: Stats collector hung on
WaitForMultipleObjectsEx while attempting to recv a datagram
Re: BUG #4958: Stats collector hung on WaitForMultipleObjectsEx while attempting to recv a datagram |
List | pgsql-bugs |
The following bug has been logged online: Bug reference: 4958 Logged by: Luke Koops Email address: luke.koops@entrust.com PostgreSQL version: 8.3.7 Operating system: Windows 2003 EE SP2 Description: Stats collector hung on WaitForMultipleObjectsEx while attempting to recv a datagram Details: While running a load test with 80 processes connected to postgres, of which 62 were generating load, the stats collector got blocked on WaitForMultipleObjectsEx which was called with INFINITE timeout. The stack trace for the hung thread follows: -- postgres.exe!mainCRTStartup -- ntoskrnl.exe!KiSwapContext+0x26 ntoskrnl.exe!KiSwapThread+0x2e5 ntoskrnl.exe!KeWaitForSingleObject+0x346 ntoskrnl.exe!KiSuspendThread+0x18 ntoskrnl.exe!KiDeliverApc+0x117 ntoskrnl.exe!KiSwapThread+0x300 ntoskrnl.exe!KeWaitForMultipleObjects+0x3d7 ntoskrnl.exe!ObpWaitForMultipleObjects+0x202 ntoskrnl.exe!NtWaitForMultipleObjects+0xe9 ntoskrnl.exe!KiFastCallEntry+0xfc ntdll.dll!KiFastSystemCallRet ntdll.dll!NtWaitForMultipleObjects+0xc kernel32.dll!WaitForMultipleObjectsEx+0x11a postgres.exe!pgwin32_waitforsinglesocket+0x1ed postgres.exe!pgwin32_recv+0x90 postgres.exe!PgstatCollectorMain+0x17f postgres.exe!SubPostmasterMain+0x33a postgres.exe!main+0x168 postgres.exe!__tmainCRTStartup+0x10f kernel32.dll!BaseProcessStart+0x23 Leading up to the hang, the DB was not under high load. The CPU was running at 80%, but less than a fifth of that was due to PostgreSQL. The underlying disk subsystems were routinely 90% idle. The txlogs are on a 4 disk RAID-10 array and the data is on a 5 disk RAID-5 array (15K RPM). This is a performance test environment for a product that will include PostgreSQL as an embedded database. I have the debug symbols for this build, and a process dump of the stats collector. I have have not been able to reproduce this problem. It seems to be relatively rare, so I'm lucky to have seen it before any of our customers. Even if the problem cannot be reproduced reliability, I think there is enough information that would help the community to make corrections in the code. After filtering out everything that is expected output, there was nothing informative in the postgres logs (from 24 hours prior to the hang until 8 hours after). We are accessing the DB over ODBC. All of our processes are running on the same server as PostgreSQL. The DB is 52GB in size, so I cannot targzupload it. I estimate that 4-8 GB of the size is due to the lack of autovacuum after stats collection stopped. I doubt that this issue is due to the schema or contents of the DB. I have verified, using TDIMon, that the back ends are sending the statistics. That was a bit redundant since the collector is clearly hung. Here are the settings in my system that differ from the defaults: logging_collector = on log_line_prefix = '%m,' autovacuum_naptime = 10s autovacuum_analyze_scale_factor = 0.05 autovacuum_vacuum_scale_factor = 0.1 autovacuum_vacuum_cost_limit = 1000 log_autovacuum_min_duration = 0 I have customized the vacuum rules for a few "hot" tables that are limited to 1 or a few rows, but are frequently updated (whenever the system does anything). Here is a dump of pg_autovacuum: vacrelid | enabled | vac_base_thresh | vac_scale_factor | anl_base_thresh | anl_scale_factor | vac_ cost_delay | vac_cost_limit | freeze_min_age | freeze_max_age ----------+---------+-----------------+------------------+-----------------+ ------------------+----- -----------+----------------+----------------+---------------- 16507 | t | 1 | -1 | 1 | -1 | 0 | 100000 | -1 | -1 16511 | t | 1 | -1 | 1 | -1 | 0 | 100000 | -1 | -1 16541 | t | 1 | -1 | 1 | -1 | 0 | 100000 | -1 | -1 (3 rows) I created a new cluster on the same system and the stats collector worked just fine.
pgsql-bugs by date: