BUG #4958: Stats collector hung on WaitForMultipleObjectsEx while attempting to recv a datagram - Mailing list pgsql-bugs

From Luke Koops
Subject BUG #4958: Stats collector hung on WaitForMultipleObjectsEx while attempting to recv a datagram
Date
Msg-id 200907311459.n6VExuSa036743@wwwmaster.postgresql.org
Whole thread Raw
Responses Re: BUG #4958: Stats collector hung on WaitForMultipleObjectsEx while attempting to recv a datagram
Re: BUG #4958: Stats collector hung on WaitForMultipleObjectsEx while attempting to recv a datagram
List pgsql-bugs
The following bug has been logged online:

Bug reference:      4958
Logged by:          Luke Koops
Email address:      luke.koops@entrust.com
PostgreSQL version: 8.3.7
Operating system:   Windows 2003 EE SP2
Description:        Stats collector hung on WaitForMultipleObjectsEx while
attempting to recv a datagram
Details:

While running a load test with 80 processes connected to postgres, of which
62 were generating load, the stats collector got blocked on
WaitForMultipleObjectsEx which was called with INFINITE timeout.  The stack
trace for the hung thread follows:

-- postgres.exe!mainCRTStartup --
ntoskrnl.exe!KiSwapContext+0x26
ntoskrnl.exe!KiSwapThread+0x2e5
ntoskrnl.exe!KeWaitForSingleObject+0x346
ntoskrnl.exe!KiSuspendThread+0x18
ntoskrnl.exe!KiDeliverApc+0x117
ntoskrnl.exe!KiSwapThread+0x300
ntoskrnl.exe!KeWaitForMultipleObjects+0x3d7
ntoskrnl.exe!ObpWaitForMultipleObjects+0x202
ntoskrnl.exe!NtWaitForMultipleObjects+0xe9
ntoskrnl.exe!KiFastCallEntry+0xfc
ntdll.dll!KiFastSystemCallRet
ntdll.dll!NtWaitForMultipleObjects+0xc
kernel32.dll!WaitForMultipleObjectsEx+0x11a
postgres.exe!pgwin32_waitforsinglesocket+0x1ed
postgres.exe!pgwin32_recv+0x90
postgres.exe!PgstatCollectorMain+0x17f
postgres.exe!SubPostmasterMain+0x33a
postgres.exe!main+0x168
postgres.exe!__tmainCRTStartup+0x10f
kernel32.dll!BaseProcessStart+0x23

Leading up to the hang, the DB was not under high load.  The CPU was running
at 80%, but less than a fifth of that was due to PostgreSQL.  The underlying
disk subsystems were routinely 90% idle.  The txlogs are on a 4 disk RAID-10
array and the data is on a 5 disk RAID-5 array (15K RPM).  This is a
performance test environment for a product that will include PostgreSQL as
an embedded database.

I have the debug symbols for this build, and a process dump of the stats
collector.

I have have not been able to reproduce this problem.  It seems to be
relatively rare, so I'm lucky to have seen it before any of our customers.

Even if the problem cannot be reproduced reliability, I think there is
enough information that would help the community to make corrections in the
code.

After filtering out everything that is expected output, there was nothing
informative in the postgres logs (from 24 hours prior to the hang until 8
hours after).

We are accessing the DB over ODBC.  All of our processes are running on the
same server as PostgreSQL.

The DB is 52GB in size, so I cannot targzupload it.  I estimate that 4-8 GB
of the size is due to the lack of autovacuum after stats collection stopped.
 I doubt that this issue is due to the schema or contents of the DB.

I have verified, using TDIMon, that the back ends are sending the
statistics.  That was a bit redundant since the collector is clearly hung.

Here are the settings in my system that differ from the defaults:
    logging_collector = on
    log_line_prefix = '%m,'
    autovacuum_naptime = 10s
    autovacuum_analyze_scale_factor = 0.05
    autovacuum_vacuum_scale_factor = 0.1
    autovacuum_vacuum_cost_limit = 1000
    log_autovacuum_min_duration = 0

I have customized the vacuum rules for a few "hot" tables that are limited
to 1 or a few rows, but are frequently updated (whenever the system does
anything).  Here is a dump of pg_autovacuum:

 vacrelid | enabled | vac_base_thresh | vac_scale_factor | anl_base_thresh |
anl_scale_factor | vac_
cost_delay | vac_cost_limit | freeze_min_age | freeze_max_age
----------+---------+-----------------+------------------+-----------------+
------------------+-----
-----------+----------------+----------------+----------------
    16507 | t       |               1 |               -1 |               1 |
              -1 |
         0 |         100000 |             -1 |             -1
    16511 | t       |               1 |               -1 |               1 |
              -1 |
         0 |         100000 |             -1 |             -1
    16541 | t       |               1 |               -1 |               1 |
              -1 |
         0 |         100000 |             -1 |             -1
(3 rows)

I created a new cluster on the same system and the stats collector worked
just fine.

pgsql-bugs by date:

Previous
From: Tom Lane
Date:
Subject: Re: BUG #4956: Array Construct array() returning blank result
Next
From: Ryan Wallace
Date:
Subject: Inconsistent query results after upgrading to Postgresql 8.4.0