Thread: Our poll() based WaitLatch implementation is broken
Build Postgres master, on Linux or another platform that will use the poll() implementation rather than the older select(). Send the Postmaster SIGKILL. Observe that the WAL Writer lives on, representing a denial of service as it stays attached to shared memory, busy waiting (evident from the fact that it quickly leaks memory). WAL writer shouldn't do this. It isn't doing anything stupid like relying on the return value of WaitLatch(), which is documented to only reliably indicate certain wake events but not others. The main event loop calls PostmasterIsAlive(), which is supposed to be totally reliable. If I use the select() based latch implementation, it behaves just fine. I have some doubts about the latch usage within WAL Writer as things stands - it needs to be cleaned up a bit (I think it should use the process latch, because I'm paranoid about timeout invalidation issues now and in the future, plus it doesn't record errno in the handlers). These smaller issues are covered in passing in the group commit patch that Simon Riggs and I are currently working on in advance of the final 9.2 commitfest. In case it matters: [peter@peterlaptop postmaster]$ uname -a Linux peterlaptop 3.1.6-1.fc16.x86_64 #1 SMP Wed Dec 21 22:41:17 UTC 2011 x86_64 x86_64 x86_64 GNU/Linux I'd debug this myself, but I'm a little bit preoccupied with group commit right now. The rationale for introducing the poll()-based implementation where available was that it performed better than a select()-based one. I wonder, how compelling a win is that expected to be? -- Peter Geoghegan http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training and Services
On 15 January 2012 07:26, Peter Geoghegan <peter@2ndquadrant.com> wrote: > Build Postgres master, on Linux or another platform that will use the > poll() implementation rather than the older select(). Send the > Postmaster SIGKILL. Observe that the WAL Writer lives on, representing > a denial of service as it stays attached to shared memory, busy > waiting (evident from the fact that it quickly leaks memory). Correction: The CPU usage % approaches 100. -- Peter Geoghegan http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training and Services
On 15.01.2012 09:26, Peter Geoghegan wrote: > Build Postgres master, on Linux or another platform that will use the > poll() implementation rather than the older select(). Send the > Postmaster SIGKILL. Observe that the WAL Writer lives on, representing > a denial of service as it stays attached to shared memory, busy > waiting (evident from the fact that it quickly leaks memory). The poll()-based implementation checked for POLLIN on the postmaster-alive-pipe, just like we check for the fd to become readable in the select() implementation. But poll() has a separate POLLHUP event code for that; it does not set POLLIN on the fd but POLLHUP. Fixed, to check POLLHUP. I still kept the check POLLIN, the pipe should never become readable so if it does something is badly wrong. I also threw in a check for POLLNVAL, which means "Invalid request: fd not open". That should definitely not happen, but if it does, it seems good to treat it as postmaster death too. Even if the postmaster isn't dead yet, we could no longer detect when it does die. > The rationale for introducing the poll()-based implementation where > available was that it performed better than a select()-based one. I > wonder, how compelling a win is that expected to be? Ganesh Venkitachalam did some micro-benchmarking after the latch patch was committed: http://archives.postgresql.org/pgsql-hackers/2010-09/msg01609.php. I don't think it make any meaningful difference in real applications, but poll() also doesn't have an arbitrary limit on the range of fd numbers that can be used. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com