Broken system timekeeping breaks the stats collector - Mailing list pgsql-hackers
From | Tom Lane |
---|---|
Subject | Broken system timekeeping breaks the stats collector |
Date | |
Msg-id | 13738.1339892776@sss.pgh.pa.us Whole thread Raw |
Responses |
Re: Broken system timekeeping breaks the stats collector
Re: Broken system timekeeping breaks the stats collector |
List | pgsql-hackers |
I've had a vague feeling for awhile now that the occasional buildfarm failures we see in the stats regression test (where it appears that the stats collector fails to respond to requests for no good reason) might be related to operating-system timekeeping glitches. Today there is finally indisputable evidence of this happening, in this log file: http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=frogmouth&dt=2012-06-16%2014%3A30%3A12 Observe the following log excerpt: [4fdcaca3.e74:478] LOG: statement: CREATE INDEX dupindexcols_i ON dupindexcols (f1, id, f1 text_pattern_ops); [4fdcaca3.e74:479] LOG: statement: ANALYZE dupindexcols; [4fdcaca3.e74:480] LOG: statement: EXPLAIN (COSTS OFF) SELECT count(*) FROM dupindexcols WHERE f1 > 'WA' and id < 1000and f1 ~<~ 'YX'; [4fdcaca3.e74:481] LOG: statement: SELECT count(*) FROM dupindexcols WHERE f1 > 'WA' and id < 1000 and f1 ~<~ 'YX'; [4fdcaca3.e74:482] LOG: disconnection: session time: 0:00:00.000 user=pgrunner database=regression host=::1 port=1123 [4fdcab40.e04:1] LOG: connection received: host=::1 port=1125 [4fdcab40.e04:2] LOG: connection authorized: user=pgrunner database=regression [4fdcab40.e04:3] LOG: statement: CREATE TABLE a (aa TEXT); [4fdcab40.444:1] LOG: connection received: host=::1 port=1126 [4fdcab40.e88:1] LOG: connection received: host=::1 port=1127 [4fdcab40.444:2] LOG: connection authorized: user=pgrunner database=regression [4fdcab40.444:3] LOG: statement: CREATE USER regtest_unpriv_user; [4fdcab40.444:4] LOG: statement: CREATE SCHEMA temp_func_test; [4fdcab40.e04:4] LOG: statement: CREATE TABLE b (bb TEXT) INHERITS (a); We can tell from the statements being executed that session 4fdcaca3.e74 is running the create_index regression test, while 4fdcab40.444 is running create_function_3, and the other sessions starting concurrently with it are part of the parallel group that runs after create_index. So they certainly didn't start until 4fdcaca3.e74 finished. Also create_index is not exactly a zero-time test, so the fact that 4fdcaca3.e74 reported a session runtime of 0:00:00.000 should already set off some alarm bells. But remember that in the %c log_line_prefix escape, the first part is the process's start-time timestamp in hex. So 4fdcaca3 means Sat Jun 16 2012, 11:56:19 EDT while 4fdcab40 means Sat Jun 16 2012, 11:50:24 EDT, nearly six minutes *earlier*. The reported zero session time is explainable by the fact that TimestampDifference returns zeros if the given timestamps are out of order. All the other process timestamps in the log are likewise consistent with the theory that the system's clock went backwards six-plus minutes while create_index was running. Then, when we finally get to the stats regression test, it fails with symptoms indicating that the stats collector never answered the backend's requests for an updated stats file; and when the postmaster finally shuts down and tells the stats collector to shut down too, the final pgstat_write_statsfile call bleats like so: [4fdcac54.a0c:1] LOG: last_statrequest 2012-06-16 11:55:20.813625-04 is later than collector's time 2012-06-16 11:54:12.392375-04 Once you know that the system clock glitched like that, it's fairly obvious what happened inside the stats collector: it wrote out the stats file (probably in response to an autovacuum request) sometime shortly after 11:55:20, and then the system clock went backwards to around 11:50, and so all subsequent inquiry messages had request timestamps older than last_statwrite causing the collector to believe it needn't emit fresh stats files. Aside from breaking the stats regression test, a scenario like this one has bad implications for autovacuum: for something like six minutes, autovacuum would have been operating with stale pgstat data since the stats file would not get rewritten, and backends would see the file timestamp as being ahead of their own clocks so they wouldn't think it was stale. (So this scenario doesn't explain the occasional "pgstat wait timeout" failures we see; though perhaps someone can think of a variant that fits that symptom?) I will refrain from disparaging Windows here and just note that this would be an unsurprising occurrence on any machine not running NTP: the owner might occasionally fix the clock once it had drifted far enough from reality. So I think it might be advisable to install some defenses against the case rather than assuming it can't happen. The most direct fix would involve checking GetCurrentTimestamp against last_statswrite in the stats collector's main loop, but on a machine with slow gettimeofday calls that would be pretty bad for performance. What I suggest instead is: (1) In backend_read_statsfile, make an initial attempt to read the stats file and then read GetCurrentTimestamp after that. If the local clock reading is less than the stats file's timestamp, we know that some sort of clock skew or glitch has happened, so force an inquiry message to be sent with the local timestamp. But then accept the stats file anyway, since the skew might be small and harmless. The reason for the forced inquiry message is to cause (2) to happen at the collector. (2) In pgstat_recv_inquiry, if the received inquiry_time is older than last_statwrite, we should suspect a clock glitch (though it might just indicate delayed message receipt). In this case, do a fresh GetCurrentTimestamp call, and if the reading is less than last_statwrite, we know that the collector's time went backwards. To recover, reset these variables as we do at startup:last_statrequest = GetCurrentTimestamp();last_statwrite = last_statrequest- 1; to force an immediate write to happen with the "new" local time. (1) is basically free in terms of the amount of work done in non-broken cases, though it will require a few more lines of code. (2) means adding some GetCurrentTimestamp calls that did not occur before, but hopefully these will be infrequent, since in the absence of clock glitches they would only happen when a backend's demand for a new stats file is generated before the collector starts to write a new stats file but not received till afterwards. Comments? Anyone see a flaw in this design? Or want to argue that we shouldn't do anything about such cases? regards, tom lane
pgsql-hackers by date: