Thread: recovery test failure on morepork with timestamp mystery
Andres drew my attention to this [1] build farm failure. Looks like a test I wrote resetting pg_stat_replication_slots is failing: # Failed test 'Check that reset timestamp is later after resetting stats for slot 'test_slot' again.' # at t/006_logical_decoding.pl line 261. # got: 'f' # expected: 't' # Looks like you failed 1 test of 20. [19:59:58] t/006_logical_decoding.pl ............ This is the test code itself: is( $node_primary->safe_psql( 'postgres', qq(SELECT stats_reset > '$reset1'::timestamptz FROM pg_stat_replication_slots WHERE slot_name = '$stats_test_slot1') ), qq(t), qq(Check that reset timestamp is later after resetting stats for slot '$stats_test_slot1' again.) ); This is the relevant SQL statement: SELECT stats_reset > '$reset1'::timestamptz FROM pg_stat_replication_slots WHERE slot_name = '$stats_test_slot1' When this statement is executed, reset1 is as shown: 2022-05-12 19:59:58.342 CEST [88829:3] 006_logical_decoding.pl LOG: statement: SELECT stats_reset > '2022-05-12 19:59:58.402808+02'::timestamptz FROM pg_stat_replication_slots WHERE slot_name = 'test_slot' Note the timestamp of this execution. The stats reset occurred in the past, and as such *must* have come before '2022-05-12 19:59:58.402808+02'::timestamptz. The starred line is where `reset1` is fetched: 2022-05-12 19:59:58.305 CEST [86784:2] [unknown] LOG: connection authorized: user=pgbf database=postgres application_name=006_logical_decoding.pl * 2022-05-12 19:59:58.306 CEST [86784:3] 006_logical_decoding.pl LOG: statement: SELECT stats_reset FROM pg_stat_replication_slots WHERE slot_name = 'test_slot' 2022-05-12 19:59:58.308 CEST [86784:4] 006_logical_decoding.pl LOG: disconnection: session time: 0:00:00.003 user=pgbf database=postgres host=[local] 2022-05-12 19:59:58.315 CEST [18214:1] [unknown] LOG: connection received: host=[local] 2022-05-12 19:59:58.316 CEST [18214:2] [unknown] LOG: connection authorized: user=pgbf database=postgres application_name=006_logical_decoding.pl 2022-05-12 19:59:58.317 CEST [18214:3] 006_logical_decoding.pl LOG: statement: SELECT pg_stat_reset_replication_slot(NULL) 2022-05-12 19:59:58.322 CEST [18214:4] 006_logical_decoding.pl LOG: disconnection: session time: 0:00:00.007 user=pgbf database=postgres host=[local] 2022-05-12 19:59:58.329 CEST [45967:1] [unknown] LOG: connection received: host=[local] 2022-05-12 19:59:58.330 CEST [45967:2] [unknown] LOG: connection authorized: user=pgbf database=postgres application_name=006_logical_decoding.pl 2022-05-12 19:59:58.331 CEST [45967:3] 006_logical_decoding.pl LOG: statement: SELECT stats_reset IS NOT NULL FROM pg_stat_replication_slots WHERE slot_name = 'logical_slot' 2022-05-12 19:59:58.333 CEST [45967:4] 006_logical_decoding.pl LOG: disconnection: session time: 0:00:00.003 user=pgbf database=postgres host=[local] 2022-05-12 19:59:58.341 CEST [88829:1] [unknown] LOG: connection received: host=[local] 2022-05-12 19:59:58.341 CEST [88829:2] [unknown] LOG: connection authorized: user=pgbf database=postgres application_name=006_logical_decoding.pl 2022-05-12 19:59:58.342 CEST [88829:3] 006_logical_decoding.pl LOG: statement: SELECT stats_reset > '2022-05-12 19:59:58.402808+02'::timestamptz FROM pg_stat_replication_slots WHERE slot_name = 'test_slot' 2022-05-12 19:59:58.344 CEST [88829:4] 006_logical_decoding.pl LOG: disconnection: session time: 0:00:00.003 user=pgbf database=postgres host=[local] 2022-05-12 19:59:58.350 CEST [50055:4] LOG: received fast shutdown request 2022-05-12 19:59:58.350 CEST [50055:5] LOG: aborting any active transactions 2022-05-12 19:59:58.352 CEST [50055:6] LOG: background worker "logical replication launcher" (PID 89924) exited with exit code 1 2022-05-12 19:59:58.352 CEST [56213:1] LOG: shutting down 2022-05-12 19:59:58.352 CEST [56213:2] LOG: checkpoint starting: shutdown immediate 2022-05-12 19:59:58.353 CEST [56213:3] LOG: checkpoint complete: wrote 4 buffers (3.1%); 0 WAL file(s) added, 0 removed, 0 recycled; write=0.001 s, sync=0.001 s, total=0.001 s; sync files=0, longest=0.000 s, average=0.000 s; distance=0 kB, estimate=0 kB 2022-05-12 19:59:58.355 CEST [50055:7] LOG: database system is shut down stats_reset was set in the past, so `reset1` shouldn't be after '2022-05-12 19:59:58.306 CEST'. It looks like the timestamp appearing in the test query would correspond to a time after the database is shut down. - melanie [1] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=morepork&dt=2022-05-12%2017%3A50%3A47
On Fri, May 13, 2022 at 11:43 AM Melanie Plageman <melanieplageman@gmail.com> wrote: > > Andres drew my attention to this [1] build farm failure. > > Looks like a test I wrote resetting pg_stat_replication_slots is > failing: > > # Failed test 'Check that reset timestamp is later after resetting > stats for slot 'test_slot' again.' > # at t/006_logical_decoding.pl line 261. > # got: 'f' > # expected: 't' > # Looks like you failed 1 test of 20. > [19:59:58] t/006_logical_decoding.pl ............ > > This is the test code itself: > > is( $node_primary->safe_psql( > 'postgres', > qq(SELECT stats_reset > '$reset1'::timestamptz FROM > pg_stat_replication_slots WHERE slot_name = '$stats_test_slot1') > ), > qq(t), > qq(Check that reset timestamp is later after resetting stats for > slot '$stats_test_slot1' again.) > ); > > This is the relevant SQL statement: > > SELECT stats_reset > '$reset1'::timestamptz FROM > pg_stat_replication_slots WHERE slot_name = '$stats_test_slot1' > > When this statement is executed, reset1 is as shown: > > 2022-05-12 19:59:58.342 CEST [88829:3] 006_logical_decoding.pl LOG: > statement: SELECT stats_reset > '2022-05-12 > 19:59:58.402808+02'::timestamptz FROM pg_stat_replication_slots WHERE > slot_name = 'test_slot' > I don't know if this is related, but I noticed that the log timestamp (19:59:58.342) is reporting the $reset1 value (19:59:58.402808). I did not understand how a timestamp saved from the past could be ahead of the timestamp of the log. ------ Kind Regards, Peter Smith. Fujitsu Australia
On Fri, May 13, 2022 at 12:01:09PM +1000, Peter Smith wrote: > I don't know if this is related, but I noticed that the log timestamp > (19:59:58.342) is reporting the $reset1 value (19:59:58.402808). > > I did not understand how a timestamp saved from the past could be > ahead of the timestamp of the log. morepork is not completely in the white in this area. See the following thread: https://www.postgresql.org/message-id/X+r2VUFkZdKcF29A@paquier.xyz -- Michael
Attachment
Hi, On 2022-05-12 21:42:43 -0400, Melanie Plageman wrote: > Andres drew my attention to this [1] build farm failure. > > Looks like a test I wrote resetting pg_stat_replication_slots is > failing: > > # Failed test 'Check that reset timestamp is later after resetting > stats for slot 'test_slot' again.' > # at t/006_logical_decoding.pl line 261. > # got: 'f' > # expected: 't' > # Looks like you failed 1 test of 20. > [19:59:58] t/006_logical_decoding.pl ............ > > This is the test code itself: > > is( $node_primary->safe_psql( > 'postgres', > qq(SELECT stats_reset > '$reset1'::timestamptz FROM > pg_stat_replication_slots WHERE slot_name = '$stats_test_slot1') > ), > qq(t), > qq(Check that reset timestamp is later after resetting stats for > slot '$stats_test_slot1' again.) > ); > > This is the relevant SQL statement: > > SELECT stats_reset > '$reset1'::timestamptz FROM > pg_stat_replication_slots WHERE slot_name = '$stats_test_slot1' > > When this statement is executed, reset1 is as shown: > > 2022-05-12 19:59:58.342 CEST [88829:3] 006_logical_decoding.pl LOG: > statement: SELECT stats_reset > '2022-05-12 > 19:59:58.402808+02'::timestamptz FROM pg_stat_replication_slots WHERE > slot_name = 'test_slot' > > Note the timestamp of this execution. The stats reset occurred in the > past, and as such *must* have come before '2022-05-12 > 19:59:58.402808+02'::timestamptz. The timestamp is computed during: > 2022-05-12 19:59:58.317 CEST [18214:3] 006_logical_decoding.pl LOG: > statement: SELECT pg_stat_reset_replication_slot(NULL) One interesting tidbit is that the log timestamps are computed differently (with elog.c:get_formatted_log_time()) than the reset timestamp (with GetCurrentTimestamp()). Both use gettimeofday() internally. I wonder if there's a chance that somehow openbsd ends up with more usecs than "fit" in a second in the result of gettimeofday()? The elog.c case would truncate everything above a second away afaics: /* 'paste' milliseconds into place... */ sprintf(msbuf, ".%03d", (int) (saved_timeval.tv_usec / 1000)); memcpy(formatted_log_time + 19, msbuf, 4); whereas GetCurrentTimestamp() would add them to the timestamp: result = (TimestampTz) tp.tv_sec - ((POSTGRES_EPOCH_JDATE - UNIX_EPOCH_JDATE) * SECS_PER_DAY); result = (result * USECS_PER_SEC) + tp.tv_usec; Greetings, Andres Freund
Hi, On 2022-05-12 19:14:13 -0700, Andres Freund wrote: > On 2022-05-12 21:42:43 -0400, Melanie Plageman wrote: > > Andres drew my attention to this [1] build farm failure. I just saw that there's another recent timestamp related failure: https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=gombessa&dt=2022-05-13%2002%3A58%3A52 It's pretty odd that we have two timestamp related failures in stats code that hasn't changed in >30 days, both only on openbsd within the last ~10h. There's not been a similar isolationtest failure before. Greetings, Andres Freund
On 2022-05-13 04:14, Andres Freund wrote: > One interesting tidbit is that the log timestamps are computed differently > (with elog.c:get_formatted_log_time()) than the reset timestamp (with > GetCurrentTimestamp()). Both use gettimeofday() internally. > > I wonder if there's a chance that somehow openbsd ends up with more usecs than > "fit" in a second in the result of gettimeofday()? The elog.c case would > truncate everything above a second away afaics: > /* 'paste' milliseconds into place... */ > sprintf(msbuf, ".%03d", (int) (saved_timeval.tv_usec / 1000)); > memcpy(formatted_log_time + 19, msbuf, 4); > > whereas GetCurrentTimestamp() would add them to the timestamp: > result = (TimestampTz) tp.tv_sec - > ((POSTGRES_EPOCH_JDATE - UNIX_EPOCH_JDATE) * SECS_PER_DAY); > result = (result * USECS_PER_SEC) + tp.tv_usec; > Well, I don't know if you remember but there was a thread a while back and a test program (monotime.c) to test the clock if it could go backwards and openbsd showed the following result when running the attached testprogram: openbsd 5.9: $ ./monotime 1021006 Starting 1017367 Starting 1003415 Starting 1007598 Starting 1021006 Stopped 1007598 Stopped 1017367 Stopped 1003415 Stopped openbsd 6.9: $ ./monotime 410310 Starting 547727 Starting 410310 Back 262032.372314102 => 262032.242045208 410310 Stopped 465180 Starting 255646 Starting 547727 Stopped 465180 Stopped 255646 Stopped could that have something to do with it? /Mikael
Attachment
Hi, On 2022-05-13 09:00:20 +0200, Mikael Kjellström wrote: > Well, I don't know if you remember but there was a thread a while back and a > test program (monotime.c) to test the clock if it could go backwards and > openbsd showed the following result when running the attached testprogram: Nope, didn't remember... > $ ./monotime > 410310 Starting > 547727 Starting > 410310 Back 262032.372314102 => 262032.242045208 > 410310 Stopped > 465180 Starting > 255646 Starting > 547727 Stopped > 465180 Stopped > 255646 Stopped > > could that have something to do with it? Yes! > printf("%d Back %lld.%09lu => %lld.%09lu\n", > (int)getthrid(), ts0.tv_sec, ts0.tv_nsec, ts1.tv_sec, > ts1.tv_nsec); > break; I wonder whether the %09lu potentially is truncating ts1.tv_nsec. I can't reproduce the problem trivially in an openbsd VM I had around. But it's 7.1, so maybe that's the reason? Greetings, Andres Freund
Hi, On 2022-05-13 10:22:32 -0700, Andres Freund wrote: > On 2022-05-13 09:00:20 +0200, Mikael Kjellström wrote: > > Well, I don't know if you remember but there was a thread a while back and a > > test program (monotime.c) to test the clock if it could go backwards and > > openbsd showed the following result when running the attached testprogram: > > Nope, didn't remember... > > > > $ ./monotime > > 410310 Starting > > 547727 Starting > > 410310 Back 262032.372314102 => 262032.242045208 > > 410310 Stopped > > 465180 Starting > > 255646 Starting > > 547727 Stopped > > 465180 Stopped > > 255646 Stopped > > > > could that have something to do with it? > > Yes! > > > > printf("%d Back %lld.%09lu => %lld.%09lu\n", > > (int)getthrid(), ts0.tv_sec, ts0.tv_nsec, ts1.tv_sec, > > ts1.tv_nsec); > > break; > > I wonder whether the %09lu potentially is truncating ts1.tv_nsec. > > > I can't reproduce the problem trivially in an openbsd VM I had around. But > it's 7.1, so maybe that's the reason? What does sysctl kern.timecounter return? Does the problem continue if you switch kern.timecounter.hardware to something else? In https://postgr.es/m/32aaeb66-71b2-4af0-91ef-1a992ac4d58b%40mksoft.nu you said it was using acpitimer0 and that it's a vmware VM. It might also be a vmware bug, not an openbsd one... Greetings, Andres Freund
On 2022-05-13 22:09, Andres Freund wrote: > What does > sysctl kern.timecounter > return? Does the problem continue if you switch kern.timecounter.hardware to > something else? openbsd 5.9: $ sysctl kern.timecounter kern.timecounter.tick=1 kern.timecounter.timestepwarnings=0 kern.timecounter.hardware=acpihpet0 kern.timecounter.choice=i8254(0) acpihpet0(1000) acpitimer0(1000) dummy(-1000000) $ ./monotime 1024736 Starting 1013674 Starting 1028468 Starting 1014641 Starting 1013674 Stopped 1024736 Stopped 1014641 Stopped 1028468 Stopped no problem openbsd 6.9: $ sysctl kern.timecounter kern.timecounter.tick=1 kern.timecounter.timestepwarnings=0 kern.timecounter.hardware=tsc kern.timecounter.choice=i8254(0) acpihpet0(1000) tsc(2000) acpitimer0(1000) Hm, here it's using the tsc timer. So that is a difference from 5.9 $ ./monotime 133998 Starting 408137 Starting 578042 Starting 326139 Starting 133998 Back 310670.000931851668 => 310670.000801582864 133998 Stopped 326139 Stopped 578042 Stopped 408137 Stopped it's only in 6.9 the problem with the timer going backwards shows up. If I switch timer to acpihpet0 this is the result: $ ./monotime 101622 Starting 480782 Starting 219318 Starting 316650 Starting 101622 Stopped 480782 Stopped 219318 Stopped 316650 Stopped so that seems to solve the problem with the timer going backwards. > In https://postgr.es/m/32aaeb66-71b2-4af0-91ef-1a992ac4d58b%40mksoft.nu you > said it was using acpitimer0 and that it's a vmware VM. It might also be a > vmware bug, not an openbsd one... Might be a bug in vmware but it's running the latest patchlevel for 6.7 and no other VM have this problem so seems to only happen in openbsd 6.9 for some reason. Maybe that is the only VM that is using tsc as a timer source though? /Mikael