Thread: Weird failure in explain.out with OpenBSD
Hi all, Buildfarm member gombessa just had an interesting failure: https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=gombessa&dt=2020-12-29%2000%3A16%3A49 Seq Scan on int8_tbl i8 (cost=N.N..N.N rows=N width=N) (actual time=N.N..N.N rows=N loops=N) Planning Time: N.N ms - Execution Time: N.N ms + Execution Time: -N.N ms (3 rows) Not sure what to think about that, as this implies the calculation of a negative execution time. Regards, -- Michael
Attachment
Michael Paquier <michael@paquier.xyz> writes: > Buildfarm member gombessa just had an interesting failure: > - Execution Time: N.N ms > + Execution Time: -N.N ms > Not sure what to think about that, as this implies the calculation of > a negative execution time. Yeah, I saw that. I notice that gombessa uses 'HEAD' => [ 'force_parallel_mode = regress' ] so my guess is that something messed up in transmitting or combining a parallel worker's execution time. No idea how to narrow it down more than that. regards, tom lane
I wrote: > ... so my guess is that something messed up in transmitting or combining a > parallel worker's execution time. Hmph, no, a look at explain.c shows that the "Execution Time" is just based on the difference of INSTR_TIME_SET_CURRENT measurements taken within the current process. It's difficult to conclude anything except that the clock went backwards. Which is weird, because according to [1] that system does have clock_gettime(CLOCK_MONOTONIC), which'd be our preferred choice of INSTR_TIME time base; and such clocks are not supposed to go backwards ever. Tis puzzling, but it's hard to avoid the suspicion of a kernel bug. regards, tom lane [1] https://man.openbsd.org/OpenBSD-6.8/clock_gettime
On Tue, Dec 29, 2020 at 04:16:06PM -0500, Tom Lane wrote: > Hmph, no, a look at explain.c shows that the "Execution Time" is just > based on the difference of INSTR_TIME_SET_CURRENT measurements taken > within the current process. It's difficult to conclude anything except > that the clock went backwards. Which is weird, because according to [1] > that system does have clock_gettime(CLOCK_MONOTONIC), which'd be our > preferred choice of INSTR_TIME time base; and such clocks are not > supposed to go backwards ever. I was looking at that, and I agree that this looks like a monotonic clock going backwards. Or could it be possible that it gave 0.0 as result, still a minus sign was appended? That would mean an execution that took less than 1us per the system clock. -- Michael
Attachment
On Wed, Dec 30, 2020 at 8:17 PM Michael Paquier <michael@paquier.xyz> wrote: > On Tue, Dec 29, 2020 at 04:16:06PM -0500, Tom Lane wrote: > > Hmph, no, a look at explain.c shows that the "Execution Time" is just > > based on the difference of INSTR_TIME_SET_CURRENT measurements taken > > within the current process. It's difficult to conclude anything except > > that the clock went backwards. Which is weird, because according to [1] > > that system does have clock_gettime(CLOCK_MONOTONIC), which'd be our > > preferred choice of INSTR_TIME time base; and such clocks are not > > supposed to go backwards ever. > > I was looking at that, and I agree that this looks like a monotonic > clock going backwards. Or could it be possible that it gave 0.0 as > result, still a minus sign was appended? That would mean an execution > that took less than 1us per the system clock. If I'm reading this right, it might be further evidence of CLOCK_MONOTONIC going backwards on OpenBSD, this time by quite a lot (the regular expression doesn't tolerate a leading minus sign, so the test failed): https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=morepork&dt=2021-11-10%2020%3A51%3A32 # Log entry not matching: 1 22 -124769 0 1636578000 152817
On Thu, Nov 11, 2021 at 10:23:06AM +1300, Thomas Munro wrote: > If I'm reading this right, it might be further evidence of > CLOCK_MONOTONIC going backwards on OpenBSD, this time by quite a lot > (the regular expression doesn't tolerate a leading minus sign, so the > test failed): > > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=morepork&dt=2021-11-10%2020%3A51%3A32 Yeah, I have noticed this failure yesterday. morepork uses 6.9 now, but it has been recently upgraded from 5.4 or a version close to that, right? I am wondering if 7.0 may help regarding this issue. Postgres is not wrong here. -- Michael
Attachment
On Thu, Nov 11, 2021 at 1:15 PM Michael Paquier <michael@paquier.xyz> wrote: > morepork uses 6.9 now, but it has been recently upgraded from 5.4 or > a version close to that, right? I am wondering if 7.0 may help > regarding this issue. Postgres is not wrong here. I dunno. Clocks on virtualised systems and even metal seem to be a minefield of quirks and heuristics. Some discussion, may or may not be relevant: https://marc.info/?l=openbsd-tech&m=161657532610882&w=2 Some starter questions for Mikael would be: could you please check which clock source it's using?, is it under a hypervisor, and if so which?, what is the CPU model?, what are other kernels choosing when running as guests on the same hypervisor (if applicable)? Did anything interesting happen on either the guest or host operating system (if not bare metal) some time around 2021-11-10 21:59 CET? I'm no expert on this stuff but something tells me that a 124ms leap needs a different explanation than the sub-µs difference reported earlier...
Hi, On 2021-11-11 14:15:33 +1300, Thomas Munro wrote: > Some starter questions for Mikael would be: could you please check > which clock source it's using?, is it under a hypervisor, and if so > which?, what is the CPU model?, what are other kernels choosing when > running as guests on the same hypervisor (if applicable)? Did > anything interesting happen on either the guest or host operating > system (if not bare metal) some time around 2021-11-10 21:59 CET? > I'm no expert on this stuff but something tells me that a 124ms leap > needs a different explanation than the sub-µs difference reported > earlier... One quite conceivable way could be a VM migration. Which quite a few hosters can do behind ones back, to reduce downtimes in case of hardware maintenance etc. If openbsd is using the TSC, and something in the stack isn't quite dealing correctly with the necessary tsc offset / speed correction values you'd expect to see something like this. Greetings, Andres Freund
On 2021-11-11 01:15, Michael Paquier wrote: >> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=morepork&dt=2021-11-10%2020%3A51%3A32 > > Yeah, I have noticed this failure yesterday. > > morepork uses 6.9 now, but it has been recently upgraded from 5.4 or > a version close to that, right? I am wondering if 7.0 may help > regarding this issue. Postgres is not wrong here. Just FYI. morepork was replaced with 6.9. It was earlier running 6.5. It wasn't upgraded though it was completely installed from scratched. 6.9 was the latest version available when I installed it a month ago. 7.0 was released Oct 14 2021. /Mikael
On 2021-11-11 02:15, Thomas Munro wrote: > On Thu, Nov 11, 2021 at 1:15 PM Michael Paquier <michael@paquier.xyz> wrote: >> morepork uses 6.9 now, but it has been recently upgraded from 5.4 or >> a version close to that, right? I am wondering if 7.0 may help >> regarding this issue. Postgres is not wrong here. > > I dunno. Clocks on virtualised systems and even metal seem to be a > minefield of quirks and heuristics. Some discussion, may or may not > be relevant: > > https://marc.info/?l=openbsd-tech&m=161657532610882&w=2 Hm, interesting. > Some starter questions for Mikael would be: could you please check > which clock source it's using?, It's disabling TSC: cpu1: disabling user TSC (skew=-4465) so it seems to be using the acpitimer? acpitimer0 at acpi0: 3579545 Hz, 24 bits > is it under a hypervisor, and if so > which?, VMWare ESXI 6.7 latest updates installed. > what is the CPU model?, what are other kernels choosing when > running as guests on the same hypervisor (if applicable)? AMD Ryzen 5 3600X > Did > anything interesting happen on either the guest or host operating > system (if not bare metal) some time around 2021-11-10 21:59 CET? Nope, not that I can see. > I'm no expert on this stuff but something tells me that a 124ms leap > needs a different explanation than the sub-µs difference reported > earlier... It's running ntpd to keep time in sync if that matters? /Mikael
On 2021-11-11 04:35, Andres Freund wrote: > One quite conceivable way could be a VM migration. Which quite a few hosters > can do behind ones back, to reduce downtimes in case of hardware maintenance > etc. If openbsd is using the TSC, and something in the stack isn't quite > dealing correctly with the necessary tsc offset / speed correction values > you'd expect to see something like this. No VM migration as this is a stand alone ESXI host. It's not using TSC as clock source. It's using acpitimer0 at acpi0: 3579545 Hz, 24 bits. /Mikael
On 2021-11-11 02:15, Thomas Munro wrote: > I dunno. Clocks on virtualised systems and even metal seem to be a > minefield of quirks and heuristics. Some discussion, may or may not > be relevant: > > https://marc.info/?l=openbsd-tech&m=161657532610882&w=2 Just for fun I compiled and ran the test program and this was the result: $ ./monotime 589047 Starting 284971 Starting 542819 Starting 315557 Starting 589047 Stopped 542819 Back 2728606.093473837 => 2728605.963205128 542819 Stopped 284971 Stopped 315557 Stopped above was on 6.9 then I tried the same on my older 5.9 animal and this was the result: ./monotime 1003853 Starting 1004437 Starting 1032556 Starting 1001887 Starting 1004437 Stopped 1003853 Stopped 1032556 Stopped 1001887 Stopped it's also running under the same VMWARE 6.7 instance / machine. dmesg on that machine also indicates that it's using: acpitimer0 at acpi0: 3579545 Hz, 24 bits so looks like a kernel bug / regression in recent OpenBSD kernels then? /Mikael