Thread: Weird failure in explain.out with OpenBSD

Weird failure in explain.out with OpenBSD

From
Michael Paquier
Date:
Hi all,

Buildfarm member gombessa just had an interesting failure:
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=gombessa&dt=2020-12-29%2000%3A16%3A49
  Seq Scan on int8_tbl i8  (cost=N.N..N.N rows=N width=N) (actual
  time=N.N..N.N rows=N loops=N)
  Planning Time: N.N ms
- Execution Time: N.N ms
+ Execution Time: -N.N ms
 (3 rows)

Not sure what to think about that, as this implies the calculation of
a negative execution time.
Regards,
--
Michael

Attachment

Re: Weird failure in explain.out with OpenBSD

From
Tom Lane
Date:
Michael Paquier <michael@paquier.xyz> writes:
> Buildfarm member gombessa just had an interesting failure:
> - Execution Time: N.N ms
> + Execution Time: -N.N ms
> Not sure what to think about that, as this implies the calculation of
> a negative execution time.

Yeah, I saw that.  I notice that gombessa uses

        'HEAD' => [
                    'force_parallel_mode = regress'
                  ]

so my guess is that something messed up in transmitting or combining a
parallel worker's execution time.  No idea how to narrow it down more
than that.

            regards, tom lane



Re: Weird failure in explain.out with OpenBSD

From
Tom Lane
Date:
I wrote:
> ... so my guess is that something messed up in transmitting or combining a
> parallel worker's execution time.

Hmph, no, a look at explain.c shows that the "Execution Time" is just
based on the difference of INSTR_TIME_SET_CURRENT measurements taken
within the current process.  It's difficult to conclude anything except
that the clock went backwards.  Which is weird, because according to [1]
that system does have clock_gettime(CLOCK_MONOTONIC), which'd be our
preferred choice of INSTR_TIME time base; and such clocks are not
supposed to go backwards ever.

Tis puzzling, but it's hard to avoid the suspicion of a kernel bug.

            regards, tom lane

[1] https://man.openbsd.org/OpenBSD-6.8/clock_gettime



Re: Weird failure in explain.out with OpenBSD

From
Michael Paquier
Date:
On Tue, Dec 29, 2020 at 04:16:06PM -0500, Tom Lane wrote:
> Hmph, no, a look at explain.c shows that the "Execution Time" is just
> based on the difference of INSTR_TIME_SET_CURRENT measurements taken
> within the current process.  It's difficult to conclude anything except
> that the clock went backwards.  Which is weird, because according to [1]
> that system does have clock_gettime(CLOCK_MONOTONIC), which'd be our
> preferred choice of INSTR_TIME time base; and such clocks are not
> supposed to go backwards ever.

I was looking at that, and I agree that this looks like a monotonic
clock going backwards.  Or could it be possible that it gave 0.0 as
result, still a minus sign was appended?  That would mean an execution
that took less than 1us per the system clock.
--
Michael

Attachment

Re: Weird failure in explain.out with OpenBSD

From
Thomas Munro
Date:
On Wed, Dec 30, 2020 at 8:17 PM Michael Paquier <michael@paquier.xyz> wrote:
> On Tue, Dec 29, 2020 at 04:16:06PM -0500, Tom Lane wrote:
> > Hmph, no, a look at explain.c shows that the "Execution Time" is just
> > based on the difference of INSTR_TIME_SET_CURRENT measurements taken
> > within the current process.  It's difficult to conclude anything except
> > that the clock went backwards.  Which is weird, because according to [1]
> > that system does have clock_gettime(CLOCK_MONOTONIC), which'd be our
> > preferred choice of INSTR_TIME time base; and such clocks are not
> > supposed to go backwards ever.
>
> I was looking at that, and I agree that this looks like a monotonic
> clock going backwards.  Or could it be possible that it gave 0.0 as
> result, still a minus sign was appended?  That would mean an execution
> that took less than 1us per the system clock.

If I'm reading this right, it might be further evidence of
CLOCK_MONOTONIC going backwards on OpenBSD, this time by quite a lot
(the regular expression doesn't tolerate a leading minus sign, so the
test failed):

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=morepork&dt=2021-11-10%2020%3A51%3A32

# Log entry not matching: 1 22 -124769 0 1636578000 152817



Re: Weird failure in explain.out with OpenBSD

From
Michael Paquier
Date:
On Thu, Nov 11, 2021 at 10:23:06AM +1300, Thomas Munro wrote:
> If I'm reading this right, it might be further evidence of
> CLOCK_MONOTONIC going backwards on OpenBSD, this time by quite a lot
> (the regular expression doesn't tolerate a leading minus sign, so the
> test failed):
>
> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=morepork&dt=2021-11-10%2020%3A51%3A32

Yeah, I have noticed this failure yesterday.

morepork uses 6.9 now, but it has been recently upgraded from 5.4 or
a version close to that, right?  I am wondering if 7.0 may help
regarding this issue.  Postgres is not wrong here.
--
Michael

Attachment

Re: Weird failure in explain.out with OpenBSD

From
Thomas Munro
Date:
On Thu, Nov 11, 2021 at 1:15 PM Michael Paquier <michael@paquier.xyz> wrote:
> morepork uses 6.9 now, but it has been recently upgraded from 5.4 or
> a version close to that, right?  I am wondering if 7.0 may help
> regarding this issue.  Postgres is not wrong here.

I dunno.  Clocks on virtualised systems and even metal seem to be a
minefield of quirks and heuristics.  Some discussion, may or may not
be relevant:

https://marc.info/?l=openbsd-tech&m=161657532610882&w=2

Some starter questions for Mikael would be: could you please check
which clock source it's using?, is it under a hypervisor, and if so
which?, what is the CPU model?, what are other kernels choosing when
running as guests on the same hypervisor (if applicable)?  Did
anything interesting happen on either the guest or host operating
system (if not bare metal) some time around 2021-11-10 21:59 CET?

I'm no expert on this stuff but something tells me that a 124ms leap
needs a different explanation than the sub-µs difference reported
earlier...



Re: Weird failure in explain.out with OpenBSD

From
Andres Freund
Date:
Hi,

On 2021-11-11 14:15:33 +1300, Thomas Munro wrote:
> Some starter questions for Mikael would be: could you please check
> which clock source it's using?, is it under a hypervisor, and if so
> which?, what is the CPU model?, what are other kernels choosing when
> running as guests on the same hypervisor (if applicable)?  Did
> anything interesting happen on either the guest or host operating
> system (if not bare metal) some time around 2021-11-10 21:59 CET?

> I'm no expert on this stuff but something tells me that a 124ms leap
> needs a different explanation than the sub-µs difference reported
> earlier...

One quite conceivable way could be a VM migration. Which quite a few hosters
can do behind ones back, to reduce downtimes in case of hardware maintenance
etc. If openbsd is using the TSC, and something in the stack isn't quite
dealing correctly with the necessary tsc offset / speed correction values
you'd expect to see something like this.

Greetings,

Andres Freund



Re: Weird failure in explain.out with OpenBSD

From
Mikael Kjellström
Date:

On 2021-11-11 01:15, Michael Paquier wrote:

>> https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=morepork&dt=2021-11-10%2020%3A51%3A32
> 
> Yeah, I have noticed this failure yesterday.
> 
> morepork uses 6.9 now, but it has been recently upgraded from 5.4 or
> a version close to that, right?  I am wondering if 7.0 may help
> regarding this issue.  Postgres is not wrong here.


Just FYI.  morepork was replaced with 6.9.  It was earlier running 6.5. 
  It wasn't upgraded though it was completely installed from scratched.

6.9 was the latest version available when I installed it a month ago. 
7.0 was released Oct 14 2021.

/Mikael





Re: Weird failure in explain.out with OpenBSD

From
Mikael Kjellström
Date:
On 2021-11-11 02:15, Thomas Munro wrote:
> On Thu, Nov 11, 2021 at 1:15 PM Michael Paquier <michael@paquier.xyz> wrote:
>> morepork uses 6.9 now, but it has been recently upgraded from 5.4 or
>> a version close to that, right?  I am wondering if 7.0 may help
>> regarding this issue.  Postgres is not wrong here.
> 
> I dunno.  Clocks on virtualised systems and even metal seem to be a
> minefield of quirks and heuristics.  Some discussion, may or may not
> be relevant:
> 
> https://marc.info/?l=openbsd-tech&m=161657532610882&w=2

Hm, interesting.


> Some starter questions for Mikael would be: could you please check
> which clock source it's using?,

It's disabling TSC:

cpu1: disabling user TSC (skew=-4465)

so it seems to be using the acpitimer?

acpitimer0 at acpi0: 3579545 Hz, 24 bits



> is it under a hypervisor, and if so
> which?,

VMWare ESXI 6.7 latest updates installed.

> what is the CPU model?, what are other kernels choosing when
> running as guests on the same hypervisor (if applicable)? 

AMD Ryzen 5 3600X

> Did
> anything interesting happen on either the guest or host operating
> system (if not bare metal) some time around 2021-11-10 21:59 CET?

Nope, not that I can see.


> I'm no expert on this stuff but something tells me that a 124ms leap
> needs a different explanation than the sub-µs difference reported
> earlier...


It's running ntpd to keep time in sync if that matters?

/Mikael




Re: Weird failure in explain.out with OpenBSD

From
Mikael Kjellström
Date:

On 2021-11-11 04:35, Andres Freund wrote:

> One quite conceivable way could be a VM migration. Which quite a few hosters
> can do behind ones back, to reduce downtimes in case of hardware maintenance
> etc. If openbsd is using the TSC, and something in the stack isn't quite
> dealing correctly with the necessary tsc offset / speed correction values
> you'd expect to see something like this.

No VM migration as this is a stand alone ESXI host.

It's not using TSC as clock source.  It's using acpitimer0 at acpi0: 
3579545 Hz, 24 bits.

/Mikael





Re: Weird failure in explain.out with OpenBSD

From
Mikael Kjellström
Date:
On 2021-11-11 02:15, Thomas Munro wrote:

> I dunno.  Clocks on virtualised systems and even metal seem to be a
> minefield of quirks and heuristics.  Some discussion, may or may not
> be relevant:
> 
> https://marc.info/?l=openbsd-tech&m=161657532610882&w=2

Just for fun I compiled and ran the test program and this was the result:

$ ./monotime
589047 Starting
284971 Starting
542819 Starting
315557 Starting
589047 Stopped
542819 Back 2728606.093473837 => 2728605.963205128
542819 Stopped
284971 Stopped
315557 Stopped

above was on 6.9

then I tried the same on my older 5.9 animal and this was the result:

./monotime
1003853 Starting
1004437 Starting
1032556 Starting
1001887 Starting
1004437 Stopped
1003853 Stopped
1032556 Stopped
1001887 Stopped

it's also running under the same VMWARE 6.7 instance / machine.

dmesg on that machine also indicates that it's using:

acpitimer0 at acpi0: 3579545 Hz, 24 bits

so looks like a kernel bug / regression in recent OpenBSD kernels then?

/Mikael