Hello Tomas,
01.09.2023 16:00, Tomas Vondra wrote:
> Hmmm, I'm not very good at reading the binary code, but here's what
> objdump produced for WaitEventSetWait. Maybe someone will see what the
> issue is.
At first glance, I can't see anything suspicious in the disassembly.
IIUC, waiting = true presented there as:
805c38: b902ad18 str w24, [x8, #684] // pgstat_report_wait_start(): proc->wait_event_info =
wait_event_info;
// end of pgstat_report_wait_start(wait_event_info);
805c3c: b0ffdb09 adrp x9, 0x366000 <dsm_segment_address+0x24>
805c40: b0ffdb0a adrp x10, 0x366000 <dsm_segment_address+0x28>
805c44: f0000eeb adrp x11, 0x9e4000 <PMSignalShmemInit+0x4>
805c48: 52800028 mov w8, #1 // true
805c4c: 52800319 mov w25, #24
805c50: 5280073a mov w26, #57
805c54: fd446128 ldr d8, [x9, #2240]
805c58: 90000d7b adrp x27, 0x9b1000 <ModifyWaitEvent+0xb0>
805c5c: fd415949 ldr d9, [x10, #688]
805c60: f9071d68 str x8, [x11, #3640] // waiting = true (x8 = w8)
So there are two simple mov's and two load operations performed in parallel,
but I don't think it's similar to what we had in that case.
> I thought about maybe just adding the barrier in the code, but then how
> would we know it's the issue and this fixed it? It happens so rarely we
> can't make any conclusions from a couple runs of tests.
Probably I could construct a reproducer for the lockup if I had access to
the such machine for a day or two.
Best regards,
Alexander