Re: Non-reproducible AIO failure - Mailing list pgsql-hackers
From | Matthias van de Meent |
---|---|
Subject | Re: Non-reproducible AIO failure |
Date | |
Msg-id | CAEze2WhTcBjgzOPzD0xYdw-F-HjYzGNKEyQB9=FP9XneFZWESg@mail.gmail.com Whole thread Raw |
In response to | Re: Non-reproducible AIO failure (Alexander Lakhin <exclusion@gmail.com>) |
Responses |
Re: Non-reproducible AIO failure
|
List | pgsql-hackers |
On Thu, 5 Jun 2025 at 21:00, Alexander Lakhin <exclusion@gmail.com> wrote: > > Hello Thomas and Andres, > > 04.06.2025 23:32, Thomas Munro wrote: > > On Thu, Jun 5, 2025 at 8:02 AM Andres Freund <andres@anarazel.de> wrote: > >> On 2025-06-03 08:00:01 +0300, Alexander Lakhin wrote: > >>> 2025-06-03 00:19:09.282 EDT [25175:1] LOG: !!!pgaio_io_before_start| ioh: > >>> 0x104c3e1a0, ioh->op: 1, ioh->state: 1, ioh->result: 0, ioh->num_callbacks: > >>> 2, ioh->generation: 21694 > >> But here it somehow turned to 1 (PGAIO_OP_READV), despite there not being any > >> "normal" reason for that. We know that the IO wasn't actually started as > >> otherwise pgaio_io_start_readv() would have logged that fact. > > A cheap/easy thing to try, maybe: give that enumeration more > > distinctive values like 0xaaaa, 0xbbbb, or whatever, to see if we get > > a wild 1 here? I have a very wild guess that's probably wrong in a weird way, but here goes anyway: Did anyone test if interleaving the enum-typed bitfield fields of PgAioHandle with the uint8 fields might solve the issue? I.e. {state, flags, target, num_callbacks, op} instead of {state, target, op, flags, num_callbacks}? I've looked at some decompiled code generated for pgaio_io_before_start and pgaio_io_start_readv. pgaio_io_start_readv looks rather normal, but in pgaio_io_before_start it's clear that pgaio_io_before_start does some weird things as a response to the bitfields of PgAioHandle: (manually annotated) _pgaio_io_before_start: 00000001002eaba8 stp x20, x19, [sp, #-0x20]! /* save registers */ 00000001002eabac stp x29, x30, [sp, #0x10] /* save registers */ 00000001002eabb0 add x29, sp, #0x10 /* x29 := stack+10 */ 00000001002eabb4 ldrb w8, [x0] /* w8 := ioh->state */ 00000001002eabb8 cmp w8, #0x1 /* w8 ?? PGAIO_HS_HANDED_OUT */ 00000001002eabbc b.ne 0x1002eac38 w8 here will contain the value 0x1. 00000001002eabc0 mov x19, x0 /* x19 := ioh */ 00000001002eabc4 adrp x8, 1461 ; 0x10089f000 /* x8 := 0x10089f000 */ 00000001002eabc8 add x8, x8, #0xd40 /* x8 += 0xd40 (0x10089fd40) */ 00000001002eabcc ldr x8, [x8] /* x8 := pgaio_my_backend */ 00000001002eabd0 ldr x8, [x8, #0x20] /* x8 := pgaio_my_backend->handed_out_io */ 00000001002eabd4 cmp x8, x0 /* x8 (pgaio_my_backend->handed_out_io) ?? x19 (ioh) */ 00000001002eabd8 b.ne 0x1002eac50 00000001002eabdc mov x0, x19 /* x0 := ioh */ 00000001002eabe0 bl _pgaio_io_has_target 00000001002eabe4 tbz w0, #0x0, 0x1002eac68 /* _pgaio_io_has_target(ioh)? */ 00000001002eabe8 ldrb w8, [x19, #0x2] /* w8 := ioh->op */ w8 now contains ioh->op, which is presumed to be 0x0. 00000001002eabec ldrh w9, [x19] /* w9 := ioh->[target | state] */ 00000001002eabf0 orr w8, w9, w8, lsl #16 /* w8 := w9 | (w8 << 16) // [ op | [target | state] ]*/ But here more operations are applied, pushing in the preceding bitfields of PgAioHandle, mixing in various values into w8. 00000001002eabf4 cmp w8, #0x10, lsl #12 /* w8 ?? 0x1'00'00 // op'state'target ?? 0x1'00'00 */ And finally the full set of bitfields is compared against a target value, rather than just ioh->op. This means we do 2 memory accesses for this field. 00000001002eabf8 b.hs 0x1002eac80 /* >= ?? ERROR : continue */ And finally we branch into the setup for _ExceptionalCondition. This shows that b->op is not exactly a stand-alone field, and might be influenced by the other sibling bitfield fields. If we changed the order of the fields to not have continuous bitfields, that might be able to convince the compiler that there are no data dependencies between state/target/op, and remove the two-phase load operation. (checks more decompiled code) Oh, the bitfields are not just read with multiple accesses, but also *written* using multiple accesses. See pgaio_io_set_target: _pgaio_io_set_target: 00000001002eafdc stp x29, x30, [sp, #-0x10]! 00000001002eafe0 mov x29, sp 00000001002eafe4 ldrb w8, [x0, #0x2] 00000001002eafe8 ldrh w9, [x0] 00000001002eafec orr w8, w9, w8, lsl #16 00000001002eaff0 and w9, w8, #0xff 00000001002eaff4 cmp w9, #0x1 00000001002eaff8 b.ne 0x1002eb01c 00000001002eaffc tst w8, #0xff00 00000001002eb000 b.ne 0x1002eb034 00000001002eb004 lsr w9, w8, #16 00000001002eb008 bfi w8, w1, #8, #24 00000001002eb00c strb w9, [x0, #0x2] /* ioh->op = w9 */ 00000001002eb010 strh w8, [x0] /* ioh->[state, target] = w8 */ ... and pgaio_io_update_state: _pgaio_io_update_state: 00000001002e7f08 sub sp, sp, #0x70 00000001002e7f0c stp x24, x23, [sp, #0x30] 00000001002e7f10 stp x22, x21, [sp, #0x40] 00000001002e7f14 stp x20, x19, [sp, #0x50] 00000001002e7f18 stp x29, x30, [sp, #0x60] 00000001002e7f1c add x29, sp, #0x60 00000001002e7f20 mov x19, x1 00000001002e7f24 mov x20, x0 00000001002e7f28 mov w0, #0xa 00000001002e7f2c mov x1, #0x0 00000001002e7f30 bl _errstart 00000001002e7f34 cbz w0, 0x1002e8010 [...] /* skip pgaio_debug_io log handling */ 00000001002e8010 dmb ish 00000001002e8014 ldurh w8, [x20, #0x1] 00000001002e8018 bfi w19, w8, #8, #16 00000001002e801c lsr w8, w8, #8 00000001002e8020 strb w8, [x20, #0x2] /* store ioh->op */ 00000001002e8024 strh w19, [x20] /* store ioh->{state, target} */ I'm sending this now without looking much deeper into this due to $err_date_changed and other time-related constraints, but might this be a candidate for further debugging research? Kind regards, Matthias van de Meent Databricks Inc.
pgsql-hackers by date: