Re: Direct I/O - Mailing list pgsql-hackers

From Thomas Munro
Subject Re: Direct I/O
Date
Msg-id CA+hUKGKSBaz78Fw3WTF3Q8ArqKCz1GgsTfRFiDPbu-j9OFz-jw@mail.gmail.com
Whole thread Raw
In response to Re: Direct I/O  (Thomas Munro <thomas.munro@gmail.com>)
Responses Re: Direct I/O
Re: Direct I/O
List pgsql-hackers
On Sun, Apr 9, 2023 at 4:52 PM Thomas Munro <thomas.munro@gmail.com> wrote:
> Here, btrfs seems to be taking a different path that I can't quite
> make out...  I see no warning/error about a checksum failure like [1],
> and we apparently managed to read something other than a mix of the
> old and new page contents (which, based on your hypothesis, should
> just leave it indeterminate whether the hint bit changes were captured
> or not, and the rest of the page should be stable, right).  It's like
> the page time-travelled or got scrambled in some other way, but it
> didn't tell us?  I'll try to dig further...

I think there are two separate bad phenomena.

1.  A concurrent modification of the user space buffer while writing
breaks the checksum so you can't read the data back in, as .  I can
reproduce that with a stand-alone program, attached.  The "verifier"
process occasionally reports EIO while reading, unless you comment out
the "scribbler" process's active line.  The system log/dmesg gets some
warnings.

2.  The crake-style failure doesn't involve any reported checksum
failures or errors, and I'm not sure if another process is even
involved.  I attach a complete syscall trace of a repro session.  (I
tried to get strace to dump 8192 byte strings, but then it doesn't
repro, so we have only the start of the data transferred for each
page.)  Working back from the error message,

ERROR:  invalid page in block 78 of relation base/5/16384,

we have a page at offset 638976, and we can find all system calls that
touched that offset:

[pid 26031] 23:26:48.521123 pwritev(50,
[{iov_base="\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,
iov_len=8192}], 1, 638976) = 8192

[pid 26040] 23:26:48.568975 pwrite64(5,
"\0\0\0\0\0Nj\1\0\0\0\0\240\3\300\3\0 \4
\0\0\0\0\340\2378\0\300\2378\0"..., 8192, 638976) = 8192

[pid 26040] 23:26:48.593157 pread64(6,
"\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,
8192, 638976) = 8192

In between the write of non-zeros and the read of zeros, nothing seems
to happen that could justify that, that I can grok, but perhaps
someone else will see something that I'm missing.  We pretty much just
have the parallel worker scanning the table, and writing stuff out as
it does it.  This was obtained with:

strace -f --absolute-timestamps=time,us ~/install/bin/postgres -D
pgdata -c io_direct=data -c shared_buffers=256kB -c wal_level=minimal
-c max_wal_senders=0 2>&1 | tee trace.log

The repro is just:

set debug_parallel_query=regress;
drop table if exists t;
create table t as select generate_series(1, 10000);
update t set generate_series = 1;
select count(*) from t;

Occasionally it fails in a different way: after create table t, later
references to t can't find it in the catalogs but there is no invalid
page error.  Perhaps the freaky zeros are happening one 4k page at a
time but perhaps if you get two in a row it might look like an empty
catalog page and pass validation.

Attachment

pgsql-hackers by date:

Previous
From: Andrew Dunstan
Date:
Subject: Re: Direct I/O
Next
From: Thomas Munro
Date:
Subject: Re: Direct I/O