Re: Direct I/O - Mailing list pgsql-hackers
From | Thomas Munro |
---|---|
Subject | Re: Direct I/O |
Date | |
Msg-id | CA+hUKGKSBaz78Fw3WTF3Q8ArqKCz1GgsTfRFiDPbu-j9OFz-jw@mail.gmail.com Whole thread Raw |
In response to | Re: Direct I/O (Thomas Munro <thomas.munro@gmail.com>) |
Responses |
Re: Direct I/O
Re: Direct I/O |
List | pgsql-hackers |
On Sun, Apr 9, 2023 at 4:52 PM Thomas Munro <thomas.munro@gmail.com> wrote: > Here, btrfs seems to be taking a different path that I can't quite > make out... I see no warning/error about a checksum failure like [1], > and we apparently managed to read something other than a mix of the > old and new page contents (which, based on your hypothesis, should > just leave it indeterminate whether the hint bit changes were captured > or not, and the rest of the page should be stable, right). It's like > the page time-travelled or got scrambled in some other way, but it > didn't tell us? I'll try to dig further... I think there are two separate bad phenomena. 1. A concurrent modification of the user space buffer while writing breaks the checksum so you can't read the data back in, as . I can reproduce that with a stand-alone program, attached. The "verifier" process occasionally reports EIO while reading, unless you comment out the "scribbler" process's active line. The system log/dmesg gets some warnings. 2. The crake-style failure doesn't involve any reported checksum failures or errors, and I'm not sure if another process is even involved. I attach a complete syscall trace of a repro session. (I tried to get strace to dump 8192 byte strings, but then it doesn't repro, so we have only the start of the data transferred for each page.) Working back from the error message, ERROR: invalid page in block 78 of relation base/5/16384, we have a page at offset 638976, and we can find all system calls that touched that offset: [pid 26031] 23:26:48.521123 pwritev(50, [{iov_base="\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., iov_len=8192}], 1, 638976) = 8192 [pid 26040] 23:26:48.568975 pwrite64(5, "\0\0\0\0\0Nj\1\0\0\0\0\240\3\300\3\0 \4 \0\0\0\0\340\2378\0\300\2378\0"..., 8192, 638976) = 8192 [pid 26040] 23:26:48.593157 pread64(6, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 8192, 638976) = 8192 In between the write of non-zeros and the read of zeros, nothing seems to happen that could justify that, that I can grok, but perhaps someone else will see something that I'm missing. We pretty much just have the parallel worker scanning the table, and writing stuff out as it does it. This was obtained with: strace -f --absolute-timestamps=time,us ~/install/bin/postgres -D pgdata -c io_direct=data -c shared_buffers=256kB -c wal_level=minimal -c max_wal_senders=0 2>&1 | tee trace.log The repro is just: set debug_parallel_query=regress; drop table if exists t; create table t as select generate_series(1, 10000); update t set generate_series = 1; select count(*) from t; Occasionally it fails in a different way: after create table t, later references to t can't find it in the catalogs but there is no invalid page error. Perhaps the freaky zeros are happening one 4k page at a time but perhaps if you get two in a row it might look like an empty catalog page and pass validation.
Attachment
pgsql-hackers by date: