Re: Direct I/O - Mailing list pgsql-hackers
From | Andres Freund |
---|---|
Subject | Re: Direct I/O |
Date | |
Msg-id | 20230410025741.whvq7w5ev4ficjuk@awork3.anarazel.de Whole thread Raw |
In response to | Re: Direct I/O (Thomas Munro <thomas.munro@gmail.com>) |
Responses |
Re: Direct I/O
Re: Direct I/O |
List | pgsql-hackers |
Hi, On 2023-04-10 00:17:12 +1200, Thomas Munro wrote: > I think there are two separate bad phenomena. > > 1. A concurrent modification of the user space buffer while writing > breaks the checksum so you can't read the data back in, as . I can > reproduce that with a stand-alone program, attached. The "verifier" > process occasionally reports EIO while reading, unless you comment out > the "scribbler" process's active line. The system log/dmesg gets some > warnings. I think we really need to think about whether we eventually we want to do something to avoid modifying pages while IO is in progress. The only alternative is for filesystems to make copies of everything in the IO path, which is far from free (and obviously prevents from using DMA for the whole IO). The copy we do to avoid the same problem when checksums are enabled, shows up quite prominently in write-heavy profiles, so there's a "purely postgres" reason to avoid these issues too. > 2. The crake-style failure doesn't involve any reported checksum > failures or errors, and I'm not sure if another process is even > involved. I attach a complete syscall trace of a repro session. (I > tried to get strace to dump 8192 byte strings, but then it doesn't > repro, so we have only the start of the data transferred for each > page.) Working back from the error message, > > ERROR: invalid page in block 78 of relation base/5/16384, > > we have a page at offset 638976, and we can find all system calls that > touched that offset: > > [pid 26031] 23:26:48.521123 pwritev(50, > [{iov_base="\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., > iov_len=8192}], 1, 638976) = 8192 > > [pid 26040] 23:26:48.568975 pwrite64(5, > "\0\0\0\0\0Nj\1\0\0\0\0\240\3\300\3\0 \4 > \0\0\0\0\340\2378\0\300\2378\0"..., 8192, 638976) = 8192 > > [pid 26040] 23:26:48.593157 pread64(6, > "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., > 8192, 638976) = 8192 > > In between the write of non-zeros and the read of zeros, nothing seems > to happen that could justify that, that I can grok, but perhaps > someone else will see something that I'm missing. We pretty much just > have the parallel worker scanning the table, and writing stuff out as > it does it. This was obtained with: Have you tried to write a reproducer for this that doesn't involve postgres? It'd certainly be interesting to know the precise conditions for this. E.g., can this also happen without O_DIRECT, if cache pressure is high enough for the page to get evicted soon after (potentially simulated with fadvise or such)? We should definitely let the brtfs folks know of this issue... It's possible that this bug was recently introduced even. What kernel version did you repro this on Thomas? I wonder if we should have a postgres-io-torture program in our tree for some of these things. We've found issues with our assumptions on several operating systems and filesystems, without systematically looking. Or even stressing IO all that hard in our tests. Greetings, Andres Freund
pgsql-hackers by date: