Thread: Direct I/O
Hi, Here is a patch to allow PostgreSQL to use $SUBJECT. It is from the AIO patch-set[1]. It adds three new settings, defaulting to off: io_data_direct = whether to use O_DIRECT for main data files io_wal_direct = ... for WAL io_wal_init_direct = ... for WAL-file initialisation O_DIRECT asks the kernel to avoid caching file data as much as possible. Here's a fun quote about it[2]: "The exact semantics of Direct I/O (O_DIRECT) are not well specified. It is not a part of POSIX, or SUS, or any other formal standards specification. The exact meaning of O_DIRECT has historically been negotiated in non-public discussions between powerful enterprise database companies and proprietary Unix systems, and its behaviour has generally been passed down as oral lore rather than as a formal set of requirements and specifications." It gives the kernel the opportunity to move data directly between PostgreSQL's user space buffers and the storage hardware using DMA hardware, that is, without CPU involvement or copying. Not all storage stacks can do that, for various reasons, but even if not, the caching policy should ideally still use temporary buffers and avoid polluting the page cache. These settings currently destroy performance, and are not intended to be used by end-users, yet! That's why we filed them under DEVELOPER_OPTIONS. You don't get automatic read-ahead, concurrency, clustering or (of course) buffering from the kernel. The idea is that later parts of the AIO patch-set will introduce mechanisms to replace what the kernel is doing for us today, and then more, since we ought to be even better at predicting our own future I/O than it, so that we'll finish up ahead. Even with all that, you wouldn't want to turn it on by default because the default shared_buffers would be insufficient for any real system, and there are portability problems. Examples of slowness: * every 8KB sequential read or write becomes a full round trip to the storage, one at a time * data that is written to WAL and then read back in by WAL sender will incur full I/O round trip (that's probably not really an AIO problem, that's something we should probably address by using shared memory instead of files, as noted as a TODO item in the source code) Memory alignment patches: Direct I/O generally needs to be done to/from VM page-aligned addresses, but only "standard" 4KB pages, even when larger VM pages are in use (if there is an exotic system where that isn't true, it won't work). We need to deal with buffers on the stack, the heap and in shmem. For the stack, see patch 0001. For the heap and shared memory, see patch 0002, but David Rowley is going to propose that part separately, as MemoryContext API adjustments are a specialised enough topic to deserve another thread; here I include a copy as a dependency. The main direct I/O patch is 0003. Assorted portability notes: I expect this to "work" (that is, successfully destroy performance) on typical developer systems running at least Linux, macOS, Windows and FreeBSD. By work, I mean: not be rejected by PostgreSQL, not be rejected by the kernel, and influence kernel cache behaviour on common filesystems. It might be rejected with ENOSUPP, EINVAL etc on some more exotic filesystems and OSes. Of currently supported OSes, only OpenBSD and Solaris don't have O_DIRECT at all, and we'll reject the GUCs. For macOS and Windows we internally translate our own PG_O_DIRECT flag to the correct flags/calls (committed a while back[3]). On Windows, scatter/gather is available only with direct I/O, so a true pwritev would in theory be possible, but that has some more complications and is left for later patches (probably using native interfaces, not disguising as POSIX). There may be systems on which 8KB offset alignment will not work at all or not work well, and that's expected. For example, BTRFS, ZFS, JFS "big file", UFS etc allow larger-than-8KB blocks/records, and an 8KB write will have to trigger a read-before-write. Note that offset/length alignment requirements (blocks) are independent of buffer alignment requirements (memory pages, 4KB). The behaviour and cache coherency of files that have open descriptors using both direct and non-direct flags may be complicated and vary between systems. The patch currently lets you change the GUCs at runtime so backends can disagree: that should probably not be allowed, but is like that now for experimentation. More study is required. If someone has a compiler that we don't know how to do pg_attribute_aligned() for, then we can't make correctly aligned stack buffers, so in that case direct I/O is disabled, but I don't know of such a system (maybe aCC, but we dropped it). That's why smgr code can only assert that pointers are IO-aligned if PG_O_DIRECT != 0, and why PG_O_DIRECT is forced to 0 if there is no pg_attribute_aligned() macro, disabling the GUCs. This seems to be an independent enough piece to get into the tree on its own, with the proviso that it's not actually useful yet other than for experimentation. Thoughts? These patches have been hacked on at various times by Andres Freund, David Rowley and me. [1] https://wiki.postgresql.org/wiki/AIO [2] https://ext4.wiki.kernel.org/index.php/Clarifying_Direct_IO%27s_Semantics [3] https://www.postgresql.org/message-id/flat/CA%2BhUKG%2BADiyyHe0cun2wfT%2BSVnFVqNYPxoO6J9zcZkVO7%2BNGig%40mail.gmail.com
Attachment
On Tue, Nov 01, 2022 at 08:36:18PM +1300, Thomas Munro wrote: > Hi, > > Here is a patch to allow PostgreSQL to use $SUBJECT. It is from the > AIO patch-set[1]. It adds three new settings, defaulting to off: > > io_data_direct = whether to use O_DIRECT for main data files > io_wal_direct = ... for WAL > io_wal_init_direct = ... for WAL-file initialisation You added 3 booleans, but I wonder if it's better to add a string GUC which is parsed for comma separated strings. (By "better", I mean reducing the number of new GUCs - which is less important for developer GUCs anyway.) DIO is slower, but not so much that it can't run under CI. I suggest to add an 099 commit to enable the feature during development. Note that this fails under linux with fsanitize=align: ../src/backend/storage/file/buffile.c:117:17: runtime error: member access within misaligned address 0x561a4a8e40f8 for type'struct BufFile', which requires 4096 byte alignment -- Justin
On Wed, Nov 2, 2022 at 2:33 AM Justin Pryzby <pryzby@telsasoft.com> wrote: > On Tue, Nov 01, 2022 at 08:36:18PM +1300, Thomas Munro wrote: > > io_data_direct = whether to use O_DIRECT for main data files > > io_wal_direct = ... for WAL > > io_wal_init_direct = ... for WAL-file initialisation > > You added 3 booleans, but I wonder if it's better to add a string GUC > which is parsed for comma separated strings. (By "better", I mean > reducing the number of new GUCs - which is less important for developer > GUCs anyway.) Interesting idea. So "direct_io = data, wal, wal_init", or maybe that should be spelled io_direct. ("Direct I/O" is a common term of art, but we also have some more io_XXX GUCs in later patches, so it's hard to choose...) > DIO is slower, but not so much that it can't run under CI. I suggest to > add an 099 commit to enable the feature during development. Good idea, will do. > Note that this fails under linux with fsanitize=align: > ../src/backend/storage/file/buffile.c:117:17: runtime error: member access within misaligned address 0x561a4a8e40f8 fortype 'struct BufFile', which requires 4096 byte alignment Oh, so BufFile is palloc'd and contains one of these. BufFile is not even using direct I/O, but by these rules it would need to be palloc_io_align'd. I will think about what to do about that...
Hi, On 2022-11-02 09:44:30 +1300, Thomas Munro wrote: > On Wed, Nov 2, 2022 at 2:33 AM Justin Pryzby <pryzby@telsasoft.com> wrote: > > On Tue, Nov 01, 2022 at 08:36:18PM +1300, Thomas Munro wrote: > > > io_data_direct = whether to use O_DIRECT for main data files > > > io_wal_direct = ... for WAL > > > io_wal_init_direct = ... for WAL-file initialisation > > > > You added 3 booleans, but I wonder if it's better to add a string GUC > > which is parsed for comma separated strings. In the past more complicated GUCs have not been well received, but it does seem like a nice way to reduce the amount of redundant stuff. Perhaps we could use the guc assignment hook to transform the input value into a bitmask? > > (By "better", I mean reducing the number of new GUCs - which is less > > important for developer GUCs anyway.) FWIW, if / once we get to actual AIO, at least some of these would stop being developer-only GUCs. There's substantial performance benefits in using DIO with AIO. Buffered IO requires the CPU to copy the data from the userspace into the kernelspace. But DIO can use DMA for that, freeing the CPU to do more useful work. Buffered IO tops out much much earlier than AIO + DIO, and unfortunately tops out at much lower speeds on server CPUs. > > DIO is slower, but not so much that it can't run under CI. I suggest to > > add an 099 commit to enable the feature during development. > > Good idea, will do. Might be worth to additionally have a short tap test that does some basic stuff with DIO and leave that enabled? I think it'd be good to have check-world exercise DIO on dev machines, to reduce the likelihood of finding problems only in CI, which is somewhat painful. > > Note that this fails under linux with fsanitize=align: > > ../src/backend/storage/file/buffile.c:117:17: runtime error: member access within misaligned address 0x561a4a8e40f8 fortype 'struct BufFile', which requires 4096 byte alignment > > Oh, so BufFile is palloc'd and contains one of these. BufFile is not > even using direct I/O, but by these rules it would need to be > palloc_io_align'd. I will think about what to do about that... It might be worth having two different versions of the struct, so we don't impose unnecessarily high alignment everywhere? Greetings, Andres Freund
Hi, On 2022-11-01 15:54:02 -0700, Andres Freund wrote: > On 2022-11-02 09:44:30 +1300, Thomas Munro wrote: > > Oh, so BufFile is palloc'd and contains one of these. BufFile is not > > even using direct I/O, but by these rules it would need to be > > palloc_io_align'd. I will think about what to do about that... > > It might be worth having two different versions of the struct, so we don't > impose unnecessarily high alignment everywhere? Although it might actually be worth aligning fully everywhere - there's a noticable performance difference for buffered read IO. I benchmarked this on my workstation and laptop. I mmap'ed a buffer with 2 MiB alignment, MAP_ANONYMOUS | MAP_HUGETLB, and then measured performance of reading 8192 bytes into the buffer at different offsets. Each time I copied 16GiB in total. Within a program invocation I benchmarked each offset 4 times, threw away the worst measurement, and averaged the rest. Then used the best of three program invocations. workstation with dual xeon Gold 5215: turbo on turbo off offset GiB/s GiB/s 0 18.358 13.528 8 15.361 11.472 9 15.330 11.418 32 17.583 13.097 512 17.707 13.229 513 15.890 11.852 4096 18.176 13.568 8192 18.088 13.566 2Mib 18.658 13.496 laptop with i9-9880H: turbo on turbo off offset GiB/s GiB/s 0 33.589 17.160 8 28.045 14.301 9 27.582 14.318 32 31.797 16.711 512 32.215 16.810 513 28.864 14.932 4096 32.503 17.266 8192 32.871 17.277 2Mib 32.657 17.262 Seems pretty clear that using 4096 byte alignment is worth it. Greetings, Andres Freund
On 11/1/22 2:36 AM, Thomas Munro wrote: > Hi, > > Here is a patch to allow PostgreSQL to use $SUBJECT. It is from the This is exciting to see! There's two other items to add to the TODO list before this would be ready for production: 1) work_mem. This is a significant impediment to scaling shared buffers the way you'd want to. 2) Clock sweep. Specifically, currently the only thing that drives usage_count is individual backends running the clock hand. On large systems with 75% of memory going to shared_buffers, that becomes a very significant problem, especially when the backend running the clock sweep is doing so in order to perform an operation like a b-tree page split. I suspect it shouldn't be too hard to deal with this issue by just having bgwriter or another bgworker proactively ensuring some reasonable number of buffers with usage_count=0 exist. One other thing to be aware of: overflowing as SLRU becomes a massive problem if there isn't a filesystem backing the SLRU. Obviously only an issue if you try and apply DIO to SLRU files.
Hi, On 2022-11-04 14:47:31 -0500, Jim Nasby wrote: > On 11/1/22 2:36 AM, Thomas Munro wrote: > > Here is a patch to allow PostgreSQL to use $SUBJECT. It is from the > > This is exciting to see! There's two other items to add to the TODO list > before this would be ready for production: > > 1) work_mem. This is a significant impediment to scaling shared buffers the > way you'd want to. I don't really think that's closely enough related to tackle together. Yes, it'd be easier to set a large s_b if we had better work_mem management, but it's a completely distinct problem, and in a lot of cases you could use DIO without tackling the work_mem issue. > 2) Clock sweep. Specifically, currently the only thing that drives > usage_count is individual backends running the clock hand. On large systems > with 75% of memory going to shared_buffers, that becomes a very significant > problem, especially when the backend running the clock sweep is doing so in > order to perform an operation like a b-tree page split. I suspect it > shouldn't be too hard to deal with this issue by just having bgwriter or > another bgworker proactively ensuring some reasonable number of buffers with > usage_count=0 exist. I agree this isn't great, but I don't think the replacement efficiency is that big a problem. Replacing the wrong buffers is a bigger issue. I've run tests with s_b=768GB (IIRC) without it showing up as a major issue. If you have an extreme replacement rate at such a large s_b you have a lot of other problems. I don't want to discourage anybody from tackling the clock replacement issues, the contrary, but AIO+DIO can show significant wins without those changes. It's already a humongous project... > One other thing to be aware of: overflowing as SLRU becomes a massive > problem if there isn't a filesystem backing the SLRU. Obviously only an > issue if you try and apply DIO to SLRU files. Which would be a very bad idea for now.... Thomas does have a patch for moving them into the main buffer pool. Greetings, Andres Freund
On Tue, Nov 1, 2022 at 2:37 PM Thomas Munro <thomas.munro@gmail.com> wrote:
>
> Direct I/O generally needs to be done to/from VM page-aligned
> addresses, but only "standard" 4KB pages, even when larger VM pages
> are in use (if there is an exotic system where that isn't true, it
> won't work). We need to deal with buffers on the stack, the heap and
> in shmem. For the stack, see patch 0001. For the heap and shared
> memory, see patch 0002, but David Rowley is going to propose that part
> separately, as MemoryContext API adjustments are a specialised enough
> topic to deserve another thread; here I include a copy as a
> dependency. The main direct I/O patch is 0003.
One thing to note: Currently, a request to aset above 8kB must go into a dedicated block. Not sure if it's a coincidence that that matches the default PG page size, but if allocating pages on the heap is hot enough, maybe we should consider raising that limit. Although then, aligned-to-4kB requests would result in 16kB chunks requested unless a different allocator was used.
--
John Naylor
EDB: http://www.enterprisedb.com
Hi, On 2022-11-10 14:26:20 +0700, John Naylor wrote: > On Tue, Nov 1, 2022 at 2:37 PM Thomas Munro <thomas.munro@gmail.com> wrote: > > > Memory alignment patches: > > > > Direct I/O generally needs to be done to/from VM page-aligned > > addresses, but only "standard" 4KB pages, even when larger VM pages > > are in use (if there is an exotic system where that isn't true, it > > won't work). We need to deal with buffers on the stack, the heap and > > in shmem. For the stack, see patch 0001. For the heap and shared > > memory, see patch 0002, but David Rowley is going to propose that part > > separately, as MemoryContext API adjustments are a specialised enough > > topic to deserve another thread; here I include a copy as a > > dependency. The main direct I/O patch is 0003. > > One thing to note: Currently, a request to aset above 8kB must go into a > dedicated block. Not sure if it's a coincidence that that matches the > default PG page size, but if allocating pages on the heap is hot enough, > maybe we should consider raising that limit. Although then, aligned-to-4kB > requests would result in 16kB chunks requested unless a different allocator > was used. With one exception, there's only a small number of places that allocate pages dynamically and we only do it for a small number of buffers. So I don't think we should worry too much about this for now. The one exception to this: GetLocalBufferStorage(). But it already batches memory allocations by increasing sizes, so I think we're good as well. Greetings, Andres Freund
On Wed, Nov 2, 2022 at 11:54 AM Andres Freund <andres@anarazel.de> wrote: > On 2022-11-02 09:44:30 +1300, Thomas Munro wrote: > > On Wed, Nov 2, 2022 at 2:33 AM Justin Pryzby <pryzby@telsasoft.com> wrote: > > > On Tue, Nov 01, 2022 at 08:36:18PM +1300, Thomas Munro wrote: > > > > io_data_direct = whether to use O_DIRECT for main data files > > > > io_wal_direct = ... for WAL > > > > io_wal_init_direct = ... for WAL-file initialisation > > > > > > You added 3 booleans, but I wonder if it's better to add a string GUC > > > which is parsed for comma separated strings. Done as io_direct=data,wal,wal_init. Thanks Justin, this is better. I resisted the urge to invent a meaning for 'on' and 'off', mainly because it's not clear what values 'on' should enable and it'd be strange to have off without on, so for now an empty string means off. I suppose the meaning of this string could evolve over time: the names of forks, etc. > Perhaps we could use the guc assignment hook to transform the input value into > a bitmask? Makes sense. The only tricky question was where to store the GUC. I went for fd.c for now, but it doesn't seem quite right... > > > DIO is slower, but not so much that it can't run under CI. I suggest to > > > add an 099 commit to enable the feature during development. > > > > Good idea, will do. Done. The tests take 2-3x as long depending on the OS. > Might be worth to additionally have a short tap test that does some basic > stuff with DIO and leave that enabled? I think it'd be good to have > check-world exercise DIO on dev machines, to reduce the likelihood of finding > problems only in CI, which is somewhat painful. Done. > > > Note that this fails under linux with fsanitize=align: > > > ../src/backend/storage/file/buffile.c:117:17: runtime error: member access within misaligned address 0x561a4a8e40f8for type 'struct BufFile', which requires 4096 byte alignment > > > > Oh, so BufFile is palloc'd and contains one of these. BufFile is not > > even using direct I/O, but by these rules it would need to be > > palloc_io_align'd. I will think about what to do about that... > > It might be worth having two different versions of the struct, so we don't > impose unnecessarily high alignment everywhere? Done. I now have PGAlignedBlock (unchanged) and PGIOAlignedBlock. You have to use the latter for SMgr, because I added alignment assertions there. We might as well use it for any other I/O such as frontend code too for a chance of a small performance boost as you showed. For now I have not use PGIOAlignedBlock for BufFile, even though it would be a great candidate for a potential speedup, only because I am afraid of adding padding to every BufFile in scenarios where we allocate many (could be avoided, a subject for separate research). V2 comprises: 0001 -- David's palloc_aligned() patch https://commitfest.postgresql.org/41/3999/ 0002 -- I/O-align almost all buffers used for I/O 0003 -- Add the GUCs 0004 -- Throwaway hack to make cfbot turn the GUCs on
Attachment
On Wed, Dec 14, 2022 at 5:48 PM Thomas Munro <thomas.munro@gmail.com> wrote: > 0001 -- David's palloc_aligned() patch https://commitfest.postgresql.org/41/3999/ > 0002 -- I/O-align almost all buffers used for I/O > 0003 -- Add the GUCs > 0004 -- Throwaway hack to make cfbot turn the GUCs on David pushed the first as commit 439f6175, so here is a rebase of the rest. I also fixed a couple of thinkos in the handling of systems where we don't know how to do direct I/O. In one place I had #ifdef PG_O_DIRECT, but that's always defined, it's just that it's 0 on Solaris and OpenBSD, and the check to reject the GUC wasn't quite right on such systems.
Attachment
On Thu, Dec 22, 2022 at 7:34 AM Thomas Munro <thomas.munro@gmail.com> wrote: > > On Wed, Dec 14, 2022 at 5:48 PM Thomas Munro <thomas.munro@gmail.com> wrote: > > 0001 -- David's palloc_aligned() patch https://commitfest.postgresql.org/41/3999/ > > 0002 -- I/O-align almost all buffers used for I/O > > 0003 -- Add the GUCs > > 0004 -- Throwaway hack to make cfbot turn the GUCs on > > David pushed the first as commit 439f6175, so here is a rebase of the > rest. I also fixed a couple of thinkos in the handling of systems > where we don't know how to do direct I/O. In one place I had #ifdef > PG_O_DIRECT, but that's always defined, it's just that it's 0 on > Solaris and OpenBSD, and the check to reject the GUC wasn't quite > right on such systems. Thanks. I have some comments on v3-0002-Add-io_direct-setting-developer-only.patch: 1. I think we don't need to overwrite the io_direct_string in check_io_direct so that show_io_direct can be avoided. 2. check_io_direct can leak the flags memory - when io_direct is not supported or for an invalid list syntax or an invalid option is specified. I have addressed my review comments as a delta patch on top of v3-0002 and added it here as v1-0001-Review-comments-io_direct-GUC.txt. Some comments on the tests added: 1. Is there a way to know if Direct IO for WAL and data has been picked up programmatically? IOW, can we know if the OS page cache is bypassed? I know an external extension pgfincore which can help here, but nothing in the core exists AFAICS. +is('10000', $node->safe_psql('postgres', 'select count(*) from t1'), "read back from shared"); +is('10000', $node->safe_psql('postgres', 'select * from t2count'), "read back from local"); +$node->stop('immediate'); 2. Can we combine these two append_conf to a single statement? +$node->append_conf('io_direct', 'data,wal,wal_init'); +$node->append_conf('shared_buffers', '64kB'); # tiny to force I/O 3. A nitpick: Can we split these queries multi-line instead of in a single line? +$node->safe_psql('postgres', 'begin; create temporary table t2 as select 1 as i from generate_series(1, 10000); update t2 set i = i; insert into t2count select count(*) from t2; commit;'); 4. I don't think we need to stop the node before the test ends, no? +$node->stop; + +done_testing(); -- Bharath Rupireddy PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Attachment
On Wed, Jan 25, 2023 at 8:57 PM Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> wrote: > Thanks. I have some comments on > v3-0002-Add-io_direct-setting-developer-only.patch: > > 1. I think we don't need to overwrite the io_direct_string in > check_io_direct so that show_io_direct can be avoided. Thanks for looking at this, and sorry for the late response. Yeah, agreed. > 2. check_io_direct can leak the flags memory - when io_direct is not > supported or for an invalid list syntax or an invalid option is > specified. > > I have addressed my review comments as a delta patch on top of v3-0002 > and added it here as v1-0001-Review-comments-io_direct-GUC.txt. Thanks. Your way is nicer. I merged your patch and added you as a co-author. > Some comments on the tests added: > > 1. Is there a way to know if Direct IO for WAL and data has been > picked up programmatically? IOW, can we know if the OS page cache is > bypassed? I know an external extension pgfincore which can help here, > but nothing in the core exists AFAICS. Right, that extension can tell you how many pages are in the kernel page cache which is quite interesting for this. I also once hacked up something primitive to see *which* pages are in kernel cache, so I could join that against pg_buffercache to measure double buffering, when I was studying the "smile" shape where pgbench TPS goes down and then back up again as you increase shared_buffers if the working set is nearly as big as physical memory (code available in a link from [1]). Yeah, I agree it might be nice for human investigators to put something like that in contrib/pg_buffercache, but I'm not sure you could rely on it enough for an automated test, though, 'cause it probably won't work on some file systems and the tests would probably fail for random transient reasons (for example: some systems won't kick pages out of kernel cache if they were already there, just because we decided to open the file with O_DIRECT). (I got curious about why mincore() wasn't standardised along with mmap() and all that jazz; it seems the BSD and later Sun people who invented all those interfaces didn't think that one was quite good enough[2], but every (?) Unixoid OS copied it anyway, with variations... Apparently the Windows thing is called VirtualQuery()). > 2. Can we combine these two append_conf to a single statement? > +$node->append_conf('io_direct', 'data,wal,wal_init'); > +$node->append_conf('shared_buffers', '64kB'); # tiny to force I/O OK, sure, done. And also oops, that was completely wrong and not working the way I had it in that version... > 3. A nitpick: Can we split these queries multi-line instead of in a single line? > +$node->safe_psql('postgres', 'begin; create temporary table t2 as > select 1 as i from generate_series(1, 10000); update t2 set i = i; > insert into t2count select count(*) from t2; commit;'); OK. > 4. I don't think we need to stop the node before the test ends, no? > +$node->stop; > + > +done_testing(); Sure, but why not? Otherwise, I rebased, and made a couple more changes: I found a line of the manual about wal_sync_method that needed to be removed: - The <literal>open_</literal>* options also use <literal>O_DIRECT</literal> if available. In fact that sentence didn't correctly document the behaviour in released branches (wal_level=minimal is also required for that, so probably very few people ever used it). I think we should adjust that misleading sentence in back-branches, separately from this patch set. I also updated the commit message to highlight the only expected user-visible change for this, namely the loss of the above incorrectly documented obscure special case, which is replaced by the less obscure new setting io_direct=wal, if someone still wants that behaviour. Also a few minor comment changes. [1] https://twitter.com/MengTangmu/status/994770040745615361 [2] http://kos.enix.org/pub/gingell8.pdf
Attachment
I did some testing with non-default block sizes, and found a few minor things that needed adjustment. The short version is that I blocked some configurations that won't work or would break an assertion. After a bit more copy-editing on docs and comments and a round of automated indenting, I have now pushed this. I will now watch the build farm. I tested on quite a few OSes that I have access to, but this is obviously a very OS-sensitive kind of a thing. The adjustments were: 1. If you set your BLCKSZ or XLOG_BLCKSZ smaller than PG_IO_ALIGN_SIZE, you shouldn't be allowed to turn on direct I/O for the relevant operations, because such undersized direct I/Os will fail on common systems. FATAL: invalid value for parameter "io_direct": "wal" DETAIL: io_direct is not supported for WAL because XLOG_BLCKSZ is too small FATAL: invalid value for parameter "io_direct": "data" DETAIL: io_direct is not supported for data because BLCKSZ is too small In fact some systems would be OK with it if the true requirement is 512 not 4096, but (1) tiny blocks are a niche build option that doesn't even pass regression tests and (2) it's hard and totally unportable to find out the true requirement at runtime, and (3) the conservative choice of 4096 has additional benefits by matching memory pages. So I think a conservative compile-time number is a good starting position. 2. Previously I had changed the WAL buffer alignment to be the larger of PG_IO_ALIGN_SIZE and XLOG_BLCKSZ, but in light of the above thinking, I reverted that part (no point in aligning the address of the buffer when the size is too small for direct I/O, but now that combination is blocked off at GUC level so we don't need any change here). 3. I updated the md.c alignment assertions to allow for tiny blocks. The point of these assertions is to fail if any new code does I/O from badly aligned buffers even with io_direct turned off (ie how most people hack), 'cause that will fail with io_direct turned on. The change is that I don't make the assertion if you're using BLCKSZ < PG_IO_ALIGN_SIZE. Such buffers wouldn't work if used for direct I/O but that's OK, the GUC won't allow it. 4. I made the language to explain where PG_IO_ALIGN_SIZE really comes from a little vaguer because it's complex.
On Sat, Apr 8, 2023 at 4:47 PM Thomas Munro <thomas.munro@gmail.com> wrote: > After a bit more copy-editing on docs and comments and a round of > automated indenting, I have now pushed this. I will now watch the > build farm. I tested on quite a few OSes that I have access to, but > this is obviously a very OS-sensitive kind of a thing. Hmm. I see a strange "invalid page" failure on Andrew's machine crake in 004_io_direct.pl. Let's see what else comes in.
Thomas Munro <thomas.munro@gmail.com> writes: > I did some testing with non-default block sizes, and found a few minor > things that needed adjustment. https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=crake&dt=2023-04-08%2004%3A42%3A04 This seems like another thing that should not have been pushed mere hours before feature freeze. regards, tom lane
Hi, On 2023-04-08 16:59:20 +1200, Thomas Munro wrote: > On Sat, Apr 8, 2023 at 4:47 PM Thomas Munro <thomas.munro@gmail.com> wrote: > > After a bit more copy-editing on docs and comments and a round of > > automated indenting, I have now pushed this. I will now watch the > > build farm. I tested on quite a few OSes that I have access to, but > > this is obviously a very OS-sensitive kind of a thing. > > Hmm. I see a strange "invalid page" failure on Andrew's machine crake > in 004_io_direct.pl. Let's see what else comes in. There were some failures in CI (e.g. [1] (and perhaps also bf, didn't yet check), about "no unpinned buffers available". I was worried for a moment that this could actually be relation to the bulk extension patch. But it looks like it's older - and not caused by direct_io support (except by way of the test existing). I reproduced the issue locally by setting s_b even lower, to 16 and made the ERROR a PANIC. #4 0x00005624dfe90336 in errfinish (filename=0x5624df6867c0 "../../../../home/andres/src/postgresql/src/backend/storage/buffer/freelist.c",lineno=353, funcname=0x5624df686900 <__func__.6> "StrategyGetBuffer") at ../../../../home/andres/src/postgresql/src/backend/utils/error/elog.c:604 #5 0x00005624dfc71dbe in StrategyGetBuffer (strategy=0x0, buf_state=0x7ffd4182137c, from_ring=0x7ffd4182137b) at ../../../../home/andres/src/postgresql/src/backend/storage/buffer/freelist.c:353 #6 0x00005624dfc6a922 in GetVictimBuffer (strategy=0x0, io_context=IOCONTEXT_NORMAL) at ../../../../home/andres/src/postgresql/src/backend/storage/buffer/bufmgr.c:1601 #7 0x00005624dfc6a29f in BufferAlloc (smgr=0x5624e1ff27f8, relpersistence=112 'p', forkNum=MAIN_FORKNUM, blockNum=16, strategy=0x0,foundPtr=0x7ffd418214a3, io_context=IOCONTEXT_NORMAL) at ../../../../home/andres/src/postgresql/src/backend/storage/buffer/bufmgr.c:1290 #8 0x00005624dfc69c0c in ReadBuffer_common (smgr=0x5624e1ff27f8, relpersistence=112 'p', forkNum=MAIN_FORKNUM, blockNum=16,mode=RBM_NORMAL, strategy=0x0, hit=0x7ffd4182156b) at ../../../../home/andres/src/postgresql/src/backend/storage/buffer/bufmgr.c:1056 #9 0x00005624dfc69335 in ReadBufferExtended (reln=0x5624e1ee09f0, forkNum=MAIN_FORKNUM, blockNum=16, mode=RBM_NORMAL, strategy=0x0) at ../../../../home/andres/src/postgresql/src/backend/storage/buffer/bufmgr.c:776 #10 0x00005624df8eb78a in log_newpage_range (rel=0x5624e1ee09f0, forknum=MAIN_FORKNUM, startblk=0, endblk=45, page_std=false) at ../../../../home/andres/src/postgresql/src/backend/access/transam/xloginsert.c:1290 #11 0x00005624df9567e7 in smgrDoPendingSyncs (isCommit=true, isParallelWorker=false) at ../../../../home/andres/src/postgresql/src/backend/catalog/storage.c:837 #12 0x00005624df8d1dd2 in CommitTransaction () at ../../../../home/andres/src/postgresql/src/backend/access/transam/xact.c:2225 #13 0x00005624df8d2da2 in CommitTransactionCommand () at ../../../../home/andres/src/postgresql/src/backend/access/transam/xact.c:3060 #14 0x00005624dfcbe0a1 in finish_xact_command () at ../../../../home/andres/src/postgresql/src/backend/tcop/postgres.c:2779 #15 0x00005624dfcbb867 in exec_simple_query (query_string=0x5624e1eacd98 "create table t1 as select 1 as i from generate_series(1,10000)") at ../../../../home/andres/src/postgresql/src/backend/tcop/postgres.c:1299 #16 0x00005624dfcc09c4 in PostgresMain (dbname=0x5624e1ee40e8 "postgres", username=0x5624e1e6c5f8 "andres") at ../../../../home/andres/src/postgresql/src/backend/tcop/postgres.c:4623 #17 0x00005624dfbecc03 in BackendRun (port=0x5624e1ed8250) at ../../../../home/andres/src/postgresql/src/backend/postmaster/postmaster.c:4461 #18 0x00005624dfbec48e in BackendStartup (port=0x5624e1ed8250) at ../../../../home/andres/src/postgresql/src/backend/postmaster/postmaster.c:4189 #19 0x00005624dfbe8541 in ServerLoop () at ../../../../home/andres/src/postgresql/src/backend/postmaster/postmaster.c:1779 #20 0x00005624dfbe7e56 in PostmasterMain (argc=4, argv=0x5624e1e6a520) at ../../../../home/andres/src/postgresql/src/backend/postmaster/postmaster.c:1463 #21 0x00005624dfad538b in main (argc=4, argv=0x5624e1e6a520) at ../../../../home/andres/src/postgresql/src/backend/main/main.c:200 If you look at log_newpage_range(), it's not surprising that we get this error - it pins up to 32 buffers at once. Afaics log_newpage_range() originates in 9155580fd5fc, but this caller is from c6b92041d385. It doesn't really seem OK to me to unconditionally pin 32 buffers. For the relation extension patch I introduced LimitAdditionalPins() to deal with this concern. Perhaps it needs to be exposed and log_newpage_buffers() should use it? Do we care about fixing this in the backbranches? Probably not, given there haven't been user complaints? Greetings, Andres Freund [1] https://cirrus-ci.com/task/4519721039560704
On Sat, Apr 8, 2023 at 4:59 PM Thomas Munro <thomas.munro@gmail.com> wrote: > On Sat, Apr 8, 2023 at 4:47 PM Thomas Munro <thomas.munro@gmail.com> wrote: > > After a bit more copy-editing on docs and comments and a round of > > automated indenting, I have now pushed this. I will now watch the > > build farm. I tested on quite a few OSes that I have access to, but > > this is obviously a very OS-sensitive kind of a thing. > > Hmm. I see a strange "invalid page" failure on Andrew's machine crake > in 004_io_direct.pl. Let's see what else comes in. No particular ideas about what happened there yet. It *looks* like we wrote out a page, and then read it back in very soon afterwards, all via the usual locked bufmgr/smgr pathways, and it failed basic page validation. The reader was a parallel worker, because of the debug_parallel_mode setting on that box. The page number looks reasonable (I guess it's reading a page created by the UPDATE full of new tuples, but I don't know which process wrote it). It's also not immediately obvious how this could be connected to the 32 pinned buffer problem (all that would have happened in the CREATE TABLE process which ended already before the UPDATE and then the SELECT backends even started). Andrew, what file system and type of disk is that machine using?
Hi, On 2023-04-07 23:04:08 -0700, Andres Freund wrote: > There were some failures in CI (e.g. [1] (and perhaps also bf, didn't yet > check), about "no unpinned buffers available". I was worried for a moment > that this could actually be relation to the bulk extension patch. > > But it looks like it's older - and not caused by direct_io support (except by > way of the test existing). I reproduced the issue locally by setting s_b even > lower, to 16 and made the ERROR a PANIC. > > [backtrace] > > If you look at log_newpage_range(), it's not surprising that we get this error > - it pins up to 32 buffers at once. > > Afaics log_newpage_range() originates in 9155580fd5fc, but this caller is from > c6b92041d385. > > > It doesn't really seem OK to me to unconditionally pin 32 buffers. For the > relation extension patch I introduced LimitAdditionalPins() to deal with this > concern. Perhaps it needs to be exposed and log_newpage_buffers() should use > it? > > > Do we care about fixing this in the backbranches? Probably not, given there > haven't been user complaints? Here's a quick prototype of this approach. If we expose LimitAdditionalPins(), we'd probably want to add "Buffer" to the name, and pass it a relation, so that it can hand off LimitAdditionalLocalPins() when appropriate? The callsite in question doesn't need it, but ... Without the limiting of pins the modified 004_io_direct.pl fails 100% of the time for me. Presumably the reason it fails occasionally with 256kB of shared buffers (i.e. NBuffers=32) is that autovacuum or checkpointer briefly pins a single buffer. As log_newpage_range() thinks it can just pin 32 buffers unconditionally, it fails in that case. Greetings, Andres Freund
Attachment
Hi, Given the frequency of failures on this in the buildfarm, I propose using the temporary workaround of using wal_level=replica. That avoids the use of the over-eager log_newpage_range(). Greetings, Andres Freund
On Sun, Apr 9, 2023 at 6:55 AM Andres Freund <andres@anarazel.de> wrote: > Given the frequency of failures on this in the buildfarm, I propose using the > temporary workaround of using wal_level=replica. That avoids the use of the > over-eager log_newpage_range(). Will do.
Thomas Munro <thomas.munro@gmail.com> writes: > On Sun, Apr 9, 2023 at 6:55 AM Andres Freund <andres@anarazel.de> wrote: >> Given the frequency of failures on this in the buildfarm, I propose using the >> temporary workaround of using wal_level=replica. That avoids the use of the >> over-eager log_newpage_range(). > Will do. Now crake is doing this: 2023-04-08 16:50:03.177 EDT [2023-04-08 16:50:03 EDT 3257645:3] 004_io_direct.pl LOG: statement: select count(*) from t1 2023-04-08 16:50:03.316 EDT [2023-04-08 16:50:03 EDT 3257646:1] ERROR: invalid page in block 56 of relation base/5/16384 2023-04-08 16:50:03.316 EDT [2023-04-08 16:50:03 EDT 3257646:2] STATEMENT: select count(*) from t1 2023-04-08 16:50:03.317 EDT [2023-04-08 16:50:03 EDT 3257645:4] 004_io_direct.pl ERROR: invalid page in block 56 of relationbase/5/16384 2023-04-08 16:50:03.317 EDT [2023-04-08 16:50:03 EDT 3257645:5] 004_io_direct.pl STATEMENT: select count(*) from t1 2023-04-08 16:50:03.319 EDT [2023-04-08 16:50:02 EDT 3257591:4] LOG: background worker "parallel worker" (PID 3257646) exitedwith exit code 1 The fact that the error is happening in a parallel worker seems interesting ... (BTW, why are the log lines doubly timestamped?) regards, tom lane
On Sun, Apr 9, 2023 at 9:10 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: > 2023-04-08 16:50:03.177 EDT [2023-04-08 16:50:03 EDT 3257645:3] 004_io_direct.pl LOG: statement: select count(*) fromt1 > 2023-04-08 16:50:03.316 EDT [2023-04-08 16:50:03 EDT 3257646:1] ERROR: invalid page in block 56 of relation base/5/16384 > The fact that the error is happening in a parallel worker seems > interesting ... That's because it's running with debug_parallel_query=regress. I've been trying to repro that but no luck... A different kind of failure also showed up, where it counted the wrong number of tuples: https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=crake&dt=2023-04-08%2015%3A52%3A03 A paranoid explanation would be that this system is failing to provide basic I/O coherency, we're writing pages out and not reading them back in. Or of course there is a dumb bug... but why only here? Can of course be timing-sensitive and it's interesting that crake suffers from the "no unpinned buffers available" thing (which should now be gone) with higher frequency; I'm keen to see if the dodgy-read problem continues with a similar frequency now.
Hi, On 2023-04-08 17:10:19 -0400, Tom Lane wrote: > Thomas Munro <thomas.munro@gmail.com> writes: > Now crake is doing this: > > 2023-04-08 16:50:03.177 EDT [2023-04-08 16:50:03 EDT 3257645:3] 004_io_direct.pl LOG: statement: select count(*) fromt1 > 2023-04-08 16:50:03.316 EDT [2023-04-08 16:50:03 EDT 3257646:1] ERROR: invalid page in block 56 of relation base/5/16384 > 2023-04-08 16:50:03.316 EDT [2023-04-08 16:50:03 EDT 3257646:2] STATEMENT: select count(*) from t1 > 2023-04-08 16:50:03.317 EDT [2023-04-08 16:50:03 EDT 3257645:4] 004_io_direct.pl ERROR: invalid page in block 56 of relationbase/5/16384 > 2023-04-08 16:50:03.317 EDT [2023-04-08 16:50:03 EDT 3257645:5] 004_io_direct.pl STATEMENT: select count(*) from t1 > 2023-04-08 16:50:03.319 EDT [2023-04-08 16:50:02 EDT 3257591:4] LOG: background worker "parallel worker" (PID 3257646)exited with exit code 1 > > The fact that the error is happening in a parallel worker seems > interesting ... There were a few prior instances of that error. One that I hadn't seen before is this: [11:35:07.190](0.001s) # Failed test 'read back from shared' # at /home/andrew/bf/root/HEAD/pgsql/src/test/modules/test_misc/t/004_io_direct.pl line 43. [11:35:07.190](0.000s) # got: '10000' # expected: '10098' For one it points to the arguments to is() being switched around, but that's a sideshow. > (BTW, why are the log lines doubly timestamped?) It's odd. It's also odd that it's just crake having the issue. It's just a linux host, afaics. Andrew, is there any chance you can run that test in isolation and see whether it reproduces? If so, does the problem vanish, if you comment out the io_direct= in the test? Curious whether this is actually an O_DIRECT issue, or whether it's an independent issue exposed by the new test. I wonder if we should make the test use data checksum - if we continue to see the wrong query results, the corruption is more likely to be in memory. Greetings, Andres Freund
Andres Freund <andres@anarazel.de> writes: > On 2023-04-08 17:10:19 -0400, Tom Lane wrote: >> (BTW, why are the log lines doubly timestamped?) > It's odd. Oh, I guess that's intentional, because crake has 'log_line_prefix = \'%m [%s %p:%l] %q%a \'', > It's also odd that it's just crake having the issue. It's just a linux host, > afaics. Indeed. I'm guessing from the compiler version that it's Fedora 37 now (the lack of such basic information in the meson configuration output is pretty annoying). I've been trying to repro it here on an F37 box, with no success, suggesting that it's very timing sensitive. Or maybe it's inside a VM and that matters? regards, tom lane
Hi, On 2023-04-08 17:31:02 -0400, Tom Lane wrote: > Andres Freund <andres@anarazel.de> writes: > > On 2023-04-08 17:10:19 -0400, Tom Lane wrote: > > It's also odd that it's just crake having the issue. It's just a linux host, > > afaics. > > Indeed. I'm guessing from the compiler version that it's Fedora 37 now The 15 branch says: hostname = neoemma uname -m = x86_64 uname -r = 6.2.8-100.fc36.x86_64 uname -s = Linux uname -v = #1 SMP PREEMPT_DYNAMIC Wed Mar 22 19:14:19 UTC 2023 So at least the kernel claims to be 36... > (the lack of such basic information in the meson configuration output > is pretty annoying). Yea, I was thinking yesterday that we should add uname output to meson's configure (if available). I'm sure we can figure out a reasonably fast windows command for the version, too. > I've been trying to repro it here on an F37 box, with no success, suggesting > that it's very timing sensitive. Or maybe it's inside a VM and that > matters? Could also be filesystem specific? Greetings, Andres Freund
Hi, On 2023-04-08 17:31:02 -0400, Tom Lane wrote:Andres Freund <andres@anarazel.de> writes:On 2023-04-08 17:10:19 -0400, Tom Lane wrote: It's also odd that it's just crake having the issue. It's just a linux host, afaics.Indeed. I'm guessing from the compiler version that it's Fedora 37 nowThe 15 branch says: hostname = neoemma uname -m = x86_64 uname -r = 6.2.8-100.fc36.x86_64 uname -s = Linux uname -v = #1 SMP PREEMPT_DYNAMIC Wed Mar 22 19:14:19 UTC 2023 So at least the kernel claims to be 36...(the lack of such basic information in the meson configuration output is pretty annoying).Yea, I was thinking yesterday that we should add uname output to meson's configure (if available). I'm sure we can figure out a reasonably fast windows command for the version, too.I've been trying to repro it here on an F37 box, with no success, suggesting that it's very timing sensitive. Or maybe it's inside a VM and that matters?Could also be filesystem specific?
I migrated it in February from a VM to a non-virtual instance. Almost nothing else runs on the machine. The personality info shown on the BF server is correct.
andrew@neoemma:~ $ cat /etc/fedora-release
Fedora release 36 (Thirty Six)
andrew@neoemma:~ $ uname -a
Linux neoemma 6.2.8-100.fc36.x86_64 #1 SMP PREEMPT_DYNAMIC Wed Mar 22 19:14:19 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
andrew@neoemma:~ $ gcc --version
gcc (GCC) 12.2.1 20221121 (Red Hat 12.2.1-4)
andrew@neoemma:~ $ mount | grep home
/dev/mapper/luks-xxxxxxx on /home type btrfs (rw,relatime,seclabel,compress=zstd:1,ssd,discard=async,space_cache,subvolid=256,subvol=/home)
I guess it could be btrfs-specific. I'll be somewhat annoyed if I have to re-init the machine to use something else.
cheers
andrew
-- Andrew Dunstan EDB: https://www.enterprisedb.com
On Sun, Apr 9, 2023 at 10:08 AM Andrew Dunstan <andrew@dunslane.net> wrote: > btrfs Aha!
Hi, On 2023-04-08 17:10:19 -0400, Tom Lane wrote:Thomas Munro <thomas.munro@gmail.com> writes: Now crake is doing this: 2023-04-08 16:50:03.177 EDT [2023-04-08 16:50:03 EDT 3257645:3] 004_io_direct.pl LOG: statement: select count(*) from t1 2023-04-08 16:50:03.316 EDT [2023-04-08 16:50:03 EDT 3257646:1] ERROR: invalid page in block 56 of relation base/5/16384 2023-04-08 16:50:03.316 EDT [2023-04-08 16:50:03 EDT 3257646:2] STATEMENT: select count(*) from t1 2023-04-08 16:50:03.317 EDT [2023-04-08 16:50:03 EDT 3257645:4] 004_io_direct.pl ERROR: invalid page in block 56 of relation base/5/16384 2023-04-08 16:50:03.317 EDT [2023-04-08 16:50:03 EDT 3257645:5] 004_io_direct.pl STATEMENT: select count(*) from t1 2023-04-08 16:50:03.319 EDT [2023-04-08 16:50:02 EDT 3257591:4] LOG: background worker "parallel worker" (PID 3257646) exited with exit code 1 The fact that the error is happening in a parallel worker seems interesting ...There were a few prior instances of that error. One that I hadn't seen before is this: [11:35:07.190](0.001s) # Failed test 'read back from shared' # at /home/andrew/bf/root/HEAD/pgsql/src/test/modules/test_misc/t/004_io_direct.pl line 43. [11:35:07.190](0.000s) # got: '10000' # expected: '10098' For one it points to the arguments to is() being switched around, but that's a sideshow.It's also odd that it's just crake having the issue. It's just a linux host, afaics. Andrew, is there any chance you can run that test in isolation and see whether it reproduces? If so, does the problem vanish, if you comment out the io_direct= in the test? Curious whether this is actually an O_DIRECT issue, or whether it's an independent issue exposed by the new test. I wonder if we should make the test use data checksum - if we continue to see the wrong query results, the corruption is more likely to be in memory.
I can run the test in isolation, and it's get an error reliably.
cheers
andrew
-- Andrew Dunstan EDB: https://www.enterprisedb.com
On Sun, Apr 9, 2023 at 10:17 AM Andrew Dunstan <andrew@dunslane.net> wrote: > I can run the test in isolation, and it's get an error reliably. Random idea: it looks like you have compression enabled. What if you turn it off in the directory where the test runs? Something like btrfs property set <file> compression ... according to the intergoogles. (I have never used btrfs before 6 minutes ago but I can't seem to repro this with basic settings in a loopback btrfs filesystems).
Thomas Munro <thomas.munro@gmail.com> writes: > On Sun, Apr 9, 2023 at 10:08 AM Andrew Dunstan <andrew@dunslane.net> wrote: >> btrfs > Aha! Googling finds a lot of suggestions that O_DIRECT doesn't play nice with btrfs, for example https://www.mail-archive.com/linux-btrfs@vger.kernel.org/msg92824.html It's not clear to me how much of that lore is still current, but it's disturbing. regards, tom lane
On Sun, Apr 9, 2023 at 11:05 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: > Googling finds a lot of suggestions that O_DIRECT doesn't play nice > with btrfs, for example > > https://www.mail-archive.com/linux-btrfs@vger.kernel.org/msg92824.html > > It's not clear to me how much of that lore is still current, > but it's disturbing. I think that particular thing might relate to modifications of the user buffer while a write is in progress (breaking btrfs's internal checksums). I don't think we should ever do that ourselves (not least because it'd break our own checksums). We lock the page during the write so no one can do that, and then we sleep in a synchronous syscall. Here's something recent. I guess it's probably not relevant (a fault on our buffer that we recently touched sounds pretty unlikely), but who knows... (developer lists for file systems are truly terrifying places to drive through). https://lore.kernel.org/linux-btrfs/20230315195231.GW10580@twin.jikos.cz/T/ It's odd, though, if it is their bug and not ours: I'd expect our friends in other databases to have hit all that sort of thing years ago, since many comparable systems have a direct I/O knob*. What are we doing differently? Are our multiple processes a factor here, breaking some coherency logic? Unsurprisingly, having compression on as Andrew does actually involves buffering anyway[1] despite our O_DIRECT flag, but maybe that's saying writes are buffered but reads are still direct (?), which sounds like the sort of initial conditions that might produce a coherency bug. I dunno. I gather that btrfs is actually Fedora's default file system (or maybe it's just "laptops and desktops"[2]?). I wonder if any of the several green Fedora systems in the 'farm are using btrfs. I wonder if they are using different mount options (thinking again of compression). *Probably a good reason to add a more prominent warning that the feature is developer-only, experimental and not for production use. I'm thinking a warning at startup or something. [1] https://btrfs.readthedocs.io/en/latest/Compression.html [2] https://fedoraproject.org/wiki/Changes/BtrfsByDefault
Thomas Munro <thomas.munro@gmail.com> writes: > It's odd, though, if it is their bug and not ours: I'd expect our > friends in other databases to have hit all that sort of thing years > ago, since many comparable systems have a direct I/O knob*. Yeah, it seems moderately likely that it's our own bug ... but this code's all file-system-ignorant, so how? Maybe we are breaking some POSIX rule that btrfs exploits but others don't? > I gather that btrfs is actually Fedora's default file system (or maybe > it's just "laptops and desktops"[2]?). I have a ton of Fedora images laying about, and I doubt that any of them use btrfs, mainly because that's not the default in the "server spin" which is what I usually install from. It's hard to guess about the buildfarm, but it wouldn't surprise me that most of them are on xfs. (If we haven't figured this out pretty shortly, I'm probably going to put together a btrfs-on-bare-metal machine to try to duplicate crake's results.) regards, tom lane
Hi, On 2023-04-09 13:55:33 +1200, Thomas Munro wrote: > I think that particular thing might relate to modifications of the > user buffer while a write is in progress (breaking btrfs's internal > checksums). I don't think we should ever do that ourselves (not least > because it'd break our own checksums). We lock the page during the > write so no one can do that, and then we sleep in a synchronous > syscall. Oh, but we actually *do* modify pages while IO is going on. I wonder if you hit the jack pot here. The content lock doesn't prevent hint bit writes. That's why we copy the page to temporary memory when computing checksums. I think we should modify the test to enable checksums - if the problem goes away, then it's likely to be related to modifying pages while an O_DIRECT write is ongoing... Greetings, Andres Freund
On Sun, Apr 9, 2023 at 2:18 PM Andres Freund <andres@anarazel.de> wrote: > On 2023-04-09 13:55:33 +1200, Thomas Munro wrote: > > I think that particular thing might relate to modifications of the > > user buffer while a write is in progress (breaking btrfs's internal > > checksums). I don't think we should ever do that ourselves (not least > > because it'd break our own checksums). We lock the page during the > > write so no one can do that, and then we sleep in a synchronous > > syscall. > > Oh, but we actually *do* modify pages while IO is going on. I wonder if you > hit the jack pot here. The content lock doesn't prevent hint bit > writes. That's why we copy the page to temporary memory when computing > checksums. More like the jackpot hit me. Woo, I can now reproduce this locally on a loop filesystem. Previously I had missed a step, the parallel worker seems to be necessary. More soon.
On Sat, Apr 08, 2023 at 11:08:16AM -0700, Andres Freund wrote: > On 2023-04-07 23:04:08 -0700, Andres Freund wrote: > > There were some failures in CI (e.g. [1] (and perhaps also bf, didn't yet > > check), about "no unpinned buffers available". I was worried for a moment > > that this could actually be relation to the bulk extension patch. > > > > But it looks like it's older - and not caused by direct_io support (except by > > way of the test existing). I reproduced the issue locally by setting s_b even > > lower, to 16 and made the ERROR a PANIC. > > > > [backtrace] I get an ERROR, not a PANIC: $ git rev-parse HEAD 2e57ffe12f6b5c1498f29cb7c0d9e17c797d9da6 $ git diff -U0 diff --git a/src/test/modules/test_misc/t/004_io_direct.pl b/src/test/modules/test_misc/t/004_io_direct.pl index f5bf0b1..8f0241b 100644 --- a/src/test/modules/test_misc/t/004_io_direct.pl +++ b/src/test/modules/test_misc/t/004_io_direct.pl @@ -25 +25 @@ io_direct = 'data,wal,wal_init' -shared_buffers = '256kB' # tiny to force I/O +shared_buffers = 16 $ ./configure -C --enable-debug --enable-cassert --enable-depend --enable-tap-tests --with-tcl --with-python --with-perl $ make -C src/test/modules/test_misc check PROVE_TESTS=t/004_io_direct.pl # +++ tap check in src/test/modules/test_misc +++ t/004_io_direct.pl .. Dubious, test returned 29 (wstat 7424, 0x1d00) No subtests run Test Summary Report ------------------- t/004_io_direct.pl (Wstat: 7424 Tests: 0 Failed: 0) Non-zero exit status: 29 Parse errors: No plan found in TAP output Files=1, Tests=0, 1 wallclock secs ( 0.01 usr 0.00 sys + 0.41 cusr 0.14 csys = 0.56 CPU) Result: FAIL make: *** [../../../../src/makefiles/pgxs.mk:460: check] Error 1 $ grep pinned src/test/modules/test_misc/tmp_check/log/* src/test/modules/test_misc/tmp_check/log/004_io_direct_main.log:2023-04-08 21:12:46.781 PDT [929628] 004_io_direct.pl ERROR: no unpinned buffers available src/test/modules/test_misc/tmp_check/log/regress_log_004_io_direct:error running SQL: 'psql:<stdin>:1: ERROR: no unpinnedbuffers available' No good reason to PANIC there, so the path to PANIC may be fixable. > > If you look at log_newpage_range(), it's not surprising that we get this error > > - it pins up to 32 buffers at once. > > > > Afaics log_newpage_range() originates in 9155580fd5fc, but this caller is from > > c6b92041d385. > > Do we care about fixing this in the backbranches? Probably not, given there > > haven't been user complaints? I would not. This is only going to come up where the user goes out of the way to use near-minimum shared_buffers. > Here's a quick prototype of this approach. This looks fine. I'm not enthusiastic about incurring post-startup cycles to cater to allocating less than 512k*max_connections of shared buffers, but I expect the cycles in question are negligible here.
Indeed, I can't reproduce this with (our) checksums on. I also can't reproduce it with O_DIRECT off. I also can't reproduce it if I use "mkdir pgdata && chattr +C pgdata && initdb -D pgdata" to have a pgdata directory with copy-on-write and (their) checksums disabled. But it reproduces quite easily with COW on (default behaviour) with io_direct=data, debug_parallel_query=debug, create table as ...; update ...; select count(*) ...; from that test. Unfortunately my mental model of btrfs is extremely limited, basically just "something a bit like ZFS". FWIW I've been casually following along with OpenZFS's ongoing O_DIRECT project, and I know that the plan there is to make a temporary stable copy if checksums and other features are on (a bit like PostgreSQL does for the same reason, as you reminded us). Time will tell how that works out but it *seems* like all available modes would therefore work correctly for us, with different tradeoffs (ie if you want the fastest zero-copy I/O, don't use checksums, compression, etc). Here, btrfs seems to be taking a different path that I can't quite make out... I see no warning/error about a checksum failure like [1], and we apparently managed to read something other than a mix of the old and new page contents (which, based on your hypothesis, should just leave it indeterminate whether the hint bit changes were captured or not, and the rest of the page should be stable, right). It's like the page time-travelled or got scrambled in some other way, but it didn't tell us? I'll try to dig further... [1] https://archive.kernel.org/oldwiki/btrfs.wiki.kernel.org/index.php/Gotchas.html#Direct_IO_and_CRCs
On Sun, Apr 9, 2023 at 10:17 AM Andrew Dunstan <andrew@dunslane.net> wrote:I can run the test in isolation, and it's get an error reliably.Random idea: it looks like you have compression enabled. What if you turn it off in the directory where the test runs? Something like btrfs property set <file> compression ... according to the intergoogles. (I have never used btrfs before 6 minutes ago but I can't seem to repro this with basic settings in a loopback btrfs filesystems).
Didn't seem to make any difference.
cheers
andrew
-- Andrew Dunstan EDB: https://www.enterprisedb.com
On Sun, Apr 9, 2023 at 4:52 PM Thomas Munro <thomas.munro@gmail.com> wrote: > Here, btrfs seems to be taking a different path that I can't quite > make out... I see no warning/error about a checksum failure like [1], > and we apparently managed to read something other than a mix of the > old and new page contents (which, based on your hypothesis, should > just leave it indeterminate whether the hint bit changes were captured > or not, and the rest of the page should be stable, right). It's like > the page time-travelled or got scrambled in some other way, but it > didn't tell us? I'll try to dig further... I think there are two separate bad phenomena. 1. A concurrent modification of the user space buffer while writing breaks the checksum so you can't read the data back in, as . I can reproduce that with a stand-alone program, attached. The "verifier" process occasionally reports EIO while reading, unless you comment out the "scribbler" process's active line. The system log/dmesg gets some warnings. 2. The crake-style failure doesn't involve any reported checksum failures or errors, and I'm not sure if another process is even involved. I attach a complete syscall trace of a repro session. (I tried to get strace to dump 8192 byte strings, but then it doesn't repro, so we have only the start of the data transferred for each page.) Working back from the error message, ERROR: invalid page in block 78 of relation base/5/16384, we have a page at offset 638976, and we can find all system calls that touched that offset: [pid 26031] 23:26:48.521123 pwritev(50, [{iov_base="\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., iov_len=8192}], 1, 638976) = 8192 [pid 26040] 23:26:48.568975 pwrite64(5, "\0\0\0\0\0Nj\1\0\0\0\0\240\3\300\3\0 \4 \0\0\0\0\340\2378\0\300\2378\0"..., 8192, 638976) = 8192 [pid 26040] 23:26:48.593157 pread64(6, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 8192, 638976) = 8192 In between the write of non-zeros and the read of zeros, nothing seems to happen that could justify that, that I can grok, but perhaps someone else will see something that I'm missing. We pretty much just have the parallel worker scanning the table, and writing stuff out as it does it. This was obtained with: strace -f --absolute-timestamps=time,us ~/install/bin/postgres -D pgdata -c io_direct=data -c shared_buffers=256kB -c wal_level=minimal -c max_wal_senders=0 2>&1 | tee trace.log The repro is just: set debug_parallel_query=regress; drop table if exists t; create table t as select generate_series(1, 10000); update t set generate_series = 1; select count(*) from t; Occasionally it fails in a different way: after create table t, later references to t can't find it in the catalogs but there is no invalid page error. Perhaps the freaky zeros are happening one 4k page at a time but perhaps if you get two in a row it might look like an empty catalog page and pass validation.
Attachment
On Sun, Apr 9, 2023 at 11:25 PM Andrew Dunstan <andrew@dunslane.net> wrote: > Didn't seem to make any difference. Thanks for testing. I think it's COW (and I think that implies also checksums?) that needs to be turned off, at least based on experiments here.
On Sun, Apr 9, 2023 at 11:25 PM Andrew Dunstan <andrew@dunslane.net> wrote:Didn't seem to make any difference.Thanks for testing. I think it's COW (and I think that implies also checksums?) that needs to be turned off, at least based on experiments here.
Googling agrees with you about checksums. But I'm still wondering if we shouldn't disable COW for the build directory etc. It is suggested at [1]:
Recommend to set nodatacow – turn cow off – for the files that require fast IO and tend to get very big and get alot of random writes: such VMDK (vm disks) files and the like.
I'll give it a whirl.
cheers
andrew
[1] <http://www.infotinks.com/btrfs-disabling-cow-file-directory-nodatacow/>
-- Andrew Dunstan EDB: https://www.enterprisedb.com
On 2023-04-09 Su 08:39, Thomas Munro wrote:On Sun, Apr 9, 2023 at 11:25 PM Andrew Dunstan <andrew@dunslane.net> wrote:Didn't seem to make any difference.Thanks for testing. I think it's COW (and I think that implies also checksums?) that needs to be turned off, at least based on experiments here.
Googling agrees with you about checksums. But I'm still wondering if we shouldn't disable COW for the build directory etc. It is suggested at [1]:
Recommend to set nodatacow – turn cow off – for the files that require fast IO and tend to get very big and get alot of random writes: such VMDK (vm disks) files and the like.
I'll give it a whirl.
with COW disabled, I can no longer generate a failure with the test.
cheers
andrew
-- Andrew Dunstan EDB: https://www.enterprisedb.com
Thomas Munro <thomas.munro@gmail.com> writes: > we have a page at offset 638976, and we can find all system calls that > touched that offset: > [pid 26031] 23:26:48.521123 pwritev(50, > [{iov_base="\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., > iov_len=8192}], 1, 638976) = 8192 > [pid 26040] 23:26:48.568975 pwrite64(5, > "\0\0\0\0\0Nj\1\0\0\0\0\240\3\300\3\0 \4 > \0\0\0\0\340\2378\0\300\2378\0"..., 8192, 638976) = 8192 > [pid 26040] 23:26:48.593157 pread64(6, > "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., > 8192, 638976) = 8192 Boy, it's hard to look at that trace and not call it a filesystem bug. Given the apparent dependency on COW, I wonder if this has something to do with getting confused about which copy is current? Another thing that struck me is that the two calls from pid 26040 are issued on different FDs. I checked the strace log and verified that these do both refer to "base/5/16384". It looks like there was a cache flush at about 23:26:48.575023 that caused 26040 to close and later reopen all its database relation FDs. Maybe that is somehow contributing to the filesystem's confusion? And more to the point, could that explain why other O_DIRECT users aren't up in arms over this bug? Maybe they don't switch FDs as readily as we do. regards, tom lane
Hi, On 2023-04-08 21:29:54 -0700, Noah Misch wrote: > On Sat, Apr 08, 2023 at 11:08:16AM -0700, Andres Freund wrote: > > On 2023-04-07 23:04:08 -0700, Andres Freund wrote: > > > There were some failures in CI (e.g. [1] (and perhaps also bf, didn't yet > > > check), about "no unpinned buffers available". I was worried for a moment > > > that this could actually be relation to the bulk extension patch. > > > > > > But it looks like it's older - and not caused by direct_io support (except by > > > way of the test existing). I reproduced the issue locally by setting s_b even > > > lower, to 16 and made the ERROR a PANIC. > > > > > > [backtrace] > > I get an ERROR, not a PANIC: What I meant is that I changed the code to use PANIC, to make it easier to get a backtrace. > > > If you look at log_newpage_range(), it's not surprising that we get this error > > > - it pins up to 32 buffers at once. > > > > > > Afaics log_newpage_range() originates in 9155580fd5fc, but this caller is from > > > c6b92041d385. > > > > Do we care about fixing this in the backbranches? Probably not, given there > > > haven't been user complaints? > > I would not. This is only going to come up where the user goes out of the way > to use near-minimum shared_buffers. It's not *just* that scenario. With a few concurrent connections you can get into problematic territory even with halfway reasonable shared buffers. > > Here's a quick prototype of this approach. > > This looks fine. I'm not enthusiastic about incurring post-startup cycles to > cater to allocating less than 512k*max_connections of shared buffers, but I > expect the cycles in question are negligible here. Yea, I can't imagine it'd matter, compared to the other costs. Arguably it'd allow us to crank up the maximum batch size further, even. Greetings, Andres Freund
On Sun, Apr 09, 2023 at 02:45:16PM -0700, Andres Freund wrote: > On 2023-04-08 21:29:54 -0700, Noah Misch wrote: > > On Sat, Apr 08, 2023 at 11:08:16AM -0700, Andres Freund wrote: > > > On 2023-04-07 23:04:08 -0700, Andres Freund wrote: > > > > If you look at log_newpage_range(), it's not surprising that we get this error > > > > - it pins up to 32 buffers at once. > > > > > > > > Afaics log_newpage_range() originates in 9155580fd5fc, but this caller is from > > > > c6b92041d385. > > > > > > Do we care about fixing this in the backbranches? Probably not, given there > > > > haven't been user complaints? > > > > I would not. This is only going to come up where the user goes out of the way > > to use near-minimum shared_buffers. > > It's not *just* that scenario. With a few concurrent connections you can get > into problematic territory even with halfway reasonable shared buffers. I am not familiar with such cases. You could get there with 64MB shared buffers and 256 simultaneous commits of new-refilenode-creating transactions, but I'd still file that under going out of one's way to use tiny shared buffers relative to the write activity. What combination did you envision?
On Mon, Apr 10, 2023 at 8:43 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: > Boy, it's hard to look at that trace and not call it a filesystem bug. Agreed. > Given the apparent dependency on COW, I wonder if this has something > to do with getting confused about which copy is current? Yeah, I suppose it would require bogus old page versions (or I guess alternatively completely mixed up page offsets) rather than bogus zeroed pages to explain the too-high count observed in one of crake's failed runs: I guess it counted some pre-updated tuples that were supposed to be deleted and then counted the post-updated tuples on later pages (insert joke about the Easter variant of the Halloween problem). It's just that in the runs I've managed to observe and analyse, the previous version always happened to be zeros.
Hi, On 2023-04-10 00:17:12 +1200, Thomas Munro wrote: > I think there are two separate bad phenomena. > > 1. A concurrent modification of the user space buffer while writing > breaks the checksum so you can't read the data back in, as . I can > reproduce that with a stand-alone program, attached. The "verifier" > process occasionally reports EIO while reading, unless you comment out > the "scribbler" process's active line. The system log/dmesg gets some > warnings. I think we really need to think about whether we eventually we want to do something to avoid modifying pages while IO is in progress. The only alternative is for filesystems to make copies of everything in the IO path, which is far from free (and obviously prevents from using DMA for the whole IO). The copy we do to avoid the same problem when checksums are enabled, shows up quite prominently in write-heavy profiles, so there's a "purely postgres" reason to avoid these issues too. > 2. The crake-style failure doesn't involve any reported checksum > failures or errors, and I'm not sure if another process is even > involved. I attach a complete syscall trace of a repro session. (I > tried to get strace to dump 8192 byte strings, but then it doesn't > repro, so we have only the start of the data transferred for each > page.) Working back from the error message, > > ERROR: invalid page in block 78 of relation base/5/16384, > > we have a page at offset 638976, and we can find all system calls that > touched that offset: > > [pid 26031] 23:26:48.521123 pwritev(50, > [{iov_base="\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., > iov_len=8192}], 1, 638976) = 8192 > > [pid 26040] 23:26:48.568975 pwrite64(5, > "\0\0\0\0\0Nj\1\0\0\0\0\240\3\300\3\0 \4 > \0\0\0\0\340\2378\0\300\2378\0"..., 8192, 638976) = 8192 > > [pid 26040] 23:26:48.593157 pread64(6, > "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., > 8192, 638976) = 8192 > > In between the write of non-zeros and the read of zeros, nothing seems > to happen that could justify that, that I can grok, but perhaps > someone else will see something that I'm missing. We pretty much just > have the parallel worker scanning the table, and writing stuff out as > it does it. This was obtained with: Have you tried to write a reproducer for this that doesn't involve postgres? It'd certainly be interesting to know the precise conditions for this. E.g., can this also happen without O_DIRECT, if cache pressure is high enough for the page to get evicted soon after (potentially simulated with fadvise or such)? We should definitely let the brtfs folks know of this issue... It's possible that this bug was recently introduced even. What kernel version did you repro this on Thomas? I wonder if we should have a postgres-io-torture program in our tree for some of these things. We've found issues with our assumptions on several operating systems and filesystems, without systematically looking. Or even stressing IO all that hard in our tests. Greetings, Andres Freund
Il giorno lun 10 apr 2023 alle ore 04:58 Andres Freund <andres@anarazel.de> ha scritto: > We should definitely let the brtfs folks know of this issue... It's possible > that this bug was recently introduced even. What kernel version did you repro > this on Thomas? In these days on BTRFS ml they are discussing about Direct I/O data corruption. No patch at the moment, they are still discussing how to address it: https://lore.kernel.org/linux-btrfs/aa1fb69e-b613-47aa-a99e-a0a2c9ed273f@app.fastmail.com/ Ciao, Gelma
On Mon, Apr 10, 2023 at 2:57 PM Andres Freund <andres@anarazel.de> wrote: > Have you tried to write a reproducer for this that doesn't involve postgres? I tried a bit. I'll try harder soon. > ... What kernel version did you repro > this on Thomas? Debian's 6.0.10-2 kernel (Debian 12 on a random laptop). Here's how I set up a test btrfs in case someone else wants a head start: truncate -s2G 2GB.img sudo losetup --show --find 2GB.img sudo mkfs -t btrfs /dev/loop0 # the device name shown by losetup sudo mkdir /mnt/tmp sudo mount /dev/loop0 /mnt/tmp sudo chown $(whoami) /mnt/tmp cd /mnt/tmp /path/to/initdb -D pgdata ... (see instructions further up for postgres command line + queries to run)
On Mon, Apr 10, 2023 at 7:27 PM Thomas Munro <thomas.munro@gmail.com> wrote: > Debian's 6.0.10-2 kernel (Debian 12 on a random laptop). Realising I hadn't updated for a bit, I did so and it still reproduces on: $ uname -a Linux x1 6.1.0-7-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.1.20-1 (2023-03-19) x86_64 GNU/Linux
Hi, On 2023-04-10 19:27:27 +1200, Thomas Munro wrote: > On Mon, Apr 10, 2023 at 2:57 PM Andres Freund <andres@anarazel.de> wrote: > > Have you tried to write a reproducer for this that doesn't involve postgres? > > I tried a bit. I'll try harder soon. > > > ... What kernel version did you repro > > this on Thomas? > > Debian's 6.0.10-2 kernel (Debian 12 on a random laptop). Here's how I > set up a test btrfs in case someone else wants a head start: > > truncate -s2G 2GB.img > sudo losetup --show --find 2GB.img > sudo mkfs -t btrfs /dev/loop0 # the device name shown by losetup > sudo mkdir /mnt/tmp > sudo mount /dev/loop0 /mnt/tmp > sudo chown $(whoami) /mnt/tmp > > cd /mnt/tmp > /path/to/initdb -D pgdata > ... (see instructions further up for postgres command line + queries to run) I initially failed to repro the issue with these instructions. Turns out that the problem does not happen if huge pages are in used - I'd configured huge pages, so the default huge_pages=try succeeded. As soon as I disable huge_pages explicitly, it reproduces. Another interesting bit is that if checksums are enabled, I also can't reproduce the issue. Together with the huge_page issue, it does suggest that this is somehow related to page faults. Which fits with the thread Andrea referenced... Greetings, Andres Freund
Hi, On 2023-04-10 18:55:26 -0700, Andres Freund wrote: > On 2023-04-10 19:27:27 +1200, Thomas Munro wrote: > > On Mon, Apr 10, 2023 at 2:57 PM Andres Freund <andres@anarazel.de> wrote: > > > Have you tried to write a reproducer for this that doesn't involve postgres? > > > > I tried a bit. I'll try harder soon. > > > > > ... What kernel version did you repro > > > this on Thomas? > > > > Debian's 6.0.10-2 kernel (Debian 12 on a random laptop). Here's how I > > set up a test btrfs in case someone else wants a head start: > > > > truncate -s2G 2GB.img > > sudo losetup --show --find 2GB.img > > sudo mkfs -t btrfs /dev/loop0 # the device name shown by losetup > > sudo mkdir /mnt/tmp > > sudo mount /dev/loop0 /mnt/tmp > > sudo chown $(whoami) /mnt/tmp > > > > cd /mnt/tmp > > /path/to/initdb -D pgdata > > ... (see instructions further up for postgres command line + queries to run) > > I initially failed to repro the issue with these instructions. Turns out that > the problem does not happen if huge pages are in used - I'd configured huge > pages, so the default huge_pages=try succeeded. As soon as I disable > huge_pages explicitly, it reproduces. > > Another interesting bit is that if checksums are enabled, I also can't > reproduce the issue. Together with the huge_page issue, it does suggest that > this is somehow related to page faults. Which fits with the thread Andrea > referenced... The last iteration of the fix that I could find is: https://lore.kernel.org/linux-btrfs/20230328051957.1161316-1-hch@lst.de/T/#m1afdc3fe77e10a97302e7d80fed3efeaa297f0f7 And the fix has been merged into https://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux.git/log/?h=for-next I think that means it'll have to wait for 6.4 development to open (in a few weeks), and then will be merged into the stable branches from there. Greetings, Andres Freund
On Tue, Apr 11, 2023 at 2:15 PM Andres Freund <andres@anarazel.de> wrote: > And the fix has been merged into > https://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux.git/log/?h=for-next > > I think that means it'll have to wait for 6.4 development to open (in a few > weeks), and then will be merged into the stable branches from there. Great! Let's hope/assume for now that that'll fix phenomenon #2. That still leaves the checksum-vs-concurrent-modification thing that I called phenomenon #1, which we've not actually hit with PostgreSQL yet but is clearly possible and can be seen with the stand-alone repro-program I posted upthread. You wrote: On Mon, Apr 10, 2023 at 2:57 PM Andres Freund <andres@anarazel.de> wrote: > I think we really need to think about whether we eventually we want to do > something to avoid modifying pages while IO is in progress. The only > alternative is for filesystems to make copies of everything in the IO path, > which is far from free (and obviously prevents from using DMA for the whole > IO). The copy we do to avoid the same problem when checksums are enabled, > shows up quite prominently in write-heavy profiles, so there's a "purely > postgres" reason to avoid these issues too. +1 I wonder what the other file systems that maintain checksums (see list at [1]) do when the data changes underneath a write. ZFS's policy is conservative[2], while BTRFS took the demons-will-fly-out-of-your-nose route. I can see arguments for both approaches (ZFS can only reach zero-copy optimum by turning off checksums completely, while BTRFS is happy to assume that if you break this programming rule that is not written down anywhere then you must never want to see your data ever again). What about ReFS? CephFS? I tried to find out what POSIX says about this WRT synchronous pwrite() (as Tom suggested, maybe we're doing something POSIX doesn't allow), but couldn't find it in my first attempt. It *does* say it's undefined for aio_write() (which means that my prototype io_method=posix_aio code that uses that stuff is undefined in presense of hintbit modifications). I don't really see why it should vary between synchronous and asynchronous interfaces (considering the existence of threads, shared memory etc, the synchronous interface only removes one thread from list of possible suspects that could flip some bits). But yeah, in any case, it doesn't seem great that we do that. [1] https://en.wikipedia.org/wiki/Comparison_of_file_systems#Block_capabilities [2] https://openzfs.topicbox.com/groups/developer/T950b02acdf392290/odirect-semantics-in-zfs
On Tue, Apr 11, 2023 at 2:31 PM Thomas Munro <thomas.munro@gmail.com> wrote: > I tried to find out what POSIX says about this (But of course whatever it might say is of especially limited value when O_DIRECT is in the picture, being completely unstandardised. Really I guess all they meant was "if you *copy* something that's moving, who knows which bits you'll copy"... not "your data might be incinerated with lasers".)
Hi, On 2023-04-09 16:40:54 -0700, Noah Misch wrote: > On Sun, Apr 09, 2023 at 02:45:16PM -0700, Andres Freund wrote: > > It's not *just* that scenario. With a few concurrent connections you can get > > into problematic territory even with halfway reasonable shared buffers. > > I am not familiar with such cases. You could get there with 64MB shared > buffers and 256 simultaneous commits of new-refilenode-creating transactions, > but I'd still file that under going out of one's way to use tiny shared > buffers relative to the write activity. What combination did you envision? I'd not say it's common, but it's less crazy than running with 128kB of s_b... There's also the issue that log_newpage_range() is used in number of places where we could have a lot of pre-existing buffer pins. So pinning another 64 buffers could tip us over. Greetings, Andres Freund
Hi, I'm hitting a panic in t_004_io_direct. The build is running on overlayfs on tmpfs/ext4 (upper/lower) which is probably a weird combination but has worked well for building everything over the last decade. On Debian unstable: PANIC: could not open file "pg_wal/000000010000000000000001": Invalid argument 16:21:16 Bailout called. Further testing stopped: pg_ctl start failed 16:21:16 t/004_io_direct.pl .............. 16:21:16 Dubious, test returned 255 (wstat 65280, 0xff00) 16:21:16 No subtests run 16:21:16 16:21:16 Test Summary Report 16:21:16 ------------------- 16:21:16 t/004_io_direct.pl (Wstat: 65280 (exited 255) Tests: 0 Failed: 0) 16:21:16 Non-zero exit status: 255 16:21:16 Parse errors: No plan found in TAP output 16:21:16 Files=4, Tests=65, 9 wallclock secs ( 0.03 usr 0.02 sys + 3.78 cusr 1.48 csys = 5.31 CPU) 16:21:16 Result: FAIL 16:21:16 ******** build/src/test/modules/test_misc/tmp_check/log/004_io_direct_main.log ******** 16:21:16 2023-04-11 23:21:16.431 UTC [25991] LOG: starting PostgreSQL 16devel (Debian 16~~devel-1.pgdg+~20230411.2256.gc03c2ea)on x86_64-pc-linux-gnu, compiled by gcc (Debian 12.2.0-14) 12.2.0, 64-bit 16:21:16 2023-04-11 23:21:16.431 UTC [25991] LOG: listening on Unix socket "/tmp/s0C4hWQq82/.s.PGSQL.54693" 16:21:16 2023-04-11 23:21:16.433 UTC [25994] LOG: database system was shut down at 2023-04-11 23:21:16 UTC 16:21:16 2023-04-11 23:21:16.434 UTC [25994] PANIC: could not open file "pg_wal/000000010000000000000001": Invalid argument 16:21:16 2023-04-11 23:21:16.525 UTC [25991] LOG: startup process (PID 25994) was terminated by signal 6: Aborted 16:21:16 2023-04-11 23:21:16.525 UTC [25991] LOG: aborting startup due to startup process failure 16:21:16 2023-04-11 23:21:16.526 UTC [25991] LOG: database system is shut down 16:21:16 ******** build/src/test/modules/test_misc/tmp_check/t_004_io_direct_main_data/pgdata/core ******** 16:21:17 16:21:17 warning: Can't open file /dev/shm/PostgreSQL.3457641370 during file-backed mapping note processing 16:21:17 16:21:17 warning: Can't open file /dev/shm/PostgreSQL.2391834648 during file-backed mapping note processing 16:21:17 16:21:17 warning: Can't open file /dev/zero (deleted) during file-backed mapping note processing 16:21:17 16:21:17 warning: Can't open file /SYSV00000dea (deleted) during file-backed mapping note processing 16:21:17 [New LWP 25994] 16:21:17 [Thread debugging using libthread_db enabled] 16:21:17 Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1". 16:21:17 Core was generated by `postgres: main: startup '. 16:21:17 Program terminated with signal SIGABRT, Aborted. 16:21:17 #0 0x00007f176c591ccc in ?? () from /lib/x86_64-linux-gnu/libc.so.6 16:21:17 #0 0x00007f176c591ccc in ?? () from /lib/x86_64-linux-gnu/libc.so.6 16:21:17 No symbol table info available. 16:21:17 #1 0x00007f176c542ef2 in raise () from /lib/x86_64-linux-gnu/libc.so.6 16:21:17 No symbol table info available. 16:21:17 #2 0x00007f176c52d472 in abort () from /lib/x86_64-linux-gnu/libc.so.6 16:21:17 No symbol table info available. 16:21:17 #3 0x000055a7ba7978a1 in errfinish (filename=<optimized out>, lineno=<optimized out>, funcname=0x55a7ba810560 <__func__.47>"XLogFileInitInternal") at ./build/../src/backend/utils/error/elog.c:604 16:21:17 edata = 0x55a7baae3e20 <errordata> 16:21:17 elevel = 23 16:21:17 oldcontext = 0x55a7bb471590 16:21:17 econtext = 0x0 16:21:17 __func__ = "errfinish" 16:21:17 #4 0x000055a7ba21759c in XLogFileInitInternal (logsegno=1, logtli=logtli@entry=1, added=added@entry=0x7ffebc6c8a3f,path=path@entry=0x7ffebc6c8a40 "pg_wal/00000001", '0' <repeats 15 times>, "1") at ./build/../src/backend/access/transam/xlog.c:2944 16:21:17 __errno_location = <optimized out> 16:21:17 tmppath = "0\214l\274\376\177\000\000\321\330~\272\247U\000\000\005Q\223\272\247U\000\000p\214l\274\376\177\000\000`\214l\274\376\177\000\000\212\335~\000\v", '\000'<repeats 31 times>, "\247U\000\000\000\000\000\000\000\177\000\000*O\202\272\247U\000\000\254\206l\274\376\177\000\000\000\000\000\000\v", '\000'<repeats 23 times>, "0\000\000\000\000\000\000\000\247U\000\000\000\000\000\000\000\000\000\000\001Q\223\272\247U\000\000\240\215l\274\376\177\000\000\376\377\377\377\000\000\000\0000\207l\274\376\177\000\000[\326~\272\247U\000\0000\207l\274\376\177\000\000"... 16:21:17 installed_segno = 0 16:21:17 max_segno = <optimized out> 16:21:17 fd = <optimized out> 16:21:17 save_errno = <optimized out> 16:21:17 open_flags = 194 16:21:17 __func__ = "XLogFileInitInternal" 16:21:17 #5 0x000055a7ba35a1d5 in XLogFileInit (logsegno=<optimized out>, logtli=logtli@entry=1) at ./build/../src/backend/access/transam/xlog.c:3099 16:21:17 ignore_added = false 16:21:17 path = "pg_wal/00000001", '0' <repeats 15 times>, "1\000\220\312P\273\247U\000\000/\375Yl\027\177\000\000\220\252P\273\247U\000\000\001",'\000' <repeats 15 times>, "\220\252P\273\247U\000\000\300\212l\274\376\177\000\000\002\261{\272\247U\000\000\220\252P\273\247U\000\000\220\252P\273\247U\000\000\001", '\000'<repeats 15 times>, "\340\212l\274\376\177\000\000\021\032|\272\247U\000\000\000\000\000\000\000\000\000\000\240\312P\273\247U\000\0000\213l\274\376\177\000\000\350\262{\272\247U\000\000\001", '\000'<repeats 16 times>, "\256\023i\027\177\000\000"... 16:21:17 fd = <optimized out> 16:21:17 __func__ = "XLogFileInit" 16:21:17 #6 0x000055a7ba35bab3 in XLogWrite (WriteRqst=..., tli=tli@entry=1, flexible=flexible@entry=false) at ./build/../src/backend/access/transam/xlog.c:2137 16:21:17 EndPtr = 21954560 16:21:17 ispartialpage = true 16:21:17 last_iteration = <optimized out> 16:21:17 finishing_seg = <optimized out> 16:21:17 curridx = 7 16:21:17 npages = 0 16:21:17 startidx = 0 16:21:17 startoffset = 0 16:21:17 __func__ = "XLogWrite" 16:21:17 #7 0x000055a7ba35c8e0 in XLogFlush (record=21949600) at ./build/../src/backend/access/transam/xlog.c:2638 16:21:17 insertpos = 21949600 16:21:17 WriteRqstPtr = 21949600 16:21:17 WriteRqst = <optimized out> 16:21:17 insertTLI = 1 16:21:17 __func__ = "XLogFlush" 16:21:17 #8 0x000055a7ba36118e in XLogReportParameters () at ./build/../src/backend/access/transam/xlog.c:7620 16:21:17 xlrec = {MaxConnections = 100, max_worker_processes = 8, max_wal_senders = 0, max_prepared_xacts = 0, max_locks_per_xact= 64, wal_level = 1, wal_log_hints = false, track_commit_timestamp = false} 16:21:17 recptr = <optimized out> 16:21:17 #9 StartupXLOG () at ./build/../src/backend/access/transam/xlog.c:5726 16:21:17 Insert = <optimized out> 16:21:17 checkPoint = <optimized out> 16:21:17 wasShutdown = true 16:21:17 didCrash = <optimized out> 16:21:17 haveTblspcMap = false 16:21:17 haveBackupLabel = false 16:21:17 EndOfLog = 21949544 16:21:17 EndOfLogTLI = <optimized out> 16:21:17 newTLI = 1 16:21:17 performedWalRecovery = <optimized out> 16:21:17 endOfRecoveryInfo = <optimized out> 16:21:17 abortedRecPtr = <optimized out> 16:21:17 missingContrecPtr = 0 16:21:17 oldestActiveXID = <optimized out> 16:21:17 promoted = false 16:21:17 __func__ = "StartupXLOG" 16:21:17 #10 0x000055a7ba5b4d00 in StartupProcessMain () at ./build/../src/backend/postmaster/startup.c:267 16:21:17 No locals. 16:21:17 #11 0x000055a7ba5ab0cf in AuxiliaryProcessMain (auxtype=auxtype@entry=StartupProcess) at ./build/../src/backend/postmaster/auxprocess.c:141 16:21:17 __func__ = "AuxiliaryProcessMain" 16:21:17 #12 0x000055a7ba5b0aa3 in StartChildProcess (type=StartupProcess) at ./build/../src/backend/postmaster/postmaster.c:5369 16:21:17 pid = <optimized out> 16:21:17 __func__ = "StartChildProcess" 16:21:17 save_errno = <optimized out> 16:21:17 __errno_location = <optimized out> 16:21:17 __errno_location = <optimized out> 16:21:17 __errno_location = <optimized out> 16:21:17 __errno_location = <optimized out> 16:21:17 __errno_location = <optimized out> 16:21:17 __errno_location = <optimized out> 16:21:17 __errno_location = <optimized out> 16:21:17 #13 0x000055a7ba5b45d6 in PostmasterMain (argc=argc@entry=4, argv=argv@entry=0x55a7bb471450) at ./build/../src/backend/postmaster/postmaster.c:1455 16:21:17 opt = <optimized out> 16:21:17 status = <optimized out> 16:21:17 userDoption = <optimized out> 16:21:17 listen_addr_saved = <optimized out> 16:21:17 i = <optimized out> 16:21:17 output_config_variable = <optimized out> 16:21:17 __func__ = "PostmasterMain" 16:21:17 #14 0x000055a7ba29fd62 in main (argc=4, argv=0x55a7bb471450) at ./build/../src/backend/main/main.c:200 16:21:17 do_check_root = <optimized out> Apologies if this was already reported elsewhere in the thread, I skimmed it but the problems looked different. Christoph
On Wed, Apr 12, 2023 at 2:56 PM Christoph Berg <myon@debian.org> wrote: > I'm hitting a panic in t_004_io_direct. The build is running on > overlayfs on tmpfs/ext4 (upper/lower) which is probably a weird > combination but has worked well for building everything over the last > decade. On Debian unstable: > > PANIC: could not open file "pg_wal/000000010000000000000001": Invalid argument Hi Christoph, That's an interesting one. I was half expecting to see that on some unusual systems, which is why I made the test check which OS it is and exclude those that are known to fail with EINVAL or ENOTSUPP on their common/typical file systems. But if it's going to be Linux, that's not going to work. I have a new idea: perhaps it is possible to try to open a file with O_DIRECT from perl, and if it fails like that, skip the test. Looking into that now.
On Wed, Apr 12, 2023 at 3:04 PM Thomas Munro <thomas.munro@gmail.com> wrote: > On Wed, Apr 12, 2023 at 2:56 PM Christoph Berg <myon@debian.org> wrote: > > I'm hitting a panic in t_004_io_direct. The build is running on > > overlayfs on tmpfs/ext4 (upper/lower) which is probably a weird > > combination but has worked well for building everything over the last > > decade. On Debian unstable: > > > > PANIC: could not open file "pg_wal/000000010000000000000001": Invalid argument > ... I have a new idea: perhaps it is possible to try > to open a file with O_DIRECT from perl, and if it fails like that, > skip the test. Looking into that now. I think I have that working OK. Any Perl hackers want to comment on my use of IO::File (copied from examples on the internet that showed how to use O_DIRECT)? I am not much of a perl hacker but according to my package manager, IO/File.pm came with perl itself. And the Fcntl eval trick that I copied from File::stat, and the perl-critic suppression that requires? I tested this on OpenBSD, which has no O_DIRECT, so that tests the first reason to skip. Does it skip OK on your system, for the second reason? Should we be more specific about the errno? As far as I know, the only systems on the build farm that should be affected by this change are the illumos boxen (they have O_DIRECT, unlike Solaris, but perl's $^O couldn't tell the difference between Solaris and illumos, so they didn't previously run this test). One thing I resisted the urge to do is invent PG_TEST_SKIP, a sort of anti-PG_TEST_EXTRA. I think I'd rather hear about it if there is a system out there that passes the pre-flight check, but fails later on, because we'd better investigate why. That's basically the point of shipping this "developer only" feature long before serious use of it.
Attachment
On Wed, Apr 12, 2023 at 5:48 PM Thomas Munro <thomas.munro@gmail.com> wrote: > On Wed, Apr 12, 2023 at 3:04 PM Thomas Munro <thomas.munro@gmail.com> wrote: > > On Wed, Apr 12, 2023 at 2:56 PM Christoph Berg <myon@debian.org> wrote: > > > I'm hitting a panic in t_004_io_direct. The build is running on > > > overlayfs on tmpfs/ext4 (upper/lower) which is probably a weird > > > combination but has worked well for building everything over the last > > > decade. On Debian unstable: After trying a couple of things and doing some googling, it looks like it's tmpfs that rejects it, not overlayfs, so I'd adjust that commit message slightly. Of course it's a completely reasonable thing to expect the tests to pass (or in this case be skipped) in a tmpfs, eg /tmp on some distributions. (It's a strange to contemplate what O_DIRECT means for tmpfs, considering that it *is* the page cache, kinda, and I see people have been arguing about that for a couple of decades since O_DIRECT was added to Linux; doesn't seem that helpful to me that it rejects it, but 🤷).
On Wed, Apr 12, 2023 at 3:04 PM Thomas Munro <thomas.munro@gmail.com> wrote:On Wed, Apr 12, 2023 at 2:56 PM Christoph Berg <myon@debian.org> wrote:I'm hitting a panic in t_004_io_direct. The build is running on overlayfs on tmpfs/ext4 (upper/lower) which is probably a weird combination but has worked well for building everything over the last decade. On Debian unstable: PANIC: could not open file "pg_wal/000000010000000000000001": Invalid argument... I have a new idea: perhaps it is possible to try to open a file with O_DIRECT from perl, and if it fails like that, skip the test. Looking into that now.I think I have that working OK. Any Perl hackers want to comment on my use of IO::File (copied from examples on the internet that showed how to use O_DIRECT)? I am not much of a perl hacker but according to my package manager, IO/File.pm came with perl itself. And the Fcntl eval trick that I copied from File::stat, and the perl-critic suppression that requires?
I think you can probably replace a lot of the magic here by simply saying
if (Fcntl->can("O_DIRECT")) ...
cheers
andrew
-- Andrew Dunstan EDB: https://www.enterprisedb.com
Thomas Munro <thomas.munro@gmail.com> writes: > I think I have that working OK. Any Perl hackers want to comment on > my use of IO::File (copied from examples on the internet that showed > how to use O_DIRECT)? I am not much of a perl hacker but according to > my package manager, IO/File.pm came with perl itself. Indeed, and it has been since perl 5.003_07, released in 1996. And Fcntl has known about O_DIRECT since perl 5.6.0, released in 2000, so we can safely use both. > And the Fcntl eval trick that I copied from File::stat, and the > perl-critic suppression that requires? […] > + no strict 'refs'; ## no critic (ProhibitNoStrict) > + my $val = eval { &{'Fcntl::O_DIRECT'} }; > + if (defined $val) This trick is only needed in File::stat because it's constructing the symbol name dynamically. And because Fcntl by default exports all the O_* and F_* constants it knows about, we can simply do: if (defined &O_DIRECT) > + { > + use Fcntl qw(O_DIRECT); The `use Fcntl;` above will already have imported this, so this is redundant. - ilmari
Andrew Dunstan <andrew@dunslane.net> writes: > On 2023-04-12 We 01:48, Thomas Munro wrote: >> On Wed, Apr 12, 2023 at 3:04 PM Thomas Munro<thomas.munro@gmail.com> wrote: >>> On Wed, Apr 12, 2023 at 2:56 PM Christoph Berg<myon@debian.org> wrote: >>>> I'm hitting a panic in t_004_io_direct. The build is running on >>>> overlayfs on tmpfs/ext4 (upper/lower) which is probably a weird >>>> combination but has worked well for building everything over the last >>>> decade. On Debian unstable: >>>> >>>> PANIC: could not open file "pg_wal/000000010000000000000001": Invalid argument >>> ... I have a new idea: perhaps it is possible to try >>> to open a file with O_DIRECT from perl, and if it fails like that, >>> skip the test. Looking into that now. >> I think I have that working OK. Any Perl hackers want to comment on >> my use of IO::File (copied from examples on the internet that showed >> how to use O_DIRECT)? I am not much of a perl hacker but according to >> my package manager, IO/File.pm came with perl itself. And the Fcntl >> eval trick that I copied from File::stat, and the perl-critic >> suppression that requires? > > > I think you can probably replace a lot of the magic here by simply saying > > > if (Fcntl->can("O_DIRECT")) ... Fcntl->can() is true for all constants that Fcntl knows about, whether or not they are defined for your OS. `defined &O_DIRECT` is the simplest check, see my other reply to Thomas. > cheers > > > andrew - ilmari
Andrew Dunstan <andrew@dunslane.net> writes:On 2023-04-12 We 01:48, Thomas Munro wrote:On Wed, Apr 12, 2023 at 3:04 PM Thomas Munro<thomas.munro@gmail.com> wrote:On Wed, Apr 12, 2023 at 2:56 PM Christoph Berg<myon@debian.org> wrote:I'm hitting a panic in t_004_io_direct. The build is running on overlayfs on tmpfs/ext4 (upper/lower) which is probably a weird combination but has worked well for building everything over the last decade. On Debian unstable: PANIC: could not open file "pg_wal/000000010000000000000001": Invalid argument... I have a new idea: perhaps it is possible to try to open a file with O_DIRECT from perl, and if it fails like that, skip the test. Looking into that now.I think I have that working OK. Any Perl hackers want to comment on my use of IO::File (copied from examples on the internet that showed how to use O_DIRECT)? I am not much of a perl hacker but according to my package manager, IO/File.pm came with perl itself. And the Fcntl eval trick that I copied from File::stat, and the perl-critic suppression that requires?I think you can probably replace a lot of the magic here by simply saying if (Fcntl->can("O_DIRECT")) ...Fcntl->can() is true for all constants that Fcntl knows about, whether or not they are defined for your OS. `defined &O_DIRECT` is the simplest check, see my other reply to Thomas.
My understanding was that Fcntl only exported constants known to the OS. That's certainly what its docco suggests, e.g.:
By default your system's F_* and O_* constants (eg, F_DUPFD and O_CREAT)
and the FD_CLOEXEC constant are exported into your namespace.
cheers
andrew
-- Andrew Dunstan EDB: https://www.enterprisedb.com
Andrew Dunstan <andrew@dunslane.net> writes: > On 2023-04-12 We 10:23, Dagfinn Ilmari Mannsåker wrote: >> Andrew Dunstan<andrew@dunslane.net> writes: >> >>> On 2023-04-12 We 01:48, Thomas Munro wrote: >>>> On Wed, Apr 12, 2023 at 3:04 PM Thomas Munro<thomas.munro@gmail.com> wrote: >>>>> On Wed, Apr 12, 2023 at 2:56 PM Christoph Berg<myon@debian.org> wrote: >>>>>> I'm hitting a panic in t_004_io_direct. The build is running on >>>>>> overlayfs on tmpfs/ext4 (upper/lower) which is probably a weird >>>>>> combination but has worked well for building everything over the last >>>>>> decade. On Debian unstable: >>>>>> >>>>>> PANIC: could not open file "pg_wal/000000010000000000000001": Invalid argument >>>>> ... I have a new idea: perhaps it is possible to try >>>>> to open a file with O_DIRECT from perl, and if it fails like that, >>>>> skip the test. Looking into that now. >>>> I think I have that working OK. Any Perl hackers want to comment on >>>> my use of IO::File (copied from examples on the internet that showed >>>> how to use O_DIRECT)? I am not much of a perl hacker but according to >>>> my package manager, IO/File.pm came with perl itself. And the Fcntl >>>> eval trick that I copied from File::stat, and the perl-critic >>>> suppression that requires? >>> >>> I think you can probably replace a lot of the magic here by simply saying >>> >>> >>> if (Fcntl->can("O_DIRECT")) ... >> Fcntl->can() is true for all constants that Fcntl knows about, whether >> or not they are defined for your OS. `defined &O_DIRECT` is the simplest >> check, see my other reply to Thomas. >> >> > > My understanding was that Fcntl only exported constants known to the > OS. That's certainly what its docco suggests, e.g.: > > By default your system's F_* and O_* constants (eg, F_DUPFD and > O_CREAT) > and the FD_CLOEXEC constant are exported into your namespace. It's a bit more magical than that (this is Perl after all). They are all exported (which implicitly creates stubs visible to `->can()`, similarly to forward declarations like `sub O_FOO;`), but only the defined ones (`#ifdef O_FOO` is true) are defined (`defined &O_FOO` is true). The rest fall through to an AUTOLOAD¹ function that throws an exception for undefined ones. Here's an example (Fcntl knows O_RAW, but Linux does not define it): $ perl -E ' use strict; use Fcntl; say "can", main->can("O_RAW") ? "" : "not"; say defined &O_RAW ? "" : "not ", "defined"; say O_RAW;' can not defined Your vendor has not defined Fcntl macro O_RAW, used at -e line 4 While O_DIRECT is defined: $ perl -E ' use strict; use Fcntl; say "can", main->can("O_DIRECT") ? "" : "not"; say defined &O_DIRECT ? "" : "not ", "defined"; say O_DIRECT;' can defined 16384 And O_FOO is unknown to Fcntl (the parens on `O_FOO()q are to make it not a bareword, which would be a compile error under `use strict;` when the sub doesn't exist at all): $ perl -E ' use strict; use Fcntl; say "can", main->can("O_FOO") ? "" : "not"; say defined &O_FOO ? "" : "not ", "defined"; say O_FOO();' cannot not defined Undefined subroutine &main::O_FOO called at -e line 4. > cheers > > > andrew - ilmari [1] https://perldoc.perl.org/perlsub#Autoloading
Andrew Dunstan <andrew@dunslane.net> writes:On 2023-04-12 We 10:23, Dagfinn Ilmari Mannsåker wrote:Andrew Dunstan<andrew@dunslane.net> writes:On 2023-04-12 We 01:48, Thomas Munro wrote:On Wed, Apr 12, 2023 at 3:04 PM Thomas Munro<thomas.munro@gmail.com> wrote:On Wed, Apr 12, 2023 at 2:56 PM Christoph Berg<myon@debian.org> wrote:I'm hitting a panic in t_004_io_direct. The build is running on overlayfs on tmpfs/ext4 (upper/lower) which is probably a weird combination but has worked well for building everything over the last decade. On Debian unstable: PANIC: could not open file "pg_wal/000000010000000000000001": Invalid argument... I have a new idea: perhaps it is possible to try to open a file with O_DIRECT from perl, and if it fails like that, skip the test. Looking into that now.I think I have that working OK. Any Perl hackers want to comment on my use of IO::File (copied from examples on the internet that showed how to use O_DIRECT)? I am not much of a perl hacker but according to my package manager, IO/File.pm came with perl itself. And the Fcntl eval trick that I copied from File::stat, and the perl-critic suppression that requires?I think you can probably replace a lot of the magic here by simply saying if (Fcntl->can("O_DIRECT")) ...Fcntl->can() is true for all constants that Fcntl knows about, whether or not they are defined for your OS. `defined &O_DIRECT` is the simplest check, see my other reply to Thomas.My understanding was that Fcntl only exported constants known to the OS. That's certainly what its docco suggests, e.g.: By default your system's F_* and O_* constants (eg, F_DUPFD and O_CREAT) and the FD_CLOEXEC constant are exported into your namespace.It's a bit more magical than that (this is Perl after all). They are all exported (which implicitly creates stubs visible to `->can()`, similarly to forward declarations like `sub O_FOO;`), but only the defined ones (`#ifdef O_FOO` is true) are defined (`defined &O_FOO` is true). The rest fall through to an AUTOLOAD¹ function that throws an exception for undefined ones. Here's an example (Fcntl knows O_RAW, but Linux does not define it): $ perl -E ' use strict; use Fcntl; say "can", main->can("O_RAW") ? "" : "not"; say defined &O_RAW ? "" : "not ", "defined"; say O_RAW;' can not defined Your vendor has not defined Fcntl macro O_RAW, used at -e line 4 While O_DIRECT is defined: $ perl -E ' use strict; use Fcntl; say "can", main->can("O_DIRECT") ? "" : "not"; say defined &O_DIRECT ? "" : "not ", "defined"; say O_DIRECT;' can defined 16384 And O_FOO is unknown to Fcntl (the parens on `O_FOO()q are to make it not a bareword, which would be a compile error under `use strict;` when the sub doesn't exist at all): $ perl -E ' use strict; use Fcntl; say "can", main->can("O_FOO") ? "" : "not"; say defined &O_FOO ? "" : "not ", "defined"; say O_FOO();' cannot not defined Undefined subroutine &main::O_FOO called at -e line 4.
*grumble* a bit too magical for my taste. Thanks for the correction.
cheers
andrew
-- Andrew Dunstan EDB: https://www.enterprisedb.com
Thanks both for looking, and thanks for the explanation Ilmari. Pushed with your improvements. The 4 CI systems run the tests (Windows and Mac by special always-expected-to-work case, Linux and FreeBSD by successful pre-flight perl test of O_DIRECT), and I also tested three unusual systems, two that skip for different reasons and one that will henceforth run this test on the build farm so I wanted to confirm locally first: Linux/tmpfs: 1..0 # SKIP pre-flight test if we can open a file with O_DIRECT failed: Invalid argument OpenBSD: t/004_io_direct.pl .............. skipped: no O_DIRECT illumos: t/004_io_direct.pl .............. ok (Format different because those last two are autoconf, no meson on my collection of Vagrant images yet...)
Re: Thomas Munro > Linux/tmpfs: 1..0 # SKIP pre-flight test if we can open a file with > O_DIRECT failed: Invalid argument I confirm it's working now: t/004_io_direct.pl .............. skipped: pre-flight test if we can open a file with O_DIRECT failed: Invalid argument All tests successful. Thanks, Christoph
Since the direct I/O commit went in, buildfarm animals curculio and morepork have been issuing warnings like hashpage.c: In function '_hash_expandtable': hashpage.c:995: warning: ignoring alignment for stack allocated 'zerobuf' in places where there's a local variable of type PGIOAlignedBlock or PGAlignedXLogBlock. I'm not sure why only those two animals are unhappy, but I think they have a point: typical ABIs don't guarantee alignment of function stack frames to better than 16 bytes or so. In principle the compiler could support a 4K alignment request anyway by doing the equivalent of alloca(3), but I do not think we can count on that to happen. regards, tom lane
Hi, On 2023-04-14 13:21:33 -0400, Tom Lane wrote: > Since the direct I/O commit went in, buildfarm animals > curculio and morepork have been issuing warnings like > > hashpage.c: In function '_hash_expandtable': > hashpage.c:995: warning: ignoring alignment for stack allocated 'zerobuf' > > in places where there's a local variable of type PGIOAlignedBlock > or PGAlignedXLogBlock. I'm not sure why only those two animals > are unhappy, but I think they have a point: typical ABIs don't > guarantee alignment of function stack frames to better than > 16 bytes or so. In principle the compiler could support a 4K > alignment request anyway by doing the equivalent of alloca(3), > but I do not think we can count on that to happen. Hm. New-ish compilers seem to be ok with it. Perhaps we should have a configure check whether the compiler is OK with that, and disable direct IO support if not? Greetings, Andres Freund
Andres Freund <andres@anarazel.de> writes: > On 2023-04-14 13:21:33 -0400, Tom Lane wrote: >> ... I'm not sure why only those two animals >> are unhappy, but I think they have a point: typical ABIs don't >> guarantee alignment of function stack frames to better than >> 16 bytes or so. In principle the compiler could support a 4K >> alignment request anyway by doing the equivalent of alloca(3), >> but I do not think we can count on that to happen. > Hm. New-ish compilers seem to be ok with it. Oh! I was misled by the buildfarm label on morepork, which claims it's running clang 10.0.1. But actually, per its configure report, it's running configure: using compiler=gcc (GCC) 4.2.1 20070719 which is the same as curculio. So that explains why nothing else is complaining. I agree we needn't let 15-year-old compilers force us into the mess that would be entailed by not treating these variables as simple locals. > Perhaps we should have a > configure check whether the compiler is OK with that, and disable direct IO > support if not? +1 for that, though. (Also, the fact that these animals aren't actually failing suggests that 004_io_direct.pl needs expansion.) regards, tom lane
Hi, On 2023-04-14 15:21:18 -0400, Tom Lane wrote: > Andres Freund <andres@anarazel.de> writes: > > On 2023-04-14 13:21:33 -0400, Tom Lane wrote: > >> ... I'm not sure why only those two animals > >> are unhappy, but I think they have a point: typical ABIs don't > >> guarantee alignment of function stack frames to better than > >> 16 bytes or so. In principle the compiler could support a 4K > >> alignment request anyway by doing the equivalent of alloca(3), > >> but I do not think we can count on that to happen. > > > Hm. New-ish compilers seem to be ok with it. > > Oh! I was misled by the buildfarm label on morepork, which claims > it's running clang 10.0.1. But actually, per its configure report, > it's running > > configure: using compiler=gcc (GCC) 4.2.1 20070719 Huh. I wonder if that was an accident in the BF setup. > > Perhaps we should have a > > configure check whether the compiler is OK with that, and disable direct IO > > support if not? > > +1 for that, though. (Also, the fact that these animals aren't > actually failing suggests that 004_io_direct.pl needs expansion.) It's skipped, due to lack of O_DIRECT: [20:50:22] t/004_io_direct.pl .............. skipped: no O_DIRECT So perhaps we don't even need a configure test, just a bit of ifdef'ery? It's a bit annoying structurally, because the PG*Aligned structs are defined in c.h, but the different ways of spelling O_DIRECT are dealt with in fd.h. I wonder if we should try to move those structs to fd.h as well... Greetings, Andres Freund
Andres Freund <andres@anarazel.de> writes: > On 2023-04-14 15:21:18 -0400, Tom Lane wrote: >> +1 for that, though. (Also, the fact that these animals aren't >> actually failing suggests that 004_io_direct.pl needs expansion.) > It's skipped, due to lack of O_DIRECT: > [20:50:22] t/004_io_direct.pl .............. skipped: no O_DIRECT Hmm, I'd say that might be just luck. Whether the compiler honors weird alignment of locals seems independent of whether the OS has O_DIRECT. > So perhaps we don't even need a configure test, just a bit of ifdef'ery? It's > a bit annoying structurally, because the PG*Aligned structs are defined in > c.h, but the different ways of spelling O_DIRECT are dealt with in fd.h. > I wonder if we should try to move those structs to fd.h as well... I doubt they belong in c.h, so that could be plausible; except I'm not convinced that testing O_DIRECT is sufficient. regards, tom lane
On 2023-04-14 21:33, Andres Freund wrote: >> Oh! I was misled by the buildfarm label on morepork, which claims >> it's running clang 10.0.1. But actually, per its configure report, >> it's running >> >> configure: using compiler=gcc (GCC) 4.2.1 20070719 > > Huh. I wonder if that was an accident in the BF setup. I might have been when I reinstalled it a while ago. I have the following gcc and clang installed: openbsd_6_9-pgbf$ gcc --version gcc (GCC) 4.2.1 20070719 Copyright (C) 2007 Free Software Foundation, Inc. This is free software; see the source for copying conditions. There is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. openbsd_6_9-pgbf$ clang --version OpenBSD clang version 10.0.1 Target: amd64-unknown-openbsd6.9 Thread model: posix InstalledDir: /usr/bin want me to switch to clang instead? /Mikael
On Sat, Apr 15, 2023 at 7:38 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: > Andres Freund <andres@anarazel.de> writes: > > On 2023-04-14 15:21:18 -0400, Tom Lane wrote: > >> +1 for that, though. (Also, the fact that these animals aren't > >> actually failing suggests that 004_io_direct.pl needs expansion.) > > > It's skipped, due to lack of O_DIRECT: > > [20:50:22] t/004_io_direct.pl .............. skipped: no O_DIRECT > > Hmm, I'd say that might be just luck. Whether the compiler honors weird > alignment of locals seems independent of whether the OS has O_DIRECT. > > > So perhaps we don't even need a configure test, just a bit of ifdef'ery? It's > > a bit annoying structurally, because the PG*Aligned structs are defined in > > c.h, but the different ways of spelling O_DIRECT are dealt with in fd.h. > > > I wonder if we should try to move those structs to fd.h as well... > > I doubt they belong in c.h, so that could be plausible; except > I'm not convinced that testing O_DIRECT is sufficient. As far as I can tell, the failure to honour large alignment attributes even though the compiler passes our configure check that you can do that was considered to be approximately a bug[1] or at least a thing to be improved in fairly old GCC times but the fix wasn't back-patched that far. Unfortunately the projects that were allergic to the GPL3 change but wanted to ship a compiler (or some motivation related to that) got stuck on 4.2 for a while before they flipped to Clang (as OpenBSD has now done). It seems hard to get excited about doing anything about that on our side, and those systems are also spewing other warnings. But if we're going to do it, it looks like the right place would indeed be a new compiler check that the attribute exists *and* generates no warnings with alignment > 16, something like that? https://gcc.gnu.org/bugzilla/show_bug.cgi?id=16660
On Sat, Apr 15, 2023 at 7:50 AM Mikael Kjellström <mikael.kjellstrom@gmail.com> wrote: > want me to switch to clang instead? I vote +1, that's the system compiler in modern OpenBSD. https://www.cambus.net/the-state-of-toolchains-in-openbsd/ As for curculio, I don't understand the motivation for maintaining that machine. I'd rather know if OpenBSD 7.3 works.
Thomas Munro <thomas.munro@gmail.com> writes: > On Sat, Apr 15, 2023 at 7:50 AM Mikael Kjellström > <mikael.kjellstrom@gmail.com> wrote: >> want me to switch to clang instead? > I vote +1, that's the system compiler in modern OpenBSD. Ditto, we need coverage of that. > As for curculio, I don't understand the motivation for maintaining > that machine. I'd rather know if OpenBSD 7.3 works. Those aren't necessarily mutually exclusive :-). But I do agree that recent OpenBSD is more important to cover than ancient OpenBSD. regards, tom lane
On 2023-04-15 05:22, Tom Lane wrote: > Thomas Munro <thomas.munro@gmail.com> writes: >> On Sat, Apr 15, 2023 at 7:50 AM Mikael Kjellström >> <mikael.kjellstrom@gmail.com> wrote: >>> want me to switch to clang instead? > >> I vote +1, that's the system compiler in modern OpenBSD. > > Ditto, we need coverage of that. OK. I switched to clang on morepork now. >> As for curculio, I don't understand the motivation for maintaining >> that machine. I'd rather know if OpenBSD 7.3 works. > > Those aren't necessarily mutually exclusive :-). But I do agree > that recent OpenBSD is more important to cover than ancient OpenBSD. So do you want me to switch that machine to OpenBSD 7.3 instead? /Mikael
Thomas Munro <thomas.munro@gmail.com> writes: > As far as I can tell, the failure to honour large alignment attributes > even though the compiler passes our configure check that you can do > that was considered to be approximately a bug[1] or at least a thing > to be improved in fairly old GCC times but the fix wasn't back-patched > that far. Unfortunately the projects that were allergic to the GPL3 > change but wanted to ship a compiler (or some motivation related to > that) got stuck on 4.2 for a while before they flipped to Clang (as > OpenBSD has now done). It seems hard to get excited about doing > anything about that on our side, and those systems are also spewing > other warnings. But if we're going to do it, it looks like the right > place would indeed be a new compiler check that the attribute exists > *and* generates no warnings with alignment > 16, something like that? I tested this by building gcc 4.2.1 from source on modern Linux (which was a bit more painful than it ought to be, perhaps) and building PG with that. It generates no warnings, but nonetheless gets an exception in CREATE DATABASE: #2 0x0000000000b64522 in ExceptionalCondition ( conditionName=0xd4fca0 "(uintptr_t) buffer == TYPEALIGN(PG_IO_ALIGN_SIZE, buffer)", fileName=0xd4fbe0 "md.c", lineNumber=468)at assert.c:66 #3 0x00000000009a6b53 in mdextend (reln=0x1dcaf68, forknum=MAIN_FORKNUM, blocknum=18, buffer=0x7ffcaf8e1af0, skipFsync=true) at md.c:468 #4 0x00000000009a9075 in smgrextend (reln=0x1dcaf68, forknum=MAIN_FORKNUM, blocknum=18, buffer=0x7ffcaf8e1af0, skipFsync=true) at smgr.c:500 #5 0x000000000096739c in RelationCopyStorageUsingBuffer (srclocator=..., dstlocator=..., forkNum=MAIN_FORKNUM, permanent=true) at bufmgr.c:4286 #6 0x0000000000967584 in CreateAndCopyRelationData (src_rlocator=..., dst_rlocator=..., permanent=true) at bufmgr.c:4361 #7 0x000000000068898e in CreateDatabaseUsingWalLog (src_dboid=1, dst_dboid=24576, src_tsid=1663, dst_tsid=1663) at dbcommands.c:217 #8 0x000000000068b594 in createdb (pstate=0x1d4a6a8, stmt=0x1d20ec8) at dbcommands.c:1441 Sure enough, that buffer is a stack local in RelationCopyStorageUsingBuffer, and it's visibly got a not-very-well-aligned address. So apparently, the fact that you even get a warning about the alignment not being honored is something OpenBSD patched in after-the-fact; it's not there in genuine vintage gcc. I get the impression that we are going to need an actual runtime test if we want to defend against this. Not entirely convinced it's worth the trouble. Who, other than our deliberately rear-guard buildfarm animals, is going to be building modern PG with such old compilers? (And more especially to the point, on platforms new enough to have working O_DIRECT?) At this point I agree with Andres that it'd be good enough to silence the warning by getting rid of these alignment pragmas when the platform lacks O_DIRECT. regards, tom lane PS: I don't quite understand how it managed to get through initdb when CREATE DATABASE doesn't work. Maybe there is a different code path taken in standalone mode?
On Sun, Apr 16, 2023 at 6:19 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: > So apparently, the fact that you even get a warning about the > alignment not being honored is something OpenBSD patched in > after-the-fact; it's not there in genuine vintage gcc. Ah, that is an interesting discovery, and indeed kills the configure check idea. > At this point I agree with Andres that it'd be good enough to > silence the warning by getting rid of these alignment pragmas > when the platform lacks O_DIRECT. Hmm. My preferred choice would be: accept Mikael's kind offer to upgrade curculio to a live version, forget about GCC 4.2.1 forever, and do nothing. It is a dead parrot. But if we really want to do something about this, my next preferred option would be to modify c.h's test to add more conditions, here: /* GCC, Sunpro and XLC support aligned, packed and noreturn */ #if defined(__GNUC__) || defined(__SUNPRO_C) || defined(__IBMC__) #define pg_attribute_aligned(a) __attribute__((aligned(a))) ... Full GCC support including stack objects actually began in 4.6, it seems. It might require a bit of research because the GCC-workalikes including Clang also claim to be certain versions of GCC (for example I think Clang 7 would be excluded if you excluded GCC 4.2, even though this particular thing apparently worked fine in Clang 7). That's my best idea, ie to actually model the feature history accurately, if we are suspending disbelief and pretending that it is a reasonable target.
Thomas Munro <thomas.munro@gmail.com> writes: > Full GCC support including stack objects actually began in 4.6, it > seems. Hmm. The oldest gcc versions remaining in the buildfarm seem to be curculio | configure: using compiler=gcc (GCC) 4.2.1 20070719 frogfish | configure: using compiler=gcc (Debian 4.6.3-14) 4.6.3 lapwing | configure: using compiler=gcc (Debian 4.7.2-5) 4.7.2 skate | configure: using compiler=gcc-4.7 (Debian 4.7.2-5) 4.7.2 snapper | configure: using compiler=gcc-4.7 (Debian 4.7.2-5) 4.7.2 buri | configure: using compiler=gcc (GCC) 4.8.5 20150623 (Red Hat 4.8.5-44) chub | configure: using compiler=gcc (GCC) 4.8.5 20150623 (Red Hat 4.8.5-44) dhole | configure: using compiler=gcc (GCC) 4.8.5 20150623 (Red Hat 4.8.5-44) mantid | configure: using compiler=gcc (GCC) 4.8.5 20150623 (Red Hat 4.8.5-44) prion | configure: using compiler=gcc (GCC) 4.8.5 20150623 (Red Hat 4.8.5-28) rhinoceros | configure: using compiler=gcc (GCC) 4.8.5 20150623 (Red Hat 4.8.5-44) siskin | configure: using compiler=gcc (GCC) 4.8.5 20150623 (Red Hat 4.8.5-44) shelduck | configure: using compiler=gcc (SUSE Linux) 4.8.5 topminnow | configure: using compiler=gcc (Debian 4.9.2-10+deb8u1) 4.9.2 so curculio should be the only one that's at risk here. Maybe just upgrading it is the right answer. regards, tom lane
On Sat, Apr 15, 2023 at 02:19:35PM -0400, Tom Lane wrote: > PS: I don't quite understand how it managed to get through initdb > when CREATE DATABASE doesn't work. Maybe there is a different > code path taken in standalone mode? ad43a413c4f7f5d024a5b2f51e00d280a22f1874 initdb: When running CREATE DATABASE, use STRATEGY = WAL_COPY.
On 2023-04-16 00:10, Tom Lane wrote: > so curculio should be the only one that's at risk here. > Maybe just upgrading it is the right answer. Just let me know if I should switch curculio to OpenBSD 7.3. I already have a new server setup so only need to switch the "animal" and "secret" and enable the cron job to get it running. /Mikael
=?UTF-8?Q?Mikael_Kjellstr=c3=b6m?= <mikael.kjellstrom@mksoft.nu> writes: > On 2023-04-16 00:10, Tom Lane wrote: >> so curculio should be the only one that's at risk here. >> Maybe just upgrading it is the right answer. > Just let me know if I should switch curculio to OpenBSD 7.3. Yes please. > I already have a new server setup so only need to switch the "animal" > and "secret" and enable the cron job to get it running. Actually, as long as it's still OpenBSD I think you can keep using the same animal name ... Andrew, what's the policy on that? regards, tom lane
On 2023-04-16 16:18, Tom Lane wrote: > =?UTF-8?Q?Mikael_Kjellstr=c3=b6m?= <mikael.kjellstrom@mksoft.nu> writes: >> On 2023-04-16 00:10, Tom Lane wrote: >>> so curculio should be the only one that's at risk here. >>> Maybe just upgrading it is the right answer. > >> Just let me know if I should switch curculio to OpenBSD 7.3. > > Yes please. Ok. >> I already have a new server setup so only need to switch the "animal" >> and "secret" and enable the cron job to get it running. > > Actually, as long as it's still OpenBSD I think you can keep using > the same animal name ... Andrew, what's the policy on that? That is what I meant with above. I just use the same animal name and secret and then run "update_personality.pl". That should be enough I think? /Mikael
Mikael Kjellström <mikael.kjellstrom@mksoft.nu> writes:On 2023-04-16 00:10, Tom Lane wrote:so curculio should be the only one that's at risk here. Maybe just upgrading it is the right answer.Just let me know if I should switch curculio to OpenBSD 7.3.Yes please.I already have a new server setup so only need to switch the "animal" and "secret" and enable the cron job to get it running.Actually, as long as it's still OpenBSD I think you can keep using the same animal name ... Andrew, what's the policy on that?
update_personality.pl lets you update the OS version / compiler version / owner-name / owner-email
I am in fact about to perform this exact operation for prion.
cheers
andrew
-- Andrew Dunstan EDB: https://www.enterprisedb.com
Andrew Dunstan <andrew@dunslane.net> writes: > On 2023-04-16 Su 10:18, Tom Lane wrote: >> Actually, as long as it's still OpenBSD I think you can keep using >> the same animal name ... Andrew, what's the policy on that? > update_personality.pl lets you update the OS version / compiler version > / owner-name / owner-email Oh wait ... this involves a switch from gcc in OpenBSD 5.9 to clang in OpenBSD 7.3, doesn't it? That isn't something update_personality will handle; you need a new animal if the compiler product is changing. regards, tom lane
> On Apr 16, 2023, at 12:16 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > > Andrew Dunstan <andrew@dunslane.net> writes: >>> On 2023-04-16 Su 10:18, Tom Lane wrote: >>> Actually, as long as it's still OpenBSD I think you can keep using >>> the same animal name ... Andrew, what's the policy on that? > >> update_personality.pl lets you update the OS version / compiler version >> / owner-name / owner-email > > Oh wait ... this involves a switch from gcc in OpenBSD 5.9 to clang > in OpenBSD 7.3, doesn't it? That isn't something update_personality > will handle; you need a new animal if the compiler product is changing. > > Correct. Cheers Andrew
On 2023-04-16 19:59, Andrew Dunstan wrote: >> On Apr 16, 2023, at 12:16 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> >> Andrew Dunstan <andrew@dunslane.net> writes: >>>> On 2023-04-16 Su 10:18, Tom Lane wrote: >>>> Actually, as long as it's still OpenBSD I think you can keep using >>>> the same animal name ... Andrew, what's the policy on that? >> >>> update_personality.pl lets you update the OS version / compiler version >>> / owner-name / owner-email >> >> Oh wait ... this involves a switch from gcc in OpenBSD 5.9 to clang >> in OpenBSD 7.3, doesn't it? That isn't something update_personality >> will handle; you need a new animal if the compiler product is changing. >> >> > > Correct. OK. I registered a new animal for this then. So if someone could look at that and give be an animal name + secret I can set this up. /Mikael
On Sun, Apr 16, 2023 at 04:51:04PM +0200, Mikael Kjellström wrote: > That is what I meant with above. > > I just use the same animal name and secret and then run > "update_personality.pl". > > That should be enough I think? Yes, that should be enough as far as I recall. This has been mentioned a couple of weeks ago here: https://www.postgresql.org/message-id/CA+hUKGK0jJ+G+bxLUZqpBsxpvEg7Lvt1v8LBxFkZbrvtFTSghw@mail.gmail.com I have also used setnotes.pl to reflect my animals' CFLAGS on the website. -- Michael
Attachment
On 2023-04-16 20:05, Mikael Kjellström wrote: >>> Oh wait ... this involves a switch from gcc in OpenBSD 5.9 to clang >>> in OpenBSD 7.3, doesn't it? That isn't something update_personality >>> will handle; you need a new animal if the compiler product is changing. >>> >> >> Correct. > > OK. I registered a new animal for this then. > > So if someone could look at that and give be an animal name + secret I > can set this up. I have setup a new animal "schnauzer" (thanks andrew!). That should report in a little while. /Mikael
On Sat, Apr 15, 2023 at 2:19 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > I get the impression that we are going to need an actual runtime > test if we want to defend against this. Not entirely convinced > it's worth the trouble. Who, other than our deliberately rear-guard > buildfarm animals, is going to be building modern PG with such old > compilers? (And more especially to the point, on platforms new > enough to have working O_DIRECT?) I don't think that I fully understand everything under discussion here, but I would just like to throw in a vote for trying to make failures as comprehensible as we reasonably can. It makes me a bit nervous to rely on things like "anybody who has O_DIRECT will also have working alignment pragmas," because there's no relationship between those things other than when we think they got implemented on the platforms that are popular today. If somebody ships me a brand new Deathstation 9000 that has O_DIRECT but NOT alignment pragmas, how badly are things going to break and how hard is it going to be for me to understand why it's not working? I understand that nobody (including me) wants the code cluttered with a bunch of useless cruft that caters only to hypothetical systems, and I don't want us to spend a lot of effort building untestable infrastructure that caters only to such machines. I just don't want us to do things that are more magical than they need to be. If and when something fails, it's real nice if you can easily understand why it failed. -- Robert Haas EDB: http://www.enterprisedb.com
Robert Haas <robertmhaas@gmail.com> writes: > On Sat, Apr 15, 2023 at 2:19 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: >> I get the impression that we are going to need an actual runtime >> test if we want to defend against this. Not entirely convinced >> it's worth the trouble. Who, other than our deliberately rear-guard >> buildfarm animals, is going to be building modern PG with such old >> compilers? (And more especially to the point, on platforms new >> enough to have working O_DIRECT?) > I don't think that I fully understand everything under discussion > here, but I would just like to throw in a vote for trying to make > failures as comprehensible as we reasonably can. I'm not hugely concerned about this yet. I think the reason for slipping this into v16 as developer-only code is exactly that we need to get a feeling for where the portability dragons live. When (and if) we try to make O_DIRECT mainstream, yes we'd better be sure that any known failure cases are reported well. But we need the data about which those are, first. regards, tom lane
On Tue, Apr 18, 2023 at 4:06 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: > Robert Haas <robertmhaas@gmail.com> writes: > > On Sat, Apr 15, 2023 at 2:19 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > >> I get the impression that we are going to need an actual runtime > >> test if we want to defend against this. Not entirely convinced > >> it's worth the trouble. Who, other than our deliberately rear-guard > >> buildfarm animals, is going to be building modern PG with such old > >> compilers? (And more especially to the point, on platforms new > >> enough to have working O_DIRECT?) > > > I don't think that I fully understand everything under discussion > > here, but I would just like to throw in a vote for trying to make > > failures as comprehensible as we reasonably can. > > I'm not hugely concerned about this yet. I think the reason for > slipping this into v16 as developer-only code is exactly that we need > to get a feeling for where the portability dragons live. When (and > if) we try to make O_DIRECT mainstream, yes we'd better be sure that > any known failure cases are reported well. But we need the data > about which those are, first. +1 A couple more things I wanted to note: * We have no plans to turn this on by default even when the later asynchronous machinery is proposed, and direct I/O starts to make more economic sense (think: your stream of small reads and writes will be converted to larger preadv/pwritev or moral equivalent and performed ahead of time in the background). Reasons: (1) There will always be a few file systems that refuse O_DIRECT (Linux tmpfs is one such, as we learned in this thread; if fails with EINVAL at open() time), and (2) without a page cache, you really need to size your shared_buffers adequately and we can't do that automatically. It's something you'd opt into for a dedicated database server along with other carefully considered settings. It seems acceptable to me that if you set io_direct to a non-default setting on an unusual-for-a-database-server filesystem you might get errors screaming about inability to open files -- you'll just have to turn it back off again if it doesn't work for you. * For the alignment part, C11 has "alignas(x)" in <stdalign.h>, so I very much doubt that a hypothetical new Deathstation C compiler would not know how to align stack objects arbitrarily, even though for now as a C99 program we have to use the non-standard incantations defined in our c.h. I assume we'll eventually switch to that. In the meantime, if someone manages to build PostgreSQL on a hypothetical C compiler that our c.h doesn't recognise, we just won't let you turn the io_direct GUC on (because we set PG_O_DIRECT to 0 if we don't have an alignment macro, see commit faeedbce's message for rationale). If the alignment trick from c.h appears to be available but is actually broken (GCC 4.2.1), then those assertions I added into smgrread() et al will fail as Tom showed (yay! they did their job), or in a non-assert build you'll probably get EINVAL when you try to read or write from your badly aligned buffers depending on how picky your OS is, but that's just an old bug in a defunct compiler that we have by now written more about they ever did in their bug tracker. * I guess it's unlikely at this point that POSIX will ever standardise O_DIRECT if they didn't already in the 90s (I didn't find any discussion of it in their issue tracker). There is really only one OS on our target list that truly can't do direct I/O at all: OpenBSD. It seems a reasonable bet that if they or a hypothetical totally new Unixoid system ever implemented it they'd spell it the same IRIX way for practical reasons, but if not we just won't use it until someone writes a patch *shrug*. There is also one system that's been rocking direct I/O since the 90s for Oracle etc, but PostgreSQL still doesn't know how to turn it on: Solaris has a directio() system call. I posted a (trivial) patch for that once in the thread where I added Apple F_NOCACHE, but there is probably nobody on this list who can test it successfully (as Tom discovered, wrasse's host is not configured right for it, you'd need an admin/root to help set up a UFS file system, or perhaps modern (closed) ZFS can do it but that system is old and unpatched), and I have no desire to commit a "blind" patch for an untested niche setup; I really only considered it because I realised I was so close to covering the complete set of OSes. That's cool, we just won't let you turn the GUC on if we don't know how and the error message is clear about that if you try.
On Mon, 17 Apr 2023 at 17:45, Thomas Munro <thomas.munro@gmail.com> wrote: > > Reasons: (1) There will always be a > few file systems that refuse O_DIRECT (Linux tmpfs is one such, as we > learned in this thread; if fails with EINVAL at open() time), and So why wouldn't we just automatically turn it off (globally or for that tablespace) and keep operating without it afterward? > (2) without a page cache, you really need to size your shared_buffers > adequately and we can't do that automatically. Well.... I'm more optimistic... That may not always be impossible. We've already added the ability to add more shared memory after startup. We could implement the ability to add or remove shared buffer segments after startup. And it wouldn't be crazy to imagine a kernel interface that lets us judge whether the kernel memory pressure makes it reasonable for us to take more shared buffers or makes it necessary to release shared memory to the kernel. You could hack something together using /proc/meminfo today but I imagine an interface intended for this kind of thing would be better. > It's something you'd > opt into for a dedicated database server along with other carefully > considered settings. It seems acceptable to me that if you set > io_direct to a non-default setting on an unusual-for-a-database-server > filesystem you might get errors screaming about inability to open > files -- you'll just have to turn it back off again if it doesn't work > for you. If the only solution is to turn it off perhaps the server should just turn it off? I guess the problem is that the shared_buffers might be set assuming it would be on? -- greg
On Tue, Apr 18, 2023 at 3:35 PM Greg Stark <stark@mit.edu> wrote: > Well.... I'm more optimistic... That may not always be impossible. > We've already added the ability to add more shared memory after > startup. We could implement the ability to add or remove shared buffer > segments after startup. And it wouldn't be crazy to imagine a kernel > interface that lets us judge whether the kernel memory pressure makes > it reasonable for us to take more shared buffers or makes it necessary > to release shared memory to the kernel. On this point specifically, one fairly large problem that we have currently is that our buffer replacement algorithm is terrible. In workloads I've examined, either almost all buffers end up with a usage count of 5 or almost all buffers end up with a usage count of 0 or 1. Either way, we lose all or nearly all information about which buffers are actually hot, and we are not especially unlikely to evict some extremely hot buffer. This is quite bad for performance as it is, and it would be a lot worse if recovering from a bad eviction decision routinely required rereading from disk instead of only rereading from the OS buffer cache. I've sometimes wondered whether our current algorithm is just a more expensive version of random eviction. I suspect that's a bit too pessimistic, but I don't really know. I'm not saying that it isn't possible to fix this. I bet it is, and I hope someone does. I'm just making the point that even if we knew the amount of kernel memory pressure and even if we also had the ability to add and remove shared_buffers at will, it probably wouldn't help much as things stand today, because we're not in a good position to judge how large the cache would need to be in order to be useful, or what we ought to be storing in it. -- Robert Haas EDB: http://www.enterprisedb.com
On 4/19/23 10:11, Robert Haas wrote: > On Tue, Apr 18, 2023 at 3:35 PM Greg Stark <stark@mit.edu> wrote: >> Well.... I'm more optimistic... That may not always be impossible. >> We've already added the ability to add more shared memory after >> startup. We could implement the ability to add or remove shared buffer >> segments after startup. And it wouldn't be crazy to imagine a kernel >> interface that lets us judge whether the kernel memory pressure makes >> it reasonable for us to take more shared buffers or makes it necessary >> to release shared memory to the kernel. > > On this point specifically, one fairly large problem that we have > currently is that our buffer replacement algorithm is terrible. In > workloads I've examined, either almost all buffers end up with a usage > count of 5 or almost all buffers end up with a usage count of 0 or 1. > Either way, we lose all or nearly all information about which buffers > are actually hot, and we are not especially unlikely to evict some > extremely hot buffer. That has been my experience as well, although admittedly I have not looked in quite a while. > I'm not saying that it isn't possible to fix this. I bet it is, and I > hope someone does. I keep looking at this blog post about Transparent Memory Offloading and thinking that we could learn from it: https://engineering.fb.com/2022/06/20/data-infrastructure/transparent-memory-offloading-more-memory-at-a-fraction-of-the-cost-and-power/ Unfortunately, it is very Linux specific and requires a really up to date OS -- cgroup v2, kernel >= 5.19 > I'm just making the point that even if we knew the amount of kernel > memory pressure and even if we also had the ability to add and remove > shared_buffers at will, it probably wouldn't help much as things > stand today, because we're not in a good position to judge how large > the cache would need to be in order to be useful, or what we ought to > be storing in it. The tactic TMO uses is basically to tune the available memory to get a target memory pressure. That seems like it could work. -- Joe Conway PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Hi, On 2023-04-19 10:11:32 -0400, Robert Haas wrote: > On this point specifically, one fairly large problem that we have > currently is that our buffer replacement algorithm is terrible. In > workloads I've examined, either almost all buffers end up with a usage > count of 5 or almost all buffers end up with a usage count of 0 or 1. > Either way, we lose all or nearly all information about which buffers > are actually hot, and we are not especially unlikely to evict some > extremely hot buffer. This is quite bad for performance as it is, and > it would be a lot worse if recovering from a bad eviction decision > routinely required rereading from disk instead of only rereading from > the OS buffer cache. Interestingly, I haven't seen that as much in more recent benchmarks as it used to. Partially I think because common s_b settings have gotten bigger, I guess. But I wonder if we also accidentally improved something else, e.g. by pin/unpin-ing the same buffer a bit less frequently. > I've sometimes wondered whether our current algorithm is just a more > expensive version of random eviction. I suspect that's a bit too > pessimistic, but I don't really know. I am quite certain that it's better than that. If you e.g. have pkey lookup workload >> RAM you can actually end up seeing inner index pages staying reliably in s_b. But clearly we can do better. I do think we likely should (as IIRC Peter Geoghegan suggested) provide more information to the buffer replacement layer: Independent of the concrete buffer replacement algorithm, the recency information we do provide is somewhat lacking. In some places we do repeated ReadBuffer() calls for the same buffer, leading to over-inflating usagecount. We should seriously consider using the cost of the IO into account, basically making it more likely that s_b is increased when we need to synchronously wait for IO. The cost of a miss is much lower for e.g. a sequential scan or a bitmap heap scan, because both can do some form of prefetching. Whereas index pages and the heap fetch for plain index scans aren't prefetchable (which could be improved some, but not generally). > I'm not saying that it isn't possible to fix this. I bet it is, and I > hope someone does. I'm just making the point that even if we knew the > amount of kernel memory pressure and even if we also had the ability > to add and remove shared_buffers at will, it probably wouldn't help > much as things stand today, because we're not in a good position to > judge how large the cache would need to be in order to be useful, or > what we ought to be storing in it. FWIW, my experience is that linux' page replacement doesn't work very well either. Partially because we "hide" a lot of the recency information from it. But also just because it doesn't scale all that well to large amounts of memory (there's ongoing work on that though). So I am not really convinced by this argument - for plenty workloads just caching in PG will be far better than caching both in the kernel and in PG, as long as some adaptiveness to memory pressure avoids running into OOMs. Some forms of adaptive s_b sizing aren't particularly hard, I think. Instead of actually changing the s_b shmem allocation - which would be very hard in a process based model - we can tell the kernel that some parts of that memory aren't currently in use with madvise(MADV_REMOVE). It's not quite as trivial as it sounds, because we'd have to free in multiple of huge_page_size. Greetings, Andres Freund
Hi, On 2023-04-18 09:44:10 +1200, Thomas Munro wrote: > * We have no plans to turn this on by default even when the later > asynchronous machinery is proposed, and direct I/O starts to make more > economic sense (think: your stream of small reads and writes will be > converted to larger preadv/pwritev or moral equivalent and performed > ahead of time in the background). Reasons: (1) There will always be a > few file systems that refuse O_DIRECT (Linux tmpfs is one such, as we > learned in this thread; if fails with EINVAL at open() time), and (2) > without a page cache, you really need to size your shared_buffers > adequately and we can't do that automatically. It's something you'd > opt into for a dedicated database server along with other carefully > considered settings. It seems acceptable to me that if you set > io_direct to a non-default setting on an unusual-for-a-database-server > filesystem you might get errors screaming about inability to open > files -- you'll just have to turn it back off again if it doesn't work > for you. FWIW, *long* term I think it might sense to turn DIO on automatically for a small subset of operations, if supported. Examples: 1) Once we have the ability to "feed" walsenders from wal_buffers, instead of going to disk, automatically using DIO for WAL might be beneficial. The increase in IO concurrency and reduction in latency one can get is substantial. 2) If we make base backups use s_b if pages are in s_b, and do locking via s_b for non-existing pages, it might be worth automatically using DIO for the reads of the non-resident data, to avoid swamping the kernel page cache with data that won't be read again soon (and to utilize DMA etc). 3) When writing back dirty data that we don't expect to be dirtied again soon, e.g. from vacuum ringbuffers or potentially even checkpoints, it could make sense to use DIO, to avoid the kernel keeping such pages in the page cache. But for the main s_b, I agree, I can't forsee us turning on DIO by default. Unless somebody has tuned s_b at least some for the workload, that's not going to go well. And even if somebody has, it's quite reasonable to use the same host also for other programs (including other PG instances), in which case it's likely desirable to be adaptive to the current load when deciding what to cache - which the kernel is in the best position to do. > If the alignment trick from c.h appears to be available but is actually > broken (GCC 4.2.1), then those assertions I added into smgrread() et > al will fail as Tom showed (yay! they did their job), or in a > non-assert build you'll probably get EINVAL when you try to read or > write from your badly aligned buffers depending on how picky your OS > is, but that's just an old bug in a defunct compiler that we have by > now written more about they ever did in their bug tracker. Agreed. If we ever find such issues in a postmordial compiler, we'll just need to beef up our configure test to detect that it doesn't actually fully support specifying alignment. Greetings, Andres Freund
Hi, On 2023-04-18 15:35:09 -0400, Greg Stark wrote: > On Mon, 17 Apr 2023 at 17:45, Thomas Munro <thomas.munro@gmail.com> wrote: > > It's something you'd > > opt into for a dedicated database server along with other carefully > > considered settings. It seems acceptable to me that if you set > > io_direct to a non-default setting on an unusual-for-a-database-server > > filesystem you might get errors screaming about inability to open > > files -- you'll just have to turn it back off again if it doesn't work > > for you. > > If the only solution is to turn it off perhaps the server should just > turn it off? I guess the problem is that the shared_buffers might be > set assuming it would be on? I am quite strongly opposed to that - silently (or with a log message, which practically is the same as silently) disabling performance relevant options like DIO is much more likely to cause problems, due to the drastically different performance characteristics you get. I can see us making it configurable to try using DIO though, but I am not convinced it's worth bothering with that. But we'll see. Greetings, Andres Freund
On Wed, Apr 19, 2023 at 12:54 PM Andres Freund <andres@anarazel.de> wrote: > Interestingly, I haven't seen that as much in more recent benchmarks as it > used to. Partially I think because common s_b settings have gotten bigger, I > guess. But I wonder if we also accidentally improved something else, e.g. by > pin/unpin-ing the same buffer a bit less frequently. I think the problem with the algorithm is pretty fundamental. The rate of usage count increase is tied to how often we access buffers, and the rate of usage count decrease is tied to buffer eviction. But a given workload can have no eviction at all (in which case the usage counts must converge to 5) or it can evict on every buffer access (in which case the usage counts must mostly converget to 0, because we'll be decreasing usage counts at least once per buffer and generally more). ISTM that the only way that you can end up with a good mix of usage counts is if you have a workload that falls into some kind of a sweet spot where the rate of usage count bumps and the rate of usage count de-bumps are close enough together that things don't skew all the way to one end or the other. Bigger s_b might make that more likely to happen in practice, but it seems like bad algorithm design on a theoretical level. We should be comparing the frequency of access of buffers to the frequency of access of other buffers, not to the frequency of buffer eviction. Or to put the same thing another way, the average value of the usage count shouldn't suddenly change from 5 to 1 when the server evicts 1 buffer. > I do think we likely should (as IIRC Peter Geoghegan suggested) provide more > information to the buffer replacement layer: > > Independent of the concrete buffer replacement algorithm, the recency > information we do provide is somewhat lacking. In some places we do repeated > ReadBuffer() calls for the same buffer, leading to over-inflating usagecount. Yeah, that would be good to fix. I don't think it solves the whole problem by itself, but it seems like a good step. > We should seriously consider using the cost of the IO into account, basically > making it more likely that s_b is increased when we need to synchronously wait > for IO. The cost of a miss is much lower for e.g. a sequential scan or a > bitmap heap scan, because both can do some form of prefetching. Whereas index > pages and the heap fetch for plain index scans aren't prefetchable (which > could be improved some, but not generally). I guess that's reasonable if we can pass the information around well enough, but I still think the algorithm ought to get some fundamental improvement first. > FWIW, my experience is that linux' page replacement doesn't work very well > either. Partially because we "hide" a lot of the recency information from > it. But also just because it doesn't scale all that well to large amounts of > memory (there's ongoing work on that though). So I am not really convinced by > this argument - for plenty workloads just caching in PG will be far better > than caching both in the kernel and in PG, as long as some adaptiveness to > memory pressure avoids running into OOMs. Even if the Linux algorithm is bad, and it may well be, the Linux cache is often a lot bigger than our cache. Which can cover a multitude of problems. -- Robert Haas EDB: http://www.enterprisedb.com
Hi, On 2023-04-19 13:16:54 -0400, Robert Haas wrote: > On Wed, Apr 19, 2023 at 12:54 PM Andres Freund <andres@anarazel.de> wrote: > > Interestingly, I haven't seen that as much in more recent benchmarks as it > > used to. Partially I think because common s_b settings have gotten bigger, I > > guess. But I wonder if we also accidentally improved something else, e.g. by > > pin/unpin-ing the same buffer a bit less frequently. > > I think the problem with the algorithm is pretty fundamental. The rate > of usage count increase is tied to how often we access buffers, and > the rate of usage count decrease is tied to buffer eviction. But a > given workload can have no eviction at all (in which case the usage > counts must converge to 5) or it can evict on every buffer access (in > which case the usage counts must mostly converget to 0, because we'll > be decreasing usage counts at least once per buffer and generally > more). I don't think the "evict on every buffer access" works quite that way - unless you have a completely even access pattern, buffer access frequency will increase usage count much more frequently on some buffers than others. And if you have a completely even access pattern, it's hard to come up with a good cache replacement algorithm... > ISTM that the only way that you can end up with a good mix of > usage counts is if you have a workload that falls into some kind of a > sweet spot where the rate of usage count bumps and the rate of usage > count de-bumps are close enough together that things don't skew all > the way to one end or the other. Bigger s_b might make that more > likely to happen in practice, but it seems like bad algorithm design > on a theoretical level. We should be comparing the frequency of access > of buffers to the frequency of access of other buffers, not to the > frequency of buffer eviction. Or to put the same thing another way, > the average value of the usage count shouldn't suddenly change from 5 > to 1 when the server evicts 1 buffer. I agree that there are fundamental issues with the algorithm. But practically I think the effect of the over-saturation of s_b isn't as severe as one might think: If your miss rate is very low, the occasional bad victim buffer selection won't matter that much. If the miss rate is a bit higher, the likelihood of the usagecount being increased again after being decreased is higher if a buffer is accessed frequently. This is also why I think that larger s_b makes the issues less likely - with larger s_b, it is more likely that frequently accessed buffers are accessed again after the first of the 5 clock sweeps necessary to reduce the usage count. Clearly, with a small-ish s_b and a high replacement rate, that's not going to happen for sufficiently many buffers. But once you have a few GB of s_b, multiple complete sweeps take a while. Most, if not all, buffer replacement algorithms I have seen, don't deal well with "small SB with a huge replacement rate". Most of the fancier algorithms track recency information for buffers that have recently been evicted - but you obviously can't track that to an unlimited degree, IIRC most papers propose that the shadow map to be roughly equal to the buffer pool size. You IMO pretty much need a policy decision on a higher level to improve upon that (e.g. by just deciding that some buffers are sticky, perhaps because they were used first) - but it doesn't matter much, because the miss rate is high enough that the total amount of reads is barely affected by the buffer replacement decisions. Greetings, Andres Freund
On Mon, Apr 17, 2023 at 12:06:23PM -0400, Tom Lane wrote: > Robert Haas <robertmhaas@gmail.com> writes: > > On Sat, Apr 15, 2023 at 2:19 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > >> I get the impression that we are going to need an actual runtime > >> test if we want to defend against this. Not entirely convinced > >> it's worth the trouble. Who, other than our deliberately rear-guard > >> buildfarm animals, is going to be building modern PG with such old > >> compilers? (And more especially to the point, on platforms new > >> enough to have working O_DIRECT?) > > > I don't think that I fully understand everything under discussion > > here, but I would just like to throw in a vote for trying to make > > failures as comprehensible as we reasonably can. > > I'm not hugely concerned about this yet. I think the reason for > slipping this into v16 as developer-only code is exactly that we need > to get a feeling for where the portability dragons live. Speaking of the developer-only status, I find the io_direct name more enticing than force_parallel_mode, which PostgreSQL renamed due to overuse from people expecting non-developer benefits. Should this have a name starting with debug_?
On Sun, Apr 30, 2023 at 4:11 PM Noah Misch <noah@leadboat.com> wrote: > Speaking of the developer-only status, I find the io_direct name more enticing > than force_parallel_mode, which PostgreSQL renamed due to overuse from people > expecting non-developer benefits. Should this have a name starting with > debug_? Hmm, yeah I think people coming from other databases would be tempted by it. But, unlike the please-jam-a-gather-node-on-top-of-the-plan-so-I-can-debug-the-parallel-executor switch, I think of this thing more like an experimental feature that is just waiting for more features to make it useful. What about a warning message about that at startup if it's on?
On Sun, Apr 30, 2023 at 6:35 PM Thomas Munro <thomas.munro@gmail.com> wrote: > On Sun, Apr 30, 2023 at 4:11 PM Noah Misch <noah@leadboat.com> wrote: > > Speaking of the developer-only status, I find the io_direct name more enticing > > than force_parallel_mode, which PostgreSQL renamed due to overuse from people > > expecting non-developer benefits. Should this have a name starting with > > debug_? > > Hmm, yeah I think people coming from other databases would be tempted > by it. But, unlike the > please-jam-a-gather-node-on-top-of-the-plan-so-I-can-debug-the-parallel-executor > switch, I think of this thing more like an experimental feature that > is just waiting for more features to make it useful. What about a > warning message about that at startup if it's on? Something like this? Better words welcome. $ ~/install//bin/postgres -D pgdata -c io_direct=data 2023-05-01 09:44:37.460 NZST [99675] LOG: starting PostgreSQL 16devel on x86_64-unknown-freebsd13.2, compiled by FreeBSD clang version 14.0.5 (https://github.com/llvm/llvm-project.git llvmorg-14.0.5-0-gc12386ae247c), 64-bit 2023-05-01 09:44:37.460 NZST [99675] LOG: listening on IPv6 address "::1", port 5432 2023-05-01 09:44:37.460 NZST [99675] LOG: listening on IPv4 address "127.0.0.1", port 5432 2023-05-01 09:44:37.461 NZST [99675] LOG: listening on Unix socket "/tmp/.s.PGSQL.5432" 2023-05-01 09:44:37.463 NZST [99675] WARNING: io_direct is an experimental setting for developer testing only 2023-05-01 09:44:37.463 NZST [99675] HINT: File I/O may be inefficient or not work on some file systems. 2023-05-01 09:44:37.465 NZST [99678] LOG: database system was shut down at 2023-05-01 09:43:51 NZST 2023-05-01 09:44:37.468 NZST [99675] LOG: database system is ready to accept connections
Attachment
On Sun, Apr 30, 2023 at 06:35:30PM +1200, Thomas Munro wrote: > On Sun, Apr 30, 2023 at 4:11 PM Noah Misch <noah@leadboat.com> wrote: > > Speaking of the developer-only status, I find the io_direct name more enticing > > than force_parallel_mode, which PostgreSQL renamed due to overuse from people > > expecting non-developer benefits. Should this have a name starting with > > debug_? > > Hmm, yeah I think people coming from other databases would be tempted > by it. But, unlike the > please-jam-a-gather-node-on-top-of-the-plan-so-I-can-debug-the-parallel-executor > switch, I think of this thing more like an experimental feature that > is just waiting for more features to make it useful. What about a > warning message about that at startup if it's on? Such a warning wouldn't be particularly likely to be seen by someone who already didn't read/understand the docs for the not-feature that they turned on. Since this is -currently- a developer-only feature, it seems reasonable to rename the GUC to debug_direct_io, and (at such time as it's considered to be helpful to users) later rename it to direct_io. That avoids the issue that random advice to enable direct_io=x under v17+ is applied by people running v16. +0.8 to do so. Maybe in the future, it should be added to GUC_EXPLAIN, too ? -- Justin
Justin Pryzby <pryzby@telsasoft.com> writes: > On Sun, Apr 30, 2023 at 06:35:30PM +1200, Thomas Munro wrote: >> What about a >> warning message about that at startup if it's on? > Such a warning wouldn't be particularly likely to be seen by someone who > already didn't read/understand the docs for the not-feature that they > turned on. Yeah, I doubt that that would be helpful at all. > Since this is -currently- a developer-only feature, it seems reasonable > to rename the GUC to debug_direct_io, and (at such time as it's > considered to be helpful to users) later rename it to direct_io. +1 regards, tom lane
On Mon, May 1, 2023 at 12:00 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > Justin Pryzby <pryzby@telsasoft.com> writes: > > On Sun, Apr 30, 2023 at 06:35:30PM +1200, Thomas Munro wrote: > >> What about a > >> warning message about that at startup if it's on? > > > Such a warning wouldn't be particularly likely to be seen by someone who > > already didn't read/understand the docs for the not-feature that they > > turned on. > > Yeah, I doubt that that would be helpful at all. Fair enough. > > Since this is -currently- a developer-only feature, it seems reasonable > > to rename the GUC to debug_direct_io, and (at such time as it's > > considered to be helpful to users) later rename it to direct_io. > > +1 Yeah, the future cross-version confusion thing is compelling. OK, here's a rename patch. I left all the variable names and related symbols as they were, so it's just the GUC that gains the prefix. I moved the documentation hunk up to be sorted alphabetically like nearby entries, because that seemed to look nicer, even though the list isn't globally sorted.
Attachment
On Wed, Apr 19, 2023 at 7:35 AM Greg Stark <stark@mit.edu> wrote: > On Mon, 17 Apr 2023 at 17:45, Thomas Munro <thomas.munro@gmail.com> wrote: > > (2) without a page cache, you really need to size your shared_buffers > > adequately and we can't do that automatically. > > Well.... I'm more optimistic... That may not always be impossible. > We've already added the ability to add more shared memory after > startup. We could implement the ability to add or remove shared buffer > segments after startup. Yeah, there are examples of systems going back decades with multiple buffer pools. In some you can add more space later, and in some you can also configure pools with different block sizes (imagine if you could set your extremely OLTP tables to use 4KB blocks for reduced write amplification and then perhaps even also promise that your storage doesn't need FPIs for that size because you know it's perfectly safe™, and imagine if you could set some big write-only history tables to use 32KB blocks because some compression scheme works better, etc), and you might also want different cache replacement algorithms in different pools. Complex and advanced stuff no doubt and I'm not suggesting that's anywhere near a reasonable thing for us to think about now (as a matter of fact in another thread you can find me arguing for fully unifying our existing segregated SLRU buffer pools with the one true buffer pool), but since we're talking pie-in-the-sky ideas around the water cooler...
On Mon, May 01, 2023 at 02:47:57PM +1200, Thomas Munro wrote: > On Mon, May 1, 2023 at 12:00 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > > Justin Pryzby <pryzby@telsasoft.com> writes: > > > Since this is -currently- a developer-only feature, it seems reasonable > > > to rename the GUC to debug_direct_io, and (at such time as it's > > > considered to be helpful to users) later rename it to direct_io. > > > > +1 > > Yeah, the future cross-version confusion thing is compelling. OK, > here's a rename patch. This looks reasonable.
On Mon, May 15, 2023 at 9:09 AM Noah Misch <noah@leadboat.com> wrote: > This looks reasonable. Pushed with a small change: a couple of GUC_check_errdetail strings needed s/io_direct/debug_io_direct/. Thanks.
On 01.05.23 04:47, Thomas Munro wrote: > On Mon, May 1, 2023 at 12:00 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: >> Justin Pryzby <pryzby@telsasoft.com> writes: >>> On Sun, Apr 30, 2023 at 06:35:30PM +1200, Thomas Munro wrote: >>>> What about a >>>> warning message about that at startup if it's on? >> >>> Such a warning wouldn't be particularly likely to be seen by someone who >>> already didn't read/understand the docs for the not-feature that they >>> turned on. >> >> Yeah, I doubt that that would be helpful at all. > > Fair enough. > >>> Since this is -currently- a developer-only feature, it seems reasonable >>> to rename the GUC to debug_direct_io, and (at such time as it's >>> considered to be helpful to users) later rename it to direct_io. >> >> +1 > > Yeah, the future cross-version confusion thing is compelling. OK, > here's a rename patch. I left all the variable names and related > symbols as they were, so it's just the GUC that gains the prefix. I > moved the documentation hunk up to be sorted alphabetically like > nearby entries, because that seemed to look nicer, even though the > list isn't globally sorted. I suggest to also rename the hook functions (check and assign), like in the attached patch. Mainly because utils/guc_hooks.h says to order the functions by GUC variable name, which was already wrong under the old name, but it would be pretty confusing to sort the functions by their GUC name that doesn't match the function names.
Attachment
On Wed, Apr 19, 2023 at 10:43 AM Andres Freund <andres@anarazel.de> wrote: > I don't think the "evict on every buffer access" works quite that way - unless > you have a completely even access pattern, buffer access frequency will > increase usage count much more frequently on some buffers than others. And if > you have a completely even access pattern, it's hard to come up with a good > cache replacement algorithm... My guess is that the most immediate problem in this area is the problem of "correlated references" (to use the term from the LRU-K paper). I gave an example of that here: https://postgr.es/m/CAH2-Wzk7T9K3d9_NY+jEXp2qQGMYoP=gZMoR8q1Cv57SxAw1OA@mail.gmail.com In other words, I think that the most immediate problem may in fact be the tendency of usage_count to get incremented multiple times in response to what is (for all intents and purposes) the same logical page access. Even if it's not as important as I imagine, it still seems likely that verifying that our input information isn't garbage is the logical place to begin work in this general area. It's difficult to be sure about that because it's so hard to look at just one problem in isolation. I suspect that you were right to point out that a larger shared buffers tends to look quite different to a smaller shared buffers. That same factor is going to complicate any analysis of the specific problem that I've highlighted (to say nothing of the way that contention complicates the picture). There is an interesting paper that compared the hit rates seen for TPC-C to TPC-E on relatively modern hardware: https://www.cs.cmu.edu/~chensm/papers/TPCE-sigmod-record10.pdf It concludes that the buffer misses for each workload look rather similar, past a certain point (past a certain buffer pool size): both workloads have cache misses that seem totally random. The access patterns may be very different, but that doesn't necessarily have any visible effect on buffer misses. At least provided that you make certain modest assumptions about buffer pool size, relative to working set size. The most sophisticated cache management algorithms (like ARC) work by maintaining metadata about recently evicted buffers, which is used to decide whether to favor recency over frequency. If you work backwards then it follows that having cache misses that look completely random is desirable, and perhaps even something to work towards. What you really don't want is a situation where the same small minority of pages keep getting ping-ponged into and out of the buffer pool, without ever settling, even though the buffer cache is large enough that that's possible in principle. That pathological profile is the furthest possible thing from random. With smaller shared_buffers, it's perhaps inevitable that buffer cache misses are random, and so I'd expect that managing the problem of contention will tend to matter most. With larger shared_buffers it isn't inevitable at all, so the quality of the cache eviction scheme is likely to matter quite a bit more. -- Peter Geoghegan
On Wed, Aug 23, 2023 at 12:15 AM Peter Eisentraut <peter@eisentraut.org> wrote: > I suggest to also rename the hook functions (check and assign), like in > the attached patch. Mainly because utils/guc_hooks.h says to order the > functions by GUC variable name, which was already wrong under the old > name, but it would be pretty confusing to sort the functions by their > GUC name that doesn't match the function names. OK. I'll push this tomorrow unless you do it while I'm asleep. Thanks!