Thread: Direct I/O

Direct I/O

From
Thomas Munro
Date:
Hi,

Here is a patch to allow PostgreSQL to use $SUBJECT.  It is from the
AIO patch-set[1].  It adds three new settings, defaulting to off:

  io_data_direct = whether to use O_DIRECT for main data files
  io_wal_direct = ... for WAL
  io_wal_init_direct = ... for WAL-file initialisation

O_DIRECT asks the kernel to avoid caching file data as much as
possible.  Here's a fun quote about it[2]:

"The exact semantics of Direct I/O (O_DIRECT) are not well specified.
It is not a part of POSIX, or SUS, or any other formal standards
specification. The exact meaning of O_DIRECT has historically been
negotiated in non-public discussions between powerful enterprise
database companies and proprietary Unix systems, and its behaviour has
generally been passed down as oral lore rather than as a formal set of
requirements and specifications."

It gives the kernel the opportunity to move data directly between
PostgreSQL's user space buffers and the storage hardware using DMA
hardware, that is, without CPU involvement or copying.  Not all
storage stacks can do that, for various reasons, but even if not, the
caching policy should ideally still use temporary buffers and avoid
polluting the page cache.

These settings currently destroy performance, and are not intended to
be used by end-users, yet!  That's why we filed them under
DEVELOPER_OPTIONS.  You don't get automatic read-ahead, concurrency,
clustering or (of course) buffering from the kernel.  The idea is that
later parts of the AIO patch-set will introduce mechanisms to replace
what the kernel is doing for us today, and then more, since we ought
to be even better at predicting our own future I/O than it, so that
we'll finish up ahead.  Even with all that, you wouldn't want to turn
it on by default because the default shared_buffers would be
insufficient for any real system, and there are portability problems.

Examples of slowness:

* every 8KB sequential read or write becomes a full round trip to the
storage, one at a time

* data that is written to WAL and then read back in by WAL sender will
incur full I/O round trip (that's probably not really an AIO problem,
that's something we should probably address by using shared memory
instead of files, as noted as a TODO item in the source code)

Memory alignment patches:

Direct I/O generally needs to be done to/from VM page-aligned
addresses, but only "standard" 4KB pages, even when larger VM pages
are in use (if there is an exotic system where that isn't true, it
won't work).  We need to deal with buffers on the stack, the heap and
in shmem.  For the stack, see patch 0001.  For the heap and shared
memory, see patch 0002, but David Rowley is going to propose that part
separately, as MemoryContext API adjustments are a specialised enough
topic to deserve another thread; here I include a copy as a
dependency.  The main direct I/O patch is 0003.

Assorted portability notes:

I expect this to "work" (that is, successfully destroy performance) on
typical developer systems running at least Linux, macOS, Windows and
FreeBSD.  By work, I mean: not be rejected by PostgreSQL, not be
rejected by the kernel, and influence kernel cache behaviour on common
filesystems.  It might be rejected with ENOSUPP, EINVAL etc on some
more exotic filesystems and OSes.  Of currently supported OSes, only
OpenBSD and Solaris don't have O_DIRECT at all, and we'll reject the
GUCs.  For macOS and Windows we internally translate our own
PG_O_DIRECT flag to the correct flags/calls (committed a while
back[3]).

On Windows, scatter/gather is available only with direct I/O, so a
true pwritev would in theory be possible, but that has some more
complications and is left for later patches (probably using native
interfaces, not disguising as POSIX).

There may be systems on which 8KB offset alignment will not work at
all or not work well, and that's expected.  For example, BTRFS, ZFS,
JFS "big file", UFS etc allow larger-than-8KB blocks/records, and an
8KB write will have to trigger a read-before-write.  Note that
offset/length alignment requirements (blocks) are independent of
buffer alignment requirements (memory pages, 4KB).

The behaviour and cache coherency of files that have open descriptors
using both direct and non-direct flags may be complicated and vary
between systems.  The patch currently lets you change the GUCs at
runtime so backends can disagree: that should probably not be allowed,
but is like that now for experimentation.  More study is required.

If someone has a compiler that we don't know how to do
pg_attribute_aligned() for, then we can't make correctly aligned stack
buffers, so in that case direct I/O is disabled, but I don't know of
such a system (maybe aCC, but we dropped it).  That's why smgr code
can only assert that pointers are IO-aligned if PG_O_DIRECT != 0, and
why PG_O_DIRECT is forced to 0 if there is no pg_attribute_aligned()
macro, disabling the GUCs.

This seems to be an independent enough piece to get into the tree on
its own, with the proviso that it's not actually useful yet other than
for experimentation.  Thoughts?

These patches have been hacked on at various times by Andres Freund,
David Rowley and me.

[1] https://wiki.postgresql.org/wiki/AIO
[2] https://ext4.wiki.kernel.org/index.php/Clarifying_Direct_IO%27s_Semantics
[3]
https://www.postgresql.org/message-id/flat/CA%2BhUKG%2BADiyyHe0cun2wfT%2BSVnFVqNYPxoO6J9zcZkVO7%2BNGig%40mail.gmail.com

Attachment

Re: Direct I/O

From
Justin Pryzby
Date:
On Tue, Nov 01, 2022 at 08:36:18PM +1300, Thomas Munro wrote:
> Hi,
> 
> Here is a patch to allow PostgreSQL to use $SUBJECT.  It is from the
> AIO patch-set[1].  It adds three new settings, defaulting to off:
> 
>   io_data_direct = whether to use O_DIRECT for main data files
>   io_wal_direct = ... for WAL
>   io_wal_init_direct = ... for WAL-file initialisation

You added 3 booleans, but I wonder if it's better to add a string GUC
which is parsed for comma separated strings.  (By "better", I mean
reducing the number of new GUCs - which is less important for developer
GUCs anyway.)

DIO is slower, but not so much that it can't run under CI.  I suggest to
add an 099 commit to enable the feature during development.

Note that this fails under linux with fsanitize=align:
../src/backend/storage/file/buffile.c:117:17: runtime error: member access within misaligned address 0x561a4a8e40f8 for
type'struct BufFile', which requires 4096 byte alignment
 

-- 
Justin



Re: Direct I/O

From
Thomas Munro
Date:
On Wed, Nov 2, 2022 at 2:33 AM Justin Pryzby <pryzby@telsasoft.com> wrote:
> On Tue, Nov 01, 2022 at 08:36:18PM +1300, Thomas Munro wrote:
> >   io_data_direct = whether to use O_DIRECT for main data files
> >   io_wal_direct = ... for WAL
> >   io_wal_init_direct = ... for WAL-file initialisation
>
> You added 3 booleans, but I wonder if it's better to add a string GUC
> which is parsed for comma separated strings.  (By "better", I mean
> reducing the number of new GUCs - which is less important for developer
> GUCs anyway.)

Interesting idea.  So "direct_io = data, wal, wal_init", or maybe that
should be spelled io_direct.  ("Direct I/O" is a common term of art,
but we also have some more io_XXX GUCs in later patches, so it's hard
to choose...)

> DIO is slower, but not so much that it can't run under CI.  I suggest to
> add an 099 commit to enable the feature during development.

Good idea, will do.

> Note that this fails under linux with fsanitize=align:
> ../src/backend/storage/file/buffile.c:117:17: runtime error: member access within misaligned address 0x561a4a8e40f8
fortype 'struct BufFile', which requires 4096 byte alignment
 

Oh, so BufFile is palloc'd and contains one of these.  BufFile is not
even using direct I/O, but by these rules it would need to be
palloc_io_align'd.  I will think about what to do about that...



Re: Direct I/O

From
Andres Freund
Date:
Hi,

On 2022-11-02 09:44:30 +1300, Thomas Munro wrote:
> On Wed, Nov 2, 2022 at 2:33 AM Justin Pryzby <pryzby@telsasoft.com> wrote:
> > On Tue, Nov 01, 2022 at 08:36:18PM +1300, Thomas Munro wrote:
> > >   io_data_direct = whether to use O_DIRECT for main data files
> > >   io_wal_direct = ... for WAL
> > >   io_wal_init_direct = ... for WAL-file initialisation
> >
> > You added 3 booleans, but I wonder if it's better to add a string GUC
> > which is parsed for comma separated strings.

In the past more complicated GUCs have not been well received, but it does
seem like a nice way to reduce the amount of redundant stuff.

Perhaps we could use the guc assignment hook to transform the input value into
a bitmask?


> > (By "better", I mean reducing the number of new GUCs - which is less
> > important for developer GUCs anyway.)

FWIW, if / once we get to actual AIO, at least some of these would stop being
developer-only GUCs. There's substantial performance benefits in using DIO
with AIO. Buffered IO requires the CPU to copy the data from the userspace
into the kernelspace. But DIO can use DMA for that, freeing the CPU to do more
useful work.  Buffered IO tops out much much earlier than AIO + DIO, and
unfortunately tops out at much lower speeds on server CPUs.


> > DIO is slower, but not so much that it can't run under CI.  I suggest to
> > add an 099 commit to enable the feature during development.
> 
> Good idea, will do.

Might be worth to additionally have a short tap test that does some basic
stuff with DIO and leave that enabled? I think it'd be good to have
check-world exercise DIO on dev machines, to reduce the likelihood of finding
problems only in CI, which is somewhat painful.


> > Note that this fails under linux with fsanitize=align:
> > ../src/backend/storage/file/buffile.c:117:17: runtime error: member access within misaligned address 0x561a4a8e40f8
fortype 'struct BufFile', which requires 4096 byte alignment
 
> 
> Oh, so BufFile is palloc'd and contains one of these.  BufFile is not
> even using direct I/O, but by these rules it would need to be
> palloc_io_align'd.  I will think about what to do about that...

It might be worth having two different versions of the struct, so we don't
impose unnecessarily high alignment everywhere?

Greetings,

Andres Freund



Re: Direct I/O

From
Andres Freund
Date:
Hi,

On 2022-11-01 15:54:02 -0700, Andres Freund wrote:
> On 2022-11-02 09:44:30 +1300, Thomas Munro wrote:
> > Oh, so BufFile is palloc'd and contains one of these.  BufFile is not
> > even using direct I/O, but by these rules it would need to be
> > palloc_io_align'd.  I will think about what to do about that...
>
> It might be worth having two different versions of the struct, so we don't
> impose unnecessarily high alignment everywhere?

Although it might actually be worth aligning fully everywhere - there's a
noticable performance difference for buffered read IO.

I benchmarked this on my workstation and laptop.

I mmap'ed a buffer with 2 MiB alignment, MAP_ANONYMOUS | MAP_HUGETLB, and then
measured performance of reading 8192 bytes into the buffer at different
offsets. Each time I copied 16GiB in total.  Within a program invocation I
benchmarked each offset 4 times, threw away the worst measurement, and
averaged the rest. Then used the best of three program invocations.

workstation with dual xeon Gold 5215:

         turbo on       turbo off
offset   GiB/s          GiB/s
0        18.358         13.528
8        15.361         11.472
9        15.330         11.418
32       17.583         13.097
512      17.707         13.229
513      15.890         11.852
4096     18.176         13.568
8192     18.088         13.566
2Mib     18.658         13.496


laptop with i9-9880H:

         turbo on       turbo off
offset   GiB/s          GiB/s
0        33.589         17.160
8        28.045         14.301
9        27.582         14.318
32       31.797         16.711
512      32.215         16.810
513      28.864         14.932
4096     32.503         17.266
8192     32.871         17.277
2Mib     32.657         17.262


Seems pretty clear that using 4096 byte alignment is worth it.

Greetings,

Andres Freund



Re: Direct I/O

From
Jim Nasby
Date:
On 11/1/22 2:36 AM, Thomas Munro wrote:

> Hi,
>
> Here is a patch to allow PostgreSQL to use $SUBJECT.  It is from the

This is exciting to see! There's two other items to add to the TODO list 
before this would be ready for production:

1) work_mem. This is a significant impediment to scaling shared buffers 
the way you'd want to.

2) Clock sweep. Specifically, currently the only thing that drives 
usage_count is individual backends running the clock hand. On large 
systems with 75% of memory going to shared_buffers, that becomes a very 
significant problem, especially when the backend running the clock sweep 
is doing so in order to perform an operation like a b-tree page split. I 
suspect it shouldn't be too hard to deal with this issue by just having 
bgwriter or another bgworker proactively ensuring some reasonable number 
of buffers with usage_count=0 exist.


One other thing to be aware of: overflowing as SLRU becomes a massive 
problem if there isn't a filesystem backing the SLRU. Obviously only an 
issue if you try and apply DIO to SLRU files.




Re: Direct I/O

From
Andres Freund
Date:
Hi,

On 2022-11-04 14:47:31 -0500, Jim Nasby wrote:
> On 11/1/22 2:36 AM, Thomas Munro wrote:
> > Here is a patch to allow PostgreSQL to use $SUBJECT.  It is from the
> 
> This is exciting to see! There's two other items to add to the TODO list
> before this would be ready for production:
> 
> 1) work_mem. This is a significant impediment to scaling shared buffers the
> way you'd want to.

I don't really think that's closely enough related to tackle together. Yes,
it'd be easier to set a large s_b if we had better work_mem management, but
it's a completely distinct problem, and in a lot of cases you could use DIO
without tackling the work_mem issue.


> 2) Clock sweep. Specifically, currently the only thing that drives
> usage_count is individual backends running the clock hand. On large systems
> with 75% of memory going to shared_buffers, that becomes a very significant
> problem, especially when the backend running the clock sweep is doing so in
> order to perform an operation like a b-tree page split. I suspect it
> shouldn't be too hard to deal with this issue by just having bgwriter or
> another bgworker proactively ensuring some reasonable number of buffers with
> usage_count=0 exist.

I agree this isn't great, but I don't think the replacement efficiency is that
big a problem. Replacing the wrong buffers is a bigger issue.

I've run tests with s_b=768GB (IIRC) without it showing up as a major
issue. If you have an extreme replacement rate at such a large s_b you have a
lot of other problems.

I don't want to discourage anybody from tackling the clock replacement issues,
the contrary, but AIO+DIO can show significant wins without those
changes. It's already a humongous project...


> One other thing to be aware of: overflowing as SLRU becomes a massive
> problem if there isn't a filesystem backing the SLRU. Obviously only an
> issue if you try and apply DIO to SLRU files.

Which would be a very bad idea for now.... Thomas does have a patch for moving
them into the main buffer pool.

Greetings,

Andres Freund



Re: Direct I/O

From
John Naylor
Date:

On Tue, Nov 1, 2022 at 2:37 PM Thomas Munro <thomas.munro@gmail.com> wrote:

> Memory alignment patches:
>
> Direct I/O generally needs to be done to/from VM page-aligned
> addresses, but only "standard" 4KB pages, even when larger VM pages
> are in use (if there is an exotic system where that isn't true, it
> won't work).  We need to deal with buffers on the stack, the heap and
> in shmem.  For the stack, see patch 0001.  For the heap and shared
> memory, see patch 0002, but David Rowley is going to propose that part
> separately, as MemoryContext API adjustments are a specialised enough
> topic to deserve another thread; here I include a copy as a
> dependency.  The main direct I/O patch is 0003.

One thing to note: Currently, a request to aset above 8kB must go into a dedicated block. Not sure if it's a coincidence that that matches the default PG page size, but if allocating pages on the heap is hot enough, maybe we should consider raising that limit. Although then, aligned-to-4kB requests would result in 16kB chunks requested unless a different allocator was used.

--
John Naylor
EDB: http://www.enterprisedb.com

Re: Direct I/O

From
Andres Freund
Date:
Hi,

On 2022-11-10 14:26:20 +0700, John Naylor wrote:
> On Tue, Nov 1, 2022 at 2:37 PM Thomas Munro <thomas.munro@gmail.com> wrote:
> 
> > Memory alignment patches:
> >
> > Direct I/O generally needs to be done to/from VM page-aligned
> > addresses, but only "standard" 4KB pages, even when larger VM pages
> > are in use (if there is an exotic system where that isn't true, it
> > won't work).  We need to deal with buffers on the stack, the heap and
> > in shmem.  For the stack, see patch 0001.  For the heap and shared
> > memory, see patch 0002, but David Rowley is going to propose that part
> > separately, as MemoryContext API adjustments are a specialised enough
> > topic to deserve another thread; here I include a copy as a
> > dependency.  The main direct I/O patch is 0003.
> 
> One thing to note: Currently, a request to aset above 8kB must go into a
> dedicated block. Not sure if it's a coincidence that that matches the
> default PG page size, but if allocating pages on the heap is hot enough,
> maybe we should consider raising that limit. Although then, aligned-to-4kB
> requests would result in 16kB chunks requested unless a different allocator
> was used.

With one exception, there's only a small number of places that allocate pages
dynamically and we only do it for a small number of buffers. So I don't think
we should worry too much about this for now.

The one exception to this: GetLocalBufferStorage(). But it already batches
memory allocations by increasing sizes, so I think we're good as well.

Greetings,

Andres Freund



Re: Direct I/O

From
Thomas Munro
Date:
On Wed, Nov 2, 2022 at 11:54 AM Andres Freund <andres@anarazel.de> wrote:
> On 2022-11-02 09:44:30 +1300, Thomas Munro wrote:
> > On Wed, Nov 2, 2022 at 2:33 AM Justin Pryzby <pryzby@telsasoft.com> wrote:
> > > On Tue, Nov 01, 2022 at 08:36:18PM +1300, Thomas Munro wrote:
> > > >   io_data_direct = whether to use O_DIRECT for main data files
> > > >   io_wal_direct = ... for WAL
> > > >   io_wal_init_direct = ... for WAL-file initialisation
> > >
> > > You added 3 booleans, but I wonder if it's better to add a string GUC
> > > which is parsed for comma separated strings.

Done as io_direct=data,wal,wal_init.  Thanks Justin, this is better.
I resisted the urge to invent a meaning for 'on' and 'off', mainly
because it's not clear what values 'on' should enable and it'd be
strange to have off without on, so for now an empty string means off.
I suppose the meaning of this string could evolve over time: the names
of forks, etc.

> Perhaps we could use the guc assignment hook to transform the input value into
> a bitmask?

Makes sense.  The only tricky question was where to store the GUC.  I
went for fd.c for now, but it doesn't seem quite right...

> > > DIO is slower, but not so much that it can't run under CI.  I suggest to
> > > add an 099 commit to enable the feature during development.
> >
> > Good idea, will do.

Done.  The tests take 2-3x as long depending on the OS.

> Might be worth to additionally have a short tap test that does some basic
> stuff with DIO and leave that enabled? I think it'd be good to have
> check-world exercise DIO on dev machines, to reduce the likelihood of finding
> problems only in CI, which is somewhat painful.

Done.

> > > Note that this fails under linux with fsanitize=align:
> > > ../src/backend/storage/file/buffile.c:117:17: runtime error: member access within misaligned address
0x561a4a8e40f8for type 'struct BufFile', which requires 4096 byte alignment
 
> >
> > Oh, so BufFile is palloc'd and contains one of these.  BufFile is not
> > even using direct I/O, but by these rules it would need to be
> > palloc_io_align'd.  I will think about what to do about that...
>
> It might be worth having two different versions of the struct, so we don't
> impose unnecessarily high alignment everywhere?

Done.  I now have PGAlignedBlock (unchanged) and PGIOAlignedBlock.
You have to use the latter for SMgr, because I added alignment
assertions there.  We might as well use it for any other I/O such as
frontend code too for a chance of a small performance boost as you
showed.  For now I have not use PGIOAlignedBlock for BufFile, even
though it would be a great candidate for a potential speedup, only
because I am afraid of adding padding to every BufFile in scenarios
where we allocate many (could be avoided, a subject for separate
research).

V2 comprises:

0001 -- David's palloc_aligned() patch
https://commitfest.postgresql.org/41/3999/
0002 -- I/O-align almost all buffers used for I/O
0003 -- Add the GUCs
0004 -- Throwaway hack to make cfbot turn the GUCs on

Attachment

Re: Direct I/O

From
Thomas Munro
Date:
On Wed, Dec 14, 2022 at 5:48 PM Thomas Munro <thomas.munro@gmail.com> wrote:
> 0001 -- David's palloc_aligned() patch https://commitfest.postgresql.org/41/3999/
> 0002 -- I/O-align almost all buffers used for I/O
> 0003 -- Add the GUCs
> 0004 -- Throwaway hack to make cfbot turn the GUCs on

David pushed the first as commit 439f6175, so here is a rebase of the
rest.  I also fixed a couple of thinkos in the handling of systems
where we don't know how to do direct I/O.  In one place I had #ifdef
PG_O_DIRECT, but that's always defined, it's just that it's 0 on
Solaris and OpenBSD, and the check to reject the GUC wasn't quite
right on such systems.

Attachment

Re: Direct I/O

From
Bharath Rupireddy
Date:
On Thu, Dec 22, 2022 at 7:34 AM Thomas Munro <thomas.munro@gmail.com> wrote:
>
> On Wed, Dec 14, 2022 at 5:48 PM Thomas Munro <thomas.munro@gmail.com> wrote:
> > 0001 -- David's palloc_aligned() patch https://commitfest.postgresql.org/41/3999/
> > 0002 -- I/O-align almost all buffers used for I/O
> > 0003 -- Add the GUCs
> > 0004 -- Throwaway hack to make cfbot turn the GUCs on
>
> David pushed the first as commit 439f6175, so here is a rebase of the
> rest.  I also fixed a couple of thinkos in the handling of systems
> where we don't know how to do direct I/O.  In one place I had #ifdef
> PG_O_DIRECT, but that's always defined, it's just that it's 0 on
> Solaris and OpenBSD, and the check to reject the GUC wasn't quite
> right on such systems.

Thanks. I have some comments on
v3-0002-Add-io_direct-setting-developer-only.patch:

1. I think we don't need to overwrite the io_direct_string in
check_io_direct so that show_io_direct can be avoided.
2. check_io_direct can leak the flags memory - when io_direct is not
supported or for an invalid list syntax or an invalid option is
specified.

I have addressed my review comments as a delta patch on top of v3-0002
and added it here as v1-0001-Review-comments-io_direct-GUC.txt.

Some comments on the tests added:

1. Is there a way to know if Direct IO for WAL and data has been
picked up programmatically? IOW, can we know if the OS page cache is
bypassed? I know an external extension pgfincore which can help here,
but nothing in the core exists AFAICS.
+is('10000', $node->safe_psql('postgres', 'select count(*) from t1'),
"read back from shared");
+is('10000', $node->safe_psql('postgres', 'select * from t2count'),
"read back from local");
+$node->stop('immediate');

2. Can we combine these two append_conf to a single statement?
+$node->append_conf('io_direct', 'data,wal,wal_init');
+$node->append_conf('shared_buffers', '64kB'); # tiny to force I/O

3. A nitpick: Can we split these queries multi-line instead of in a single line?
+$node->safe_psql('postgres', 'begin; create temporary table t2 as
select 1 as i from generate_series(1, 10000); update t2 set i = i;
insert into t2count select count(*) from t2; commit;');

4. I don't think we need to stop the node before the test ends, no?
+$node->stop;
+
+done_testing();

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Attachment

Re: Direct I/O

From
Thomas Munro
Date:
On Wed, Jan 25, 2023 at 8:57 PM Bharath Rupireddy
<bharath.rupireddyforpostgres@gmail.com> wrote:
> Thanks. I have some comments on
> v3-0002-Add-io_direct-setting-developer-only.patch:
>
> 1. I think we don't need to overwrite the io_direct_string in
> check_io_direct so that show_io_direct can be avoided.

Thanks for looking at this, and sorry for the late response.  Yeah, agreed.

> 2. check_io_direct can leak the flags memory - when io_direct is not
> supported or for an invalid list syntax or an invalid option is
> specified.
>
> I have addressed my review comments as a delta patch on top of v3-0002
> and added it here as v1-0001-Review-comments-io_direct-GUC.txt.

Thanks.  Your way is nicer.  I merged your patch and added you as a co-author.

> Some comments on the tests added:
>
> 1. Is there a way to know if Direct IO for WAL and data has been
> picked up programmatically? IOW, can we know if the OS page cache is
> bypassed? I know an external extension pgfincore which can help here,
> but nothing in the core exists AFAICS.

Right, that extension can tell you how many pages are in the kernel
page cache which is quite interesting for this.  I also once hacked up
something primitive to see *which* pages are in kernel cache, so I
could join that against pg_buffercache to measure double buffering,
when I was studying the "smile" shape where pgbench TPS goes down and
then back up again as you increase shared_buffers if the working set
is nearly as big as physical memory (code available in a link from
[1]).

Yeah, I agree it might be nice for human investigators to put
something like that in contrib/pg_buffercache, but I'm not sure you
could rely on it enough for an automated test, though, 'cause it
probably won't work on some file systems and the tests would probably
fail for random transient reasons (for example: some systems won't
kick pages out of kernel cache if they were already there, just
because we decided to open the file with O_DIRECT).  (I got curious
about why mincore() wasn't standardised along with mmap() and all that
jazz; it seems the BSD and later Sun people who invented all those
interfaces didn't think that one was quite good enough[2], but every
(?) Unixoid OS copied it anyway, with variations...  Apparently the
Windows thing is called VirtualQuery()).

> 2. Can we combine these two append_conf to a single statement?
> +$node->append_conf('io_direct', 'data,wal,wal_init');
> +$node->append_conf('shared_buffers', '64kB'); # tiny to force I/O

OK, sure, done.  And also oops, that was completely wrong and not
working the way I had it in that version...

> 3. A nitpick: Can we split these queries multi-line instead of in a single line?
> +$node->safe_psql('postgres', 'begin; create temporary table t2 as
> select 1 as i from generate_series(1, 10000); update t2 set i = i;
> insert into t2count select count(*) from t2; commit;');

OK.

> 4. I don't think we need to stop the node before the test ends, no?
> +$node->stop;
> +
> +done_testing();

Sure, but why not?

Otherwise, I rebased, and made a couple more changes:

I found a line of the manual about wal_sync_method that needed to be removed:

-        The <literal>open_</literal>* options also use
<literal>O_DIRECT</literal> if available.

In fact that sentence didn't correctly document the behaviour in
released branches (wal_level=minimal is also required for that, so
probably very few people ever used it).  I think we should adjust that
misleading sentence in back-branches, separately from this patch set.

I also updated the commit message to highlight the only expected
user-visible change for this, namely the loss of the above incorrectly
documented obscure special case, which is replaced by the less obscure
new setting io_direct=wal, if someone still wants that behaviour.

Also a few minor comment changes.

[1] https://twitter.com/MengTangmu/status/994770040745615361
[2] http://kos.enix.org/pub/gingell8.pdf

Attachment

Re: Direct I/O

From
Thomas Munro
Date:
I did some testing with non-default block sizes, and found a few minor
things that needed adjustment.  The short version is that I blocked
some configurations that won't work or would break an assertion.
After a bit more copy-editing on docs and comments and a round of
automated indenting, I have now pushed this.  I will now watch the
build farm.  I tested on quite a few OSes that I have access to, but
this is obviously a very OS-sensitive kind of a thing.

The adjustments were:

1.  If you set your BLCKSZ or XLOG_BLCKSZ smaller than
PG_IO_ALIGN_SIZE, you shouldn't be allowed to turn on direct I/O for
the relevant operations, because such undersized direct I/Os will fail
on common systems.

FATAL:  invalid value for parameter "io_direct": "wal"
DETAIL:  io_direct is not supported for WAL because XLOG_BLCKSZ is too small

FATAL:  invalid value for parameter "io_direct": "data"
DETAIL:  io_direct is not supported for data because BLCKSZ is too small

In fact some systems would be OK with it if the true requirement is
512 not 4096, but (1) tiny blocks are a niche build option that
doesn't even pass regression tests and (2) it's hard and totally
unportable to find out the true requirement at runtime, and (3) the
conservative choice of 4096 has additional benefits by matching memory
pages.  So I think a conservative compile-time number is a good
starting position.

2.  Previously I had changed the WAL buffer alignment to be the larger
of PG_IO_ALIGN_SIZE and XLOG_BLCKSZ, but in light of the above
thinking, I reverted that part (no point in aligning the address of
the buffer when the size is too small for direct I/O, but now that
combination is blocked off at GUC level so we don't need any change
here).

3.  I updated the md.c alignment assertions to allow for tiny blocks.
The point of these assertions is to fail if any new code does I/O from
badly aligned buffers even with io_direct turned off (ie how most
people hack), 'cause that will fail with io_direct turned on.  The
change is that I don't make the assertion if you're using BLCKSZ <
PG_IO_ALIGN_SIZE.  Such buffers wouldn't work if used for direct I/O
but that's OK, the GUC won't allow it.

4.  I made the language to explain where PG_IO_ALIGN_SIZE really comes
from a little vaguer because it's complex.



Re: Direct I/O

From
Thomas Munro
Date:
On Sat, Apr 8, 2023 at 4:47 PM Thomas Munro <thomas.munro@gmail.com> wrote:
> After a bit more copy-editing on docs and comments and a round of
> automated indenting, I have now pushed this.  I will now watch the
> build farm.  I tested on quite a few OSes that I have access to, but
> this is obviously a very OS-sensitive kind of a thing.

Hmm.  I see a strange "invalid page" failure on Andrew's machine crake
in 004_io_direct.pl.  Let's see what else comes in.



Re: Direct I/O

From
Tom Lane
Date:
Thomas Munro <thomas.munro@gmail.com> writes:
> I did some testing with non-default block sizes, and found a few minor
> things that needed adjustment.

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=crake&dt=2023-04-08%2004%3A42%3A04

This seems like another thing that should not have been pushed mere
hours before feature freeze.

            regards, tom lane



Re: Direct I/O

From
Andres Freund
Date:
Hi,

On 2023-04-08 16:59:20 +1200, Thomas Munro wrote:
> On Sat, Apr 8, 2023 at 4:47 PM Thomas Munro <thomas.munro@gmail.com> wrote:
> > After a bit more copy-editing on docs and comments and a round of
> > automated indenting, I have now pushed this.  I will now watch the
> > build farm.  I tested on quite a few OSes that I have access to, but
> > this is obviously a very OS-sensitive kind of a thing.
> 
> Hmm.  I see a strange "invalid page" failure on Andrew's machine crake
> in 004_io_direct.pl.  Let's see what else comes in.

There were some failures in CI (e.g. [1] (and perhaps also bf, didn't yet
check), about "no unpinned buffers available".  I was worried for a moment
that this could actually be relation to the bulk extension patch.

But it looks like it's older - and not caused by direct_io support (except by
way of the test existing). I reproduced the issue locally by setting s_b even
lower, to 16 and made the ERROR a PANIC.

#4  0x00005624dfe90336 in errfinish (filename=0x5624df6867c0
"../../../../home/andres/src/postgresql/src/backend/storage/buffer/freelist.c",lineno=353, 
 
    funcname=0x5624df686900 <__func__.6> "StrategyGetBuffer") at
../../../../home/andres/src/postgresql/src/backend/utils/error/elog.c:604
#5  0x00005624dfc71dbe in StrategyGetBuffer (strategy=0x0, buf_state=0x7ffd4182137c, from_ring=0x7ffd4182137b)
    at ../../../../home/andres/src/postgresql/src/backend/storage/buffer/freelist.c:353
#6  0x00005624dfc6a922 in GetVictimBuffer (strategy=0x0, io_context=IOCONTEXT_NORMAL)
    at ../../../../home/andres/src/postgresql/src/backend/storage/buffer/bufmgr.c:1601
#7  0x00005624dfc6a29f in BufferAlloc (smgr=0x5624e1ff27f8, relpersistence=112 'p', forkNum=MAIN_FORKNUM, blockNum=16,
strategy=0x0,foundPtr=0x7ffd418214a3, 
 
    io_context=IOCONTEXT_NORMAL) at ../../../../home/andres/src/postgresql/src/backend/storage/buffer/bufmgr.c:1290
#8  0x00005624dfc69c0c in ReadBuffer_common (smgr=0x5624e1ff27f8, relpersistence=112 'p', forkNum=MAIN_FORKNUM,
blockNum=16,mode=RBM_NORMAL, strategy=0x0, 
 
    hit=0x7ffd4182156b) at ../../../../home/andres/src/postgresql/src/backend/storage/buffer/bufmgr.c:1056
#9  0x00005624dfc69335 in ReadBufferExtended (reln=0x5624e1ee09f0, forkNum=MAIN_FORKNUM, blockNum=16, mode=RBM_NORMAL,
strategy=0x0)
    at ../../../../home/andres/src/postgresql/src/backend/storage/buffer/bufmgr.c:776
#10 0x00005624df8eb78a in log_newpage_range (rel=0x5624e1ee09f0, forknum=MAIN_FORKNUM, startblk=0, endblk=45,
page_std=false)
    at ../../../../home/andres/src/postgresql/src/backend/access/transam/xloginsert.c:1290
#11 0x00005624df9567e7 in smgrDoPendingSyncs (isCommit=true, isParallelWorker=false)
    at ../../../../home/andres/src/postgresql/src/backend/catalog/storage.c:837
#12 0x00005624df8d1dd2 in CommitTransaction () at
../../../../home/andres/src/postgresql/src/backend/access/transam/xact.c:2225
#13 0x00005624df8d2da2 in CommitTransactionCommand () at
../../../../home/andres/src/postgresql/src/backend/access/transam/xact.c:3060
#14 0x00005624dfcbe0a1 in finish_xact_command () at
../../../../home/andres/src/postgresql/src/backend/tcop/postgres.c:2779
#15 0x00005624dfcbb867 in exec_simple_query (query_string=0x5624e1eacd98 "create table t1 as select 1 as i from
generate_series(1,10000)")
 
    at ../../../../home/andres/src/postgresql/src/backend/tcop/postgres.c:1299
#16 0x00005624dfcc09c4 in PostgresMain (dbname=0x5624e1ee40e8 "postgres", username=0x5624e1e6c5f8 "andres")
    at ../../../../home/andres/src/postgresql/src/backend/tcop/postgres.c:4623
#17 0x00005624dfbecc03 in BackendRun (port=0x5624e1ed8250) at
../../../../home/andres/src/postgresql/src/backend/postmaster/postmaster.c:4461
#18 0x00005624dfbec48e in BackendStartup (port=0x5624e1ed8250) at
../../../../home/andres/src/postgresql/src/backend/postmaster/postmaster.c:4189
#19 0x00005624dfbe8541 in ServerLoop () at
../../../../home/andres/src/postgresql/src/backend/postmaster/postmaster.c:1779
#20 0x00005624dfbe7e56 in PostmasterMain (argc=4, argv=0x5624e1e6a520) at
../../../../home/andres/src/postgresql/src/backend/postmaster/postmaster.c:1463
#21 0x00005624dfad538b in main (argc=4, argv=0x5624e1e6a520) at
../../../../home/andres/src/postgresql/src/backend/main/main.c:200


If you look at log_newpage_range(), it's not surprising that we get this error
- it pins up to 32 buffers at once.

Afaics log_newpage_range() originates in 9155580fd5fc, but this caller is from
c6b92041d385.


It doesn't really seem OK to me to unconditionally pin 32 buffers. For the
relation extension patch I introduced LimitAdditionalPins() to deal with this
concern. Perhaps it needs to be exposed and log_newpage_buffers() should use
it?


Do we care about fixing this in the backbranches? Probably not, given there
haven't been user complaints?


Greetings,

Andres Freund

[1] https://cirrus-ci.com/task/4519721039560704



Re: Direct I/O

From
Thomas Munro
Date:
On Sat, Apr 8, 2023 at 4:59 PM Thomas Munro <thomas.munro@gmail.com> wrote:
> On Sat, Apr 8, 2023 at 4:47 PM Thomas Munro <thomas.munro@gmail.com> wrote:
> > After a bit more copy-editing on docs and comments and a round of
> > automated indenting, I have now pushed this.  I will now watch the
> > build farm.  I tested on quite a few OSes that I have access to, but
> > this is obviously a very OS-sensitive kind of a thing.
>
> Hmm.  I see a strange "invalid page" failure on Andrew's machine crake
> in 004_io_direct.pl.  Let's see what else comes in.

No particular ideas about what happened there yet.  It *looks* like we
wrote out a page, and then read it back in very soon afterwards, all
via the usual locked bufmgr/smgr pathways, and it failed basic page
validation.  The reader was a parallel worker, because of the
debug_parallel_mode setting on that box.  The page number looks
reasonable (I guess it's reading a page created by the UPDATE full of
new tuples, but I don't know which process wrote it).  It's also not
immediately obvious how this could be connected to the 32 pinned
buffer problem (all that would have happened in the CREATE TABLE
process which ended already before the UPDATE and then the SELECT
backends even started).

Andrew, what file system and type of disk is that machine using?



Re: Direct I/O

From
Andres Freund
Date:
Hi,

On 2023-04-07 23:04:08 -0700, Andres Freund wrote:
> There were some failures in CI (e.g. [1] (and perhaps also bf, didn't yet
> check), about "no unpinned buffers available".  I was worried for a moment
> that this could actually be relation to the bulk extension patch.
> 
> But it looks like it's older - and not caused by direct_io support (except by
> way of the test existing). I reproduced the issue locally by setting s_b even
> lower, to 16 and made the ERROR a PANIC.
>
> [backtrace]
> 
> If you look at log_newpage_range(), it's not surprising that we get this error
> - it pins up to 32 buffers at once.
> 
> Afaics log_newpage_range() originates in 9155580fd5fc, but this caller is from
> c6b92041d385.
> 
> 
> It doesn't really seem OK to me to unconditionally pin 32 buffers. For the
> relation extension patch I introduced LimitAdditionalPins() to deal with this
> concern. Perhaps it needs to be exposed and log_newpage_buffers() should use
> it?
> 
> 
> Do we care about fixing this in the backbranches? Probably not, given there
> haven't been user complaints?

Here's a quick prototype of this approach. If we expose LimitAdditionalPins(),
we'd probably want to add "Buffer" to the name, and pass it a relation, so
that it can hand off LimitAdditionalLocalPins() when appropriate? The callsite
in question doesn't need it, but ...

Without the limiting of pins the modified 004_io_direct.pl fails 100% of the
time for me.

Presumably the reason it fails occasionally with 256kB of shared buffers
(i.e. NBuffers=32) is that autovacuum or checkpointer briefly pins a single
buffer. As log_newpage_range() thinks it can just pin 32 buffers
unconditionally, it fails in that case.

Greetings,

Andres Freund

Attachment

Re: Direct I/O

From
Andres Freund
Date:
Hi,

Given the frequency of failures on this in the buildfarm, I propose using the
temporary workaround of using wal_level=replica. That avoids the use of the
over-eager log_newpage_range().

Greetings,

Andres Freund



Re: Direct I/O

From
Thomas Munro
Date:
On Sun, Apr 9, 2023 at 6:55 AM Andres Freund <andres@anarazel.de> wrote:
> Given the frequency of failures on this in the buildfarm, I propose using the
> temporary workaround of using wal_level=replica. That avoids the use of the
> over-eager log_newpage_range().

Will do.



Re: Direct I/O

From
Tom Lane
Date:
Thomas Munro <thomas.munro@gmail.com> writes:
> On Sun, Apr 9, 2023 at 6:55 AM Andres Freund <andres@anarazel.de> wrote:
>> Given the frequency of failures on this in the buildfarm, I propose using the
>> temporary workaround of using wal_level=replica. That avoids the use of the
>> over-eager log_newpage_range().

> Will do.

Now crake is doing this:

2023-04-08 16:50:03.177 EDT [2023-04-08 16:50:03 EDT 3257645:3] 004_io_direct.pl LOG:  statement: select count(*) from
t1
2023-04-08 16:50:03.316 EDT [2023-04-08 16:50:03 EDT 3257646:1] ERROR:  invalid page in block 56 of relation
base/5/16384
2023-04-08 16:50:03.316 EDT [2023-04-08 16:50:03 EDT 3257646:2] STATEMENT:  select count(*) from t1
2023-04-08 16:50:03.317 EDT [2023-04-08 16:50:03 EDT 3257645:4] 004_io_direct.pl ERROR:  invalid page in block 56 of
relationbase/5/16384 
2023-04-08 16:50:03.317 EDT [2023-04-08 16:50:03 EDT 3257645:5] 004_io_direct.pl STATEMENT:  select count(*) from t1
2023-04-08 16:50:03.319 EDT [2023-04-08 16:50:02 EDT 3257591:4] LOG:  background worker "parallel worker" (PID 3257646)
exitedwith exit code 1 

The fact that the error is happening in a parallel worker seems
interesting ...

(BTW, why are the log lines doubly timestamped?)

            regards, tom lane



Re: Direct I/O

From
Thomas Munro
Date:
On Sun, Apr 9, 2023 at 9:10 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> 2023-04-08 16:50:03.177 EDT [2023-04-08 16:50:03 EDT 3257645:3] 004_io_direct.pl LOG:  statement: select count(*)
fromt1 
> 2023-04-08 16:50:03.316 EDT [2023-04-08 16:50:03 EDT 3257646:1] ERROR:  invalid page in block 56 of relation
base/5/16384

> The fact that the error is happening in a parallel worker seems
> interesting ...

That's because it's running with debug_parallel_query=regress.  I've
been trying to repro that but no luck...  A different kind of failure
also showed up, where it counted the wrong number of tuples:

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=crake&dt=2023-04-08%2015%3A52%3A03

A paranoid explanation would be that this system is failing to provide
basic I/O coherency, we're writing pages out and not reading them back
in.  Or of course there is a dumb bug... but why only here?  Can of
course be timing-sensitive and it's interesting that crake suffers
from the "no unpinned buffers available" thing (which should now be
gone) with higher frequency; I'm keen to see if the dodgy-read problem
continues with a similar frequency now.



Re: Direct I/O

From
Andres Freund
Date:
Hi,

On 2023-04-08 17:10:19 -0400, Tom Lane wrote:
> Thomas Munro <thomas.munro@gmail.com> writes:
> Now crake is doing this:
> 
> 2023-04-08 16:50:03.177 EDT [2023-04-08 16:50:03 EDT 3257645:3] 004_io_direct.pl LOG:  statement: select count(*)
fromt1
 
> 2023-04-08 16:50:03.316 EDT [2023-04-08 16:50:03 EDT 3257646:1] ERROR:  invalid page in block 56 of relation
base/5/16384
> 2023-04-08 16:50:03.316 EDT [2023-04-08 16:50:03 EDT 3257646:2] STATEMENT:  select count(*) from t1
> 2023-04-08 16:50:03.317 EDT [2023-04-08 16:50:03 EDT 3257645:4] 004_io_direct.pl ERROR:  invalid page in block 56 of
relationbase/5/16384
 
> 2023-04-08 16:50:03.317 EDT [2023-04-08 16:50:03 EDT 3257645:5] 004_io_direct.pl STATEMENT:  select count(*) from t1
> 2023-04-08 16:50:03.319 EDT [2023-04-08 16:50:02 EDT 3257591:4] LOG:  background worker "parallel worker" (PID
3257646)exited with exit code 1
 
> 
> The fact that the error is happening in a parallel worker seems
> interesting ...

There were a few prior instances of that error. One that I hadn't seen before
is this:

[11:35:07.190](0.001s) #   Failed test 'read back from shared'
#   at /home/andrew/bf/root/HEAD/pgsql/src/test/modules/test_misc/t/004_io_direct.pl line 43.
[11:35:07.190](0.000s) #          got: '10000'
#     expected: '10098'

For one it points to the arguments to is() being switched around, but that's a
sideshow.


> (BTW, why are the log lines doubly timestamped?)

It's odd.

It's also odd that it's just crake having the issue. It's just a linux host,
afaics. Andrew, is there any chance you can run that test in isolation and see
whether it reproduces? If so, does the problem vanish, if you comment out the
io_direct= in the test? Curious whether this is actually an O_DIRECT issue, or
whether it's an independent issue exposed by the new test.


I wonder if we should make the test use data checksum - if we continue to see
the wrong query results, the corruption is more likely to be in memory.

Greetings,

Andres Freund



Re: Direct I/O

From
Tom Lane
Date:
Andres Freund <andres@anarazel.de> writes:
> On 2023-04-08 17:10:19 -0400, Tom Lane wrote:
>> (BTW, why are the log lines doubly timestamped?)

> It's odd.

Oh, I guess that's intentional, because crake has

                 'log_line_prefix = \'%m [%s %p:%l] %q%a \'',

> It's also odd that it's just crake having the issue. It's just a linux host,
> afaics.

Indeed.  I'm guessing from the compiler version that it's Fedora 37 now
(the lack of such basic information in the meson configuration output
is pretty annoying).  I've been trying to repro it here on an F37 box,
with no success, suggesting that it's very timing sensitive.  Or maybe
it's inside a VM and that matters?

            regards, tom lane



Re: Direct I/O

From
Andres Freund
Date:
Hi,

On 2023-04-08 17:31:02 -0400, Tom Lane wrote:
> Andres Freund <andres@anarazel.de> writes:
> > On 2023-04-08 17:10:19 -0400, Tom Lane wrote:
> > It's also odd that it's just crake having the issue. It's just a linux host,
> > afaics.
> 
> Indeed.  I'm guessing from the compiler version that it's Fedora 37 now

The 15 branch says:

hostname = neoemma
uname -m = x86_64
uname -r = 6.2.8-100.fc36.x86_64
uname -s = Linux
uname -v = #1 SMP PREEMPT_DYNAMIC Wed Mar 22 19:14:19 UTC 2023

So at least the kernel claims to be 36...


> (the lack of such basic information in the meson configuration output
> is pretty annoying).

Yea, I was thinking yesterday that we should add uname output to meson's
configure (if available). I'm sure we can figure out a reasonably fast windows
command for the version, too.


> I've been trying to repro it here on an F37 box, with no success, suggesting
> that it's very timing sensitive.  Or maybe it's inside a VM and that
> matters?

Could also be filesystem specific?

Greetings,

Andres Freund



Re: Direct I/O

From
Andrew Dunstan
Date:


On 2023-04-08 Sa 17:42, Andres Freund wrote:
Hi,

On 2023-04-08 17:31:02 -0400, Tom Lane wrote:
Andres Freund <andres@anarazel.de> writes:
On 2023-04-08 17:10:19 -0400, Tom Lane wrote:
It's also odd that it's just crake having the issue. It's just a linux host,
afaics.
Indeed.  I'm guessing from the compiler version that it's Fedora 37 now
The 15 branch says:

hostname = neoemma
uname -m = x86_64
uname -r = 6.2.8-100.fc36.x86_64
uname -s = Linux
uname -v = #1 SMP PREEMPT_DYNAMIC Wed Mar 22 19:14:19 UTC 2023

So at least the kernel claims to be 36...


(the lack of such basic information in the meson configuration output
is pretty annoying).
Yea, I was thinking yesterday that we should add uname output to meson's
configure (if available). I'm sure we can figure out a reasonably fast windows
command for the version, too.


I've been trying to repro it here on an F37 box, with no success, suggesting
that it's very timing sensitive.  Or maybe it's inside a VM and that
matters?
Could also be filesystem specific?


I migrated it in February from a VM to a non-virtual instance. Almost nothing else runs on the machine. The personality info shown on the BF server is correct.

andrew@neoemma:~ $ cat /etc/fedora-release
Fedora release 36 (Thirty Six)
andrew@neoemma:~ $ uname -a
Linux neoemma 6.2.8-100.fc36.x86_64 #1 SMP PREEMPT_DYNAMIC Wed Mar 22 19:14:19 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
andrew@neoemma:~ $ gcc --version
gcc (GCC) 12.2.1 20221121 (Red Hat 12.2.1-4)
andrew@neoemma:~ $ mount | grep home
/dev/mapper/luks-xxxxxxx on /home type btrfs (rw,relatime,seclabel,compress=zstd:1,ssd,discard=async,space_cache,subvolid=256,subvol=/home)


I guess it could be btrfs-specific. I'll be somewhat annoyed if I have to re-init the machine to use something else.


cheers


andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com

Re: Direct I/O

From
Thomas Munro
Date:
On Sun, Apr 9, 2023 at 10:08 AM Andrew Dunstan <andrew@dunslane.net> wrote:
> btrfs

Aha!



Re: Direct I/O

From
Andrew Dunstan
Date:


On 2023-04-08 Sa 17:23, Andres Freund wrote:
Hi,

On 2023-04-08 17:10:19 -0400, Tom Lane wrote:
Thomas Munro <thomas.munro@gmail.com> writes:
Now crake is doing this:

2023-04-08 16:50:03.177 EDT [2023-04-08 16:50:03 EDT 3257645:3] 004_io_direct.pl LOG:  statement: select count(*) from t1
2023-04-08 16:50:03.316 EDT [2023-04-08 16:50:03 EDT 3257646:1] ERROR:  invalid page in block 56 of relation base/5/16384
2023-04-08 16:50:03.316 EDT [2023-04-08 16:50:03 EDT 3257646:2] STATEMENT:  select count(*) from t1
2023-04-08 16:50:03.317 EDT [2023-04-08 16:50:03 EDT 3257645:4] 004_io_direct.pl ERROR:  invalid page in block 56 of relation base/5/16384
2023-04-08 16:50:03.317 EDT [2023-04-08 16:50:03 EDT 3257645:5] 004_io_direct.pl STATEMENT:  select count(*) from t1
2023-04-08 16:50:03.319 EDT [2023-04-08 16:50:02 EDT 3257591:4] LOG:  background worker "parallel worker" (PID 3257646) exited with exit code 1

The fact that the error is happening in a parallel worker seems
interesting ...
There were a few prior instances of that error. One that I hadn't seen before
is this:

[11:35:07.190](0.001s) #   Failed test 'read back from shared'
#   at /home/andrew/bf/root/HEAD/pgsql/src/test/modules/test_misc/t/004_io_direct.pl line 43.
[11:35:07.190](0.000s) #          got: '10000'
#     expected: '10098'

For one it points to the arguments to is() being switched around, but that's a
sideshow.


It's also odd that it's just crake having the issue. It's just a linux host,
afaics. Andrew, is there any chance you can run that test in isolation and see
whether it reproduces? If so, does the problem vanish, if you comment out the
io_direct= in the test? Curious whether this is actually an O_DIRECT issue, or
whether it's an independent issue exposed by the new test.


I wonder if we should make the test use data checksum - if we continue to see
the wrong query results, the corruption is more likely to be in memory.


I can run the test in isolation, and it's get an error reliably.


cheers


andrew


--
Andrew Dunstan
EDB: https://www.enterprisedb.com

Re: Direct I/O

From
Thomas Munro
Date:
On Sun, Apr 9, 2023 at 10:17 AM Andrew Dunstan <andrew@dunslane.net> wrote:
> I can run the test in isolation, and it's get an error reliably.

Random idea: it looks like you have compression enabled.  What if you
turn it off in the directory where the test runs?  Something like
btrfs property set <file> compression ... according to the
intergoogles.  (I have never used btrfs before 6 minutes ago but I
can't seem to repro this with basic settings in a loopback btrfs
filesystems).



Re: Direct I/O

From
Tom Lane
Date:
Thomas Munro <thomas.munro@gmail.com> writes:
> On Sun, Apr 9, 2023 at 10:08 AM Andrew Dunstan <andrew@dunslane.net> wrote:
>> btrfs

> Aha!

Googling finds a lot of suggestions that O_DIRECT doesn't play nice
with btrfs, for example

https://www.mail-archive.com/linux-btrfs@vger.kernel.org/msg92824.html

It's not clear to me how much of that lore is still current,
but it's disturbing.

            regards, tom lane



Re: Direct I/O

From
Thomas Munro
Date:
On Sun, Apr 9, 2023 at 11:05 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Googling finds a lot of suggestions that O_DIRECT doesn't play nice
> with btrfs, for example
>
> https://www.mail-archive.com/linux-btrfs@vger.kernel.org/msg92824.html
>
> It's not clear to me how much of that lore is still current,
> but it's disturbing.

I think that particular thing might relate to modifications of the
user buffer while a write is in progress (breaking btrfs's internal
checksums).  I don't think we should ever do that ourselves (not least
because it'd break our own checksums).  We lock the page during the
write so no one can do that, and then we sleep in a synchronous
syscall.

Here's something recent.  I guess it's probably not relevant (a fault
on our buffer that we recently touched sounds pretty unlikely), but
who knows...  (developer lists for file systems are truly terrifying
places to drive through).

https://lore.kernel.org/linux-btrfs/20230315195231.GW10580@twin.jikos.cz/T/

It's odd, though, if it is their bug and not ours: I'd expect our
friends in other databases to have hit all that sort of thing years
ago, since many comparable systems have a direct I/O knob*.  What are
we doing differently?  Are our multiple processes a factor here,
breaking some coherency logic?  Unsurprisingly, having compression on
as Andrew does actually involves buffering anyway[1] despite our
O_DIRECT flag, but maybe that's saying writes are buffered but reads
are still direct (?), which sounds like the sort of initial conditions
that might produce a coherency bug.  I dunno.

I gather that btrfs is actually Fedora's default file system (or maybe
it's just "laptops and desktops"[2]?).  I wonder if any of the several
green Fedora systems in the 'farm are using btrfs.  I wonder if they
are using different mount options (thinking again of compression).

*Probably a good reason to add a more prominent warning that the
feature is developer-only, experimental and not for production use.
I'm thinking a warning at startup or something.

[1] https://btrfs.readthedocs.io/en/latest/Compression.html
[2] https://fedoraproject.org/wiki/Changes/BtrfsByDefault



Re: Direct I/O

From
Tom Lane
Date:
Thomas Munro <thomas.munro@gmail.com> writes:
> It's odd, though, if it is their bug and not ours: I'd expect our
> friends in other databases to have hit all that sort of thing years
> ago, since many comparable systems have a direct I/O knob*.

Yeah, it seems moderately likely that it's our own bug ... but this
code's all file-system-ignorant, so how?  Maybe we are breaking some
POSIX rule that btrfs exploits but others don't?

> I gather that btrfs is actually Fedora's default file system (or maybe
> it's just "laptops and desktops"[2]?).

I have a ton of Fedora images laying about, and I doubt that any of them
use btrfs, mainly because that's not the default in the "server spin"
which is what I usually install from.  It's hard to guess about the
buildfarm, but it wouldn't surprise me that most of them are on xfs.
(If we haven't figured this out pretty shortly, I'm probably going to
put together a btrfs-on-bare-metal machine to try to duplicate crake's
results.)

            regards, tom lane



Re: Direct I/O

From
Andres Freund
Date:
Hi,

On 2023-04-09 13:55:33 +1200, Thomas Munro wrote:
> I think that particular thing might relate to modifications of the
> user buffer while a write is in progress (breaking btrfs's internal
> checksums).  I don't think we should ever do that ourselves (not least
> because it'd break our own checksums).  We lock the page during the
> write so no one can do that, and then we sleep in a synchronous
> syscall.

Oh, but we actually *do* modify pages while IO is going on. I wonder if you
hit the jack pot here. The content lock doesn't prevent hint bit
writes. That's why we copy the page to temporary memory when computing
checksums.

I think we should modify the test to enable checksums - if the problem goes
away, then it's likely to be related to modifying pages while an O_DIRECT
write is ongoing...

Greetings,

Andres Freund



Re: Direct I/O

From
Thomas Munro
Date:
On Sun, Apr 9, 2023 at 2:18 PM Andres Freund <andres@anarazel.de> wrote:
> On 2023-04-09 13:55:33 +1200, Thomas Munro wrote:
> > I think that particular thing might relate to modifications of the
> > user buffer while a write is in progress (breaking btrfs's internal
> > checksums).  I don't think we should ever do that ourselves (not least
> > because it'd break our own checksums).  We lock the page during the
> > write so no one can do that, and then we sleep in a synchronous
> > syscall.
>
> Oh, but we actually *do* modify pages while IO is going on. I wonder if you
> hit the jack pot here. The content lock doesn't prevent hint bit
> writes. That's why we copy the page to temporary memory when computing
> checksums.

More like the jackpot hit me.

Woo, I can now reproduce this locally on a loop filesystem.
Previously I had missed a step, the parallel worker seems to be
necessary.  More soon.



Re: Direct I/O

From
Noah Misch
Date:
On Sat, Apr 08, 2023 at 11:08:16AM -0700, Andres Freund wrote:
> On 2023-04-07 23:04:08 -0700, Andres Freund wrote:
> > There were some failures in CI (e.g. [1] (and perhaps also bf, didn't yet
> > check), about "no unpinned buffers available".  I was worried for a moment
> > that this could actually be relation to the bulk extension patch.
> > 
> > But it looks like it's older - and not caused by direct_io support (except by
> > way of the test existing). I reproduced the issue locally by setting s_b even
> > lower, to 16 and made the ERROR a PANIC.
> >
> > [backtrace]

I get an ERROR, not a PANIC:

$ git rev-parse HEAD
2e57ffe12f6b5c1498f29cb7c0d9e17c797d9da6
$ git diff -U0
diff --git a/src/test/modules/test_misc/t/004_io_direct.pl b/src/test/modules/test_misc/t/004_io_direct.pl
index f5bf0b1..8f0241b 100644
--- a/src/test/modules/test_misc/t/004_io_direct.pl
+++ b/src/test/modules/test_misc/t/004_io_direct.pl
@@ -25 +25 @@ io_direct = 'data,wal,wal_init'
-shared_buffers = '256kB' # tiny to force I/O
+shared_buffers = 16
$ ./configure -C --enable-debug --enable-cassert --enable-depend --enable-tap-tests --with-tcl --with-python
--with-perl
$ make -C src/test/modules/test_misc check PROVE_TESTS=t/004_io_direct.pl
# +++ tap check in src/test/modules/test_misc +++
t/004_io_direct.pl .. Dubious, test returned 29 (wstat 7424, 0x1d00)
No subtests run 

Test Summary Report
-------------------
t/004_io_direct.pl (Wstat: 7424 Tests: 0 Failed: 0)
  Non-zero exit status: 29
  Parse errors: No plan found in TAP output
Files=1, Tests=0,  1 wallclock secs ( 0.01 usr  0.00 sys +  0.41 cusr  0.14 csys =  0.56 CPU)
Result: FAIL
make: *** [../../../../src/makefiles/pgxs.mk:460: check] Error 1
$ grep pinned src/test/modules/test_misc/tmp_check/log/*
src/test/modules/test_misc/tmp_check/log/004_io_direct_main.log:2023-04-08 21:12:46.781 PDT [929628] 004_io_direct.pl
ERROR: no unpinned buffers available
 
src/test/modules/test_misc/tmp_check/log/regress_log_004_io_direct:error running SQL: 'psql:<stdin>:1: ERROR:  no
unpinnedbuffers available'
 

No good reason to PANIC there, so the path to PANIC may be fixable.

> > If you look at log_newpage_range(), it's not surprising that we get this error
> > - it pins up to 32 buffers at once.
> > 
> > Afaics log_newpage_range() originates in 9155580fd5fc, but this caller is from
> > c6b92041d385.

> > Do we care about fixing this in the backbranches? Probably not, given there
> > haven't been user complaints?

I would not.  This is only going to come up where the user goes out of the way
to use near-minimum shared_buffers.

> Here's a quick prototype of this approach.

This looks fine.  I'm not enthusiastic about incurring post-startup cycles to
cater to allocating less than 512k*max_connections of shared buffers, but I
expect the cycles in question are negligible here.



Re: Direct I/O

From
Thomas Munro
Date:
Indeed, I can't reproduce this with (our) checksums on.  I also can't
reproduce it with O_DIRECT off.  I also can't reproduce it if I use
"mkdir pgdata && chattr +C pgdata && initdb -D pgdata" to have a
pgdata directory with copy-on-write and (their) checksums disabled.
But it reproduces quite easily with COW on (default behaviour) with
io_direct=data, debug_parallel_query=debug, create table as ...;
update ...; select count(*) ...; from that test.

Unfortunately my mental model of btrfs is extremely limited, basically
just "something a bit like ZFS".  FWIW I've been casually following
along with OpenZFS's ongoing O_DIRECT project, and I know that the
plan there is to make a temporary stable copy if checksums and other
features are on (a bit like PostgreSQL does for the same reason, as
you reminded us).  Time will tell how that works out but it *seems*
like all available modes would therefore work correctly for us, with
different tradeoffs (ie if you want the fastest zero-copy I/O, don't
use checksums, compression, etc).

Here, btrfs seems to be taking a different path that I can't quite
make out...  I see no warning/error about a checksum failure like [1],
and we apparently managed to read something other than a mix of the
old and new page contents (which, based on your hypothesis, should
just leave it indeterminate whether the hint bit changes were captured
or not, and the rest of the page should be stable, right).  It's like
the page time-travelled or got scrambled in some other way, but it
didn't tell us?  I'll try to dig further...

[1] https://archive.kernel.org/oldwiki/btrfs.wiki.kernel.org/index.php/Gotchas.html#Direct_IO_and_CRCs



Re: Direct I/O

From
Andrew Dunstan
Date:


On 2023-04-08 Sa 18:50, Thomas Munro wrote:
On Sun, Apr 9, 2023 at 10:17 AM Andrew Dunstan <andrew@dunslane.net> wrote:
I can run the test in isolation, and it's get an error reliably.
Random idea: it looks like you have compression enabled.  What if you
turn it off in the directory where the test runs?  Something like
btrfs property set <file> compression ... according to the
intergoogles.  (I have never used btrfs before 6 minutes ago but I
can't seem to repro this with basic settings in a loopback btrfs
filesystems).


Didn't seem to make any difference.


cheers


andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com

Re: Direct I/O

From
Thomas Munro
Date:
On Sun, Apr 9, 2023 at 4:52 PM Thomas Munro <thomas.munro@gmail.com> wrote:
> Here, btrfs seems to be taking a different path that I can't quite
> make out...  I see no warning/error about a checksum failure like [1],
> and we apparently managed to read something other than a mix of the
> old and new page contents (which, based on your hypothesis, should
> just leave it indeterminate whether the hint bit changes were captured
> or not, and the rest of the page should be stable, right).  It's like
> the page time-travelled or got scrambled in some other way, but it
> didn't tell us?  I'll try to dig further...

I think there are two separate bad phenomena.

1.  A concurrent modification of the user space buffer while writing
breaks the checksum so you can't read the data back in, as .  I can
reproduce that with a stand-alone program, attached.  The "verifier"
process occasionally reports EIO while reading, unless you comment out
the "scribbler" process's active line.  The system log/dmesg gets some
warnings.

2.  The crake-style failure doesn't involve any reported checksum
failures or errors, and I'm not sure if another process is even
involved.  I attach a complete syscall trace of a repro session.  (I
tried to get strace to dump 8192 byte strings, but then it doesn't
repro, so we have only the start of the data transferred for each
page.)  Working back from the error message,

ERROR:  invalid page in block 78 of relation base/5/16384,

we have a page at offset 638976, and we can find all system calls that
touched that offset:

[pid 26031] 23:26:48.521123 pwritev(50,
[{iov_base="\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,
iov_len=8192}], 1, 638976) = 8192

[pid 26040] 23:26:48.568975 pwrite64(5,
"\0\0\0\0\0Nj\1\0\0\0\0\240\3\300\3\0 \4
\0\0\0\0\340\2378\0\300\2378\0"..., 8192, 638976) = 8192

[pid 26040] 23:26:48.593157 pread64(6,
"\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,
8192, 638976) = 8192

In between the write of non-zeros and the read of zeros, nothing seems
to happen that could justify that, that I can grok, but perhaps
someone else will see something that I'm missing.  We pretty much just
have the parallel worker scanning the table, and writing stuff out as
it does it.  This was obtained with:

strace -f --absolute-timestamps=time,us ~/install/bin/postgres -D
pgdata -c io_direct=data -c shared_buffers=256kB -c wal_level=minimal
-c max_wal_senders=0 2>&1 | tee trace.log

The repro is just:

set debug_parallel_query=regress;
drop table if exists t;
create table t as select generate_series(1, 10000);
update t set generate_series = 1;
select count(*) from t;

Occasionally it fails in a different way: after create table t, later
references to t can't find it in the catalogs but there is no invalid
page error.  Perhaps the freaky zeros are happening one 4k page at a
time but perhaps if you get two in a row it might look like an empty
catalog page and pass validation.

Attachment

Re: Direct I/O

From
Thomas Munro
Date:
On Sun, Apr 9, 2023 at 11:25 PM Andrew Dunstan <andrew@dunslane.net> wrote:
> Didn't seem to make any difference.

Thanks for testing.  I think it's COW (and I think that implies also
checksums?) that needs to be turned off, at least based on experiments
here.



Re: Direct I/O

From
Andrew Dunstan
Date:


On 2023-04-09 Su 08:39, Thomas Munro wrote:
On Sun, Apr 9, 2023 at 11:25 PM Andrew Dunstan <andrew@dunslane.net> wrote:
Didn't seem to make any difference.
Thanks for testing.  I think it's COW (and I think that implies also
checksums?) that needs to be turned off, at least based on experiments
here.



Googling agrees with you about checksums.  But I'm still wondering if we shouldn't disable COW for the build directory etc. It is suggested at [1]:


Recommend to set nodatacow – turn cow off – for the files that require fast IO and tend to get very big and get alot of random writes: such VMDK (vm disks) files and the like.


I'll give it a whirl.


cheers


andrew


[1] <http://www.infotinks.com/btrfs-disabling-cow-file-directory-nodatacow/>

--
Andrew Dunstan
EDB: https://www.enterprisedb.com

Re: Direct I/O

From
Andrew Dunstan
Date:


On 2023-04-09 Su 09:14, Andrew Dunstan wrote:


On 2023-04-09 Su 08:39, Thomas Munro wrote:
On Sun, Apr 9, 2023 at 11:25 PM Andrew Dunstan <andrew@dunslane.net> wrote:
Didn't seem to make any difference.
Thanks for testing.  I think it's COW (and I think that implies also
checksums?) that needs to be turned off, at least based on experiments
here.



Googling agrees with you about checksums.  But I'm still wondering if we shouldn't disable COW for the build directory etc. It is suggested at [1]:


Recommend to set nodatacow – turn cow off – for the files that require fast IO and tend to get very big and get alot of random writes: such VMDK (vm disks) files and the like.


I'll give it a whirl.



with COW disabled, I can no longer generate a failure with the test.


cheers


andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com

Re: Direct I/O

From
Tom Lane
Date:
Thomas Munro <thomas.munro@gmail.com> writes:
> we have a page at offset 638976, and we can find all system calls that
> touched that offset:

> [pid 26031] 23:26:48.521123 pwritev(50,
> [{iov_base="\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,
> iov_len=8192}], 1, 638976) = 8192

> [pid 26040] 23:26:48.568975 pwrite64(5,
> "\0\0\0\0\0Nj\1\0\0\0\0\240\3\300\3\0 \4
> \0\0\0\0\340\2378\0\300\2378\0"..., 8192, 638976) = 8192

> [pid 26040] 23:26:48.593157 pread64(6,
> "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,
> 8192, 638976) = 8192

Boy, it's hard to look at that trace and not call it a filesystem bug.
Given the apparent dependency on COW, I wonder if this has something
to do with getting confused about which copy is current?

Another thing that struck me is that the two calls from pid 26040
are issued on different FDs.  I checked the strace log and verified
that these do both refer to "base/5/16384".  It looks like there was
a cache flush at about 23:26:48.575023 that caused 26040 to close
and later reopen all its database relation FDs.  Maybe that is
somehow contributing to the filesystem's confusion?  And more to the
point, could that explain why other O_DIRECT users aren't up in arms
over this bug?  Maybe they don't switch FDs as readily as we do.

            regards, tom lane



Re: Direct I/O

From
Andres Freund
Date:
Hi,

On 2023-04-08 21:29:54 -0700, Noah Misch wrote:
> On Sat, Apr 08, 2023 at 11:08:16AM -0700, Andres Freund wrote:
> > On 2023-04-07 23:04:08 -0700, Andres Freund wrote:
> > > There were some failures in CI (e.g. [1] (and perhaps also bf, didn't yet
> > > check), about "no unpinned buffers available".  I was worried for a moment
> > > that this could actually be relation to the bulk extension patch.
> > > 
> > > But it looks like it's older - and not caused by direct_io support (except by
> > > way of the test existing). I reproduced the issue locally by setting s_b even
> > > lower, to 16 and made the ERROR a PANIC.
> > >
> > > [backtrace]
> 
> I get an ERROR, not a PANIC:

What I meant is that I changed the code to use PANIC, to make it easier to get
a backtrace.


> > > If you look at log_newpage_range(), it's not surprising that we get this error
> > > - it pins up to 32 buffers at once.
> > > 
> > > Afaics log_newpage_range() originates in 9155580fd5fc, but this caller is from
> > > c6b92041d385.
> 
> > > Do we care about fixing this in the backbranches? Probably not, given there
> > > haven't been user complaints?
> 
> I would not.  This is only going to come up where the user goes out of the way
> to use near-minimum shared_buffers.

It's not *just* that scenario. With a few concurrent connections you can get
into problematic territory even with halfway reasonable shared buffers.


> > Here's a quick prototype of this approach.
> 
> This looks fine.  I'm not enthusiastic about incurring post-startup cycles to
> cater to allocating less than 512k*max_connections of shared buffers, but I
> expect the cycles in question are negligible here.

Yea, I can't imagine it'd matter, compared to the other costs. Arguably it'd
allow us to crank up the maximum batch size further, even.

Greetings,

Andres Freund



Re: Direct I/O

From
Noah Misch
Date:
On Sun, Apr 09, 2023 at 02:45:16PM -0700, Andres Freund wrote:
> On 2023-04-08 21:29:54 -0700, Noah Misch wrote:
> > On Sat, Apr 08, 2023 at 11:08:16AM -0700, Andres Freund wrote:
> > > On 2023-04-07 23:04:08 -0700, Andres Freund wrote:
> > > > If you look at log_newpage_range(), it's not surprising that we get this error
> > > > - it pins up to 32 buffers at once.
> > > > 
> > > > Afaics log_newpage_range() originates in 9155580fd5fc, but this caller is from
> > > > c6b92041d385.
> > 
> > > > Do we care about fixing this in the backbranches? Probably not, given there
> > > > haven't been user complaints?
> > 
> > I would not.  This is only going to come up where the user goes out of the way
> > to use near-minimum shared_buffers.
> 
> It's not *just* that scenario. With a few concurrent connections you can get
> into problematic territory even with halfway reasonable shared buffers.

I am not familiar with such cases.  You could get there with 64MB shared
buffers and 256 simultaneous commits of new-refilenode-creating transactions,
but I'd still file that under going out of one's way to use tiny shared
buffers relative to the write activity.  What combination did you envision?



Re: Direct I/O

From
Thomas Munro
Date:
On Mon, Apr 10, 2023 at 8:43 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Boy, it's hard to look at that trace and not call it a filesystem bug.

Agreed.

> Given the apparent dependency on COW, I wonder if this has something
> to do with getting confused about which copy is current?

Yeah, I suppose it would require bogus old page versions (or I guess
alternatively completely mixed up page offsets) rather than bogus
zeroed pages to explain the too-high count observed in one of crake's
failed runs: I guess it counted some pre-updated tuples that were
supposed to be deleted and then counted the post-updated tuples on
later pages (insert joke about the Easter variant of the Halloween
problem).  It's just that in the runs I've managed to observe and
analyse, the previous version always happened to be zeros.



Re: Direct I/O

From
Andres Freund
Date:
Hi,

On 2023-04-10 00:17:12 +1200, Thomas Munro wrote:
> I think there are two separate bad phenomena.
> 
> 1.  A concurrent modification of the user space buffer while writing
> breaks the checksum so you can't read the data back in, as .  I can
> reproduce that with a stand-alone program, attached.  The "verifier"
> process occasionally reports EIO while reading, unless you comment out
> the "scribbler" process's active line.  The system log/dmesg gets some
> warnings.

I think we really need to think about whether we eventually we want to do
something to avoid modifying pages while IO is in progress. The only
alternative is for filesystems to make copies of everything in the IO path,
which is far from free (and obviously prevents from using DMA for the whole
IO). The copy we do to avoid the same problem when checksums are enabled,
shows up quite prominently in write-heavy profiles, so there's a "purely
postgres" reason to avoid these issues too.


> 2.  The crake-style failure doesn't involve any reported checksum
> failures or errors, and I'm not sure if another process is even
> involved.  I attach a complete syscall trace of a repro session.  (I
> tried to get strace to dump 8192 byte strings, but then it doesn't
> repro, so we have only the start of the data transferred for each
> page.)  Working back from the error message,
> 
> ERROR:  invalid page in block 78 of relation base/5/16384,
> 
> we have a page at offset 638976, and we can find all system calls that
> touched that offset:
> 
> [pid 26031] 23:26:48.521123 pwritev(50,
> [{iov_base="\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,
> iov_len=8192}], 1, 638976) = 8192
> 
> [pid 26040] 23:26:48.568975 pwrite64(5,
> "\0\0\0\0\0Nj\1\0\0\0\0\240\3\300\3\0 \4
> \0\0\0\0\340\2378\0\300\2378\0"..., 8192, 638976) = 8192
> 
> [pid 26040] 23:26:48.593157 pread64(6,
> "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,
> 8192, 638976) = 8192
> 
> In between the write of non-zeros and the read of zeros, nothing seems
> to happen that could justify that, that I can grok, but perhaps
> someone else will see something that I'm missing.  We pretty much just
> have the parallel worker scanning the table, and writing stuff out as
> it does it.  This was obtained with:

Have you tried to write a reproducer for this that doesn't involve postgres?
It'd certainly be interesting to know the precise conditions for this. E.g.,
can this also happen without O_DIRECT, if cache pressure is high enough for
the page to get evicted soon after (potentially simulated with fadvise or
such)?

We should definitely let the brtfs folks know of this issue... It's possible
that this bug was recently introduced even. What kernel version did you repro
this on Thomas?

I wonder if we should have a postgres-io-torture program in our tree for some
of these things. We've found issues with our assumptions on several operating
systems and filesystems, without systematically looking. Or even stressing IO
all that hard in our tests.

Greetings,

Andres Freund



Re: Direct I/O

From
Andrea Gelmini
Date:
Il giorno lun 10 apr 2023 alle ore 04:58 Andres Freund
<andres@anarazel.de> ha scritto:
> We should definitely let the brtfs folks know of this issue... It's possible
> that this bug was recently introduced even. What kernel version did you repro
> this on Thomas?

In these days on BTRFS ml they are discussing about Direct I/O data
corruption. No patch at the moment, they are still discussing how to
address it:
https://lore.kernel.org/linux-btrfs/aa1fb69e-b613-47aa-a99e-a0a2c9ed273f@app.fastmail.com/

Ciao,
Gelma



Re: Direct I/O

From
Thomas Munro
Date:
On Mon, Apr 10, 2023 at 2:57 PM Andres Freund <andres@anarazel.de> wrote:
> Have you tried to write a reproducer for this that doesn't involve postgres?

I tried a bit.  I'll try harder soon.

> ... What kernel version did you repro
> this on Thomas?

Debian's 6.0.10-2 kernel (Debian 12 on a random laptop).  Here's how I
set up a test btrfs in case someone else wants a head start:

truncate -s2G 2GB.img
sudo losetup --show --find 2GB.img
sudo mkfs -t btrfs /dev/loop0 # the device name shown by losetup
sudo mkdir /mnt/tmp
sudo mount /dev/loop0 /mnt/tmp
sudo chown $(whoami) /mnt/tmp

cd /mnt/tmp
/path/to/initdb -D pgdata
... (see instructions further up for postgres command line + queries to run)



Re: Direct I/O

From
Thomas Munro
Date:
On Mon, Apr 10, 2023 at 7:27 PM Thomas Munro <thomas.munro@gmail.com> wrote:
> Debian's 6.0.10-2 kernel (Debian 12 on a random laptop).

Realising I hadn't updated for a bit, I did so and it still reproduces on:

$ uname -a
Linux x1 6.1.0-7-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.1.20-1
(2023-03-19) x86_64 GNU/Linux



Re: Direct I/O

From
Andres Freund
Date:
Hi,

On 2023-04-10 19:27:27 +1200, Thomas Munro wrote:
> On Mon, Apr 10, 2023 at 2:57 PM Andres Freund <andres@anarazel.de> wrote:
> > Have you tried to write a reproducer for this that doesn't involve postgres?
> 
> I tried a bit.  I'll try harder soon.
> 
> > ... What kernel version did you repro
> > this on Thomas?
> 
> Debian's 6.0.10-2 kernel (Debian 12 on a random laptop).  Here's how I
> set up a test btrfs in case someone else wants a head start:
> 
> truncate -s2G 2GB.img
> sudo losetup --show --find 2GB.img
> sudo mkfs -t btrfs /dev/loop0 # the device name shown by losetup
> sudo mkdir /mnt/tmp
> sudo mount /dev/loop0 /mnt/tmp
> sudo chown $(whoami) /mnt/tmp
> 
> cd /mnt/tmp
> /path/to/initdb -D pgdata
> ... (see instructions further up for postgres command line + queries to run)

I initially failed to repro the issue with these instructions. Turns out that
the problem does not happen if huge pages are in used - I'd configured huge
pages, so the default huge_pages=try succeeded. As soon as I disable
huge_pages explicitly, it reproduces.

Another interesting bit is that if checksums are enabled, I also can't
reproduce the issue. Together with the huge_page issue, it does suggest that
this is somehow related to page faults. Which fits with the thread Andrea
referenced...

Greetings,

Andres Freund



Re: Direct I/O

From
Andres Freund
Date:
Hi,

On 2023-04-10 18:55:26 -0700, Andres Freund wrote:
> On 2023-04-10 19:27:27 +1200, Thomas Munro wrote:
> > On Mon, Apr 10, 2023 at 2:57 PM Andres Freund <andres@anarazel.de> wrote:
> > > Have you tried to write a reproducer for this that doesn't involve postgres?
> > 
> > I tried a bit.  I'll try harder soon.
> > 
> > > ... What kernel version did you repro
> > > this on Thomas?
> > 
> > Debian's 6.0.10-2 kernel (Debian 12 on a random laptop).  Here's how I
> > set up a test btrfs in case someone else wants a head start:
> > 
> > truncate -s2G 2GB.img
> > sudo losetup --show --find 2GB.img
> > sudo mkfs -t btrfs /dev/loop0 # the device name shown by losetup
> > sudo mkdir /mnt/tmp
> > sudo mount /dev/loop0 /mnt/tmp
> > sudo chown $(whoami) /mnt/tmp
> > 
> > cd /mnt/tmp
> > /path/to/initdb -D pgdata
> > ... (see instructions further up for postgres command line + queries to run)
> 
> I initially failed to repro the issue with these instructions. Turns out that
> the problem does not happen if huge pages are in used - I'd configured huge
> pages, so the default huge_pages=try succeeded. As soon as I disable
> huge_pages explicitly, it reproduces.
> 
> Another interesting bit is that if checksums are enabled, I also can't
> reproduce the issue. Together with the huge_page issue, it does suggest that
> this is somehow related to page faults. Which fits with the thread Andrea
> referenced...

The last iteration of the fix that I could find is:
https://lore.kernel.org/linux-btrfs/20230328051957.1161316-1-hch@lst.de/T/#m1afdc3fe77e10a97302e7d80fed3efeaa297f0f7

And the fix has been merged into
https://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux.git/log/?h=for-next

I think that means it'll have to wait for 6.4 development to open (in a few
weeks), and then will be merged into the stable branches from there.

Greetings,

Andres Freund



Re: Direct I/O

From
Thomas Munro
Date:
On Tue, Apr 11, 2023 at 2:15 PM Andres Freund <andres@anarazel.de> wrote:
> And the fix has been merged into
> https://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux.git/log/?h=for-next
>
> I think that means it'll have to wait for 6.4 development to open (in a few
> weeks), and then will be merged into the stable branches from there.

Great!  Let's hope/assume for now that that'll fix phenomenon #2.
That still leaves the checksum-vs-concurrent-modification thing that I
called phenomenon #1, which we've not actually hit with PostgreSQL yet
but is clearly possible and can be seen with the stand-alone
repro-program I posted upthread.  You wrote:

On Mon, Apr 10, 2023 at 2:57 PM Andres Freund <andres@anarazel.de> wrote:
> I think we really need to think about whether we eventually we want to do
> something to avoid modifying pages while IO is in progress. The only
> alternative is for filesystems to make copies of everything in the IO path,
> which is far from free (and obviously prevents from using DMA for the whole
> IO). The copy we do to avoid the same problem when checksums are enabled,
> shows up quite prominently in write-heavy profiles, so there's a "purely
> postgres" reason to avoid these issues too.

+1

I wonder what the other file systems that maintain checksums (see list
at [1]) do when the data changes underneath a write.  ZFS's policy is
conservative[2], while BTRFS took the demons-will-fly-out-of-your-nose
route.  I can see arguments for both approaches (ZFS can only reach
zero-copy optimum by turning off checksums completely, while BTRFS is
happy to assume that if you break this programming rule that is not
written down anywhere then you must never want to see your data ever
again).  What about ReFS?  CephFS?

I tried to find out what POSIX says about this WRT synchronous
pwrite() (as Tom suggested, maybe we're doing something POSIX doesn't
allow), but couldn't find it in my first attempt.  It *does* say it's
undefined for aio_write() (which means that my prototype
io_method=posix_aio code that uses that stuff is undefined in presense
of hintbit modifications).  I don't really see why it should vary
between synchronous and asynchronous interfaces (considering the
existence of threads, shared memory etc, the synchronous interface
only removes one thread from list of possible suspects that could flip
some bits).

But yeah, in any case, it doesn't seem great that we do that.

[1] https://en.wikipedia.org/wiki/Comparison_of_file_systems#Block_capabilities
[2] https://openzfs.topicbox.com/groups/developer/T950b02acdf392290/odirect-semantics-in-zfs



Re: Direct I/O

From
Thomas Munro
Date:
On Tue, Apr 11, 2023 at 2:31 PM Thomas Munro <thomas.munro@gmail.com> wrote:
> I tried to find out what POSIX says about this

(But of course whatever it might say is of especially limited value
when O_DIRECT is in the picture, being completely unstandardised.
Really I guess all they meant was "if you *copy* something that's
moving, who knows which bits you'll copy"... not "your data might be
incinerated with lasers".)



Re: Direct I/O

From
Andres Freund
Date:
Hi,

On 2023-04-09 16:40:54 -0700, Noah Misch wrote:
> On Sun, Apr 09, 2023 at 02:45:16PM -0700, Andres Freund wrote:
> > It's not *just* that scenario. With a few concurrent connections you can get
> > into problematic territory even with halfway reasonable shared buffers.
>
> I am not familiar with such cases.  You could get there with 64MB shared
> buffers and 256 simultaneous commits of new-refilenode-creating transactions,
> but I'd still file that under going out of one's way to use tiny shared
> buffers relative to the write activity.  What combination did you envision?

I'd not say it's common, but it's less crazy than running with 128kB of s_b...

There's also the issue that log_newpage_range() is used in number of places
where we could have a lot of pre-existing buffer pins. So pinning another 64
buffers could tip us over.

Greetings,

Andres Freund



Re: Direct I/O

From
Christoph Berg
Date:
Hi,

I'm hitting a panic in t_004_io_direct. The build is running on
overlayfs on tmpfs/ext4 (upper/lower) which is probably a weird
combination but has worked well for building everything over the last
decade. On Debian unstable:

PANIC:  could not open file "pg_wal/000000010000000000000001": Invalid argument

16:21:16 Bailout called.  Further testing stopped:  pg_ctl start failed
16:21:16 t/004_io_direct.pl ..............
16:21:16 Dubious, test returned 255 (wstat 65280, 0xff00)
16:21:16 No subtests run
16:21:16
16:21:16 Test Summary Report
16:21:16 -------------------
16:21:16 t/004_io_direct.pl            (Wstat: 65280 (exited 255) Tests: 0 Failed: 0)
16:21:16   Non-zero exit status: 255
16:21:16   Parse errors: No plan found in TAP output
16:21:16 Files=4, Tests=65,  9 wallclock secs ( 0.03 usr  0.02 sys +  3.78 cusr  1.48 csys =  5.31 CPU)
16:21:16 Result: FAIL

16:21:16 ******** build/src/test/modules/test_misc/tmp_check/log/004_io_direct_main.log ********
16:21:16 2023-04-11 23:21:16.431 UTC [25991] LOG:  starting PostgreSQL 16devel (Debian
16~~devel-1.pgdg+~20230411.2256.gc03c2ea)on x86_64-pc-linux-gnu, compiled by gcc (Debian 12.2.0-14) 12.2.0, 64-bit
 
16:21:16 2023-04-11 23:21:16.431 UTC [25991] LOG:  listening on Unix socket "/tmp/s0C4hWQq82/.s.PGSQL.54693"
16:21:16 2023-04-11 23:21:16.433 UTC [25994] LOG:  database system was shut down at 2023-04-11 23:21:16 UTC
16:21:16 2023-04-11 23:21:16.434 UTC [25994] PANIC:  could not open file "pg_wal/000000010000000000000001": Invalid
argument
16:21:16 2023-04-11 23:21:16.525 UTC [25991] LOG:  startup process (PID 25994) was terminated by signal 6: Aborted
16:21:16 2023-04-11 23:21:16.525 UTC [25991] LOG:  aborting startup due to startup process failure
16:21:16 2023-04-11 23:21:16.526 UTC [25991] LOG:  database system is shut down

16:21:16 ******** build/src/test/modules/test_misc/tmp_check/t_004_io_direct_main_data/pgdata/core ********
16:21:17
16:21:17 warning: Can't open file /dev/shm/PostgreSQL.3457641370 during file-backed mapping note processing
16:21:17
16:21:17 warning: Can't open file /dev/shm/PostgreSQL.2391834648 during file-backed mapping note processing
16:21:17
16:21:17 warning: Can't open file /dev/zero (deleted) during file-backed mapping note processing
16:21:17
16:21:17 warning: Can't open file /SYSV00000dea (deleted) during file-backed mapping note processing
16:21:17 [New LWP 25994]
16:21:17 [Thread debugging using libthread_db enabled]
16:21:17 Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
16:21:17 Core was generated by `postgres: main: startup                                                       '.
16:21:17 Program terminated with signal SIGABRT, Aborted.
16:21:17 #0  0x00007f176c591ccc in ?? () from /lib/x86_64-linux-gnu/libc.so.6
16:21:17 #0  0x00007f176c591ccc in ?? () from /lib/x86_64-linux-gnu/libc.so.6
16:21:17 No symbol table info available.
16:21:17 #1  0x00007f176c542ef2 in raise () from /lib/x86_64-linux-gnu/libc.so.6
16:21:17 No symbol table info available.
16:21:17 #2  0x00007f176c52d472 in abort () from /lib/x86_64-linux-gnu/libc.so.6
16:21:17 No symbol table info available.
16:21:17 #3  0x000055a7ba7978a1 in errfinish (filename=<optimized out>, lineno=<optimized out>, funcname=0x55a7ba810560
<__func__.47>"XLogFileInitInternal") at ./build/../src/backend/utils/error/elog.c:604
 
16:21:17         edata = 0x55a7baae3e20 <errordata>
16:21:17         elevel = 23
16:21:17         oldcontext = 0x55a7bb471590
16:21:17         econtext = 0x0
16:21:17         __func__ = "errfinish"
16:21:17 #4  0x000055a7ba21759c in XLogFileInitInternal (logsegno=1, logtli=logtli@entry=1,
added=added@entry=0x7ffebc6c8a3f,path=path@entry=0x7ffebc6c8a40 "pg_wal/00000001", '0' <repeats 15 times>, "1") at
./build/../src/backend/access/transam/xlog.c:2944
16:21:17         __errno_location = <optimized out>
16:21:17         tmppath =
"0\214l\274\376\177\000\000\321\330~\272\247U\000\000\005Q\223\272\247U\000\000p\214l\274\376\177\000\000`\214l\274\376\177\000\000\212\335~\000\v",
'\000'<repeats 31 times>,
"\247U\000\000\000\000\000\000\000\177\000\000*O\202\272\247U\000\000\254\206l\274\376\177\000\000\000\000\000\000\v",
'\000'<repeats 23 times>,
"0\000\000\000\000\000\000\000\247U\000\000\000\000\000\000\000\000\000\000\001Q\223\272\247U\000\000\240\215l\274\376\177\000\000\376\377\377\377\000\000\000\0000\207l\274\376\177\000\000[\326~\272\247U\000\0000\207l\274\376\177\000\000"...
16:21:17         installed_segno = 0
16:21:17         max_segno = <optimized out>
16:21:17         fd = <optimized out>
16:21:17         save_errno = <optimized out>
16:21:17         open_flags = 194
16:21:17         __func__ = "XLogFileInitInternal"
16:21:17 #5  0x000055a7ba35a1d5 in XLogFileInit (logsegno=<optimized out>, logtli=logtli@entry=1) at
./build/../src/backend/access/transam/xlog.c:3099
16:21:17         ignore_added = false
16:21:17         path = "pg_wal/00000001", '0' <repeats 15 times>,
"1\000\220\312P\273\247U\000\000/\375Yl\027\177\000\000\220\252P\273\247U\000\000\001",'\000' <repeats 15 times>,
"\220\252P\273\247U\000\000\300\212l\274\376\177\000\000\002\261{\272\247U\000\000\220\252P\273\247U\000\000\220\252P\273\247U\000\000\001",
'\000'<repeats 15 times>,
"\340\212l\274\376\177\000\000\021\032|\272\247U\000\000\000\000\000\000\000\000\000\000\240\312P\273\247U\000\0000\213l\274\376\177\000\000\350\262{\272\247U\000\000\001",
'\000'<repeats 16 times>, "\256\023i\027\177\000\000"...
 
16:21:17         fd = <optimized out>
16:21:17         __func__ = "XLogFileInit"
16:21:17 #6  0x000055a7ba35bab3 in XLogWrite (WriteRqst=..., tli=tli@entry=1, flexible=flexible@entry=false) at
./build/../src/backend/access/transam/xlog.c:2137
16:21:17         EndPtr = 21954560
16:21:17         ispartialpage = true
16:21:17         last_iteration = <optimized out>
16:21:17         finishing_seg = <optimized out>
16:21:17         curridx = 7
16:21:17         npages = 0
16:21:17         startidx = 0
16:21:17         startoffset = 0
16:21:17         __func__ = "XLogWrite"
16:21:17 #7  0x000055a7ba35c8e0 in XLogFlush (record=21949600) at ./build/../src/backend/access/transam/xlog.c:2638
16:21:17         insertpos = 21949600
16:21:17         WriteRqstPtr = 21949600
16:21:17         WriteRqst = <optimized out>
16:21:17         insertTLI = 1
16:21:17         __func__ = "XLogFlush"
16:21:17 #8  0x000055a7ba36118e in XLogReportParameters () at ./build/../src/backend/access/transam/xlog.c:7620
16:21:17         xlrec = {MaxConnections = 100, max_worker_processes = 8, max_wal_senders = 0, max_prepared_xacts = 0,
max_locks_per_xact= 64, wal_level = 1, wal_log_hints = false, track_commit_timestamp = false}
 
16:21:17         recptr = <optimized out>
16:21:17 #9  StartupXLOG () at ./build/../src/backend/access/transam/xlog.c:5726
16:21:17         Insert = <optimized out>
16:21:17         checkPoint = <optimized out>
16:21:17         wasShutdown = true
16:21:17         didCrash = <optimized out>
16:21:17         haveTblspcMap = false
16:21:17         haveBackupLabel = false
16:21:17         EndOfLog = 21949544
16:21:17         EndOfLogTLI = <optimized out>
16:21:17         newTLI = 1
16:21:17         performedWalRecovery = <optimized out>
16:21:17         endOfRecoveryInfo = <optimized out>
16:21:17         abortedRecPtr = <optimized out>
16:21:17         missingContrecPtr = 0
16:21:17         oldestActiveXID = <optimized out>
16:21:17         promoted = false
16:21:17         __func__ = "StartupXLOG"
16:21:17 #10 0x000055a7ba5b4d00 in StartupProcessMain () at ./build/../src/backend/postmaster/startup.c:267
16:21:17 No locals.
16:21:17 #11 0x000055a7ba5ab0cf in AuxiliaryProcessMain (auxtype=auxtype@entry=StartupProcess) at
./build/../src/backend/postmaster/auxprocess.c:141
16:21:17         __func__ = "AuxiliaryProcessMain"
16:21:17 #12 0x000055a7ba5b0aa3 in StartChildProcess (type=StartupProcess) at
./build/../src/backend/postmaster/postmaster.c:5369
16:21:17         pid = <optimized out>
16:21:17         __func__ = "StartChildProcess"
16:21:17         save_errno = <optimized out>
16:21:17         __errno_location = <optimized out>
16:21:17         __errno_location = <optimized out>
16:21:17         __errno_location = <optimized out>
16:21:17         __errno_location = <optimized out>
16:21:17         __errno_location = <optimized out>
16:21:17         __errno_location = <optimized out>
16:21:17         __errno_location = <optimized out>
16:21:17 #13 0x000055a7ba5b45d6 in PostmasterMain (argc=argc@entry=4, argv=argv@entry=0x55a7bb471450) at
./build/../src/backend/postmaster/postmaster.c:1455
16:21:17         opt = <optimized out>
16:21:17         status = <optimized out>
16:21:17         userDoption = <optimized out>
16:21:17         listen_addr_saved = <optimized out>
16:21:17         i = <optimized out>
16:21:17         output_config_variable = <optimized out>
16:21:17         __func__ = "PostmasterMain"
16:21:17 #14 0x000055a7ba29fd62 in main (argc=4, argv=0x55a7bb471450) at ./build/../src/backend/main/main.c:200
16:21:17         do_check_root = <optimized out>

Apologies if this was already reported elsewhere in the thread, I
skimmed it but the problems looked different.

Christoph



Re: Direct I/O

From
Thomas Munro
Date:
On Wed, Apr 12, 2023 at 2:56 PM Christoph Berg <myon@debian.org> wrote:
> I'm hitting a panic in t_004_io_direct. The build is running on
> overlayfs on tmpfs/ext4 (upper/lower) which is probably a weird
> combination but has worked well for building everything over the last
> decade. On Debian unstable:
>
> PANIC:  could not open file "pg_wal/000000010000000000000001": Invalid argument

Hi Christoph,

That's an interesting one.  I was half expecting to see that on some
unusual systems, which is why I made the test check which OS it is and
exclude those that are known to fail with EINVAL or ENOTSUPP on their
common/typical file systems.  But if it's going to be Linux, that's
not going to work.  I have a new idea:  perhaps it is possible to try
to open a file with O_DIRECT from perl, and if it fails like that,
skip the test.  Looking into that now.



Re: Direct I/O

From
Thomas Munro
Date:
On Wed, Apr 12, 2023 at 3:04 PM Thomas Munro <thomas.munro@gmail.com> wrote:
> On Wed, Apr 12, 2023 at 2:56 PM Christoph Berg <myon@debian.org> wrote:
> > I'm hitting a panic in t_004_io_direct. The build is running on
> > overlayfs on tmpfs/ext4 (upper/lower) which is probably a weird
> > combination but has worked well for building everything over the last
> > decade. On Debian unstable:
> >
> > PANIC:  could not open file "pg_wal/000000010000000000000001": Invalid argument

> ... I have a new idea:  perhaps it is possible to try
> to open a file with O_DIRECT from perl, and if it fails like that,
> skip the test.  Looking into that now.

I think I have that working OK.  Any Perl hackers want to comment on
my use of IO::File (copied from examples on the internet that showed
how to use O_DIRECT)?  I am not much of a perl hacker but according to
my package manager, IO/File.pm came with perl itself.  And the Fcntl
eval trick that I copied from File::stat, and the perl-critic
suppression that requires?

I tested this on OpenBSD, which has no O_DIRECT, so that tests the
first reason to skip.

Does it skip OK on your system, for the second reason?  Should we be
more specific about the errno?

As far as I know, the only systems on the build farm that should be
affected by this change are the illumos boxen (they have O_DIRECT,
unlike Solaris, but perl's $^O couldn't tell the difference between
Solaris and illumos, so they didn't previously run this test).

One thing I resisted the urge to do is invent PG_TEST_SKIP, a sort of
anti-PG_TEST_EXTRA.  I think I'd rather hear about it if there is a
system out there that passes the pre-flight check, but fails later on,
because we'd better investigate why.  That's basically the point of
shipping this "developer only" feature long before serious use of it.

Attachment

Re: Direct I/O

From
Thomas Munro
Date:
On Wed, Apr 12, 2023 at 5:48 PM Thomas Munro <thomas.munro@gmail.com> wrote:
> On Wed, Apr 12, 2023 at 3:04 PM Thomas Munro <thomas.munro@gmail.com> wrote:
> > On Wed, Apr 12, 2023 at 2:56 PM Christoph Berg <myon@debian.org> wrote:
> > > I'm hitting a panic in t_004_io_direct. The build is running on
> > > overlayfs on tmpfs/ext4 (upper/lower) which is probably a weird
> > > combination but has worked well for building everything over the last
> > > decade. On Debian unstable:

After trying a couple of things and doing some googling, it looks like
it's tmpfs that rejects it, not overlayfs, so I'd adjust that commit
message slightly.  Of course it's a completely reasonable thing to
expect the tests to pass (or in this case be skipped) in a tmpfs, eg
/tmp on some distributions.  (It's a strange to contemplate what
O_DIRECT means for tmpfs, considering that it *is* the page cache,
kinda, and I see people have been arguing about that for a couple of
decades since O_DIRECT was added to Linux; doesn't seem that helpful
to me that it rejects it, but 🤷).



Re: Direct I/O

From
Andrew Dunstan
Date:


On 2023-04-12 We 01:48, Thomas Munro wrote:
On Wed, Apr 12, 2023 at 3:04 PM Thomas Munro <thomas.munro@gmail.com> wrote:
On Wed, Apr 12, 2023 at 2:56 PM Christoph Berg <myon@debian.org> wrote:
I'm hitting a panic in t_004_io_direct. The build is running on
overlayfs on tmpfs/ext4 (upper/lower) which is probably a weird
combination but has worked well for building everything over the last
decade. On Debian unstable:

PANIC:  could not open file "pg_wal/000000010000000000000001": Invalid argument
... I have a new idea:  perhaps it is possible to try
to open a file with O_DIRECT from perl, and if it fails like that,
skip the test.  Looking into that now.
I think I have that working OK.  Any Perl hackers want to comment on
my use of IO::File (copied from examples on the internet that showed
how to use O_DIRECT)?  I am not much of a perl hacker but according to
my package manager, IO/File.pm came with perl itself.  And the Fcntl
eval trick that I copied from File::stat, and the perl-critic
suppression that requires?


I think you can probably replace a lot of the magic here by simply saying


if (Fcntl->can("O_DIRECT")) ...


cheers


andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com

Re: Direct I/O

From
Dagfinn Ilmari Mannsåker
Date:
Thomas Munro <thomas.munro@gmail.com> writes:

> I think I have that working OK.  Any Perl hackers want to comment on
> my use of IO::File (copied from examples on the internet that showed
> how to use O_DIRECT)?  I am not much of a perl hacker but according to
> my package manager, IO/File.pm came with perl itself.

Indeed, and it has been since perl 5.003_07, released in 1996.  And Fcntl
has known about O_DIRECT since perl 5.6.0, released in 2000, so we can
safely use both.

> And the Fcntl eval trick that I copied from File::stat, and the
> perl-critic suppression that requires?
[…]
> +    no strict 'refs';    ## no critic (ProhibitNoStrict)
> +    my $val = eval { &{'Fcntl::O_DIRECT'} };
> +    if (defined $val)

This trick is only needed in File::stat because it's constructing the
symbol name dynamically.  And because Fcntl by default exports all the
O_* and F_* constants it knows about, we can simply do:

       if (defined &O_DIRECT)
> +    {
> +        use Fcntl qw(O_DIRECT);

The `use Fcntl;` above will already have imported this, so this is
redundant.

- ilmari



Re: Direct I/O

From
Dagfinn Ilmari Mannsåker
Date:
Andrew Dunstan <andrew@dunslane.net> writes:

> On 2023-04-12 We 01:48, Thomas Munro wrote:
>> On Wed, Apr 12, 2023 at 3:04 PM Thomas Munro<thomas.munro@gmail.com>  wrote:
>>> On Wed, Apr 12, 2023 at 2:56 PM Christoph Berg<myon@debian.org>  wrote:
>>>> I'm hitting a panic in t_004_io_direct. The build is running on
>>>> overlayfs on tmpfs/ext4 (upper/lower) which is probably a weird
>>>> combination but has worked well for building everything over the last
>>>> decade. On Debian unstable:
>>>>
>>>> PANIC:  could not open file "pg_wal/000000010000000000000001": Invalid argument
>>> ... I have a new idea:  perhaps it is possible to try
>>> to open a file with O_DIRECT from perl, and if it fails like that,
>>> skip the test.  Looking into that now.
>> I think I have that working OK.  Any Perl hackers want to comment on
>> my use of IO::File (copied from examples on the internet that showed
>> how to use O_DIRECT)?  I am not much of a perl hacker but according to
>> my package manager, IO/File.pm came with perl itself.  And the Fcntl
>> eval trick that I copied from File::stat, and the perl-critic
>> suppression that requires?
>
>
> I think you can probably replace a lot of the magic here by simply saying
>
>
> if (Fcntl->can("O_DIRECT")) ...

Fcntl->can() is true for all constants that Fcntl knows about, whether
or not they are defined for your OS. `defined &O_DIRECT` is the simplest
check, see my other reply to Thomas.

> cheers
>
>
> andrew

- ilmari



Re: Direct I/O

From
Andrew Dunstan
Date:


On 2023-04-12 We 10:23, Dagfinn Ilmari Mannsåker wrote:
Andrew Dunstan <andrew@dunslane.net> writes:

On 2023-04-12 We 01:48, Thomas Munro wrote:
On Wed, Apr 12, 2023 at 3:04 PM Thomas Munro<thomas.munro@gmail.com>  wrote:
On Wed, Apr 12, 2023 at 2:56 PM Christoph Berg<myon@debian.org>  wrote:
I'm hitting a panic in t_004_io_direct. The build is running on
overlayfs on tmpfs/ext4 (upper/lower) which is probably a weird
combination but has worked well for building everything over the last
decade. On Debian unstable:

PANIC:  could not open file "pg_wal/000000010000000000000001": Invalid argument
... I have a new idea:  perhaps it is possible to try
to open a file with O_DIRECT from perl, and if it fails like that,
skip the test.  Looking into that now.
I think I have that working OK.  Any Perl hackers want to comment on
my use of IO::File (copied from examples on the internet that showed
how to use O_DIRECT)?  I am not much of a perl hacker but according to
my package manager, IO/File.pm came with perl itself.  And the Fcntl
eval trick that I copied from File::stat, and the perl-critic
suppression that requires?

I think you can probably replace a lot of the magic here by simply saying


if (Fcntl->can("O_DIRECT")) ...
Fcntl->can() is true for all constants that Fcntl knows about, whether
or not they are defined for your OS. `defined &O_DIRECT` is the simplest
check, see my other reply to Thomas.



My understanding was that Fcntl only exported constants known to the OS. That's certainly what its docco suggests, e.g.:

    By default your system's F_* and O_* constants (eg, F_DUPFD and O_CREAT)
    and the FD_CLOEXEC constant are exported into your namespace.


cheers


andrew




--
Andrew Dunstan
EDB: https://www.enterprisedb.com

Re: Direct I/O

From
Dagfinn Ilmari Mannsåker
Date:
Andrew Dunstan <andrew@dunslane.net> writes:

> On 2023-04-12 We 10:23, Dagfinn Ilmari Mannsåker wrote:
>> Andrew Dunstan<andrew@dunslane.net>  writes:
>>
>>> On 2023-04-12 We 01:48, Thomas Munro wrote:
>>>> On Wed, Apr 12, 2023 at 3:04 PM Thomas Munro<thomas.munro@gmail.com>   wrote:
>>>>> On Wed, Apr 12, 2023 at 2:56 PM Christoph Berg<myon@debian.org>   wrote:
>>>>>> I'm hitting a panic in t_004_io_direct. The build is running on
>>>>>> overlayfs on tmpfs/ext4 (upper/lower) which is probably a weird
>>>>>> combination but has worked well for building everything over the last
>>>>>> decade. On Debian unstable:
>>>>>>
>>>>>> PANIC:  could not open file "pg_wal/000000010000000000000001": Invalid argument
>>>>> ... I have a new idea:  perhaps it is possible to try
>>>>> to open a file with O_DIRECT from perl, and if it fails like that,
>>>>> skip the test.  Looking into that now.
>>>> I think I have that working OK.  Any Perl hackers want to comment on
>>>> my use of IO::File (copied from examples on the internet that showed
>>>> how to use O_DIRECT)?  I am not much of a perl hacker but according to
>>>> my package manager, IO/File.pm came with perl itself.  And the Fcntl
>>>> eval trick that I copied from File::stat, and the perl-critic
>>>> suppression that requires?
>>>
>>> I think you can probably replace a lot of the magic here by simply saying
>>>
>>>
>>> if (Fcntl->can("O_DIRECT")) ...
>> Fcntl->can() is true for all constants that Fcntl knows about, whether
>> or not they are defined for your OS. `defined &O_DIRECT` is the simplest
>> check, see my other reply to Thomas.
>>
>>
>
> My understanding was that Fcntl only exported constants known to the
> OS. That's certainly what its docco suggests, e.g.:
>
>     By default your system's F_* and O_* constants (eg, F_DUPFD and
> O_CREAT)
>     and the FD_CLOEXEC constant are exported into your namespace.

It's a bit more magical than that (this is Perl after all).  They are
all exported (which implicitly creates stubs visible to `->can()`,
similarly to forward declarations like `sub O_FOO;`), but only the
defined ones (`#ifdef O_FOO` is true) are defined (`defined &O_FOO` is
true).  The rest fall through to an AUTOLOAD¹ function that throws an
exception for undefined ones.

Here's an example (Fcntl knows O_RAW, but Linux does not define it):

    $ perl -E '
        use strict; use Fcntl;
        say "can", main->can("O_RAW") ? "" : "not";
        say defined &O_RAW ? "" : "not ", "defined";
        say O_RAW;'
    can
    not defined
    Your vendor has not defined Fcntl macro O_RAW, used at -e line 4

While O_DIRECT is defined:

    $ perl -E '
        use strict; use Fcntl;
        say "can", main->can("O_DIRECT") ? "" : "not";
        say defined &O_DIRECT ? "" : "not ", "defined";
        say O_DIRECT;'
    can
    defined
    16384

And O_FOO is unknown to Fcntl (the parens on `O_FOO()q are to make it
not a bareword, which would be a compile error under `use strict;` when
the sub doesn't exist at all):

    $ perl -E '
        use strict; use Fcntl;
        say "can", main->can("O_FOO") ? "" : "not";
        say defined &O_FOO ? "" : "not ", "defined";
        say O_FOO();'
    cannot
    not defined
    Undefined subroutine &main::O_FOO called at -e line 4.

> cheers
>
>
> andrew

- ilmari

[1] https://perldoc.perl.org/perlsub#Autoloading



Re: Direct I/O

From
Andrew Dunstan
Date:


On 2023-04-12 We 12:38, Dagfinn Ilmari Mannsåker wrote:
Andrew Dunstan <andrew@dunslane.net> writes:

On 2023-04-12 We 10:23, Dagfinn Ilmari Mannsåker wrote:
Andrew Dunstan<andrew@dunslane.net>  writes:

On 2023-04-12 We 01:48, Thomas Munro wrote:
On Wed, Apr 12, 2023 at 3:04 PM Thomas Munro<thomas.munro@gmail.com>   wrote:
On Wed, Apr 12, 2023 at 2:56 PM Christoph Berg<myon@debian.org>   wrote:
I'm hitting a panic in t_004_io_direct. The build is running on
overlayfs on tmpfs/ext4 (upper/lower) which is probably a weird
combination but has worked well for building everything over the last
decade. On Debian unstable:

PANIC:  could not open file "pg_wal/000000010000000000000001": Invalid argument
... I have a new idea:  perhaps it is possible to try
to open a file with O_DIRECT from perl, and if it fails like that,
skip the test.  Looking into that now.
I think I have that working OK.  Any Perl hackers want to comment on
my use of IO::File (copied from examples on the internet that showed
how to use O_DIRECT)?  I am not much of a perl hacker but according to
my package manager, IO/File.pm came with perl itself.  And the Fcntl
eval trick that I copied from File::stat, and the perl-critic
suppression that requires?
I think you can probably replace a lot of the magic here by simply saying


if (Fcntl->can("O_DIRECT")) ...
Fcntl->can() is true for all constants that Fcntl knows about, whether
or not they are defined for your OS. `defined &O_DIRECT` is the simplest
check, see my other reply to Thomas.


My understanding was that Fcntl only exported constants known to the
OS. That's certainly what its docco suggests, e.g.:

    By default your system's F_* and O_* constants (eg, F_DUPFD and
O_CREAT)
    and the FD_CLOEXEC constant are exported into your namespace.
It's a bit more magical than that (this is Perl after all).  They are
all exported (which implicitly creates stubs visible to `->can()`,
similarly to forward declarations like `sub O_FOO;`), but only the
defined ones (`#ifdef O_FOO` is true) are defined (`defined &O_FOO` is
true).  The rest fall through to an AUTOLOAD¹ function that throws an
exception for undefined ones.

Here's an example (Fcntl knows O_RAW, but Linux does not define it):
    $ perl -E '        use strict; use Fcntl;        say "can", main->can("O_RAW") ? "" : "not";        say defined &O_RAW ? "" : "not ", "defined";        say O_RAW;'    can    not defined    Your vendor has not defined Fcntl macro O_RAW, used at -e line 4

While O_DIRECT is defined:
    $ perl -E '        use strict; use Fcntl;        say "can", main->can("O_DIRECT") ? "" : "not";        say defined &O_DIRECT ? "" : "not ", "defined";        say O_DIRECT;'    can    defined    16384

And O_FOO is unknown to Fcntl (the parens on `O_FOO()q are to make it
not a bareword, which would be a compile error under `use strict;` when
the sub doesn't exist at all):
    $ perl -E '        use strict; use Fcntl;        say "can", main->can("O_FOO") ? "" : "not";        say defined &O_FOO ? "" : "not ", "defined";        say O_FOO();'    cannot    not defined    Undefined subroutine &main::O_FOO called at -e line 4.



*grumble* a bit too magical for my taste. Thanks for the correction.


cheers


andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com

Re: Direct I/O

From
Thomas Munro
Date:
Thanks both for looking, and thanks for the explanation Ilmari.
Pushed with your improvements.  The 4 CI systems run the tests
(Windows and Mac by special always-expected-to-work case, Linux and
FreeBSD by successful pre-flight perl test of O_DIRECT), and I also
tested three unusual systems, two that skip for different reasons and
one that will henceforth run this test on the build farm so I wanted
to confirm locally first:

Linux/tmpfs: 1..0 # SKIP pre-flight test if we can open a file with
O_DIRECT failed: Invalid argument
OpenBSD:     t/004_io_direct.pl .............. skipped: no O_DIRECT
illumos:     t/004_io_direct.pl .............. ok

(Format different because those last two are autoconf, no meson on my
collection of Vagrant images yet...)



Re: Direct I/O

From
Christoph Berg
Date:
Re: Thomas Munro
> Linux/tmpfs: 1..0 # SKIP pre-flight test if we can open a file with
> O_DIRECT failed: Invalid argument

I confirm it's working now:

t/004_io_direct.pl .............. skipped: pre-flight test if we can open a file with O_DIRECT failed: Invalid
argument
All tests successful.

Thanks,
Christoph



Re: Direct I/O

From
Tom Lane
Date:
Since the direct I/O commit went in, buildfarm animals
curculio and morepork have been issuing warnings like

hashpage.c: In function '_hash_expandtable':
hashpage.c:995: warning: ignoring alignment for stack allocated 'zerobuf'

in places where there's a local variable of type PGIOAlignedBlock
or PGAlignedXLogBlock.  I'm not sure why only those two animals
are unhappy, but I think they have a point: typical ABIs don't
guarantee alignment of function stack frames to better than
16 bytes or so.  In principle the compiler could support a 4K
alignment request anyway by doing the equivalent of alloca(3),
but I do not think we can count on that to happen.

            regards, tom lane



Re: Direct I/O

From
Andres Freund
Date:
Hi,

On 2023-04-14 13:21:33 -0400, Tom Lane wrote:
> Since the direct I/O commit went in, buildfarm animals
> curculio and morepork have been issuing warnings like
> 
> hashpage.c: In function '_hash_expandtable':
> hashpage.c:995: warning: ignoring alignment for stack allocated 'zerobuf'
> 
> in places where there's a local variable of type PGIOAlignedBlock
> or PGAlignedXLogBlock.  I'm not sure why only those two animals
> are unhappy, but I think they have a point: typical ABIs don't
> guarantee alignment of function stack frames to better than
> 16 bytes or so.  In principle the compiler could support a 4K
> alignment request anyway by doing the equivalent of alloca(3),
> but I do not think we can count on that to happen.

Hm. New-ish compilers seem to be ok with it.  Perhaps we should have a
configure check whether the compiler is OK with that, and disable direct IO
support if not?

Greetings,

Andres Freund



Re: Direct I/O

From
Tom Lane
Date:
Andres Freund <andres@anarazel.de> writes:
> On 2023-04-14 13:21:33 -0400, Tom Lane wrote:
>> ...  I'm not sure why only those two animals
>> are unhappy, but I think they have a point: typical ABIs don't
>> guarantee alignment of function stack frames to better than
>> 16 bytes or so.  In principle the compiler could support a 4K
>> alignment request anyway by doing the equivalent of alloca(3),
>> but I do not think we can count on that to happen.

> Hm. New-ish compilers seem to be ok with it.

Oh!  I was misled by the buildfarm label on morepork, which claims
it's running clang 10.0.1.  But actually, per its configure report,
it's running

    configure: using compiler=gcc (GCC) 4.2.1 20070719 

which is the same as curculio.  So that explains why nothing else is
complaining.  I agree we needn't let 15-year-old compilers force us
into the mess that would be entailed by not treating these variables
as simple locals.

> Perhaps we should have a
> configure check whether the compiler is OK with that, and disable direct IO
> support if not?

+1 for that, though.  (Also, the fact that these animals aren't
actually failing suggests that 004_io_direct.pl needs expansion.)

            regards, tom lane



Re: Direct I/O

From
Andres Freund
Date:
Hi,

On 2023-04-14 15:21:18 -0400, Tom Lane wrote:
> Andres Freund <andres@anarazel.de> writes:
> > On 2023-04-14 13:21:33 -0400, Tom Lane wrote:
> >> ...  I'm not sure why only those two animals
> >> are unhappy, but I think they have a point: typical ABIs don't
> >> guarantee alignment of function stack frames to better than
> >> 16 bytes or so.  In principle the compiler could support a 4K
> >> alignment request anyway by doing the equivalent of alloca(3),
> >> but I do not think we can count on that to happen.
> 
> > Hm. New-ish compilers seem to be ok with it.
> 
> Oh!  I was misled by the buildfarm label on morepork, which claims
> it's running clang 10.0.1.  But actually, per its configure report,
> it's running
> 
>     configure: using compiler=gcc (GCC) 4.2.1 20070719 

Huh. I wonder if that was an accident in the BF setup.


> > Perhaps we should have a
> > configure check whether the compiler is OK with that, and disable direct IO
> > support if not?
> 
> +1 for that, though.  (Also, the fact that these animals aren't
> actually failing suggests that 004_io_direct.pl needs expansion.)

It's skipped, due to lack of O_DIRECT:
[20:50:22] t/004_io_direct.pl .............. skipped: no O_DIRECT

So perhaps we don't even need a configure test, just a bit of ifdef'ery? It's
a bit annoying structurally, because the PG*Aligned structs are defined in
c.h, but the different ways of spelling O_DIRECT are dealt with in fd.h.

I wonder if we should try to move those structs to fd.h as well...

Greetings,

Andres Freund



Re: Direct I/O

From
Tom Lane
Date:
Andres Freund <andres@anarazel.de> writes:
> On 2023-04-14 15:21:18 -0400, Tom Lane wrote:
>> +1 for that, though.  (Also, the fact that these animals aren't
>> actually failing suggests that 004_io_direct.pl needs expansion.)

> It's skipped, due to lack of O_DIRECT:
> [20:50:22] t/004_io_direct.pl .............. skipped: no O_DIRECT

Hmm, I'd say that might be just luck.  Whether the compiler honors weird
alignment of locals seems independent of whether the OS has O_DIRECT.

> So perhaps we don't even need a configure test, just a bit of ifdef'ery? It's
> a bit annoying structurally, because the PG*Aligned structs are defined in
> c.h, but the different ways of spelling O_DIRECT are dealt with in fd.h.

> I wonder if we should try to move those structs to fd.h as well...

I doubt they belong in c.h, so that could be plausible; except
I'm not convinced that testing O_DIRECT is sufficient.

            regards, tom lane



Re: Direct I/O

From
Mikael Kjellström
Date:

On 2023-04-14 21:33, Andres Freund wrote:

>> Oh!  I was misled by the buildfarm label on morepork, which claims
>> it's running clang 10.0.1.  But actually, per its configure report,
>> it's running
>>
>>     configure: using compiler=gcc (GCC) 4.2.1 20070719
> 
> Huh. I wonder if that was an accident in the BF setup.

I might have been when I reinstalled it a while ago.

I have the following gcc and clang installed:

openbsd_6_9-pgbf$ gcc --version
gcc (GCC) 4.2.1 20070719
Copyright (C) 2007 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

openbsd_6_9-pgbf$ clang --version
OpenBSD clang version 10.0.1
Target: amd64-unknown-openbsd6.9
Thread model: posix
InstalledDir: /usr/bin

want me to switch to clang instead?

/Mikael



Re: Direct I/O

From
Thomas Munro
Date:
On Sat, Apr 15, 2023 at 7:38 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Andres Freund <andres@anarazel.de> writes:
> > On 2023-04-14 15:21:18 -0400, Tom Lane wrote:
> >> +1 for that, though.  (Also, the fact that these animals aren't
> >> actually failing suggests that 004_io_direct.pl needs expansion.)
>
> > It's skipped, due to lack of O_DIRECT:
> > [20:50:22] t/004_io_direct.pl .............. skipped: no O_DIRECT
>
> Hmm, I'd say that might be just luck.  Whether the compiler honors weird
> alignment of locals seems independent of whether the OS has O_DIRECT.
>
> > So perhaps we don't even need a configure test, just a bit of ifdef'ery? It's
> > a bit annoying structurally, because the PG*Aligned structs are defined in
> > c.h, but the different ways of spelling O_DIRECT are dealt with in fd.h.
>
> > I wonder if we should try to move those structs to fd.h as well...
>
> I doubt they belong in c.h, so that could be plausible; except
> I'm not convinced that testing O_DIRECT is sufficient.

As far as I can tell, the failure to honour large alignment attributes
even though the compiler passes our configure check that you can do
that was considered to be approximately a bug[1] or at least a thing
to be improved in fairly old GCC times but the fix wasn't back-patched
that far.  Unfortunately the projects that were allergic to the GPL3
change but wanted to ship a compiler (or some motivation related to
that) got stuck on 4.2 for a while before they flipped to Clang (as
OpenBSD has now done).  It seems hard to get excited about doing
anything about that on our side, and those systems are also spewing
other warnings.  But if we're going to do it, it looks like the right
place would indeed be a new compiler check that the attribute exists
*and* generates no warnings with alignment > 16, something like that?

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=16660



Re: Direct I/O

From
Thomas Munro
Date:
On Sat, Apr 15, 2023 at 7:50 AM Mikael Kjellström
<mikael.kjellstrom@gmail.com> wrote:
> want me to switch to clang instead?

I vote +1, that's the system compiler in modern OpenBSD.

https://www.cambus.net/the-state-of-toolchains-in-openbsd/

As for curculio, I don't understand the motivation for maintaining
that machine.  I'd rather know if OpenBSD 7.3 works.



Re: Direct I/O

From
Tom Lane
Date:
Thomas Munro <thomas.munro@gmail.com> writes:
> On Sat, Apr 15, 2023 at 7:50 AM Mikael Kjellström
> <mikael.kjellstrom@gmail.com> wrote:
>> want me to switch to clang instead?

> I vote +1, that's the system compiler in modern OpenBSD.

Ditto, we need coverage of that.

> As for curculio, I don't understand the motivation for maintaining
> that machine.  I'd rather know if OpenBSD 7.3 works.

Those aren't necessarily mutually exclusive :-).  But I do agree
that recent OpenBSD is more important to cover than ancient OpenBSD.

            regards, tom lane



Re: Direct I/O

From
Mikael Kjellström
Date:

On 2023-04-15 05:22, Tom Lane wrote:
> Thomas Munro <thomas.munro@gmail.com> writes:
>> On Sat, Apr 15, 2023 at 7:50 AM Mikael Kjellström
>> <mikael.kjellstrom@gmail.com> wrote:
>>> want me to switch to clang instead?
> 
>> I vote +1, that's the system compiler in modern OpenBSD.
> 
> Ditto, we need coverage of that.

OK. I switched to clang on morepork now.


>> As for curculio, I don't understand the motivation for maintaining
>> that machine.  I'd rather know if OpenBSD 7.3 works.
> 
> Those aren't necessarily mutually exclusive :-).  But I do agree
> that recent OpenBSD is more important to cover than ancient OpenBSD.

So do you want me to switch that machine to OpenBSD 7.3 instead?

/Mikael





Re: Direct I/O

From
Tom Lane
Date:
Thomas Munro <thomas.munro@gmail.com> writes:
> As far as I can tell, the failure to honour large alignment attributes
> even though the compiler passes our configure check that you can do
> that was considered to be approximately a bug[1] or at least a thing
> to be improved in fairly old GCC times but the fix wasn't back-patched
> that far.  Unfortunately the projects that were allergic to the GPL3
> change but wanted to ship a compiler (or some motivation related to
> that) got stuck on 4.2 for a while before they flipped to Clang (as
> OpenBSD has now done).  It seems hard to get excited about doing
> anything about that on our side, and those systems are also spewing
> other warnings.  But if we're going to do it, it looks like the right
> place would indeed be a new compiler check that the attribute exists
> *and* generates no warnings with alignment > 16, something like that?

I tested this by building gcc 4.2.1 from source on modern Linux
(which was a bit more painful than it ought to be, perhaps)
and building PG with that.  It generates no warnings, but nonetheless
gets an exception in CREATE DATABASE:

#2  0x0000000000b64522 in ExceptionalCondition (
    conditionName=0xd4fca0 "(uintptr_t) buffer == TYPEALIGN(PG_IO_ALIGN_SIZE, buffer)", fileName=0xd4fbe0 "md.c",
lineNumber=468)at assert.c:66 
#3  0x00000000009a6b53 in mdextend (reln=0x1dcaf68, forknum=MAIN_FORKNUM,
    blocknum=18, buffer=0x7ffcaf8e1af0, skipFsync=true) at md.c:468
#4  0x00000000009a9075 in smgrextend (reln=0x1dcaf68, forknum=MAIN_FORKNUM,
    blocknum=18, buffer=0x7ffcaf8e1af0, skipFsync=true) at smgr.c:500
#5  0x000000000096739c in RelationCopyStorageUsingBuffer (srclocator=...,
    dstlocator=..., forkNum=MAIN_FORKNUM, permanent=true) at bufmgr.c:4286
#6  0x0000000000967584 in CreateAndCopyRelationData (src_rlocator=...,
    dst_rlocator=..., permanent=true) at bufmgr.c:4361
#7  0x000000000068898e in CreateDatabaseUsingWalLog (src_dboid=1,
    dst_dboid=24576, src_tsid=1663, dst_tsid=1663) at dbcommands.c:217
#8  0x000000000068b594 in createdb (pstate=0x1d4a6a8, stmt=0x1d20ec8)
    at dbcommands.c:1441

Sure enough, that buffer is a stack local in
RelationCopyStorageUsingBuffer, and it's visibly got a
not-very-well-aligned address.

So apparently, the fact that you even get a warning about the
alignment not being honored is something OpenBSD patched in
after-the-fact; it's not there in genuine vintage gcc.

I get the impression that we are going to need an actual runtime
test if we want to defend against this.  Not entirely convinced
it's worth the trouble.  Who, other than our deliberately rear-guard
buildfarm animals, is going to be building modern PG with such old
compilers?  (And more especially to the point, on platforms new
enough to have working O_DIRECT?)

At this point I agree with Andres that it'd be good enough to
silence the warning by getting rid of these alignment pragmas
when the platform lacks O_DIRECT.

            regards, tom lane

PS: I don't quite understand how it managed to get through initdb
when CREATE DATABASE doesn't work.  Maybe there is a different
code path taken in standalone mode?



Re: Direct I/O

From
Thomas Munro
Date:
On Sun, Apr 16, 2023 at 6:19 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> So apparently, the fact that you even get a warning about the
> alignment not being honored is something OpenBSD patched in
> after-the-fact; it's not there in genuine vintage gcc.

Ah, that is an interesting discovery, and indeed kills the configure check idea.

> At this point I agree with Andres that it'd be good enough to
> silence the warning by getting rid of these alignment pragmas
> when the platform lacks O_DIRECT.

Hmm.  My preferred choice would be: accept Mikael's kind offer to
upgrade curculio to a live version, forget about GCC 4.2.1 forever,
and do nothing.  It is a dead parrot.

But if we really want to do something about this, my next preferred
option would be to modify c.h's test to add more conditions, here:

/* GCC, Sunpro and XLC support aligned, packed and noreturn */
#if defined(__GNUC__) || defined(__SUNPRO_C) || defined(__IBMC__)
#define pg_attribute_aligned(a) __attribute__((aligned(a)))
...

Full GCC support including stack objects actually began in 4.6, it
seems.  It might require a bit of research because the GCC-workalikes
including Clang also claim to be certain versions of GCC (for example
I think Clang 7 would be excluded if you excluded GCC 4.2, even though
this particular thing apparently worked fine in Clang 7).  That's my
best idea, ie to actually model the feature history accurately, if we
are suspending disbelief and pretending that it is a reasonable
target.



Re: Direct I/O

From
Tom Lane
Date:
Thomas Munro <thomas.munro@gmail.com> writes:
> Full GCC support including stack objects actually began in 4.6, it
> seems.

Hmm.  The oldest gcc versions remaining in the buildfarm seem to be

 curculio      | configure: using compiler=gcc (GCC) 4.2.1 20070719
 frogfish      | configure: using compiler=gcc (Debian 4.6.3-14) 4.6.3
 lapwing       | configure: using compiler=gcc (Debian 4.7.2-5) 4.7.2
 skate         | configure: using compiler=gcc-4.7 (Debian 4.7.2-5) 4.7.2
 snapper       | configure: using compiler=gcc-4.7 (Debian 4.7.2-5) 4.7.2
 buri          | configure: using compiler=gcc (GCC) 4.8.5 20150623 (Red Hat 4.8.5-44)
 chub          | configure: using compiler=gcc (GCC) 4.8.5 20150623 (Red Hat 4.8.5-44)
 dhole         | configure: using compiler=gcc (GCC) 4.8.5 20150623 (Red Hat 4.8.5-44)
 mantid        | configure: using compiler=gcc (GCC) 4.8.5 20150623 (Red Hat 4.8.5-44)
 prion         | configure: using compiler=gcc (GCC) 4.8.5 20150623 (Red Hat 4.8.5-28)
 rhinoceros    | configure: using compiler=gcc (GCC) 4.8.5 20150623 (Red Hat 4.8.5-44)
 siskin        | configure: using compiler=gcc (GCC) 4.8.5 20150623 (Red Hat 4.8.5-44)
 shelduck      | configure: using compiler=gcc (SUSE Linux) 4.8.5
 topminnow     | configure: using compiler=gcc (Debian 4.9.2-10+deb8u1) 4.9.2

so curculio should be the only one that's at risk here.
Maybe just upgrading it is the right answer.

            regards, tom lane



Re: Direct I/O

From
Justin Pryzby
Date:
On Sat, Apr 15, 2023 at 02:19:35PM -0400, Tom Lane wrote:
> PS: I don't quite understand how it managed to get through initdb
> when CREATE DATABASE doesn't work.  Maybe there is a different
> code path taken in standalone mode?

ad43a413c4f7f5d024a5b2f51e00d280a22f1874
    initdb: When running CREATE DATABASE, use STRATEGY = WAL_COPY.



Re: Direct I/O

From
Mikael Kjellström
Date:

On 2023-04-16 00:10, Tom Lane wrote:

> so curculio should be the only one that's at risk here.
> Maybe just upgrading it is the right answer.

Just let me know if I should switch curculio to OpenBSD 7.3.

I already have a new server setup so only need to switch the "animal" 
and "secret" and enable the cron job to get it running.

/Mikael



Re: Direct I/O

From
Tom Lane
Date:
=?UTF-8?Q?Mikael_Kjellstr=c3=b6m?= <mikael.kjellstrom@mksoft.nu> writes:
> On 2023-04-16 00:10, Tom Lane wrote:
>> so curculio should be the only one that's at risk here.
>> Maybe just upgrading it is the right answer.

> Just let me know if I should switch curculio to OpenBSD 7.3.

Yes please.

> I already have a new server setup so only need to switch the "animal" 
> and "secret" and enable the cron job to get it running.

Actually, as long as it's still OpenBSD I think you can keep using
the same animal name ... Andrew, what's the policy on that?

            regards, tom lane



Re: Direct I/O

From
Mikael Kjellström
Date:
On 2023-04-16 16:18, Tom Lane wrote:
> =?UTF-8?Q?Mikael_Kjellstr=c3=b6m?= <mikael.kjellstrom@mksoft.nu> writes:
>> On 2023-04-16 00:10, Tom Lane wrote:
>>> so curculio should be the only one that's at risk here.
>>> Maybe just upgrading it is the right answer.
> 
>> Just let me know if I should switch curculio to OpenBSD 7.3.
> 
> Yes please.

Ok.


>> I already have a new server setup so only need to switch the "animal"
>> and "secret" and enable the cron job to get it running.
> 
> Actually, as long as it's still OpenBSD I think you can keep using
> the same animal name ... Andrew, what's the policy on that?

That is what I meant with above.

I just use the same animal name and secret and then run 
"update_personality.pl".

That should be enough I think?

/Mikael




Re: Direct I/O

From
Andrew Dunstan
Date:


On 2023-04-16 Su 10:18, Tom Lane wrote:
Mikael Kjellström <mikael.kjellstrom@mksoft.nu> writes:
On 2023-04-16 00:10, Tom Lane wrote:
so curculio should be the only one that's at risk here.
Maybe just upgrading it is the right answer.
Just let me know if I should switch curculio to OpenBSD 7.3.
Yes please.

I already have a new server setup so only need to switch the "animal" 
and "secret" and enable the cron job to get it running.
Actually, as long as it's still OpenBSD I think you can keep using
the same animal name ... Andrew, what's the policy on that?
			


update_personality.pl lets you update the OS version / compiler version / owner-name / owner-email


I am in fact about to perform this exact operation for prion.


cheers


andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com

Re: Direct I/O

From
Tom Lane
Date:
Andrew Dunstan <andrew@dunslane.net> writes:
> On 2023-04-16 Su 10:18, Tom Lane wrote:
>> Actually, as long as it's still OpenBSD I think you can keep using
>> the same animal name ... Andrew, what's the policy on that?

> update_personality.pl lets you update the OS version / compiler version 
> / owner-name / owner-email

Oh wait ... this involves a switch from gcc in OpenBSD 5.9 to clang
in OpenBSD 7.3, doesn't it?  That isn't something update_personality
will handle; you need a new animal if the compiler product is changing.

            regards, tom lane



Re: Direct I/O

From
Andrew Dunstan
Date:

> On Apr 16, 2023, at 12:16 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> Andrew Dunstan <andrew@dunslane.net> writes:
>>> On 2023-04-16 Su 10:18, Tom Lane wrote:
>>> Actually, as long as it's still OpenBSD I think you can keep using
>>> the same animal name ... Andrew, what's the policy on that?
>
>> update_personality.pl lets you update the OS version / compiler version
>> / owner-name / owner-email
>
> Oh wait ... this involves a switch from gcc in OpenBSD 5.9 to clang
> in OpenBSD 7.3, doesn't it?  That isn't something update_personality
> will handle; you need a new animal if the compiler product is changing.
>
>

Correct.

Cheers

Andrew


Re: Direct I/O

From
Mikael Kjellström
Date:

On 2023-04-16 19:59, Andrew Dunstan wrote:

>> On Apr 16, 2023, at 12:16 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>
>> Andrew Dunstan <andrew@dunslane.net> writes:
>>>> On 2023-04-16 Su 10:18, Tom Lane wrote:
>>>> Actually, as long as it's still OpenBSD I think you can keep using
>>>> the same animal name ... Andrew, what's the policy on that?
>>
>>> update_personality.pl lets you update the OS version / compiler version
>>> / owner-name / owner-email
>>
>> Oh wait ... this involves a switch from gcc in OpenBSD 5.9 to clang
>> in OpenBSD 7.3, doesn't it?  That isn't something update_personality
>> will handle; you need a new animal if the compiler product is changing.
>>
>>   
> 
> Correct.

OK. I registered a new animal for this then.

So if someone could look at that and give be an animal name + secret I 
can set this up.

/Mikael




Re: Direct I/O

From
Michael Paquier
Date:
On Sun, Apr 16, 2023 at 04:51:04PM +0200, Mikael Kjellström wrote:
> That is what I meant with above.
>
> I just use the same animal name and secret and then run
> "update_personality.pl".
>
> That should be enough I think?

Yes, that should be enough as far as I recall.  This has been
mentioned a couple of weeks ago here:
https://www.postgresql.org/message-id/CA+hUKGK0jJ+G+bxLUZqpBsxpvEg7Lvt1v8LBxFkZbrvtFTSghw@mail.gmail.com

I have also used setnotes.pl to reflect my animals' CFLAGS on the
website.
--
Michael

Attachment

Re: Direct I/O

From
Mikael Kjellström
Date:

On 2023-04-16 20:05, Mikael Kjellström wrote:

>>> Oh wait ... this involves a switch from gcc in OpenBSD 5.9 to clang
>>> in OpenBSD 7.3, doesn't it?  That isn't something update_personality
>>> will handle; you need a new animal if the compiler product is changing.
>>>
>>
>> Correct.
> 
> OK. I registered a new animal for this then.
> 
> So if someone could look at that and give be an animal name + secret I 
> can set this up.

I have setup a new animal "schnauzer" (thanks andrew!).

That should report in a little while.

/Mikael




Re: Direct I/O

From
Robert Haas
Date:
On Sat, Apr 15, 2023 at 2:19 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> I get the impression that we are going to need an actual runtime
> test if we want to defend against this.  Not entirely convinced
> it's worth the trouble.  Who, other than our deliberately rear-guard
> buildfarm animals, is going to be building modern PG with such old
> compilers?  (And more especially to the point, on platforms new
> enough to have working O_DIRECT?)

I don't think that I fully understand everything under discussion
here, but I would just like to throw in a vote for trying to make
failures as comprehensible as we reasonably can. It makes me a bit
nervous to rely on things like "anybody who has O_DIRECT will also
have working alignment pragmas," because there's no relationship
between those things other than when we think they got implemented on
the platforms that are popular today. If somebody ships me a brand new
Deathstation 9000 that has O_DIRECT but NOT alignment pragmas, how
badly are things going to break and how hard is it going to be for me
to understand why it's not working?

I understand that nobody (including me) wants the code cluttered with
a bunch of useless cruft that caters only to hypothetical systems, and
I don't want us to spend a lot of effort building untestable
infrastructure that caters only to such machines. I just don't want us
to do things that are more magical than they need to be. If and when
something fails, it's real nice if you can easily understand why it
failed.

--
Robert Haas
EDB: http://www.enterprisedb.com



Re: Direct I/O

From
Tom Lane
Date:
Robert Haas <robertmhaas@gmail.com> writes:
> On Sat, Apr 15, 2023 at 2:19 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> I get the impression that we are going to need an actual runtime
>> test if we want to defend against this.  Not entirely convinced
>> it's worth the trouble.  Who, other than our deliberately rear-guard
>> buildfarm animals, is going to be building modern PG with such old
>> compilers?  (And more especially to the point, on platforms new
>> enough to have working O_DIRECT?)

> I don't think that I fully understand everything under discussion
> here, but I would just like to throw in a vote for trying to make
> failures as comprehensible as we reasonably can.

I'm not hugely concerned about this yet.  I think the reason for
slipping this into v16 as developer-only code is exactly that we need
to get a feeling for where the portability dragons live.  When (and
if) we try to make O_DIRECT mainstream, yes we'd better be sure that
any known failure cases are reported well.  But we need the data
about which those are, first.

            regards, tom lane



Re: Direct I/O

From
Thomas Munro
Date:
On Tue, Apr 18, 2023 at 4:06 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
> > On Sat, Apr 15, 2023 at 2:19 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> >> I get the impression that we are going to need an actual runtime
> >> test if we want to defend against this.  Not entirely convinced
> >> it's worth the trouble.  Who, other than our deliberately rear-guard
> >> buildfarm animals, is going to be building modern PG with such old
> >> compilers?  (And more especially to the point, on platforms new
> >> enough to have working O_DIRECT?)
>
> > I don't think that I fully understand everything under discussion
> > here, but I would just like to throw in a vote for trying to make
> > failures as comprehensible as we reasonably can.
>
> I'm not hugely concerned about this yet.  I think the reason for
> slipping this into v16 as developer-only code is exactly that we need
> to get a feeling for where the portability dragons live.  When (and
> if) we try to make O_DIRECT mainstream, yes we'd better be sure that
> any known failure cases are reported well.  But we need the data
> about which those are, first.

+1

A couple more things I wanted to note:

* We have no plans to turn this on by default even when the later
asynchronous machinery is proposed, and direct I/O starts to make more
economic sense (think: your stream of small reads and writes will be
converted to larger preadv/pwritev or moral equivalent and performed
ahead of time in the background).  Reasons: (1) There will always be a
few file systems that refuse O_DIRECT (Linux tmpfs is one such, as we
learned in this thread; if fails with EINVAL at open() time), and (2)
without a page cache, you really need to size your shared_buffers
adequately and we can't do that automatically.  It's something you'd
opt into for a dedicated database server along with other carefully
considered settings.  It seems acceptable to me that if you set
io_direct to a non-default setting on an unusual-for-a-database-server
filesystem you might get errors screaming about inability to open
files -- you'll just have to turn it back off again if it doesn't work
for you.

* For the alignment part, C11 has "alignas(x)" in <stdalign.h>, so I
very much doubt that a hypothetical new Deathstation C compiler would
not know how to align stack objects arbitrarily, even though for now
as a C99 program we have to use the non-standard incantations defined
in our c.h.  I assume we'll eventually switch to that.  In the
meantime, if someone manages to build PostgreSQL on a hypothetical C
compiler that our c.h doesn't recognise, we just won't let you turn
the io_direct GUC on (because we set PG_O_DIRECT to 0 if we don't have
an alignment macro, see commit faeedbce's message for rationale).  If
the alignment trick from c.h appears to be available but is actually
broken (GCC 4.2.1), then those assertions I added into smgrread() et
al will fail as Tom showed (yay! they did their job), or in a
non-assert build you'll probably get EINVAL when you try to read or
write from your badly aligned buffers depending on how picky your OS
is, but that's just an old bug in a defunct compiler that we have by
now written more about they ever did in their bug tracker.

* I guess it's unlikely at this point that POSIX will ever standardise
O_DIRECT if they didn't already in the 90s (I didn't find any
discussion of it in their issue tracker).  There is really only one OS
on our target list that truly can't do direct I/O at all: OpenBSD.  It
seems a reasonable bet that if they or a hypothetical totally new
Unixoid system ever implemented it they'd spell it the same IRIX way
for practical reasons, but if not we just won't use it until someone
writes a patch *shrug*.  There is also one system that's been rocking
direct I/O since the 90s for Oracle etc, but PostgreSQL still doesn't
know how to turn it on: Solaris has a directio() system call.  I
posted a (trivial) patch for that once in the thread where I added
Apple F_NOCACHE, but there is probably nobody on this list who can
test it successfully (as Tom discovered, wrasse's host is not
configured right for it, you'd need an admin/root to help set up a UFS
file system, or perhaps modern (closed) ZFS can do it but that system
is old and unpatched), and I have no desire to commit a "blind" patch
for an untested niche setup; I really only considered it because I
realised I was so close to covering the complete set of OSes.  That's
cool, we just won't let you turn the GUC on if we don't know how and
the error message is clear about that if you try.



Re: Direct I/O

From
Greg Stark
Date:
On Mon, 17 Apr 2023 at 17:45, Thomas Munro <thomas.munro@gmail.com> wrote:
>
> Reasons: (1) There will always be a
> few file systems that refuse O_DIRECT (Linux tmpfs is one such, as we
> learned in this thread; if fails with EINVAL at open() time), and

So why wouldn't we just automatically turn it off (globally or for
that tablespace) and keep operating without it afterward?

> (2) without a page cache, you really need to size your shared_buffers
> adequately and we can't do that automatically.

Well.... I'm more optimistic... That may not always be impossible.
We've already added the ability to add more shared memory after
startup. We could implement the ability to add or remove shared buffer
segments after startup. And it wouldn't be crazy to imagine a kernel
interface that lets us judge whether the kernel memory pressure makes
it reasonable for us to take more shared buffers or makes it necessary
to release shared memory to the kernel. You could hack something
together using /proc/meminfo today but I imagine an interface intended
for this kind of thing would be better.

>  It's something you'd
> opt into for a dedicated database server along with other carefully
> considered settings.  It seems acceptable to me that if you set
> io_direct to a non-default setting on an unusual-for-a-database-server
> filesystem you might get errors screaming about inability to open
> files -- you'll just have to turn it back off again if it doesn't work
> for you.

If the only solution is to turn it off perhaps the server should just
turn it off? I guess the problem is that the shared_buffers might be
set assuming it would be on?



-- 
greg



Re: Direct I/O

From
Robert Haas
Date:
On Tue, Apr 18, 2023 at 3:35 PM Greg Stark <stark@mit.edu> wrote:
> Well.... I'm more optimistic... That may not always be impossible.
> We've already added the ability to add more shared memory after
> startup. We could implement the ability to add or remove shared buffer
> segments after startup. And it wouldn't be crazy to imagine a kernel
> interface that lets us judge whether the kernel memory pressure makes
> it reasonable for us to take more shared buffers or makes it necessary
> to release shared memory to the kernel.

On this point specifically, one fairly large problem that we have
currently is that our buffer replacement algorithm is terrible. In
workloads I've examined, either almost all buffers end up with a usage
count of 5 or almost all buffers end up with a usage count of 0 or 1.
Either way, we lose all or nearly all information about which buffers
are actually hot, and we are not especially unlikely to evict some
extremely hot buffer. This is quite bad for performance as it is, and
it would be a lot worse if recovering from a bad eviction decision
routinely required rereading from disk instead of only rereading from
the OS buffer cache.

I've sometimes wondered whether our current algorithm is just a more
expensive version of random eviction. I suspect that's a bit too
pessimistic, but I don't really know.

I'm not saying that it isn't possible to fix this. I bet it is, and I
hope someone does. I'm just making the point that even if we knew the
amount of kernel memory pressure and even if we also had the ability
to add and remove shared_buffers at will, it probably wouldn't help
much as things stand today, because we're not in a good position to
judge how large the cache would need to be in order to be useful, or
what we ought to be storing in it.

--
Robert Haas
EDB: http://www.enterprisedb.com



Re: Direct I/O

From
Joe Conway
Date:
On 4/19/23 10:11, Robert Haas wrote:
> On Tue, Apr 18, 2023 at 3:35 PM Greg Stark <stark@mit.edu> wrote:
>> Well.... I'm more optimistic... That may not always be impossible.
>> We've already added the ability to add more shared memory after
>> startup. We could implement the ability to add or remove shared buffer
>> segments after startup. And it wouldn't be crazy to imagine a kernel
>> interface that lets us judge whether the kernel memory pressure makes
>> it reasonable for us to take more shared buffers or makes it necessary
>> to release shared memory to the kernel.
> 
> On this point specifically, one fairly large problem that we have
> currently is that our buffer replacement algorithm is terrible. In
> workloads I've examined, either almost all buffers end up with a usage
> count of 5 or almost all buffers end up with a usage count of 0 or 1.
> Either way, we lose all or nearly all information about which buffers
> are actually hot, and we are not especially unlikely to evict some
> extremely hot buffer.

That has been my experience as well, although admittedly I have not 
looked in quite a while.


> I'm not saying that it isn't possible to fix this. I bet it is, and I
> hope someone does.

I keep looking at this blog post about Transparent Memory Offloading and 
thinking that we could learn from it:


https://engineering.fb.com/2022/06/20/data-infrastructure/transparent-memory-offloading-more-memory-at-a-fraction-of-the-cost-and-power/

Unfortunately, it is very Linux specific and requires a really up to 
date OS -- cgroup v2, kernel >= 5.19

> I'm just making the point that even if we knew the amount of kernel
> memory pressure and even if we also had the ability to add and remove
> shared_buffers at will, it probably wouldn't help much as things
> stand today, because we're not in a good position to judge how large
> the cache would need to be in order to be useful, or what we ought to
> be storing in it.

The tactic TMO uses is basically to tune the available memory to get a 
target memory pressure. That seems like it could work.

-- 
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Re: Direct I/O

From
Andres Freund
Date:
Hi,

On 2023-04-19 10:11:32 -0400, Robert Haas wrote:
> On this point specifically, one fairly large problem that we have
> currently is that our buffer replacement algorithm is terrible. In
> workloads I've examined, either almost all buffers end up with a usage
> count of 5 or almost all buffers end up with a usage count of 0 or 1.
> Either way, we lose all or nearly all information about which buffers
> are actually hot, and we are not especially unlikely to evict some
> extremely hot buffer. This is quite bad for performance as it is, and
> it would be a lot worse if recovering from a bad eviction decision
> routinely required rereading from disk instead of only rereading from
> the OS buffer cache.

Interestingly, I haven't seen that as much in more recent benchmarks as it
used to. Partially I think because common s_b settings have gotten bigger, I
guess. But I wonder if we also accidentally improved something else, e.g. by
pin/unpin-ing the same buffer a bit less frequently.


> I've sometimes wondered whether our current algorithm is just a more
> expensive version of random eviction. I suspect that's a bit too
> pessimistic, but I don't really know.

I am quite certain that it's better than that. If you e.g. have pkey lookup
workload >> RAM you can actually end up seeing inner index pages staying
reliably in s_b. But clearly we can do better.


I do think we likely should (as IIRC Peter Geoghegan suggested) provide more
information to the buffer replacement layer:

Independent of the concrete buffer replacement algorithm, the recency
information we do provide is somewhat lacking. In some places we do repeated
ReadBuffer() calls for the same buffer, leading to over-inflating usagecount.

We should seriously consider using the cost of the IO into account, basically
making it more likely that s_b is increased when we need to synchronously wait
for IO. The cost of a miss is much lower for e.g. a sequential scan or a
bitmap heap scan, because both can do some form of prefetching. Whereas index
pages and the heap fetch for plain index scans aren't prefetchable (which
could be improved some, but not generally).


> I'm not saying that it isn't possible to fix this. I bet it is, and I
> hope someone does. I'm just making the point that even if we knew the
> amount of kernel memory pressure and even if we also had the ability
> to add and remove shared_buffers at will, it probably wouldn't help
> much as things stand today, because we're not in a good position to
> judge how large the cache would need to be in order to be useful, or
> what we ought to be storing in it.

FWIW, my experience is that linux' page replacement doesn't work very well
either. Partially because we "hide" a lot of the recency information from
it. But also just because it doesn't scale all that well to large amounts of
memory (there's ongoing work on that though).  So I am not really convinced by
this argument - for plenty workloads just caching in PG will be far better
than caching both in the kernel and in PG, as long as some adaptiveness to
memory pressure avoids running into OOMs.

Some forms of adaptive s_b sizing aren't particularly hard, I think. Instead
of actually changing the s_b shmem allocation - which would be very hard in a
process based model - we can tell the kernel that some parts of that memory
aren't currently in use with madvise(MADV_REMOVE).  It's not quite as trivial
as it sounds, because we'd have to free in multiple of huge_page_size.

Greetings,

Andres Freund



Re: Direct I/O

From
Andres Freund
Date:
Hi,

On 2023-04-18 09:44:10 +1200, Thomas Munro wrote:
> * We have no plans to turn this on by default even when the later
> asynchronous machinery is proposed, and direct I/O starts to make more
> economic sense (think: your stream of small reads and writes will be
> converted to larger preadv/pwritev or moral equivalent and performed
> ahead of time in the background).  Reasons: (1) There will always be a
> few file systems that refuse O_DIRECT (Linux tmpfs is one such, as we
> learned in this thread; if fails with EINVAL at open() time), and (2)
> without a page cache, you really need to size your shared_buffers
> adequately and we can't do that automatically.  It's something you'd
> opt into for a dedicated database server along with other carefully
> considered settings.  It seems acceptable to me that if you set
> io_direct to a non-default setting on an unusual-for-a-database-server
> filesystem you might get errors screaming about inability to open
> files -- you'll just have to turn it back off again if it doesn't work
> for you.

FWIW, *long* term I think it might sense to turn DIO on automatically for a
small subset of operations, if supported. Examples:

1) Once we have the ability to "feed" walsenders from wal_buffers, instead of
   going to disk, automatically using DIO for WAL might be beneficial. The
   increase in IO concurrency and reduction in latency one can get is
   substantial.

2) If we make base backups use s_b if pages are in s_b, and do locking via s_b
   for non-existing pages, it might be worth automatically using DIO for the
   reads of the non-resident data, to avoid swamping the kernel page cache
   with data that won't be read again soon (and to utilize DMA etc).

3) When writing back dirty data that we don't expect to be dirtied again soon,
   e.g. from vacuum ringbuffers or potentially even checkpoints, it could make
   sense to use DIO, to avoid the kernel keeping such pages in the page cache.


But for the main s_b, I agree, I can't forsee us turning on DIO by
default. Unless somebody has tuned s_b at least some for the workload, that's
not going to go well. And even if somebody has, it's quite reasonable to use
the same host also for other programs (including other PG instances), in which
case it's likely desirable to be adaptive to the current load when deciding
what to cache - which the kernel is in the best position to do.



> If the alignment trick from c.h appears to be available but is actually
> broken (GCC 4.2.1), then those assertions I added into smgrread() et
> al will fail as Tom showed (yay! they did their job), or in a
> non-assert build you'll probably get EINVAL when you try to read or
> write from your badly aligned buffers depending on how picky your OS
> is, but that's just an old bug in a defunct compiler that we have by
> now written more about they ever did in their bug tracker.

Agreed. If we ever find such issues in a postmordial compiler, we'll just need
to beef up our configure test to detect that it doesn't actually fully support
specifying alignment.

Greetings,

Andres Freund



Re: Direct I/O

From
Andres Freund
Date:
Hi,

On 2023-04-18 15:35:09 -0400, Greg Stark wrote:
> On Mon, 17 Apr 2023 at 17:45, Thomas Munro <thomas.munro@gmail.com> wrote:
> >  It's something you'd
> > opt into for a dedicated database server along with other carefully
> > considered settings.  It seems acceptable to me that if you set
> > io_direct to a non-default setting on an unusual-for-a-database-server
> > filesystem you might get errors screaming about inability to open
> > files -- you'll just have to turn it back off again if it doesn't work
> > for you.
> 
> If the only solution is to turn it off perhaps the server should just
> turn it off? I guess the problem is that the shared_buffers might be
> set assuming it would be on?

I am quite strongly opposed to that - silently (or with a log message, which
practically is the same as silently) disabling performance relevant options
like DIO is much more likely to cause problems, due to the drastically
different performance characteristics you get. I can see us making it
configurable to try using DIO though, but I am not convinced it's worth
bothering with that. But we'll see.

Greetings,

Andres Freund



Re: Direct I/O

From
Robert Haas
Date:
On Wed, Apr 19, 2023 at 12:54 PM Andres Freund <andres@anarazel.de> wrote:
> Interestingly, I haven't seen that as much in more recent benchmarks as it
> used to. Partially I think because common s_b settings have gotten bigger, I
> guess. But I wonder if we also accidentally improved something else, e.g. by
> pin/unpin-ing the same buffer a bit less frequently.

I think the problem with the algorithm is pretty fundamental. The rate
of usage count increase is tied to how often we access buffers, and
the rate of usage count decrease is tied to buffer eviction. But a
given workload can have no eviction at all (in which case the usage
counts must converge to 5) or it can evict on every buffer access (in
which case the usage counts must mostly converget to 0, because we'll
be decreasing usage counts at least once per buffer and generally
more). ISTM that the only way that you can end up with a good mix of
usage counts is if you have a workload that falls into some kind of a
sweet spot where the rate of usage count bumps and the rate of usage
count de-bumps are close enough together that things don't skew all
the way to one end or the other. Bigger s_b might make that more
likely to happen in practice, but it seems like bad algorithm design
on a theoretical level. We should be comparing the frequency of access
of buffers to the frequency of access of other buffers, not to the
frequency of buffer eviction. Or to put the same thing another way,
the average value of the usage count shouldn't suddenly change from 5
to 1 when the server evicts 1 buffer.

> I do think we likely should (as IIRC Peter Geoghegan suggested) provide more
> information to the buffer replacement layer:
>
> Independent of the concrete buffer replacement algorithm, the recency
> information we do provide is somewhat lacking. In some places we do repeated
> ReadBuffer() calls for the same buffer, leading to over-inflating usagecount.

Yeah, that would be good to fix. I don't think it solves the whole
problem by itself, but it seems like a good step.

> We should seriously consider using the cost of the IO into account, basically
> making it more likely that s_b is increased when we need to synchronously wait
> for IO. The cost of a miss is much lower for e.g. a sequential scan or a
> bitmap heap scan, because both can do some form of prefetching. Whereas index
> pages and the heap fetch for plain index scans aren't prefetchable (which
> could be improved some, but not generally).

I guess that's reasonable if we can pass the information around well
enough, but I still think the algorithm ought to get some fundamental
improvement first.

> FWIW, my experience is that linux' page replacement doesn't work very well
> either. Partially because we "hide" a lot of the recency information from
> it. But also just because it doesn't scale all that well to large amounts of
> memory (there's ongoing work on that though).  So I am not really convinced by
> this argument - for plenty workloads just caching in PG will be far better
> than caching both in the kernel and in PG, as long as some adaptiveness to
> memory pressure avoids running into OOMs.

Even if the Linux algorithm is bad, and it may well be, the Linux
cache is often a lot bigger than our cache. Which can cover a
multitude of problems.

--
Robert Haas
EDB: http://www.enterprisedb.com



Re: Direct I/O

From
Andres Freund
Date:
Hi,

On 2023-04-19 13:16:54 -0400, Robert Haas wrote:
> On Wed, Apr 19, 2023 at 12:54 PM Andres Freund <andres@anarazel.de> wrote:
> > Interestingly, I haven't seen that as much in more recent benchmarks as it
> > used to. Partially I think because common s_b settings have gotten bigger, I
> > guess. But I wonder if we also accidentally improved something else, e.g. by
> > pin/unpin-ing the same buffer a bit less frequently.
> 
> I think the problem with the algorithm is pretty fundamental. The rate
> of usage count increase is tied to how often we access buffers, and
> the rate of usage count decrease is tied to buffer eviction. But a
> given workload can have no eviction at all (in which case the usage
> counts must converge to 5) or it can evict on every buffer access (in
> which case the usage counts must mostly converget to 0, because we'll
> be decreasing usage counts at least once per buffer and generally
> more).

I don't think the "evict on every buffer access" works quite that way - unless
you have a completely even access pattern, buffer access frequency will
increase usage count much more frequently on some buffers than others. And if
you have a completely even access pattern, it's hard to come up with a good
cache replacement algorithm...


> ISTM that the only way that you can end up with a good mix of
> usage counts is if you have a workload that falls into some kind of a
> sweet spot where the rate of usage count bumps and the rate of usage
> count de-bumps are close enough together that things don't skew all
> the way to one end or the other. Bigger s_b might make that more
> likely to happen in practice, but it seems like bad algorithm design
> on a theoretical level. We should be comparing the frequency of access
> of buffers to the frequency of access of other buffers, not to the
> frequency of buffer eviction. Or to put the same thing another way,
> the average value of the usage count shouldn't suddenly change from 5
> to 1 when the server evicts 1 buffer.

I agree that there are fundamental issues with the algorithm. But practically
I think the effect of the over-saturation of s_b isn't as severe as one might
think:

If your miss rate is very low, the occasional bad victim buffer selection
won't matter that much. If the miss rate is a bit higher, the likelihood of
the usagecount being increased again after being decreased is higher if a
buffer is accessed frequently.

This is also why I think that larger s_b makes the issues less likely - with
larger s_b, it is more likely that frequently accessed buffers are accessed
again after the first of the 5 clock sweeps necessary to reduce the usage
count.  Clearly, with a small-ish s_b and a high replacement rate, that's not
going to happen for sufficiently many buffers. But once you have a few GB of
s_b, multiple complete sweeps take a while.


Most, if not all, buffer replacement algorithms I have seen, don't deal well
with "small SB with a huge replacement rate". Most of the fancier algorithms
track recency information for buffers that have recently been evicted - but
you obviously can't track that to an unlimited degree, IIRC most papers
propose that the shadow map to be roughly equal to the buffer pool size.

You IMO pretty much need a policy decision on a higher level to improve upon
that (e.g. by just deciding that some buffers are sticky, perhaps because they
were used first) - but it doesn't matter much, because the miss rate is high
enough that the total amount of reads is barely affected by the buffer
replacement decisions.

Greetings,

Andres Freund



Re: Direct I/O

From
Noah Misch
Date:
On Mon, Apr 17, 2023 at 12:06:23PM -0400, Tom Lane wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
> > On Sat, Apr 15, 2023 at 2:19 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> >> I get the impression that we are going to need an actual runtime
> >> test if we want to defend against this.  Not entirely convinced
> >> it's worth the trouble.  Who, other than our deliberately rear-guard
> >> buildfarm animals, is going to be building modern PG with such old
> >> compilers?  (And more especially to the point, on platforms new
> >> enough to have working O_DIRECT?)
> 
> > I don't think that I fully understand everything under discussion
> > here, but I would just like to throw in a vote for trying to make
> > failures as comprehensible as we reasonably can.
> 
> I'm not hugely concerned about this yet.  I think the reason for
> slipping this into v16 as developer-only code is exactly that we need
> to get a feeling for where the portability dragons live.

Speaking of the developer-only status, I find the io_direct name more enticing
than force_parallel_mode, which PostgreSQL renamed due to overuse from people
expecting non-developer benefits.  Should this have a name starting with
debug_?



Re: Direct I/O

From
Thomas Munro
Date:
On Sun, Apr 30, 2023 at 4:11 PM Noah Misch <noah@leadboat.com> wrote:
> Speaking of the developer-only status, I find the io_direct name more enticing
> than force_parallel_mode, which PostgreSQL renamed due to overuse from people
> expecting non-developer benefits.  Should this have a name starting with
> debug_?

Hmm, yeah I think people coming from other databases would be tempted
by it.  But, unlike the
please-jam-a-gather-node-on-top-of-the-plan-so-I-can-debug-the-parallel-executor
switch, I think of this thing more like an experimental feature that
is just waiting for more features to make it useful.  What about a
warning message about that at startup if it's on?



Re: Direct I/O

From
Thomas Munro
Date:
On Sun, Apr 30, 2023 at 6:35 PM Thomas Munro <thomas.munro@gmail.com> wrote:
> On Sun, Apr 30, 2023 at 4:11 PM Noah Misch <noah@leadboat.com> wrote:
> > Speaking of the developer-only status, I find the io_direct name more enticing
> > than force_parallel_mode, which PostgreSQL renamed due to overuse from people
> > expecting non-developer benefits.  Should this have a name starting with
> > debug_?
>
> Hmm, yeah I think people coming from other databases would be tempted
> by it.  But, unlike the
> please-jam-a-gather-node-on-top-of-the-plan-so-I-can-debug-the-parallel-executor
> switch, I think of this thing more like an experimental feature that
> is just waiting for more features to make it useful.  What about a
> warning message about that at startup if it's on?

Something like this?  Better words welcome.

$ ~/install//bin/postgres -D pgdata -c io_direct=data
2023-05-01 09:44:37.460 NZST [99675] LOG:  starting PostgreSQL 16devel
on x86_64-unknown-freebsd13.2, compiled by FreeBSD clang version
14.0.5 (https://github.com/llvm/llvm-project.git
llvmorg-14.0.5-0-gc12386ae247c), 64-bit
2023-05-01 09:44:37.460 NZST [99675] LOG:  listening on IPv6 address
"::1", port 5432
2023-05-01 09:44:37.460 NZST [99675] LOG:  listening on IPv4 address
"127.0.0.1", port 5432
2023-05-01 09:44:37.461 NZST [99675] LOG:  listening on Unix socket
"/tmp/.s.PGSQL.5432"
2023-05-01 09:44:37.463 NZST [99675] WARNING:  io_direct is an
experimental setting for developer testing only
2023-05-01 09:44:37.463 NZST [99675] HINT:  File I/O may be
inefficient or not work on some file systems.
2023-05-01 09:44:37.465 NZST [99678] LOG:  database system was shut
down at 2023-05-01 09:43:51 NZST
2023-05-01 09:44:37.468 NZST [99675] LOG:  database system is ready to
accept connections

Attachment

Re: Direct I/O

From
Justin Pryzby
Date:
On Sun, Apr 30, 2023 at 06:35:30PM +1200, Thomas Munro wrote:
> On Sun, Apr 30, 2023 at 4:11 PM Noah Misch <noah@leadboat.com> wrote:
> > Speaking of the developer-only status, I find the io_direct name more enticing
> > than force_parallel_mode, which PostgreSQL renamed due to overuse from people
> > expecting non-developer benefits.  Should this have a name starting with
> > debug_?
> 
> Hmm, yeah I think people coming from other databases would be tempted
> by it.  But, unlike the
> please-jam-a-gather-node-on-top-of-the-plan-so-I-can-debug-the-parallel-executor
> switch, I think of this thing more like an experimental feature that
> is just waiting for more features to make it useful.  What about a
> warning message about that at startup if it's on?

Such a warning wouldn't be particularly likely to be seen by someone who
already didn't read/understand the docs for the not-feature that they
turned on.

Since this is -currently- a developer-only feature, it seems reasonable
to rename the GUC to debug_direct_io, and (at such time as it's
considered to be helpful to users) later rename it to direct_io.  
That avoids the issue that random advice to enable direct_io=x under
v17+ is applied by people running v16.  +0.8 to do so.

Maybe in the future, it should be added to GUC_EXPLAIN, too ?

-- 
Justin



Re: Direct I/O

From
Tom Lane
Date:
Justin Pryzby <pryzby@telsasoft.com> writes:
> On Sun, Apr 30, 2023 at 06:35:30PM +1200, Thomas Munro wrote:
>> What about a
>> warning message about that at startup if it's on?

> Such a warning wouldn't be particularly likely to be seen by someone who
> already didn't read/understand the docs for the not-feature that they
> turned on.

Yeah, I doubt that that would be helpful at all.

> Since this is -currently- a developer-only feature, it seems reasonable
> to rename the GUC to debug_direct_io, and (at such time as it's
> considered to be helpful to users) later rename it to direct_io.

+1

            regards, tom lane



Re: Direct I/O

From
Thomas Munro
Date:
On Mon, May 1, 2023 at 12:00 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Justin Pryzby <pryzby@telsasoft.com> writes:
> > On Sun, Apr 30, 2023 at 06:35:30PM +1200, Thomas Munro wrote:
> >> What about a
> >> warning message about that at startup if it's on?
>
> > Such a warning wouldn't be particularly likely to be seen by someone who
> > already didn't read/understand the docs for the not-feature that they
> > turned on.
>
> Yeah, I doubt that that would be helpful at all.

Fair enough.

> > Since this is -currently- a developer-only feature, it seems reasonable
> > to rename the GUC to debug_direct_io, and (at such time as it's
> > considered to be helpful to users) later rename it to direct_io.
>
> +1

Yeah, the future cross-version confusion thing is compelling.  OK,
here's a rename patch.  I left all the variable names and related
symbols as they were, so it's just the GUC that gains the prefix.  I
moved the documentation hunk up to be sorted alphabetically like
nearby entries, because that seemed to look nicer, even though the
list isn't globally sorted.

Attachment

Re: Direct I/O

From
Thomas Munro
Date:
On Wed, Apr 19, 2023 at 7:35 AM Greg Stark <stark@mit.edu> wrote:
> On Mon, 17 Apr 2023 at 17:45, Thomas Munro <thomas.munro@gmail.com> wrote:
> > (2) without a page cache, you really need to size your shared_buffers
> > adequately and we can't do that automatically.
>
> Well.... I'm more optimistic... That may not always be impossible.
> We've already added the ability to add more shared memory after
> startup. We could implement the ability to add or remove shared buffer
> segments after startup.

Yeah, there are examples of systems going back decades with multiple
buffer pools.  In some you can add more space later, and in some you
can also configure pools with different block sizes (imagine if you
could set your extremely OLTP tables to use 4KB blocks for reduced
write amplification and then perhaps even also promise that your
storage doesn't need FPIs for that size because you know it's
perfectly safe™, and imagine if you could set some big write-only
history tables to use 32KB blocks because some compression scheme
works better, etc), and you might also want different cache
replacement algorithms in different pools.  Complex and advanced stuff
no doubt and I'm not suggesting that's anywhere near a reasonable
thing for us to think about now (as a matter of fact in another thread
you can find me arguing for fully unifying our existing segregated
SLRU buffer pools with the one true buffer pool), but since we're
talking pie-in-the-sky ideas around the water cooler...



Re: Direct I/O

From
Noah Misch
Date:
On Mon, May 01, 2023 at 02:47:57PM +1200, Thomas Munro wrote:
> On Mon, May 1, 2023 at 12:00 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> > Justin Pryzby <pryzby@telsasoft.com> writes:
> > > Since this is -currently- a developer-only feature, it seems reasonable
> > > to rename the GUC to debug_direct_io, and (at such time as it's
> > > considered to be helpful to users) later rename it to direct_io.
> >
> > +1
> 
> Yeah, the future cross-version confusion thing is compelling.  OK,
> here's a rename patch.

This looks reasonable.



Re: Direct I/O

From
Thomas Munro
Date:
On Mon, May 15, 2023 at 9:09 AM Noah Misch <noah@leadboat.com> wrote:
> This looks reasonable.

Pushed with a small change: a couple of GUC_check_errdetail strings
needed s/io_direct/debug_io_direct/.  Thanks.



Re: Direct I/O

From
Peter Eisentraut
Date:
On 01.05.23 04:47, Thomas Munro wrote:
> On Mon, May 1, 2023 at 12:00 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Justin Pryzby <pryzby@telsasoft.com> writes:
>>> On Sun, Apr 30, 2023 at 06:35:30PM +1200, Thomas Munro wrote:
>>>> What about a
>>>> warning message about that at startup if it's on?
>>
>>> Such a warning wouldn't be particularly likely to be seen by someone who
>>> already didn't read/understand the docs for the not-feature that they
>>> turned on.
>>
>> Yeah, I doubt that that would be helpful at all.
> 
> Fair enough.
> 
>>> Since this is -currently- a developer-only feature, it seems reasonable
>>> to rename the GUC to debug_direct_io, and (at such time as it's
>>> considered to be helpful to users) later rename it to direct_io.
>>
>> +1
> 
> Yeah, the future cross-version confusion thing is compelling.  OK,
> here's a rename patch.  I left all the variable names and related
> symbols as they were, so it's just the GUC that gains the prefix.  I
> moved the documentation hunk up to be sorted alphabetically like
> nearby entries, because that seemed to look nicer, even though the
> list isn't globally sorted.

I suggest to also rename the hook functions (check and assign), like in 
the attached patch.  Mainly because utils/guc_hooks.h says to order the 
functions by GUC variable name, which was already wrong under the old 
name, but it would be pretty confusing to sort the functions by their 
GUC name that doesn't match the function names.
Attachment

Re: Direct I/O

From
Peter Geoghegan
Date:
On Wed, Apr 19, 2023 at 10:43 AM Andres Freund <andres@anarazel.de> wrote:
> I don't think the "evict on every buffer access" works quite that way - unless
> you have a completely even access pattern, buffer access frequency will
> increase usage count much more frequently on some buffers than others. And if
> you have a completely even access pattern, it's hard to come up with a good
> cache replacement algorithm...

My guess is that the most immediate problem in this area is the
problem of "correlated references" (to use the term from the LRU-K
paper). I gave an example of that here:

https://postgr.es/m/CAH2-Wzk7T9K3d9_NY+jEXp2qQGMYoP=gZMoR8q1Cv57SxAw1OA@mail.gmail.com

In other words, I think that the most immediate problem may in fact be
the tendency of usage_count to get incremented multiple times in
response to what is (for all intents and purposes) the same logical
page access. Even if it's not as important as I imagine, it still
seems likely that verifying that our input information isn't garbage
is the logical place to begin work in this general area. It's
difficult to be sure about that because it's so hard to look at just
one problem in isolation. I suspect that you were right to point out
that a larger shared buffers tends to look quite different to a
smaller shared buffers. That same factor is going to complicate any
analysis of the specific problem that I've highlighted (to say nothing
of the way that contention complicates the picture).

There is an interesting paper that compared the hit rates seen for
TPC-C to TPC-E on relatively modern hardware:

https://www.cs.cmu.edu/~chensm/papers/TPCE-sigmod-record10.pdf

It concludes that the buffer misses for each workload look rather
similar, past a certain point (past a certain buffer pool size): both
workloads have cache misses that seem totally random. The access
patterns may be very different, but that doesn't necessarily have any
visible effect on buffer misses. At least provided that you make
certain modest assumptions about buffer pool size, relative to working
set size.

The most sophisticated cache management algorithms (like ARC) work by
maintaining metadata about recently evicted buffers, which is used to
decide whether to favor recency over frequency. If you work backwards
then it follows that having cache misses that look completely random
is desirable, and perhaps even something to work towards. What you
really don't want is a situation where the same small minority of
pages keep getting ping-ponged into and out of the buffer pool,
without ever settling, even though the buffer cache is large enough
that that's possible in principle. That pathological profile is the
furthest possible thing from random.

With smaller shared_buffers, it's perhaps inevitable that buffer cache
misses are random, and so I'd expect that managing the problem of
contention will tend to matter most. With larger shared_buffers it
isn't inevitable at all, so the quality of the cache eviction scheme
is likely to matter quite a bit more.

--
Peter Geoghegan



Re: Direct I/O

From
Thomas Munro
Date:
On Wed, Aug 23, 2023 at 12:15 AM Peter Eisentraut <peter@eisentraut.org> wrote:
> I suggest to also rename the hook functions (check and assign), like in
> the attached patch.  Mainly because utils/guc_hooks.h says to order the
> functions by GUC variable name, which was already wrong under the old
> name, but it would be pretty confusing to sort the functions by their
> GUC name that doesn't match the function names.

OK.  I'll push this tomorrow unless you do it while I'm asleep.  Thanks!