Thread: macOS prefetching support
It seems to me that we could implement prefetching support (USE_PREFETCH) on macOS using the fcntl() command F_RDADVISE. The man page description is a bit terse: F_RDADVISE Issue an advisory read async with no copy to user. But it seems to be the right idea. Was this looked into before? I couldn't find anything in the archives. Attached is a patch to implement this. It seems to work, but of course it's kind of hard to tell whether it actually does anything useful. (Even if the performance effects were negligible, this would be useful to get the prefetch code some exercise on this platform.) Thoughts?
Attachment
On Wed, Aug 14, 2024 at 7:04 PM Peter Eisentraut <peter@eisentraut.org> wrote: > Attached is a patch to implement this. It seems to work, but of course > it's kind of hard to tell whether it actually does anything useful. Header order problem: pg_config_os.h defines __darwin__, but pg_config_manual.h is included first, and tests __darwin__. I hacked my way around that, and then made a table of 40,000,000 integers in a 2GB buffer pool. I used "select count(pg_buffercache_evict(buffered)) from pg_buffer_cache", and "sudo purge", to clear the two layers of cache for each test, and then measured: maintenance_io_concurrency=0, ANALYZE: 2311ms maintenance_io_concurrency=10, ANALYZE: 652ms maintenance_io_concurrency=25, ANALYZE: 389ms It works!
On 14.08.24 14:36, Thomas Munro wrote: > On Wed, Aug 14, 2024 at 7:04 PM Peter Eisentraut <peter@eisentraut.org> wrote: >> Attached is a patch to implement this. It seems to work, but of course >> it's kind of hard to tell whether it actually does anything useful. > > Header order problem: pg_config_os.h defines __darwin__, but > pg_config_manual.h is included first, and tests __darwin__. I hacked > my way around that, and then made a table of 40,000,000 integers in a > 2GB buffer pool. I used "select count(pg_buffercache_evict(buffered)) > from pg_buffer_cache", and "sudo purge", to clear the two layers of > cache for each test, and then measured: > > maintenance_io_concurrency=0, ANALYZE: 2311ms > maintenance_io_concurrency=10, ANALYZE: 652ms > maintenance_io_concurrency=25, ANALYZE: 389ms > > It works! Cool! I'll work on a more polished patch.
On Sat, Aug 17, 2024 at 6:58 AM Peter Eisentraut <peter@eisentraut.org> wrote: > What to do about the order of the symbols and include files. I threw > something into src/include/port/darwin.h, but I'm not sure if that's > good. Alternatively, we could not use __darwin__ but instead the more > standard and predefined defined(__APPLE__) && defined(__MACH__). Hmm. fd.h and fd.c test for F_NOCACHE, which is pretty closely related. Now I'm wondering why we actually need this in pg_config_manual.h at all. Who would turn it off at compile time, and why would they not be satisfied with setting relevant GUCs to 0? Can we just teach fd.h to define USE_PREFETCH if defined(POSIX_FADV_WILLNEED) || defined(F_RDADVISE)? (I have also thought multiple times about removing the configure probes for F_FULLFSYNC, and just doing #ifdef. Oh, that's in my patch for CF #4453.) > How to document it. The current documentation makes references mainly > to the availability of posix_fadvise(). That seems quite low-level. > How could a user of a prepared package even find out about that? Should > we just say "requires OS support" (kind of like I did here) and you can > query the effective state by looking at the *_io_concurrency settings? > Or do we need a read-only parameter that shows whether prefetch support > exists (kind of along the lines of huge pages)? I think that's fine. I don't really like the word "prefetch", could mean many different things. What about "requires OS support for issuing read-ahead advice", which uses a word that appears in both of those interfaces? And yeah the GUCs being nailed to zero means you don't have it. > (There is also the possibility that we provide an implementation of > posix_fadvise() for macOS that wraps the platform-specific code in this > patch. And then Apple could just take that. ;-) ) Yeah might be simpler... as long as we make sure that we test for having the function AND the relevant POSIX_FADV_xxx in places, which I see that we do.
Thomas Munro <thomas.munro@gmail.com> writes: > Hmm. fd.h and fd.c test for F_NOCACHE, which is pretty closely > related. Now I'm wondering why we actually need this in > pg_config_manual.h at all. Who would turn it off at compile time, and > why would they not be satisfied with setting relevant GUCs to 0? +1 for not bothering with a pg_config_manual.h knob. regards, tom lane
On Sat, Aug 17, 2024 at 6:58 AM Peter Eisentraut <peter@eisentraut.org> wrote: > solaris fake I'm half tempted to suggest that we take this exception out. If it's there, we call it. It doesn't do anything right now, but it's a cheap empty user space function, and I heard they are thinking about adding a real syscall. (In a burst of enthusiasm for standards and stuff, I talked to people involved in all the OSes using ZFS, to suggest hooking that up; they added some other stuff I asked about so I think it's a real threat.) But I haven't noticed any users on this list in many years, to opine either way. It doesn't do anything on Cygwin either. Actually, it's worse than nothing, it looks like it makes two useless system calls adjusting the "sequential" flag on the file handle. But really, of the 3 ways to build on Windows, only MSVC has real users, so it makes no useless system calls at all, so I don't propose to change anything due to this observation. > win32 no I guess you could implement this in two ways: * CreateFileMapping(), MapViewOfFile(), PrefetchVirtualMemory(), UnmapViewOfFile(), Closehandle(). That's a lot system calls and maybe expensive VM stuff, who knows. * ReadFileEx() in OVERLAPPED mode (async), into a dummy buffer, with a completion callback that frees the buffer and OVERLAPPED object. That's a lot of extra work allocating memory, copying to user space, scheduling a callback, and freeing memory for nothing. Both are terrible, but likely still better than an I/O stall, I dunno. I think the VMS, NT, Unix-hater view would be: why would you want such a stupid programming interface anyway, when you could use async I/O properly? I love Unix, but they'd have a point.
On 17.08.24 00:01, Thomas Munro wrote: > On Sat, Aug 17, 2024 at 6:58 AM Peter Eisentraut <peter@eisentraut.org> wrote: >> What to do about the order of the symbols and include files. I threw >> something into src/include/port/darwin.h, but I'm not sure if that's >> good. Alternatively, we could not use __darwin__ but instead the more >> standard and predefined defined(__APPLE__) && defined(__MACH__). > > Hmm. fd.h and fd.c test for F_NOCACHE, which is pretty closely > related. Now I'm wondering why we actually need this in > pg_config_manual.h at all. Who would turn it off at compile time, and > why would they not be satisfied with setting relevant GUCs to 0? Can > we just teach fd.h to define USE_PREFETCH if > defined(POSIX_FADV_WILLNEED) || defined(F_RDADVISE)? I thought USE_PREFETCH existed so that we don't have the run-time overhead for all the bookkeeping code if we don't have any OS-level prefetch support at the end. But it looks like most of that bookkeeping code is skipped anyway if the *_io_concurrency settings are at 0. So yes, getting rid of USE_PREFETCH globally would be useful. > (I have also thought multiple times about removing the configure > probes for F_FULLFSYNC, and just doing #ifdef. Oh, that's in my patch > for CF #4453.) Understandable, but we should be careful here that we don't create setups that can cause bugs like <https://www.postgresql.org/message-id/48da4a1f-ccd9-4988-9622-24f37b1de2b4@eisentraut.org>. > I think that's fine. I don't really like the word "prefetch", could > mean many different things. What about "requires OS support for > issuing read-ahead advice", which uses a word that appears in both of > those interfaces? I like that term.
On Sat, Aug 24, 2024 at 12:28 AM Peter Eisentraut <peter@eisentraut.org> wrote: > In terms of $subject, this patch seems sufficient for now. WFM. I noticed you don't have an EINTR retry loop, but the man page doesn't say you need one, so overall this patch LGTM. + * posix_fadvise() is the simplest standardized interface that accomplishes + * this. We could add an implementation using libaio in the future; but note + * that this API is inappropriate for libaio, which wants to have a buffer + * provided to read into. I would consider just dropping that comment about libaio, if touching this paragraph. Proposals exist for AIO of course, but not with libaio, and better predictions with useful discussion of the topic seem unlikely to fit in this margin.
On 26.08.24 07:54, Thomas Munro wrote: > On Sat, Aug 24, 2024 at 12:28 AM Peter Eisentraut <peter@eisentraut.org> wrote: >> In terms of $subject, this patch seems sufficient for now. > > WFM. I noticed you don't have an EINTR retry loop, but the man page > doesn't say you need one, so overall this patch LGTM. > > + * posix_fadvise() is the simplest standardized interface that accomplishes > + * this. We could add an implementation using libaio in the future; but note > + * that this API is inappropriate for libaio, which wants to have a buffer > + * provided to read into. > > I would consider just dropping that comment about libaio, if touching > this paragraph. Proposals exist for AIO of course, but not with > libaio, and better predictions with useful discussion of the topic > seem unlikely to fit in this margin. committed with that change
Oh, I missed something: I think we're missing FileAcces(), in case the vfd has to be re-opened, no?
On 29.08.24 03:22, Thomas Munro wrote: > Oh, I missed something: I think we're missing FileAcces(), in case the > vfd has to be re-opened, no? Fixed, thanks.
On 03.09.24 03:47, Thomas Munro wrote: > On Mon, Aug 19, 2024 at 1:35 AM Peter Eisentraut <peter@eisentraut.org> wrote: >> On 17.08.24 00:01, Thomas Munro wrote: >>> I think that's fine. I don't really like the word "prefetch", could >>> mean many different things. What about "requires OS support for >>> issuing read-ahead advice", which uses a word that appears in both of >>> those interfaces? >> >> I like that term. > > A couple of other places still use the old specific terminology. PSA. looks good to me