Thread: Way to check whether a particular block is on the shared_buffer?
Hello, Do we have a reliable way to check whether a particular heap block is already on the shared buffer, but not modify? Right now, ReadBuffer and ReadBufferExtended are entrypoint of the buffer manager for extensions. However, it tries to acquire an available buffer pool instead of the victim buffer, regardless of the ReadBufferMode. It is different from what I want to do:1. Check whether the supplied BlockNum is already loaded on the shared buffer.2.If yes, the caller gets buffer descriptor as usual ReadBuffer.3. If not, the caller gets InvalidBuffer without modificationof the shared buffer, also no victim buffer pool. It allows extensions (likely a custom scan provider) to take different strategies for large table's scan, according to the latest status of individual blocks. If we don't have these interface, it seems to me an enhancement of the ReadBuffer_common and (Local)BufferAlloc is the only way to implement the feature. Of course, we need careful investigation definition of the 'valid' buffer pool. How about a buffer pool with BM_IO_IN_PROGRESS? How about a buffer pool that needs storage extend (thus, no relevant physical storage does not exists yet)? ... and so on. As an aside, background of my motivation is the slide below: http://www.slideshare.net/kaigai/sqlgpussd-english (LT slides in JPUG conference last Dec) I'm under investigation of SSD-to-GPU direct feature on top of the custom-scan interface. It intends to load a bunch of data blocks on NVMe-SSD to GPU RAM using P2P DMA, prior to the data loading onto CPU/RAM, to preprocess the data to be filtered out. It only makes sense if the target blocks are not loaded to the CPU/RAM yet, because SSD device is essentially slower than RAM. So, I like to have a reliable way to check the latest status of the shared buffer, to kwon whether a particular block is already loaded or not. Thanks, -- NEC Business Creation Division / PG-Strom Project KaiGai Kohei <kaigai@ak.jp.nec.com>
On 1/31/16 7:38 PM, Kouhei Kaigai wrote: > I'm under investigation of SSD-to-GPU direct feature on top of > the custom-scan interface. It intends to load a bunch of data > blocks on NVMe-SSD to GPU RAM using P2P DMA, prior to the data > loading onto CPU/RAM, to preprocess the data to be filtered out. > It only makes sense if the target blocks are not loaded to the > CPU/RAM yet, because SSD device is essentially slower than RAM. > So, I like to have a reliable way to check the latest status of > the shared buffer, to kwon whether a particular block is already > loaded or not. That completely ignores the OS cache though... wouldn't that be a major issue? To answer your direct question, I'm no expert, but I haven't seen any functions that do exactly what you want. You'd have to pull relevant bits from ReadBuffer_*. Or maybe a better method would just be to call BufTableLookup() without any locks and if you get a result > -1 just call the relevant ReadBuffer function. Sometimes you'll end up calling ReadBuffer even though the buffer isn't in shared buffers, but I would think that would be a rare occurrence. -- Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX Experts in Analytics, Data Architecture and PostgreSQL Data in Trouble? Get it in Treble! http://BlueTreble.com
> On 1/31/16 7:38 PM, Kouhei Kaigai wrote: > > I'm under investigation of SSD-to-GPU direct feature on top of > > the custom-scan interface. It intends to load a bunch of data > > blocks on NVMe-SSD to GPU RAM using P2P DMA, prior to the data > > loading onto CPU/RAM, to preprocess the data to be filtered out. > > It only makes sense if the target blocks are not loaded to the > > CPU/RAM yet, because SSD device is essentially slower than RAM. > > So, I like to have a reliable way to check the latest status of > > the shared buffer, to kwon whether a particular block is already > > loaded or not. > > That completely ignores the OS cache though... wouldn't that be a major > issue? > Once we can ensure the target block is not cached in the shared buffer, it is a job of the driver that support P2P DMA to handle OS page cache. Once driver get a P2P DMA request from PostgreSQL, it checks OS page cache status and determine the DMA source; whether OS buffer or SSD block. > To answer your direct question, I'm no expert, but I haven't seen any > functions that do exactly what you want. You'd have to pull relevant > bits from ReadBuffer_*. Or maybe a better method would just be to call > BufTableLookup() without any locks and if you get a result > -1 just > call the relevant ReadBuffer function. Sometimes you'll end up calling > ReadBuffer even though the buffer isn't in shared buffers, but I would > think that would be a rare occurrence. > Thanks, indeed, extension can call BufTableLookup(). PrefetchBuffer() has a good example for this. If it returned a valid buf_id, we have nothing difficult; just call ReadBuffer() to pin the buffer. Elsewhere, when BufTableLookup() returned negative, it means a pair of (relation, forknum, blocknum) does not exist on the shared buffer. So, extension enqueues P2P DMA request for asynchronous translation, then driver processes the P2P DMA soon but later. Concurrent access may always happen. PostgreSQL uses MVCC, so the backend which issued P2P DMA does not need to pay attention for new tuples that didn't exist on executor start time, even if other backend loads and updates the same buffer just after the above BufTableLookup(). On the other hands, we have to pay attention whether a fraction of the buffer page is partially written to OS buffer or storage. It is in the scope of operating system, so it is not controllable from us. One idea I can find out is, temporary suspension of FlushBuffer() for a particular pairs of (relation, forknum, blocknum) until P2P DMA gets completed. Even if concurrent backend updates the buffer page after the BufTableLookup(), it allows to prevent OS caches and storages getting dirty during the P2P DMA. How about people's thought? -- NEC Business Creation Division / PG-Strom Project KaiGai Kohei <kaigai@ak.jp.nec.com>
KaiGai-san, On 2016/02/01 10:38, Kouhei Kaigai wrote: > As an aside, background of my motivation is the slide below: > http://www.slideshare.net/kaigai/sqlgpussd-english > (LT slides in JPUG conference last Dec) > > I'm under investigation of SSD-to-GPU direct feature on top of > the custom-scan interface. It intends to load a bunch of data > blocks on NVMe-SSD to GPU RAM using P2P DMA, prior to the data > loading onto CPU/RAM, to preprocess the data to be filtered out. > It only makes sense if the target blocks are not loaded to the > CPU/RAM yet, because SSD device is essentially slower than RAM. > So, I like to have a reliable way to check the latest status of > the shared buffer, to kwon whether a particular block is already > loaded or not. Quite interesting stuff, thanks for sharing! I'm in no way expert on this but could this generally be attacked from the smgr API perspective? Currently, we have only one implementation - md.c (the hard-coded RelationData.smgr_which = 0). If we extended that and provided end-to-end support so that there would be md.c alternatives to storage operations, I guess that would open up opportunities for extensions to specify smgr_which as an argument to ReadBufferExtended(), provided there is already support in place to install md.c alternatives (perhaps in .so). Of course, these are just musings and, perhaps does not really concern the requirements of custom scan methods you have been developing. Thanks, Amit
> KaiGai-san, > > On 2016/02/01 10:38, Kouhei Kaigai wrote: > > As an aside, background of my motivation is the slide below: > > http://www.slideshare.net/kaigai/sqlgpussd-english > > (LT slides in JPUG conference last Dec) > > > > I'm under investigation of SSD-to-GPU direct feature on top of > > the custom-scan interface. It intends to load a bunch of data > > blocks on NVMe-SSD to GPU RAM using P2P DMA, prior to the data > > loading onto CPU/RAM, to preprocess the data to be filtered out. > > It only makes sense if the target blocks are not loaded to the > > CPU/RAM yet, because SSD device is essentially slower than RAM. > > So, I like to have a reliable way to check the latest status of > > the shared buffer, to kwon whether a particular block is already > > loaded or not. > > Quite interesting stuff, thanks for sharing! > > I'm in no way expert on this but could this generally be attacked from the > smgr API perspective? Currently, we have only one implementation - md.c > (the hard-coded RelationData.smgr_which = 0). If we extended that and > provided end-to-end support so that there would be md.c alternatives to > storage operations, I guess that would open up opportunities for > extensions to specify smgr_which as an argument to ReadBufferExtended(), > provided there is already support in place to install md.c alternatives > (perhaps in .so). Of course, these are just musings and, perhaps does not > really concern the requirements of custom scan methods you have been > developing. > Thanks for your idea. Indeed, smgr hooks are good candidate to implement the feature, however, what I need is a thin intermediation layer rather than alternative storage engine. It becomes clear we need two features here. 1. A feature to check whether a particular block is already on the shared buffer pool. It is available. BufTableLookup()under the BufMappingPartitionLock gives us the information we want. 2. A feature to suspend i/o write-out towards a particular blocks that are registered by other concurrent backend, unlessit is not unregistered (usually, at the end of P2P DMA). ==> to be discussed. When we call smgrwrite(), like FlushBuffer(), it fetches function pointer from the 'smgrsw' array, then calls smgr_write. void smgrwrite(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum, char *buffer, bool skipFsync) { (*(smgrsw[reln->smgr_which].smgr_write)) (reln, forknum, blocknum, buffer,skipFsync); } If extension would overwrite smgrsw[] array, then call the original function under the control by extension, it allows to suspend the call of the original smgr_write until completion of P2P DMA. It may be a minimum invasive way to implement, and portable to any further storage layers. How about your thought? Even though it is a bit different from your original proposition. -- NEC Business Creation Division / PG-Strom Project KaiGai Kohei <kaigai@ak.jp.nec.com>
Kouhei Kaigai wrote: > > On 1/31/16 7:38 PM, Kouhei Kaigai wrote: > > To answer your direct question, I'm no expert, but I haven't seen any > > functions that do exactly what you want. You'd have to pull relevant > > bits from ReadBuffer_*. Or maybe a better method would just be to call > > BufTableLookup() without any locks and if you get a result > -1 just > > call the relevant ReadBuffer function. Sometimes you'll end up calling > > ReadBuffer even though the buffer isn't in shared buffers, but I would > > think that would be a rare occurrence. > > > Thanks, indeed, extension can call BufTableLookup(). PrefetchBuffer() > has a good example for this. > > If it returned a valid buf_id, we have nothing difficult; just call > ReadBuffer() to pin the buffer. Isn't this what (or very similar to) ReadBufferExtended(RBM_ZERO_AND_LOCK) is already doing? -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
> > > On 1/31/16 7:38 PM, Kouhei Kaigai wrote: > > > > To answer your direct question, I'm no expert, but I haven't seen any > > > functions that do exactly what you want. You'd have to pull relevant > > > bits from ReadBuffer_*. Or maybe a better method would just be to call > > > BufTableLookup() without any locks and if you get a result > -1 just > > > call the relevant ReadBuffer function. Sometimes you'll end up calling > > > ReadBuffer even though the buffer isn't in shared buffers, but I would > > > think that would be a rare occurrence. > > > > > Thanks, indeed, extension can call BufTableLookup(). PrefetchBuffer() > > has a good example for this. > > > > If it returned a valid buf_id, we have nothing difficult; just call > > ReadBuffer() to pin the buffer. > > Isn't this what (or very similar to) > ReadBufferExtended(RBM_ZERO_AND_LOCK) is already doing? > This operation actually acquires a buffer page, fills up with zero and a valid buffer page is wiped out if no free buffer page. I want to keep the contents of the shared buffer already loaded on the main memory. P2P DMA and GPU preprocessing intends to minimize main memory consumption by rows to be filtered by scan qualifiers. Thanks, -- NEC Business Creation Division / PG-Strom Project KaiGai Kohei <kaigai@ak.jp.nec.com>
> > KaiGai-san, > > > > On 2016/02/01 10:38, Kouhei Kaigai wrote: > > > As an aside, background of my motivation is the slide below: > > > http://www.slideshare.net/kaigai/sqlgpussd-english > > > (LT slides in JPUG conference last Dec) > > > > > > I'm under investigation of SSD-to-GPU direct feature on top of > > > the custom-scan interface. It intends to load a bunch of data > > > blocks on NVMe-SSD to GPU RAM using P2P DMA, prior to the data > > > loading onto CPU/RAM, to preprocess the data to be filtered out. > > > It only makes sense if the target blocks are not loaded to the > > > CPU/RAM yet, because SSD device is essentially slower than RAM. > > > So, I like to have a reliable way to check the latest status of > > > the shared buffer, to kwon whether a particular block is already > > > loaded or not. > > > > Quite interesting stuff, thanks for sharing! > > > > I'm in no way expert on this but could this generally be attacked from the > > smgr API perspective? Currently, we have only one implementation - md.c > > (the hard-coded RelationData.smgr_which = 0). If we extended that and > > provided end-to-end support so that there would be md.c alternatives to > > storage operations, I guess that would open up opportunities for > > extensions to specify smgr_which as an argument to ReadBufferExtended(), > > provided there is already support in place to install md.c alternatives > > (perhaps in .so). Of course, these are just musings and, perhaps does not > > really concern the requirements of custom scan methods you have been > > developing. > > > Thanks for your idea. Indeed, smgr hooks are good candidate to implement > the feature, however, what I need is a thin intermediation layer rather > than alternative storage engine. > > It becomes clear we need two features here. > 1. A feature to check whether a particular block is already on the shared > buffer pool. > It is available. BufTableLookup() under the BufMappingPartitionLock > gives us the information we want. > > 2. A feature to suspend i/o write-out towards a particular blocks > that are registered by other concurrent backend, unless it is not > unregistered (usually, at the end of P2P DMA). > ==> to be discussed. > > When we call smgrwrite(), like FlushBuffer(), it fetches function pointer > from the 'smgrsw' array, then calls smgr_write. > > void > smgrwrite(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum, > char *buffer, bool skipFsync) > { > (*(smgrsw[reln->smgr_which].smgr_write)) (reln, forknum, blocknum, > buffer, skipFsync); > } > > If extension would overwrite smgrsw[] array, then call the original > function under the control by extension, it allows to suspend the call > of the original smgr_write until completion of P2P DMA. > > It may be a minimum invasive way to implement, and portable to any > further storage layers. > > How about your thought? Even though it is a bit different from your > original proposition. > I tried to design a draft of enhancement to realize the above i/o write-out suspend/resume, with less invasive way as possible as we can. ASSUMPTION: I intend to implement this feature as a part of extension, because this i/o suspend/resume checks are pureoverhead increment for the core features, unless extension which utilizes it. Three functions shall be added: extern int GetStorageMgrNumbers(void); extern f_smgr GetStorageMgrHandlers(int smgr_which); extern void SetStorageMgrHandlers(int smgr_which, f_smgr smgr_handlers); As literal, GetStorageMgrNumbers() returns the number of storage manager currently installed. It always return 1 right now. GetStorageMgrHandlers() returns the currently configured f_smgr table to the supplied smgr_which. It allows extensions to know current configuration of the storage manager, even if other extension already modified it. SetStorageMgrHandlers() assigns the supplied 'smgr_handlers', instead of the current one. If extension wants to intermediate 'smgr_write', extension will replace the 'smgr_write' by own function, then call the original function, likely mdwrite, from the alternative function. In this case, call chain shall be: FlushBuffer, and others... +-- smgrwrite(...) +-- (extension's own function) +-- mdwrite Once extension's own function blocks write i/o until P2P DMA completed by concurrent process, we don't need to care about partial update of OS cache or storage device. It is not difficult for extensions to implement a feature to track/untrack a pair of (relFileNode, forkNum, blockNum), automatic untracking according to the resource-owner, and a mechanism to block the caller by P2P DMA completion. On the other hands, its flexibility seems to me a bit larger than necessity (what I want to implement is just a blocker of buffer write i/o). And, it may give people wrong impression for the feature of pluggable storage. How about folk's thought? Thanks, -- NEC Business Creation Division / PG-Strom Project KaiGai Kohei <kaigai@ak.jp.nec.com>
On 2/4/16 12:30 AM, Kouhei Kaigai wrote: >> 2. A feature to suspend i/o write-out towards a particular blocks >> > that are registered by other concurrent backend, unless it is not >> > unregistered (usually, at the end of P2P DMA). >> > ==> to be discussed. I think there's still a race condition here though... A finds buffer not in shared buffers B reads buffer in modifies buffer starts writing buffer to OS A Makes call to block write, but write is already in process; thinks writes are now blocked Reads corrupted block Much hilarity ensues Or maybe you were just glossing over that part for brevity. ... > I tried to design a draft of enhancement to realize the above i/o write-out > suspend/resume, with less invasive way as possible as we can. > > ASSUMPTION: I intend to implement this feature as a part of extension, > because this i/o suspend/resume checks are pure overhead increment > for the core features, unless extension which utilizes it. > > Three functions shall be added: > > extern int GetStorageMgrNumbers(void); > extern f_smgr GetStorageMgrHandlers(int smgr_which); > extern void SetStorageMgrHandlers(int smgr_which, f_smgr smgr_handlers); > > As literal, GetStorageMgrNumbers() returns the number of storage manager > currently installed. It always return 1 right now. > GetStorageMgrHandlers() returns the currently configured f_smgr table to > the supplied smgr_which. It allows extensions to know current configuration > of the storage manager, even if other extension already modified it. > SetStorageMgrHandlers() assigns the supplied 'smgr_handlers', instead of > the current one. > If extension wants to intermediate 'smgr_write', extension will replace > the 'smgr_write' by own function, then call the original function, likely > mdwrite, from the alternative function. > > In this case, call chain shall be: > > FlushBuffer, and others... > +-- smgrwrite(...) > +-- (extension's own function) > +-- mdwrite ISTR someone (Robert Haas?) complaining that this method of hooks is cumbersome to use and can be fragile if multiple hooks are being installed. So maybe we don't want to extend it's usage... I'm also not sure whether this is better done with an smgr hook or a hook into shared buffer handling... -- Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX Experts in Analytics, Data Architecture and PostgreSQL Data in Trouble? Get it in Treble! http://BlueTreble.com
> On 2/4/16 12:30 AM, Kouhei Kaigai wrote: > >> 2. A feature to suspend i/o write-out towards a particular blocks > >> > that are registered by other concurrent backend, unless it is not > >> > unregistered (usually, at the end of P2P DMA). > >> > ==> to be discussed. > > I think there's still a race condition here though... > > A > finds buffer not in shared buffers > > B > reads buffer in > modifies buffer > starts writing buffer to OS > > A > Makes call to block write, but write is already in process; thinks > writes are now blocked > Reads corrupted block > Much hilarity ensues > > Or maybe you were just glossing over that part for brevity. > Thanks, this part was not clear from my previous description. At the time when B starts writing buffer to OS, extension will catch this i/o request using a hook around the smgrwrite, then the mechanism registers the block to block P2P DMA request during B's write operation. (Of course, it unregisters the block at end of the smgrwrite) So, even if A wants to issue P2P DMA concurrently, it cannot register the block until B's write operation. In practical, this operation shall be "try lock", because B's write operation implies existence of the buffer in main memory, so B does not need to wait A's write operation if B switch DMA source from SSD to main memory. Thanks, -- NEC Business Creation Division / PG-Strom Project KaiGai Kohei <kaigai@ak.jp.nec.com> > ... > > > I tried to design a draft of enhancement to realize the above i/o write-out > > suspend/resume, with less invasive way as possible as we can. > > > > ASSUMPTION: I intend to implement this feature as a part of extension, > > because this i/o suspend/resume checks are pure overhead increment > > for the core features, unless extension which utilizes it. > > > > Three functions shall be added: > > > > extern int GetStorageMgrNumbers(void); > > extern f_smgr GetStorageMgrHandlers(int smgr_which); > > extern void SetStorageMgrHandlers(int smgr_which, f_smgr smgr_handlers); > > > > As literal, GetStorageMgrNumbers() returns the number of storage manager > > currently installed. It always return 1 right now. > > GetStorageMgrHandlers() returns the currently configured f_smgr table to > > the supplied smgr_which. It allows extensions to know current configuration > > of the storage manager, even if other extension already modified it. > > SetStorageMgrHandlers() assigns the supplied 'smgr_handlers', instead of > > the current one. > > If extension wants to intermediate 'smgr_write', extension will replace > > the 'smgr_write' by own function, then call the original function, likely > > mdwrite, from the alternative function. > > > > In this case, call chain shall be: > > > > FlushBuffer, and others... > > +-- smgrwrite(...) > > +-- (extension's own function) > > +-- mdwrite > > ISTR someone (Robert Haas?) complaining that this method of hooks is > cumbersome to use and can be fragile if multiple hooks are being > installed. So maybe we don't want to extend it's usage... > > I'm also not sure whether this is better done with an smgr hook or a > hook into shared buffer handling... > -- > Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX > Experts in Analytics, Data Architecture and PostgreSQL > Data in Trouble? Get it in Treble! http://BlueTreble.com > > > -- > Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-hackers
> -----Original Message----- > From: Jim Nasby [mailto:Jim.Nasby@BlueTreble.com] > Sent: Friday, February 05, 2016 9:17 AM > To: Kaigai Kouhei(海外 浩平); pgsql-hackers@postgresql.org; Robert Haas > Cc: Amit Langote > Subject: Re: [HACKERS] Way to check whether a particular block is on the > shared_buffer? > > On 2/4/16 12:30 AM, Kouhei Kaigai wrote: > >> 2. A feature to suspend i/o write-out towards a particular blocks > >> > that are registered by other concurrent backend, unless it is not > >> > unregistered (usually, at the end of P2P DMA). > >> > ==> to be discussed. > > I think there's still a race condition here though... > > A > finds buffer not in shared buffers > > B > reads buffer in > modifies buffer > starts writing buffer to OS > > A > Makes call to block write, but write is already in process; thinks > writes are now blocked > Reads corrupted block > Much hilarity ensues > > Or maybe you were just glossing over that part for brevity. > > ... > > > I tried to design a draft of enhancement to realize the above i/o write-out > > suspend/resume, with less invasive way as possible as we can. > > > > ASSUMPTION: I intend to implement this feature as a part of extension, > > because this i/o suspend/resume checks are pure overhead increment > > for the core features, unless extension which utilizes it. > > > > Three functions shall be added: > > > > extern int GetStorageMgrNumbers(void); > > extern f_smgr GetStorageMgrHandlers(int smgr_which); > > extern void SetStorageMgrHandlers(int smgr_which, f_smgr smgr_handlers); > > > > As literal, GetStorageMgrNumbers() returns the number of storage manager > > currently installed. It always return 1 right now. > > GetStorageMgrHandlers() returns the currently configured f_smgr table to > > the supplied smgr_which. It allows extensions to know current configuration > > of the storage manager, even if other extension already modified it. > > SetStorageMgrHandlers() assigns the supplied 'smgr_handlers', instead of > > the current one. > > If extension wants to intermediate 'smgr_write', extension will replace > > the 'smgr_write' by own function, then call the original function, likely > > mdwrite, from the alternative function. > > > > In this case, call chain shall be: > > > > FlushBuffer, and others... > > +-- smgrwrite(...) > > +-- (extension's own function) > > +-- mdwrite > > ISTR someone (Robert Haas?) complaining that this method of hooks is > cumbersome to use and can be fragile if multiple hooks are being > installed. So maybe we don't want to extend it's usage... > > I'm also not sure whether this is better done with an smgr hook or a > hook into shared buffer handling... > # sorry, I oversight the later part of your reply. I can agree that smgr hooks shall be primarily designed to make storage systems pluggable, even if we can use this hooks for suspend & resume of write i/o stuff. In addition, "pluggable storage" is a long-standing feature, even though it is not certain whether existing smgr hooks are good starting point. It may be a risk if we implement a grand feature on top of the hooks but out of its primary purpose. So, my preference is a mechanism to hook buffer write to implement this feature. (Or, maybe a built-in write i/o suspend / resume stuff if it has nearly zero cost when no extension activate the feature.) One downside of this approach is larger number of hook points. We have to deploy the hook nearby existing smgrwrite of LocalBufferAlloc and FlushRelationBuffers, in addition to FlushBuffer, at least. Thanks, -- NEC Business Creation Division / PG-Strom Project KaiGai Kohei <kaigai@ak.jp.nec.com>
On Thu, Feb 4, 2016 at 11:34 PM, Kouhei Kaigai <kaigai@ak.jp.nec.com> wrote: > I can agree that smgr hooks shall be primarily designed to make storage > systems pluggable, even if we can use this hooks for suspend & resume of > write i/o stuff. > In addition, "pluggable storage" is a long-standing feature, even though > it is not certain whether existing smgr hooks are good starting point. > It may be a risk if we implement a grand feature on top of the hooks > but out of its primary purpose. > > So, my preference is a mechanism to hook buffer write to implement this > feature. (Or, maybe a built-in write i/o suspend / resume stuff if it > has nearly zero cost when no extension activate the feature.) > One downside of this approach is larger number of hook points. > We have to deploy the hook nearby existing smgrwrite of LocalBufferAlloc > and FlushRelationBuffers, in addition to FlushBuffer, at least. I don't understand what you're hoping to achieve by introducing pluggability at the smgr layer. I mean, md.c is pretty much good for read and writing from anything that looks like a directory of files. Another smgr layer would make sense if we wanted to read and write via some kind of network protocol, or if we wanted to have some kind of storage substrate that did internally to itself some of the tasks for which we are currently relying on the filesystem - e.g. if we wanted to be able to use a raw device, or perhaps more plausibly if we wanted to reduce the number of separate files we need, or provide a substrate that can clip an unused extent out of the middle of a relation efficiently. But I don't understand what this has to do with what you're trying to do here. The subject of this thread is about whether you can check for the presence of a block in shared_buffers, and as discussed upthread, you can. I don't quite follow how we made the jump from there to smgr pluggability. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company