Thread: Way to check whether a particular block is on the shared_buffer?

Way to check whether a particular block is on the shared_buffer?

From

Kouhei Kaigai

Date:

01 February 2016, 01:38:34

Hello,

Do we have a reliable way to check whether a particular heap block
is already on the shared buffer, but not modify?

Right now, ReadBuffer and ReadBufferExtended are entrypoint of the
buffer manager for extensions. However, it tries to acquire an
available buffer pool instead of the victim buffer, regardless of
the ReadBufferMode.

It is different from what I want to do:1. Check whether the supplied BlockNum is already loaded on the shared
buffer.2.If yes, the caller gets buffer descriptor as usual ReadBuffer.3. If not, the caller gets InvalidBuffer without
modificationof the shared buffer, also no victim buffer pool.

It allows extensions (likely a custom scan provider) to take
different strategies for large table's scan, according to the
latest status of individual blocks.
If we don't have these interface, it seems to me an enhancement
of the ReadBuffer_common and (Local)BufferAlloc is the only way
to implement the feature.

Of course, we need careful investigation definition of the 'valid'
buffer pool. How about a buffer pool with BM_IO_IN_PROGRESS?
How about a buffer pool that needs storage extend (thus, no relevant
physical storage does not exists yet)? ... and so on.

As an aside, background of my motivation is the slide below:
http://www.slideshare.net/kaigai/sqlgpussd-english
(LT slides in JPUG conference last Dec)

I'm under investigation of SSD-to-GPU direct feature on top of
the custom-scan interface. It intends to load a bunch of data
blocks on NVMe-SSD to GPU RAM using P2P DMA, prior to the data
loading onto CPU/RAM, to preprocess the data to be filtered out.
It only makes sense if the target blocks are not loaded to the
CPU/RAM yet, because SSD device is essentially slower than RAM.
So, I like to have a reliable way to check the latest status of
the shared buffer, to kwon whether a particular block is already
loaded or not.

Thanks,
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei <kaigai@ak.jp.nec.com>

Re: Way to check whether a particular block is on the shared_buffer?

From

Jim Nasby

Date:

01 February 2016, 14:42:43

On 1/31/16 7:38 PM, Kouhei Kaigai wrote:
> I'm under investigation of SSD-to-GPU direct feature on top of
> the custom-scan interface. It intends to load a bunch of data
> blocks on NVMe-SSD to GPU RAM using P2P DMA, prior to the data
> loading onto CPU/RAM, to preprocess the data to be filtered out.
> It only makes sense if the target blocks are not loaded to the
> CPU/RAM yet, because SSD device is essentially slower than RAM.
> So, I like to have a reliable way to check the latest status of
> the shared buffer, to kwon whether a particular block is already
> loaded or not.

That completely ignores the OS cache though... wouldn't that be a major
issue?

To answer your direct question, I'm no expert, but I haven't seen any
functions that do exactly what you want. You'd have to pull relevant
bits from ReadBuffer_*. Or maybe a better method would just be to call
BufTableLookup() without any locks and if you get a result > -1 just
call the relevant ReadBuffer function. Sometimes you'll end up calling
ReadBuffer even though the buffer isn't in shared buffers, but I would
think that would be a rare occurrence.
-- 
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Experts in Analytics, Data Architecture and PostgreSQL
Data in Trouble? Get it in Treble! http://BlueTreble.com

Re: Way to check whether a particular block is on the shared_buffer?

From

Kouhei Kaigai

Date:

02 February 2016, 04:44:50

> On 1/31/16 7:38 PM, Kouhei Kaigai wrote:
> > I'm under investigation of SSD-to-GPU direct feature on top of
> > the custom-scan interface. It intends to load a bunch of data
> > blocks on NVMe-SSD to GPU RAM using P2P DMA, prior to the data
> > loading onto CPU/RAM, to preprocess the data to be filtered out.
> > It only makes sense if the target blocks are not loaded to the
> > CPU/RAM yet, because SSD device is essentially slower than RAM.
> > So, I like to have a reliable way to check the latest status of
> > the shared buffer, to kwon whether a particular block is already
> > loaded or not.
>
> That completely ignores the OS cache though... wouldn't that be a major
> issue?
>
Once we can ensure the target block is not cached in the shared buffer,
it is a job of the driver that support P2P DMA to handle OS page cache.
Once driver get a P2P DMA request from PostgreSQL, it checks OS page
cache status and determine the DMA source; whether OS buffer or SSD block.

> To answer your direct question, I'm no expert, but I haven't seen any
> functions that do exactly what you want. You'd have to pull relevant
> bits from ReadBuffer_*. Or maybe a better method would just be to call
> BufTableLookup() without any locks and if you get a result > -1 just
> call the relevant ReadBuffer function. Sometimes you'll end up calling
> ReadBuffer even though the buffer isn't in shared buffers, but I would
> think that would be a rare occurrence.
>
Thanks, indeed, extension can call BufTableLookup(). PrefetchBuffer()
has a good example for this.

If it returned a valid buf_id, we have nothing difficult; just call
ReadBuffer() to pin the buffer.

Elsewhere, when BufTableLookup() returned negative, it means a pair of
(relation, forknum, blocknum) does not exist on the shared buffer.
So, extension enqueues P2P DMA request for asynchronous translation,
then driver processes the P2P DMA soon but later.
Concurrent access may always happen. PostgreSQL uses MVCC, so the backend
which issued P2P DMA does not need to pay attention for new tuples that
didn't exist on executor start time, even if other backend loads and
updates the same buffer just after the above BufTableLookup().

On the other hands, we have to pay attention whether a fraction of
the buffer page is partially written to OS buffer or storage. It is
in the scope of operating system, so it is not controllable from us.

One idea I can find out is, temporary suspension of FlushBuffer() for
a particular pairs of (relation, forknum, blocknum) until P2P DMA gets
completed. Even if concurrent backend updates the buffer page after the
BufTableLookup(), it allows to prevent OS caches and storages getting
dirty during the P2P DMA.

How about people's thought?
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei <kaigai@ak.jp.nec.com>

Re: Way to check whether a particular block is on the shared_buffer?

From

Amit Langote

Date:

02 February 2016, 05:54:28

KaiGai-san,

On 2016/02/01 10:38, Kouhei Kaigai wrote:
> As an aside, background of my motivation is the slide below:
> http://www.slideshare.net/kaigai/sqlgpussd-english
> (LT slides in JPUG conference last Dec)
> 
> I'm under investigation of SSD-to-GPU direct feature on top of
> the custom-scan interface. It intends to load a bunch of data
> blocks on NVMe-SSD to GPU RAM using P2P DMA, prior to the data
> loading onto CPU/RAM, to preprocess the data to be filtered out.
> It only makes sense if the target blocks are not loaded to the
> CPU/RAM yet, because SSD device is essentially slower than RAM.
> So, I like to have a reliable way to check the latest status of
> the shared buffer, to kwon whether a particular block is already
> loaded or not.

Quite interesting stuff, thanks for sharing!

I'm in no way expert on this but could this generally be attacked from the
smgr API perspective? Currently, we have only one implementation - md.c
(the hard-coded RelationData.smgr_which = 0). If we extended that and
provided end-to-end support so that there would be md.c alternatives to
storage operations, I guess that would open up opportunities for
extensions to specify smgr_which as an argument to ReadBufferExtended(),
provided there is already support in place to install md.c alternatives
(perhaps in .so). Of course, these are just musings and, perhaps does not
really concern the requirements of custom scan methods you have been
developing.

Thanks,
Amit

Re: Way to check whether a particular block is on the shared_buffer?

From

Kouhei Kaigai

Date:

02 February 2016, 06:56:34

> KaiGai-san,
>
> On 2016/02/01 10:38, Kouhei Kaigai wrote:
> > As an aside, background of my motivation is the slide below:
> > http://www.slideshare.net/kaigai/sqlgpussd-english
> > (LT slides in JPUG conference last Dec)
> >
> > I'm under investigation of SSD-to-GPU direct feature on top of
> > the custom-scan interface. It intends to load a bunch of data
> > blocks on NVMe-SSD to GPU RAM using P2P DMA, prior to the data
> > loading onto CPU/RAM, to preprocess the data to be filtered out.
> > It only makes sense if the target blocks are not loaded to the
> > CPU/RAM yet, because SSD device is essentially slower than RAM.
> > So, I like to have a reliable way to check the latest status of
> > the shared buffer, to kwon whether a particular block is already
> > loaded or not.
>
> Quite interesting stuff, thanks for sharing!
>
> I'm in no way expert on this but could this generally be attacked from the
> smgr API perspective? Currently, we have only one implementation - md.c
> (the hard-coded RelationData.smgr_which = 0). If we extended that and
> provided end-to-end support so that there would be md.c alternatives to
> storage operations, I guess that would open up opportunities for
> extensions to specify smgr_which as an argument to ReadBufferExtended(),
> provided there is already support in place to install md.c alternatives
> (perhaps in .so). Of course, these are just musings and, perhaps does not
> really concern the requirements of custom scan methods you have been
> developing.
>
Thanks for your idea. Indeed, smgr hooks are good candidate to implement
the feature, however, what I need is a thin intermediation layer rather
than alternative storage engine.

It becomes clear we need two features here.
1. A feature to check whether a particular block is already on the shared  buffer pool.  It is available.
BufTableLookup()under the BufMappingPartitionLock  gives us the information we want. 

2. A feature to suspend i/o write-out towards a particular blocks  that are registered by other concurrent backend,
unlessit is not  unregistered (usually, at the end of P2P DMA).  ==> to be discussed. 

When we call smgrwrite(), like FlushBuffer(), it fetches function pointer
from the 'smgrsw' array, then calls smgr_write.
 void smgrwrite(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,           char *buffer, bool skipFsync) {
  (*(smgrsw[reln->smgr_which].smgr_write)) (reln, forknum, blocknum,
buffer,skipFsync); } 

If extension would overwrite smgrsw[] array, then call the original
function under the control by extension, it allows to suspend the call
of the original smgr_write until completion of P2P DMA.

It may be a minimum invasive way to implement, and portable to any
further storage layers.

How about your thought? Even though it is a bit different from your
original proposition.
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei <kaigai@ak.jp.nec.com>

Re: Way to check whether a particular block is on the shared_buffer?

From

Alvaro Herrera

Date:

02 February 2016, 11:11:21

Kouhei Kaigai wrote:
> > On 1/31/16 7:38 PM, Kouhei Kaigai wrote:

> > To answer your direct question, I'm no expert, but I haven't seen any
> > functions that do exactly what you want. You'd have to pull relevant
> > bits from ReadBuffer_*. Or maybe a better method would just be to call
> > BufTableLookup() without any locks and if you get a result > -1 just
> > call the relevant ReadBuffer function. Sometimes you'll end up calling
> > ReadBuffer even though the buffer isn't in shared buffers, but I would
> > think that would be a rare occurrence.
> >
> Thanks, indeed, extension can call BufTableLookup(). PrefetchBuffer()
> has a good example for this.
> 
> If it returned a valid buf_id, we have nothing difficult; just call
> ReadBuffer() to pin the buffer.

Isn't this what (or very similar to)
ReadBufferExtended(RBM_ZERO_AND_LOCK) is already doing?

-- 
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Way to check whether a particular block is on the shared_buffer?

From

Kouhei Kaigai

Date:

02 February 2016, 11:54:11

> > > On 1/31/16 7:38 PM, Kouhei Kaigai wrote:
> 
> > > To answer your direct question, I'm no expert, but I haven't seen any
> > > functions that do exactly what you want. You'd have to pull relevant
> > > bits from ReadBuffer_*. Or maybe a better method would just be to call
> > > BufTableLookup() without any locks and if you get a result > -1 just
> > > call the relevant ReadBuffer function. Sometimes you'll end up calling
> > > ReadBuffer even though the buffer isn't in shared buffers, but I would
> > > think that would be a rare occurrence.
> > >
> > Thanks, indeed, extension can call BufTableLookup(). PrefetchBuffer()
> > has a good example for this.
> >
> > If it returned a valid buf_id, we have nothing difficult; just call
> > ReadBuffer() to pin the buffer.
> 
> Isn't this what (or very similar to)
> ReadBufferExtended(RBM_ZERO_AND_LOCK) is already doing?
>
This operation actually acquires a buffer page, fills up with zero
and a valid buffer page is wiped out if no free buffer page.
I want to keep the contents of the shared buffer already loaded on
the main memory. P2P DMA and GPU preprocessing intends to minimize
main memory consumption by rows to be filtered by scan qualifiers.

Thanks,
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei <kaigai@ak.jp.nec.com>

Re: Way to check whether a particular block is on the shared_buffer?

From

Kouhei Kaigai

Date:

04 February 2016, 06:30:46

> > KaiGai-san,
> >
> > On 2016/02/01 10:38, Kouhei Kaigai wrote:
> > > As an aside, background of my motivation is the slide below:
> > > http://www.slideshare.net/kaigai/sqlgpussd-english
> > > (LT slides in JPUG conference last Dec)
> > >
> > > I'm under investigation of SSD-to-GPU direct feature on top of
> > > the custom-scan interface. It intends to load a bunch of data
> > > blocks on NVMe-SSD to GPU RAM using P2P DMA, prior to the data
> > > loading onto CPU/RAM, to preprocess the data to be filtered out.
> > > It only makes sense if the target blocks are not loaded to the
> > > CPU/RAM yet, because SSD device is essentially slower than RAM.
> > > So, I like to have a reliable way to check the latest status of
> > > the shared buffer, to kwon whether a particular block is already
> > > loaded or not.
> >
> > Quite interesting stuff, thanks for sharing!
> >
> > I'm in no way expert on this but could this generally be attacked from the
> > smgr API perspective? Currently, we have only one implementation - md.c
> > (the hard-coded RelationData.smgr_which = 0). If we extended that and
> > provided end-to-end support so that there would be md.c alternatives to
> > storage operations, I guess that would open up opportunities for
> > extensions to specify smgr_which as an argument to ReadBufferExtended(),
> > provided there is already support in place to install md.c alternatives
> > (perhaps in .so). Of course, these are just musings and, perhaps does not
> > really concern the requirements of custom scan methods you have been
> > developing.
> >
> Thanks for your idea. Indeed, smgr hooks are good candidate to implement
> the feature, however, what I need is a thin intermediation layer rather
> than alternative storage engine.
>
> It becomes clear we need two features here.
> 1. A feature to check whether a particular block is already on the shared
>    buffer pool.
>    It is available. BufTableLookup() under the BufMappingPartitionLock
>    gives us the information we want.
>
> 2. A feature to suspend i/o write-out towards a particular blocks
>    that are registered by other concurrent backend, unless it is not
>    unregistered (usually, at the end of P2P DMA).
>    ==> to be discussed.
>
> When we call smgrwrite(), like FlushBuffer(), it fetches function pointer
> from the 'smgrsw' array, then calls smgr_write.
>
>   void
>   smgrwrite(SMgrRelation reln, ForkNumber forknum, BlockNumber blocknum,
>             char *buffer, bool skipFsync)
>   {
>       (*(smgrsw[reln->smgr_which].smgr_write)) (reln, forknum, blocknum,
>                                                 buffer, skipFsync);
>   }
>
> If extension would overwrite smgrsw[] array, then call the original
> function under the control by extension, it allows to suspend the call
> of the original smgr_write until completion of P2P DMA.
>
> It may be a minimum invasive way to implement, and portable to any
> further storage layers.
>
> How about your thought? Even though it is a bit different from your
> original proposition.
>
I tried to design a draft of enhancement to realize the above i/o write-out
suspend/resume, with less invasive way as possible as we can.
 ASSUMPTION: I intend to implement this feature as a part of extension,     because this i/o suspend/resume checks are
pureoverhead increment     for the core features, unless extension which utilizes it. 

Three functions shall be added:

extern int    GetStorageMgrNumbers(void);
extern f_smgr GetStorageMgrHandlers(int smgr_which);
extern void   SetStorageMgrHandlers(int smgr_which, f_smgr smgr_handlers);

As literal, GetStorageMgrNumbers() returns the number of storage manager
currently installed. It always return 1 right now.
GetStorageMgrHandlers() returns the currently configured f_smgr table to
the supplied smgr_which. It allows extensions to know current configuration
of the storage manager, even if other extension already modified it.
SetStorageMgrHandlers() assigns the supplied 'smgr_handlers', instead of
the current one.
If extension wants to intermediate 'smgr_write', extension will replace
the 'smgr_write' by own function, then call the original function, likely
mdwrite, from the alternative function.

In this case, call chain shall be:
 FlushBuffer, and others...  +-- smgrwrite(...)       +-- (extension's own function)            +-- mdwrite

Once extension's own function blocks write i/o until P2P DMA completed by
concurrent process, we don't need to care about partial update of OS cache
or storage device.
It is not difficult for extensions to implement a feature to track/untrack
a pair of (relFileNode, forkNum, blockNum), automatic untracking according
to the resource-owner, and a mechanism to block the caller by P2P DMA
completion.

On the other hands, its flexibility seems to me a bit larger than necessity
(what I want to implement is just a blocker of buffer write i/o). And, it
may give people wrong impression for the feature of pluggable storage.

How about folk's thought?

Thanks,
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei <kaigai@ak.jp.nec.com>

Re: Way to check whether a particular block is on the shared_buffer?

From

Jim Nasby

Date:

05 February 2016, 00:16:44

On 2/4/16 12:30 AM, Kouhei Kaigai wrote:
>> 2. A feature to suspend i/o write-out towards a particular blocks
>> >    that are registered by other concurrent backend, unless it is not
>> >    unregistered (usually, at the end of P2P DMA).
>> >    ==> to be discussed.

I think there's still a race condition here though...

A
finds buffer not in shared buffers

B
reads buffer in
modifies buffer
starts writing buffer to OS

A
Makes call to block write, but write is already in process; thinks
writes are now blocked
Reads corrupted block
Much hilarity ensues

Or maybe you were just glossing over that part for brevity.

...

> I tried to design a draft of enhancement to realize the above i/o write-out
> suspend/resume, with less invasive way as possible as we can.
> 
>    ASSUMPTION: I intend to implement this feature as a part of extension,
>        because this i/o suspend/resume checks are pure overhead increment
>        for the core features, unless extension which utilizes it.
> 
> Three functions shall be added:
> 
> extern int    GetStorageMgrNumbers(void);
> extern f_smgr GetStorageMgrHandlers(int smgr_which);
> extern void   SetStorageMgrHandlers(int smgr_which, f_smgr smgr_handlers);
> 
> As literal, GetStorageMgrNumbers() returns the number of storage manager
> currently installed. It always return 1 right now.
> GetStorageMgrHandlers() returns the currently configured f_smgr table to
> the supplied smgr_which. It allows extensions to know current configuration
> of the storage manager, even if other extension already modified it.
> SetStorageMgrHandlers() assigns the supplied 'smgr_handlers', instead of
> the current one.
> If extension wants to intermediate 'smgr_write', extension will replace
> the 'smgr_write' by own function, then call the original function, likely
> mdwrite, from the alternative function.
> 
> In this case, call chain shall be:
> 
>    FlushBuffer, and others...
>     +-- smgrwrite(...)
>          +-- (extension's own function)
>               +-- mdwrite

ISTR someone (Robert Haas?) complaining that this method of hooks is
cumbersome to use and can be fragile if multiple hooks are being
installed. So maybe we don't want to extend it's usage...

I'm also not sure whether this is better done with an smgr hook or a
hook into shared buffer handling...
-- 
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Experts in Analytics, Data Architecture and PostgreSQL
Data in Trouble? Get it in Treble! http://BlueTreble.com

Re: Way to check whether a particular block is on the shared_buffer?

From

Kouhei Kaigai

Date:

05 February 2016, 03:51:37

> On 2/4/16 12:30 AM, Kouhei Kaigai wrote:
> >> 2. A feature to suspend i/o write-out towards a particular blocks
> >> >    that are registered by other concurrent backend, unless it is not
> >> >    unregistered (usually, at the end of P2P DMA).
> >> >    ==> to be discussed.
>
> I think there's still a race condition here though...
>
> A
> finds buffer not in shared buffers
>
> B
> reads buffer in
> modifies buffer
> starts writing buffer to OS
>
> A
> Makes call to block write, but write is already in process; thinks
> writes are now blocked
> Reads corrupted block
> Much hilarity ensues
>
> Or maybe you were just glossing over that part for brevity.
>
Thanks, this part was not clear from my previous description.

At the time when B starts writing buffer to OS, extension will catch
this i/o request using a hook around the smgrwrite, then the mechanism
registers the block to block P2P DMA request during B's write operation.
(Of course, it unregisters the block at end of the smgrwrite)
So, even if A wants to issue P2P DMA concurrently, it cannot register
the block until B's write operation.

In practical, this operation shall be "try lock", because B's write
operation implies existence of the buffer in main memory, so B does
not need to wait A's write operation if B switch DMA source from SSD
to main memory.

Thanks,
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei <kaigai@ak.jp.nec.com>

> ...
>
> > I tried to design a draft of enhancement to realize the above i/o write-out
> > suspend/resume, with less invasive way as possible as we can.
> >
> >    ASSUMPTION: I intend to implement this feature as a part of extension,
> >        because this i/o suspend/resume checks are pure overhead increment
> >        for the core features, unless extension which utilizes it.
> >
> > Three functions shall be added:
> >
> > extern int    GetStorageMgrNumbers(void);
> > extern f_smgr GetStorageMgrHandlers(int smgr_which);
> > extern void   SetStorageMgrHandlers(int smgr_which, f_smgr smgr_handlers);
> >
> > As literal, GetStorageMgrNumbers() returns the number of storage manager
> > currently installed. It always return 1 right now.
> > GetStorageMgrHandlers() returns the currently configured f_smgr table to
> > the supplied smgr_which. It allows extensions to know current configuration
> > of the storage manager, even if other extension already modified it.
> > SetStorageMgrHandlers() assigns the supplied 'smgr_handlers', instead of
> > the current one.
> > If extension wants to intermediate 'smgr_write', extension will replace
> > the 'smgr_write' by own function, then call the original function, likely
> > mdwrite, from the alternative function.
> >
> > In this case, call chain shall be:
> >
> >    FlushBuffer, and others...
> >     +-- smgrwrite(...)
> >          +-- (extension's own function)
> >               +-- mdwrite
>
> ISTR someone (Robert Haas?) complaining that this method of hooks is
> cumbersome to use and can be fragile if multiple hooks are being
> installed. So maybe we don't want to extend it's usage...
>
> I'm also not sure whether this is better done with an smgr hook or a
> hook into shared buffer handling...
> --
> Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
> Experts in Analytics, Data Architecture and PostgreSQL
> Data in Trouble? Get it in Treble! http://BlueTreble.com
>
>
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers

Re: Way to check whether a particular block is on the shared_buffer?

From

Kouhei Kaigai

Date:

05 February 2016, 04:34:58

> -----Original Message-----
> From: Jim Nasby [mailto:Jim.Nasby@BlueTreble.com]
> Sent: Friday, February 05, 2016 9:17 AM
> To: Kaigai Kouhei(海外 浩平); pgsql-hackers@postgresql.org; Robert Haas
> Cc: Amit Langote
> Subject: Re: [HACKERS] Way to check whether a particular block is on the
> shared_buffer?
>
> On 2/4/16 12:30 AM, Kouhei Kaigai wrote:
> >> 2. A feature to suspend i/o write-out towards a particular blocks
> >> >    that are registered by other concurrent backend, unless it is not
> >> >    unregistered (usually, at the end of P2P DMA).
> >> >    ==> to be discussed.
>
> I think there's still a race condition here though...
>
> A
> finds buffer not in shared buffers
>
> B
> reads buffer in
> modifies buffer
> starts writing buffer to OS
>
> A
> Makes call to block write, but write is already in process; thinks
> writes are now blocked
> Reads corrupted block
> Much hilarity ensues
>
> Or maybe you were just glossing over that part for brevity.
>
> ...
>
> > I tried to design a draft of enhancement to realize the above i/o write-out
> > suspend/resume, with less invasive way as possible as we can.
> >
> >    ASSUMPTION: I intend to implement this feature as a part of extension,
> >        because this i/o suspend/resume checks are pure overhead increment
> >        for the core features, unless extension which utilizes it.
> >
> > Three functions shall be added:
> >
> > extern int    GetStorageMgrNumbers(void);
> > extern f_smgr GetStorageMgrHandlers(int smgr_which);
> > extern void   SetStorageMgrHandlers(int smgr_which, f_smgr smgr_handlers);
> >
> > As literal, GetStorageMgrNumbers() returns the number of storage manager
> > currently installed. It always return 1 right now.
> > GetStorageMgrHandlers() returns the currently configured f_smgr table to
> > the supplied smgr_which. It allows extensions to know current configuration
> > of the storage manager, even if other extension already modified it.
> > SetStorageMgrHandlers() assigns the supplied 'smgr_handlers', instead of
> > the current one.
> > If extension wants to intermediate 'smgr_write', extension will replace
> > the 'smgr_write' by own function, then call the original function, likely
> > mdwrite, from the alternative function.
> >
> > In this case, call chain shall be:
> >
> >    FlushBuffer, and others...
> >     +-- smgrwrite(...)
> >          +-- (extension's own function)
> >               +-- mdwrite
>
> ISTR someone (Robert Haas?) complaining that this method of hooks is
> cumbersome to use and can be fragile if multiple hooks are being
> installed. So maybe we don't want to extend it's usage...
>
> I'm also not sure whether this is better done with an smgr hook or a
> hook into shared buffer handling...
>
# sorry, I oversight the later part of your reply.

I can agree that smgr hooks shall be primarily designed to make storage
systems pluggable, even if we can use this hooks for suspend & resume of
write i/o stuff.
In addition, "pluggable storage" is a long-standing feature, even though
it is not certain whether existing smgr hooks are good starting point.
It may be a risk if we implement a grand feature on top of the hooks
but out of its primary purpose.

So, my preference is a mechanism to hook buffer write to implement this
feature. (Or, maybe a built-in write i/o suspend / resume stuff if it
has nearly zero cost when no extension activate the feature.)
One downside of this approach is larger number of hook points.
We have to deploy the hook nearby existing smgrwrite of LocalBufferAlloc
and FlushRelationBuffers, in addition to FlushBuffer, at least.

Thanks,
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei <kaigai@ak.jp.nec.com>

Re: Way to check whether a particular block is on the shared_buffer?

From

Robert Haas

Date:

07 February 2016, 16:52:20

On Thu, Feb 4, 2016 at 11:34 PM, Kouhei Kaigai <kaigai@ak.jp.nec.com> wrote:
> I can agree that smgr hooks shall be primarily designed to make storage
> systems pluggable, even if we can use this hooks for suspend & resume of
> write i/o stuff.
> In addition, "pluggable storage" is a long-standing feature, even though
> it is not certain whether existing smgr hooks are good starting point.
> It may be a risk if we implement a grand feature on top of the hooks
> but out of its primary purpose.
>
> So, my preference is a mechanism to hook buffer write to implement this
> feature. (Or, maybe a built-in write i/o suspend / resume stuff if it
> has nearly zero cost when no extension activate the feature.)
> One downside of this approach is larger number of hook points.
> We have to deploy the hook nearby existing smgrwrite of LocalBufferAlloc
> and FlushRelationBuffers, in addition to FlushBuffer, at least.

I don't understand what you're hoping to achieve by introducing
pluggability at the smgr layer.  I mean, md.c is pretty much good for
read and writing from anything that looks like a directory of files.
Another smgr layer would make sense if we wanted to read and write via
some kind of network protocol, or if we wanted to have some kind of
storage substrate that did internally to itself some of the tasks for
which we are currently relying on the filesystem - e.g. if we wanted
to be able to use a raw device, or perhaps more plausibly if we wanted
to reduce the number of separate files we need, or provide a substrate
that can clip an unused extent out of the middle of a relation
efficiently.  But I don't understand what this has to do with what
you're trying to do here.  The subject of this thread is about whether
you can check for the presence of a block in shared_buffers, and as
discussed upthread, you can.  I don't quite follow how we made the
jump from there to smgr pluggability.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company