Thread: Re: Way to check whether a particular block is on the shared_buffer?

Re: Way to check whether a particular block is on the shared_buffer?

From
Kouhei Kaigai
Date:
> -----Original Message-----
> From: Robert Haas [mailto:robertmhaas@gmail.com]
> Sent: Wednesday, February 10, 2016 1:58 AM
> To: Kaigai Kouhei(海外 浩平)
> Cc: Jim Nasby; pgsql-hackers@postgresql.org; Amit Langote
> Subject: ##freemail## Re: [HACKERS] Way to check whether a particular block is
> on the shared_buffer?
> 
> On Sun, Feb 7, 2016 at 9:49 PM, Kouhei Kaigai <kaigai@ak.jp.nec.com> wrote:
> > On the other hands, it also became clear we have to guarantee OS buffer
> > or storage block must not be updated partially during the P2P DMA.
> > My motivation is a potential utilization of P2P DMA of SSD-to-GPU to
> > filter out unnecessary rows and columns prior to loading to CPU/RAM.
> > It needs to ensure PostgreSQL does not write out buffers to OS buffers
> > to avoid unexpected data corruption.
> >
> > What I want to achieve is suspend of buffer write towards a particular
> > (relnode, forknum, blocknum) pair for a short time, by completion of
> > data processing by GPU (or other external devices).
> > In addition, it is preferable being workable regardless of the choice
> > of storage manager, even if we may have multiple options on top of the
> > pluggable smgr in the future.
> 
> It seems like you just need to take an exclusive content lock on the
> buffer, or maybe a shared content lock would be sufficient.
>
Unfortunately, it was not sufficient.

Due to the assumption, the buffer page to be suspended does not exist
when a backend process issues a series P2P DMA command. (If block would
be already loaded to the shared buffer, it don't need to issue P2P DMA,
but just use usual memory<->device DMA because RAM is much faster than
SSD.)
It knows the pair of (rel,fork,block), but no BufferDesc of this block
exists. Thus, it cannot acquire locks in BufferDesc structure.

Even if the block does not exist at this point, concurrent process may
load the same page. BufferDesc of this page shall be assigned at this
point, however, here is no chance to lock something in BufferDesc for
the process which issues P2P DMA command.

It is the reason why I assume the suspend/resume mechanism shall take
a pair of (rel,fork,block) as identifier of the target block.

Thanks,
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei <kaigai@ak.jp.nec.com>


Re: Way to check whether a particular block is on the shared_buffer?

From
Robert Haas
Date:
On Tue, Feb 9, 2016 at 6:35 PM, Kouhei Kaigai <kaigai@ak.jp.nec.com> wrote:
> Unfortunately, it was not sufficient.
>
> Due to the assumption, the buffer page to be suspended does not exist
> when a backend process issues a series P2P DMA command. (If block would
> be already loaded to the shared buffer, it don't need to issue P2P DMA,
> but just use usual memory<->device DMA because RAM is much faster than
> SSD.)
> It knows the pair of (rel,fork,block), but no BufferDesc of this block
> exists. Thus, it cannot acquire locks in BufferDesc structure.
>
> Even if the block does not exist at this point, concurrent process may
> load the same page. BufferDesc of this page shall be assigned at this
> point, however, here is no chance to lock something in BufferDesc for
> the process which issues P2P DMA command.
>
> It is the reason why I assume the suspend/resume mechanism shall take
> a pair of (rel,fork,block) as identifier of the target block.

I see the problem, but I'm not terribly keen on putting in the hooks
that it would take to let you solve it without hacking core.  It
sounds like an awfully invasive thing for a pretty niche requirement.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Way to check whether a particular block is on the shared_buffer?

From
Kouhei Kaigai
Date:
> On Tue, Feb 9, 2016 at 6:35 PM, Kouhei Kaigai <kaigai@ak.jp.nec.com> wrote:
> > Unfortunately, it was not sufficient.
> >
> > Due to the assumption, the buffer page to be suspended does not exist
> > when a backend process issues a series P2P DMA command. (If block would
> > be already loaded to the shared buffer, it don't need to issue P2P DMA,
> > but just use usual memory<->device DMA because RAM is much faster than
> > SSD.)
> > It knows the pair of (rel,fork,block), but no BufferDesc of this block
> > exists. Thus, it cannot acquire locks in BufferDesc structure.
> >
> > Even if the block does not exist at this point, concurrent process may
> > load the same page. BufferDesc of this page shall be assigned at this
> > point, however, here is no chance to lock something in BufferDesc for
> > the process which issues P2P DMA command.
> >
> > It is the reason why I assume the suspend/resume mechanism shall take
> > a pair of (rel,fork,block) as identifier of the target block.
> 
> I see the problem, but I'm not terribly keen on putting in the hooks
> that it would take to let you solve it without hacking core.  It
> sounds like an awfully invasive thing for a pretty niche requirement.
>
Hmm. In my experience, it is often not a productive discussion whether
a feature is niche or commodity. So, let me change the viewpoint.

We may utilize OS-level locking mechanism here.

Even though it depends on filesystem implementation under the VFS,
we may use inode->i_mutex lock that shall be acquired during the buffer
copy from user to kernel, at least, on a few major filesystems; ext4,
xfs and btrfs in my research. As well, the modified NVMe SSD driver can
acquire the inode->i_mutex lock during P2P DMA transfer.

Once we can consider the OS buffer is updated atomically by the lock,
we don't need to worry about corrupted pages, but still needs to pay
attention to the scenario when updated buffer page is moved to GPU.

In this case, PD_ALL_VISIBLE may give us a hint. GPU side has no MVCC
infrastructure, so I intend to move all-visible pages only.
If someone updates the buffer concurrently, then write out the page
including invisible tuples, PD_ALL_VISIBLE flag shall be cleared because
updated tuples should not be visible to the transaction which issued
P2P DMA.

Once GPU met a page with !PD_ALL_VISIBLE, it can return an error status
that indicates CPU to retry this page again. In this case, this page is
likely loaded to the shared buffer already, so retry penalty is not so
much.

I'll try to investigate the implementation in this way.
Please correct me, if I misunderstand something (especially, treatment
of PD_ALL_VISIBLE).

Thanks,
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei <kaigai@ak.jp.nec.com>


Re: Way to check whether a particular block is on the shared_buffer?

From
Robert Haas
Date:
On Thu, Feb 11, 2016 at 9:05 PM, Kouhei Kaigai <kaigai@ak.jp.nec.com> wrote:
> Hmm. In my experience, it is often not a productive discussion whether
> a feature is niche or commodity. So, let me change the viewpoint.
>
> We may utilize OS-level locking mechanism here.
>
> Even though it depends on filesystem implementation under the VFS,
> we may use inode->i_mutex lock that shall be acquired during the buffer
> copy from user to kernel, at least, on a few major filesystems; ext4,
> xfs and btrfs in my research. As well, the modified NVMe SSD driver can
> acquire the inode->i_mutex lock during P2P DMA transfer.
>
> Once we can consider the OS buffer is updated atomically by the lock,
> we don't need to worry about corrupted pages, but still needs to pay
> attention to the scenario when updated buffer page is moved to GPU.
>
> In this case, PD_ALL_VISIBLE may give us a hint. GPU side has no MVCC
> infrastructure, so I intend to move all-visible pages only.
> If someone updates the buffer concurrently, then write out the page
> including invisible tuples, PD_ALL_VISIBLE flag shall be cleared because
> updated tuples should not be visible to the transaction which issued
> P2P DMA.
>
> Once GPU met a page with !PD_ALL_VISIBLE, it can return an error status
> that indicates CPU to retry this page again. In this case, this page is
> likely loaded to the shared buffer already, so retry penalty is not so
> much.
>
> I'll try to investigate the implementation in this way.
> Please correct me, if I misunderstand something (especially, treatment
> of PD_ALL_VISIBLE).

I suppose there's no theoretical reason why the buffer couldn't go
from all-visible to not-all-visible and back to all-visible again all
during the time you are copying it.

Honestly, I think trying to access buffers without going through
shared_buffers is likely to be very hard to make correct and probably
a loser.  Copying the data into shared_buffers and then to the GPU is,
doubtless, at least somewhat slower.  But I kind of doubt that it's
enough slower to make up for all of the problems you're going to have
with the approach you've chosen.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Way to check whether a particular block is on the shared_buffer?

From
Kouhei Kaigai
Date:



> -----Original Message-----
> From: pgsql-hackers-owner@postgresql.org
> [mailto:pgsql-hackers-owner@postgresql.org] On Behalf Of Robert Haas
> Sent: Saturday, February 13, 2016 1:46 PM
> To: Kaigai Kouhei(海外 浩平)
> Cc: Jim Nasby; pgsql-hackers@postgresql.org; Amit Langote
> Subject: Re: [HACKERS] Way to check whether a particular block is on the
> shared_buffer?
> 
> On Thu, Feb 11, 2016 at 9:05 PM, Kouhei Kaigai <kaigai@ak.jp.nec.com> wrote:
> > Hmm. In my experience, it is often not a productive discussion whether
> > a feature is niche or commodity. So, let me change the viewpoint.
> >
> > We may utilize OS-level locking mechanism here.
> >
> > Even though it depends on filesystem implementation under the VFS,
> > we may use inode->i_mutex lock that shall be acquired during the buffer
> > copy from user to kernel, at least, on a few major filesystems; ext4,
> > xfs and btrfs in my research. As well, the modified NVMe SSD driver can
> > acquire the inode->i_mutex lock during P2P DMA transfer.
> >
> > Once we can consider the OS buffer is updated atomically by the lock,
> > we don't need to worry about corrupted pages, but still needs to pay
> > attention to the scenario when updated buffer page is moved to GPU.
> >
> > In this case, PD_ALL_VISIBLE may give us a hint. GPU side has no MVCC
> > infrastructure, so I intend to move all-visible pages only.
> > If someone updates the buffer concurrently, then write out the page
> > including invisible tuples, PD_ALL_VISIBLE flag shall be cleared because
> > updated tuples should not be visible to the transaction which issued
> > P2P DMA.
> >
> > Once GPU met a page with !PD_ALL_VISIBLE, it can return an error status
> > that indicates CPU to retry this page again. In this case, this page is
> > likely loaded to the shared buffer already, so retry penalty is not so
> > much.
> >
> > I'll try to investigate the implementation in this way.
> > Please correct me, if I misunderstand something (especially, treatment
> > of PD_ALL_VISIBLE).
> 
> I suppose there's no theoretical reason why the buffer couldn't go
> from all-visible to not-all-visible and back to all-visible again all
> during the time you are copying it.
>
The backend process that is copying the data to GPU has a transaction
in-progress (= not committed). Is it possible to get the updated buffer
page back to the all-visible state again?
I expect that in-progress transactions works as a blocker for backing
to all-visible. Right?

> Honestly, I think trying to access buffers without going through
> shared_buffers is likely to be very hard to make correct and probably
> a loser.
>
No challenge, no outcome. ;-)

> Copying the data into shared_buffers and then to the GPU is,
> doubtless, at least somewhat slower.  But I kind of doubt that it's
> enough slower to make up for all of the problems you're going to have
> with the approach you've chosen.
>
Honestly, I'm still uncertain whether it works well as I expects.
However, scan workload on the table larger than main memory is
headache for PG-Strom, so I'd like to try ideas we can implement.

Thanks,
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei <kaigai@ak.jp.nec.com>


Re: Way to check whether a particular block is on the shared_buffer?

From
Robert Haas
Date:
On Sat, Feb 13, 2016 at 7:29 AM, Kouhei Kaigai <kaigai@ak.jp.nec.com> wrote:
>> I suppose there's no theoretical reason why the buffer couldn't go
>> from all-visible to not-all-visible and back to all-visible again all
>> during the time you are copying it.
>>
> The backend process that is copying the data to GPU has a transaction
> in-progress (= not committed). Is it possible to get the updated buffer
> page back to the all-visible state again?
> I expect that in-progress transactions works as a blocker for backing
> to all-visible. Right?

Yeah, probably.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Way to check whether a particular block is on the shared_buffer?

From
Kouhei Kaigai
Date:
I found one other, but tiny, problem to implement SSD-to-GPU direct
data transfer feature under the PostgreSQL storage.

Extension cannot know the raw file descriptor opened by smgr.

I expect an extension issues an ioctl(2) on the special device file
on behalf of the special kernel driver, to control the P2P DMA.
This ioctl(2) will pack file descriptor of the DMA source and some
various information (like base position, range, destination device
pointer, ...).

However, the raw file descriptor is wrapped in the fd.c, instead of
the File handler, thus, not visible to extension. oops...

The attached patch provides a way to obtain raw file descriptor (and
relevant flags) of a particular File virtual file descriptor on
PostgreSQL. (No need to say, extension has to treat the raw descriptor
carefully not to give an adverse effect to the storage manager.)

How about this tiny enhancement?

> > -----Original Message-----
> > From: pgsql-hackers-owner@postgresql.org
> > [mailto:pgsql-hackers-owner@postgresql.org] On Behalf Of Robert Haas
> > Sent: Saturday, February 13, 2016 1:46 PM
> > To: Kaigai Kouhei(海外 浩平)
> > Cc: Jim Nasby; pgsql-hackers@postgresql.org; Amit Langote
> > Subject: Re: [HACKERS] Way to check whether a particular block is on the
> > shared_buffer?
> >
> > On Thu, Feb 11, 2016 at 9:05 PM, Kouhei Kaigai <kaigai@ak.jp.nec.com> wrote:
> > > Hmm. In my experience, it is often not a productive discussion whether
> > > a feature is niche or commodity. So, let me change the viewpoint.
> > >
> > > We may utilize OS-level locking mechanism here.
> > >
> > > Even though it depends on filesystem implementation under the VFS,
> > > we may use inode->i_mutex lock that shall be acquired during the buffer
> > > copy from user to kernel, at least, on a few major filesystems; ext4,
> > > xfs and btrfs in my research. As well, the modified NVMe SSD driver can
> > > acquire the inode->i_mutex lock during P2P DMA transfer.
> > >
> > > Once we can consider the OS buffer is updated atomically by the lock,
> > > we don't need to worry about corrupted pages, but still needs to pay
> > > attention to the scenario when updated buffer page is moved to GPU.
> > >
> > > In this case, PD_ALL_VISIBLE may give us a hint. GPU side has no MVCC
> > > infrastructure, so I intend to move all-visible pages only.
> > > If someone updates the buffer concurrently, then write out the page
> > > including invisible tuples, PD_ALL_VISIBLE flag shall be cleared because
> > > updated tuples should not be visible to the transaction which issued
> > > P2P DMA.
> > >
> > > Once GPU met a page with !PD_ALL_VISIBLE, it can return an error status
> > > that indicates CPU to retry this page again. In this case, this page is
> > > likely loaded to the shared buffer already, so retry penalty is not so
> > > much.
> > >
> > > I'll try to investigate the implementation in this way.
> > > Please correct me, if I misunderstand something (especially, treatment
> > > of PD_ALL_VISIBLE).
> >
> > I suppose there's no theoretical reason why the buffer couldn't go
> > from all-visible to not-all-visible and back to all-visible again all
> > during the time you are copying it.
> >
> The backend process that is copying the data to GPU has a transaction
> in-progress (= not committed). Is it possible to get the updated buffer
> page back to the all-visible state again?
> I expect that in-progress transactions works as a blocker for backing
> to all-visible. Right?
> 
> > Honestly, I think trying to access buffers without going through
> > shared_buffers is likely to be very hard to make correct and probably
> > a loser.
> >
> No challenge, no outcome. ;-)
> 
> > Copying the data into shared_buffers and then to the GPU is,
> > doubtless, at least somewhat slower.  But I kind of doubt that it's
> > enough slower to make up for all of the problems you're going to have
> > with the approach you've chosen.
> >
> Honestly, I'm still uncertain whether it works well as I expects.
> However, scan workload on the table larger than main memory is
> headache for PG-Strom, so I'd like to try ideas we can implement.
> 
> Thanks,
> --
> NEC Business Creation Division / PG-Strom Project
> KaiGai Kohei <kaigai@ak.jp.nec.com>
>


Attachment

Re: Way to check whether a particular block is on the shared_buffer?

From
Robert Haas
Date:
On Thu, Mar 3, 2016 at 8:54 PM, Kouhei Kaigai <kaigai@ak.jp.nec.com> wrote:
> I found one other, but tiny, problem to implement SSD-to-GPU direct
> data transfer feature under the PostgreSQL storage.
>
> Extension cannot know the raw file descriptor opened by smgr.
>
> I expect an extension issues an ioctl(2) on the special device file
> on behalf of the special kernel driver, to control the P2P DMA.
> This ioctl(2) will pack file descriptor of the DMA source and some
> various information (like base position, range, destination device
> pointer, ...).
>
> However, the raw file descriptor is wrapped in the fd.c, instead of
> the File handler, thus, not visible to extension. oops...
>
> The attached patch provides a way to obtain raw file descriptor (and
> relevant flags) of a particular File virtual file descriptor on
> PostgreSQL. (No need to say, extension has to treat the raw descriptor
> carefully not to give an adverse effect to the storage manager.)
>
> How about this tiny enhancement?

Why not FileDescriptor(), FileFlags(), FileMode() as separate
functions like FilePathName()?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Way to check whether a particular block is on the shared_buffer?

From
Kouhei Kaigai
Date:



> -----Original Message-----
> From: pgsql-hackers-owner@postgresql.org
> [mailto:pgsql-hackers-owner@postgresql.org] On Behalf Of Robert Haas
> Sent: Saturday, March 05, 2016 2:42 AM
> To: Kaigai Kouhei(海外 浩平)
> Cc: Jim Nasby; pgsql-hackers@postgresql.org; Amit Langote
> Subject: Re: [HACKERS] Way to check whether a particular block is on the
> shared_buffer?
> 
> On Thu, Mar 3, 2016 at 8:54 PM, Kouhei Kaigai <kaigai@ak.jp.nec.com> wrote:
> > I found one other, but tiny, problem to implement SSD-to-GPU direct
> > data transfer feature under the PostgreSQL storage.
> >
> > Extension cannot know the raw file descriptor opened by smgr.
> >
> > I expect an extension issues an ioctl(2) on the special device file
> > on behalf of the special kernel driver, to control the P2P DMA.
> > This ioctl(2) will pack file descriptor of the DMA source and some
> > various information (like base position, range, destination device
> > pointer, ...).
> >
> > However, the raw file descriptor is wrapped in the fd.c, instead of
> > the File handler, thus, not visible to extension. oops...
> >
> > The attached patch provides a way to obtain raw file descriptor (and
> > relevant flags) of a particular File virtual file descriptor on
> > PostgreSQL. (No need to say, extension has to treat the raw descriptor
> > carefully not to give an adverse effect to the storage manager.)
> >
> > How about this tiny enhancement?
> 
> Why not FileDescriptor(), FileFlags(), FileMode() as separate
> functions like FilePathName()?
>
Here is no deep reason. The attached patch adds three individual
functions.

Thanks,
--
NEC Business Creation Division / PG-Strom Project
KaiGai Kohei <kaigai@ak.jp.nec.com>


Attachment

Re: Way to check whether a particular block is on the shared_buffer?

From
Robert Haas
Date:
On Mon, Mar 7, 2016 at 4:32 AM, Kouhei Kaigai <kaigai@ak.jp.nec.com> wrote:
>> Why not FileDescriptor(), FileFlags(), FileMode() as separate
>> functions like FilePathName()?
>>
> Here is no deep reason. The attached patch adds three individual
> functions.

This seems unobjectionable to me, so committed.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company