Thread: How to make ResourceOwnerForgetBuffer() O(1), instead of O(N^2) scale

How to make ResourceOwnerForgetBuffer() O(1), instead of O(N^2) scale

From
Kouhei Kaigai
Date:
Hello,

I recently got a trouble on development of my extension that utilizes
the shared buffer when it released each buffer page.

This extension transfers contents of the shared buffers to GPU device
using DMA feature, then kicks a device kernel code.
Usually 8KB (= BLCKSZ) is too small as a unit size for calculation,
so this extension pins multiple pages prior to DMA transfer, then it
releases after the device kernel execution.
For the performance reason, 16MB-64MB is a preferable data size per
a device kernel execution. DMA transfer of 16MB-64MB needs 2048-8192
pages being pinned simultaneously.

Once backend/extension calls ReadBuffer(), resowner.c tracks which
buffer was referenced by the current resource owner, to ensure these
buffers being released at end of the transaction.
However, it seems to me implementation of resowner.c didn't assume
many buffers are referenced by a particular resource owner simultaneously.
It manages the buffer index using an expandable array, then looks up
the target buffer by sequential walk but from the tail because recently
pinned buffer tends to be released first.
It made a trouble in my case. My extension pinned multiple thousands
buffers, so owner->buffers[] were enlarged and takes expensive cost
to walk on.
In my measurement, ResourceOwnerForgetBuffer() takes 36 seconds in
total during hash-joining 2M rows; even though hash-joining itself
takes less than 16 seconds.

What is the best way to solve the problem?

Idea-1) Put ResourceOwnerForgetBuffer() O(1) logic, instead of O(N^2).
The source of problem come from data structure in ResourceOwnerData,
so a straightforward way is to apply O(1) logic based on hashing,
instead of the linear search.
An issue is how beneficial or harmless to the core code, not only
my extension. Probably, it "potentially" beneficial to the core
backend also. However, its effect is not easy to observe right now
because usual workload takes enough small amount of buffers at the
same time.

The attached patch applies O(1) logic on ResourceOwnerForgetBuffer().
It makes time consumption 36sec->0.09sec during 20M rows joinning
based on hash-logic.

Idea-2) track shared buffer being referenced by extension itself
One other, but not preferable, option is to call ResourceOwnerForgetBuffer()
just after ReadBuffer() on the extension side.
Once resource-owner forget it, extension shall be responsible to
release the buffer at end of the transaction, even if it aborted.
It also makes us unavailable to use ReleaseBuffer(), so extension
has to have duplication of ReleaseBuffer() but no ResourceOwnerForgetBuffer().
This idea has few advantage towards the idea-1, but only advantage
is to avoid changes to the core PostgreSQL.

Any comments?

Thanks,
--
NEC OSS Promotion Center / PG-Strom Project
KaiGai Kohei <kaigai@ak.jp.nec.com>

Attachment

Re: How to make ResourceOwnerForgetBuffer() O(1), instead of O(N^2) scale

From
Tom Lane
Date:
Kouhei Kaigai <kaigai@ak.jp.nec.com> writes:
> Idea-1) Put ResourceOwnerForgetBuffer() O(1) logic, instead of O(N^2).
> The source of problem come from data structure in ResourceOwnerData,
> so a straightforward way is to apply O(1) logic based on hashing,
> instead of the linear search.

I will bet that this is a dead loss for all normal usage patterns,
because queries seldom have more than a few buffers pinned.  More
than that: I do not think we should encourage coding patterns that
pin lots of buffers.  There is no way that holding pins on thousands
of buffers to do one operation is a sane design.
        regards, tom lane



Re: How to make ResourceOwnerForgetBuffer() O(1), instead of O(N^2) scale

From
Kouhei Kaigai
Date:
> Kouhei Kaigai <kaigai@ak.jp.nec.com> writes:
> > Idea-1) Put ResourceOwnerForgetBuffer() O(1) logic, instead of O(N^2).
> > The source of problem come from data structure in ResourceOwnerData,
> > so a straightforward way is to apply O(1) logic based on hashing,
> > instead of the linear search.
>
> I will bet that this is a dead loss for all normal usage patterns, because
> queries seldom have more than a few buffers pinned.  More than that: I do
> not think we should encourage coding patterns that pin lots of buffers.
> There is no way that holding pins on thousands of buffers to do one operation
> is a sane design.
>
Yep, that's my pain. Even though usual query does not take many buffers pinned,
my use case needs to fetch megabytes scale data at once because of performance
reason; page-by-page synchronous scan makes GPU being idle.
It is also a reason why I cared about whether the O(1) logic is harmless to
existing (major portion of) workloads.

BTW, what is the reason why we assume small amount of pages are pined at same
time? Does it come from hardware assumption that has much less RAM capacity
than the data-set to be processed?
Probably, once we pay attention a hardware that has more than several dozen GB
RAM which can keep the data-set on memory, it may lead different results.

At least, it seems to me valuable to support both of the styles of buffer
usages, as long as it is harmless to the existing workloads.
I'd like to dig down the deeper reason of this.

Thanks,
--
NEC OSS Promotion Center / PG-Strom Project
KaiGai Kohei <kaigai@ak.jp.nec.com>



Re: How to make ResourceOwnerForgetBuffer() O(1), instead of O(N^2) scale

From
Heikki Linnakangas
Date:
On 10/03/2014 07:08 AM, Kouhei Kaigai wrote:
> Hello,
> 
> I recently got a trouble on development of my extension that utilizes
> the shared buffer when it released each buffer page.
> 
> This extension transfers contents of the shared buffers to GPU device
> using DMA feature, then kicks a device kernel code.

Wow, that sounds crazy.

> Once backend/extension calls ReadBuffer(), resowner.c tracks which
> buffer was referenced by the current resource owner, to ensure these
> buffers being released at end of the transaction.
> However, it seems to me implementation of resowner.c didn't assume
> many buffers are referenced by a particular resource owner simultaneously.
> It manages the buffer index using an expandable array, then looks up
> the target buffer by sequential walk but from the tail because recently
> pinned buffer tends to be released first.
> It made a trouble in my case. My extension pinned multiple thousands
> buffers, so owner->buffers[] were enlarged and takes expensive cost
> to walk on.
> In my measurement, ResourceOwnerForgetBuffer() takes 36 seconds in
> total during hash-joining 2M rows; even though hash-joining itself
> takes less than 16 seconds.
> 
> What is the best way to solve the problem?

How about creating a separate ResourceOwner for these buffer pins, and
doing a wholesale ResourceOwnerRelease() on it when you're done?

- Heikki




Re: How to make ResourceOwnerForgetBuffer() O(1), instead of O(N^2) scale

From
Andres Freund
Date:
On 2014-10-03 10:35:42 +0300, Heikki Linnakangas wrote:
> On 10/03/2014 07:08 AM, Kouhei Kaigai wrote:
> > Hello,
> > 
> > I recently got a trouble on development of my extension that utilizes
> > the shared buffer when it released each buffer page.
> > 
> > This extension transfers contents of the shared buffers to GPU device
> > using DMA feature, then kicks a device kernel code.
> 
> Wow, that sounds crazy.

Agreed. I doubt that pinning that many buffers is a sane thing to do. At
the very least you'll heavily interfere with vacuum and such.

> > Once backend/extension calls ReadBuffer(), resowner.c tracks which
> > buffer was referenced by the current resource owner, to ensure these
> > buffers being released at end of the transaction.
> > However, it seems to me implementation of resowner.c didn't assume
> > many buffers are referenced by a particular resource owner simultaneously.
> > It manages the buffer index using an expandable array, then looks up
> > the target buffer by sequential walk but from the tail because recently
> > pinned buffer tends to be released first.
> > It made a trouble in my case. My extension pinned multiple thousands
> > buffers, so owner->buffers[] were enlarged and takes expensive cost
> > to walk on.
> > In my measurement, ResourceOwnerForgetBuffer() takes 36 seconds in
> > total during hash-joining 2M rows; even though hash-joining itself
> > takes less than 16 seconds.

> > What is the best way to solve the problem?
> 
> How about creating a separate ResourceOwner for these buffer pins, and
> doing a wholesale ResourceOwnerRelease() on it when you're done?

Or even just unpinning them in reverse order? That should already fix
the performance issues?

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: How to make ResourceOwnerForgetBuffer() O(1), instead of O(N^2) scale

From
Kouhei Kaigai
Date:
> On 10/03/2014 07:08 AM, Kouhei Kaigai wrote:
> > Hello,
> >
> > I recently got a trouble on development of my extension that utilizes
> > the shared buffer when it released each buffer page.
> >
> > This extension transfers contents of the shared buffers to GPU device
> > using DMA feature, then kicks a device kernel code.
>
> Wow, that sounds crazy.
>
> > Once backend/extension calls ReadBuffer(), resowner.c tracks which
> > buffer was referenced by the current resource owner, to ensure these
> > buffers being released at end of the transaction.
> > However, it seems to me implementation of resowner.c didn't assume
> > many buffers are referenced by a particular resource owner simultaneously.
> > It manages the buffer index using an expandable array, then looks up
> > the target buffer by sequential walk but from the tail because
> > recently pinned buffer tends to be released first.
> > It made a trouble in my case. My extension pinned multiple thousands
> > buffers, so owner->buffers[] were enlarged and takes expensive cost to
> > walk on.
> > In my measurement, ResourceOwnerForgetBuffer() takes 36 seconds in
> > total during hash-joining 2M rows; even though hash-joining itself
> > takes less than 16 seconds.
> >
> > What is the best way to solve the problem?
>
> How about creating a separate ResourceOwner for these buffer pins, and doing
> a wholesale ResourceOwnerRelease() on it when you're done?
>
Let me clarify your idea.

1. Create a separate ResourceOwner under the CurrentResourceOwner.
2. Switch CurrentResourceOwner to the self-constructed resource owner  during ReadBuffer() on thousands buffers.
3. Switch back to the original CurrentResourceOwner.
4. Kick DMA and run GPU device kernel (actual job running...)
5. Switch CurrentResourceOwner to the self-constructed resource owner  again, during ReleaseBuffer() in reverse order.
6. Switch back to the original CurrentResourceOwner, then calls  ResourceOwnerDelete().

Hmm... at this moment, I cannot find something not easy to implement.
I'd like to try this idea, then report it later.

Thanks,
--
NEC OSS Promotion Center / PG-Strom Project
KaiGai Kohei <kaigai@ak.jp.nec.com>




Re: How to make ResourceOwnerForgetBuffer() O(1), instead of O(N^2) scale

From
Kouhei Kaigai
Date:
> On 2014-10-03 10:35:42 +0300, Heikki Linnakangas wrote:
> > On 10/03/2014 07:08 AM, Kouhei Kaigai wrote:
> > > Hello,
> > >
> > > I recently got a trouble on development of my extension that
> > > utilizes the shared buffer when it released each buffer page.
> > >
> > > This extension transfers contents of the shared buffers to GPU
> > > device using DMA feature, then kicks a device kernel code.
> >
> > Wow, that sounds crazy.
>
> Agreed. I doubt that pinning that many buffers is a sane thing to do. At
> the very least you'll heavily interfere with vacuum and such.
>
My assumption is, this extension is used to handle OLAP type workload,
thus relatively less amount of write traffic to the database.
Sorry, I missed to mention about.

> > > Once backend/extension calls ReadBuffer(), resowner.c tracks which
> > > buffer was referenced by the current resource owner, to ensure these
> > > buffers being released at end of the transaction.
> > > However, it seems to me implementation of resowner.c didn't assume
> > > many buffers are referenced by a particular resource owner
> simultaneously.
> > > It manages the buffer index using an expandable array, then looks up
> > > the target buffer by sequential walk but from the tail because
> > > recently pinned buffer tends to be released first.
> > > It made a trouble in my case. My extension pinned multiple thousands
> > > buffers, so owner->buffers[] were enlarged and takes expensive cost
> > > to walk on.
> > > In my measurement, ResourceOwnerForgetBuffer() takes 36 seconds in
> > > total during hash-joining 2M rows; even though hash-joining itself
> > > takes less than 16 seconds.
>
> > > What is the best way to solve the problem?
> >
> > How about creating a separate ResourceOwner for these buffer pins, and
> > doing a wholesale ResourceOwnerRelease() on it when you're done?
>
> Or even just unpinning them in reverse order? That should already fix the
> performance issues?
>
In case when multiple chunks (note: a chunk contains thousands buffers as
a unit of device kernel execution) are running asynchronously, order of
GPU job's completion is not predictable.
So, it does not help my situation if one resource-owner tracks all the
buffers.

Probably, Heikki suggested to create a separate resource-owner per chunk.
In this case, all the buffers in a particular chunk shall be released
on the same time, so ReleaseBuffer() in reverse order makes sense.

Thanks,
--
NEC OSS Promotion Center / PG-Strom Project
KaiGai Kohei <kaigai@ak.jp.nec.com>




On 03/10/2014 05:53, Kouhei Kaigai wrote:
> Yep, that's my pain. Even though usual query does not take many buffers pinned,
> my use case needs to fetch megabytes scale data at once because of performance
> reason; page-by-page synchronous scan makes GPU being idle.
Doesn't your GPU have an async queue and exec mechanism? Then you could
do an asyn
DMA to the GPU with an event, use that event in he GPU to start the
kernel and in the
DB to release the pin?




Re: How to make ResourceOwnerForgetBuffer() O(1), instead of O(N^2) scale

From
Tom Lane
Date:
Heikki Linnakangas <hlinnakangas@vmware.com> writes:
> On 10/03/2014 07:08 AM, Kouhei Kaigai wrote:
>> What is the best way to solve the problem?

> How about creating a separate ResourceOwner for these buffer pins, and
> doing a wholesale ResourceOwnerRelease() on it when you're done?

That's a thought.  Another point is that if you can release the buffer
pins in reverse order of acquisition, the existing ResourceOwner code
works just fine.

I have a larger question though: how is it useful to transfer
raw contents of shared buffers to a GPU in the first place?
Surely you're not going to be putting tasks like tuple visibility
verification into the GPU.  So it seems like this whole thread is
based on a dubious architectural assumption.
        regards, tom lane



Re: How to make ResourceOwnerForgetBuffer() O(1), instead of O(N^2) scale

From
Kouhei Kaigai
Date:
> On 03/10/2014 05:53, Kouhei Kaigai wrote:
> > Yep, that's my pain. Even though usual query does not take many
> > buffers pinned, my use case needs to fetch megabytes scale data at
> > once because of performance reason; page-by-page synchronous scan makes
> GPU being idle.
> Doesn't your GPU have an async queue and exec mechanism? Then you could
> do an asyn DMA to the GPU with an event, use that event in he GPU to start
> the kernel and in the DB to release the pin?
>
That is exactly what I'm doing now. Problem is, it needs to keep multiple
buffers (likely, more than a few dozen thousands) in the queue not to make
GPU waiting for the data to be calculated. Thus, I want to pin relatively
many buffers than usual query workloads.

Thanks,
--
NEC OSS Promotion Center / PG-Strom Project
KaiGai Kohei <kaigai@ak.jp.nec.com>



Re: How to make ResourceOwnerForgetBuffer() O(1), instead of O(N^2) scale

From
Kouhei Kaigai
Date:
> Heikki Linnakangas <hlinnakangas@vmware.com> writes:
> > On 10/03/2014 07:08 AM, Kouhei Kaigai wrote:
> >> What is the best way to solve the problem?
>
> > How about creating a separate ResourceOwner for these buffer pins, and
> > doing a wholesale ResourceOwnerRelease() on it when you're done?
>
> That's a thought.  Another point is that if you can release the buffer pins
> in reverse order of acquisition, the existing ResourceOwner code works just
> fine.
>
Yep, I'm now trying.

> I have a larger question though: how is it useful to transfer raw contents
> of shared buffers to a GPU in the first place?
> Surely you're not going to be putting tasks like tuple visibility
> verification into the GPU.  So it seems like this whole thread is based
> on a dubious architectural assumption.
>
In my implementation, it is a job of GPU to check tuple visibility, then
it also send GPU an array of visible item index, in addition to the buffer
contents itself. GPU code can pick up only visible tuples using this index.
OLAP workloads tends to have all-visible pages, so cost of visibility check
was not expensive.

Thanks,
--
NEC OSS Promotion Center / PG-Strom Project
KaiGai Kohei <kaigai@ak.jp.nec.com>




Re: How to make ResourceOwnerForgetBuffer() O(1), instead of O(N^2) scale

From
Kouhei Kaigai
Date:
> > On 10/03/2014 07:08 AM, Kouhei Kaigai wrote:
> > > Hello,
> > >
> > > I recently got a trouble on development of my extension that
> > > utilizes the shared buffer when it released each buffer page.
> > >
> > > This extension transfers contents of the shared buffers to GPU
> > > device using DMA feature, then kicks a device kernel code.
> >
> > Wow, that sounds crazy.
> >
> > > Once backend/extension calls ReadBuffer(), resowner.c tracks which
> > > buffer was referenced by the current resource owner, to ensure these
> > > buffers being released at end of the transaction.
> > > However, it seems to me implementation of resowner.c didn't assume
> > > many buffers are referenced by a particular resource owner
> simultaneously.
> > > It manages the buffer index using an expandable array, then looks up
> > > the target buffer by sequential walk but from the tail because
> > > recently pinned buffer tends to be released first.
> > > It made a trouble in my case. My extension pinned multiple thousands
> > > buffers, so owner->buffers[] were enlarged and takes expensive cost
> > > to walk on.
> > > In my measurement, ResourceOwnerForgetBuffer() takes 36 seconds in
> > > total during hash-joining 2M rows; even though hash-joining itself
> > > takes less than 16 seconds.
> > >
> > > What is the best way to solve the problem?
> >
> > How about creating a separate ResourceOwner for these buffer pins, and
> > doing a wholesale ResourceOwnerRelease() on it when you're done?
> >
> Let me clarify your idea.
>
> 1. Create a separate ResourceOwner under the CurrentResourceOwner.
> 2. Switch CurrentResourceOwner to the self-constructed resource owner
>    during ReadBuffer() on thousands buffers.
> 3. Switch back to the original CurrentResourceOwner.
> 4. Kick DMA and run GPU device kernel (actual job running...) 5. Switch
> CurrentResourceOwner to the self-constructed resource owner
>    again, during ReleaseBuffer() in reverse order.
> 6. Switch back to the original CurrentResourceOwner, then calls
>    ResourceOwnerDelete().
>
> Hmm... at this moment, I cannot find something not easy to implement.
> I'd like to try this idea, then report it later.
>
Let me share the result.

The above approach could eliminated my problem. Individual resource-owner
for each set of shared-buffer and ReleaseBuffer() in reverse order reduced
time to release pinned buffers from 36sec->67.3msec when I run hash-joining
on 20M records; that involves 59168 buffers are pinned concurrently in
maximum (32 asynchronous requests, a request packs 1849 buffers).

Thanks for the suggestion!


postgres=# explain analyze select * from t0 natural join t1 natural join t2;
INFO:  time to release buffers: 67.289ms
  QUERY PLAN 


-----------------------------------------------------------------------------------------------------------------------------------------------
Custom (GpuHashJoin)  (cost=3468.00..471640.11 rows=19740924 width=139) (actual time=193.086..5913.459 rows=20000000
loops=1) pseudo scan tlist: 1:(t0.bid), 2:(t0.aid), 3:<t0.id>, 4:<t0.cat>, 5:<t0.cid>, 6:<t0.did>, 7:<t0.x>, 8:<t0.y>,
9:<t0.z>,11:<t2.btext>, 12:[t2.bid], 10:<t1.atext>, 13:[t1.aid]  hash clause 1: (t0.bid = t2.bid)  hash clause 2:
(t0.aid= t1.aid)  ->  Custom (GpuScan) on t0  (cost=500.00..467167.24 rows=20000024 width=73) (actual
time=7.757..1056.426rows=20000000 loops=1)  ->  Custom (MultiHash)  (cost=734.00..734.00 rows=40000 width=37) (actual
time=23.382..23.382rows=40000 loops=1)        hash keys: bid        ->  Seq Scan on t2  (cost=0.00..734.00 rows=40000
width=37)(actual time=0.007..5.124 rows=40000 loops=1)        ->  Custom (MultiHash)  (cost=734.00..734.00 rows=40000
width=37)(actual time=11.919..11.919 rows=40000 loops=1)              hash keys: aid              ->  Seq Scan on t1
(cost=0.00..734.00rows=40000 width=37) (actual time=0.010..5.299 rows=40000 loops=1)Planning time: 0.904 msExecution
time:6667.986 ms 
(13 rows)

--
NEC OSS Promotion Center / PG-Strom Project
KaiGai Kohei <kaigai@ak.jp.nec.com>