Thread: Keeping temporary tables in shared buffers

Keeping temporary tables in shared buffers

From
Asim Praveen
Date:
Hello

We are evaluating the use of shared buffers for temporary tables.  The advantage being queries involving temporary tables can make use of parallel workers.

Challenges:
1. We lose the performance benefit of local buffers.
2. Additional complexity in shared buffer manager - BufferTag needs to include backend ID to distinguish 'my' temp buffers from non-temp or 'others' temp buffers.
3. Checkpointer needs to skip them for fsync but bgwriter needs to write them out.

Has someone already thought about this?  Any workloads that might benefit from this idea?

Asim and David

Re: Keeping temporary tables in shared buffers

From
Tom Lane
Date:
Asim Praveen <apraveen@pivotal.io> writes:
> We are evaluating the use of shared buffers for temporary tables.  The
> advantage being queries involving temporary tables can make use of parallel
> workers.

Hm...

> Challenges:

> 1. We lose the performance benefit of local buffers.

Yeah.  This would be an absolute dead loss for any situation where you
couldn't get major parallelism wins, which I'm afraid would be often.
So then you have to think about how to transition smoothly between "rel
is in local buffers" and "rel is in shared buffers", bearing in mind that
ever having the same page in two different buffers would be disastrous.

> 2. Additional complexity in shared buffer manager - BufferTag needs to
> include backend ID to distinguish 'my' temp buffers from non-temp or
> 'others' temp buffers.

I think that would be a deal breaker right there, because of the
distributed overhead of making the tags bigger.  However, I don't
actually understand why you would need to do that.  Temp tables
have unique OIDs/relfilenodes anyway, don't they?  Or if I'm
misremembering and they don't, couldn't we make them so?

> 3. Checkpointer needs to skip them for fsync but bgwriter needs to write
> them out.

Not really sure why that would be a "must", either.  If the checkpointer
performs some useless writes, that's a bit of a performance drag, but
taking a performance hit to avoid it could be a net loss.  The only reason
why you'd care about writing at all is to try to make the buffers clean
in case they have to be reclaimed for some other use --- and if the
checkpointer does such a write instead of the bgwriter, why's that bad?

            regards, tom lane


Re: Keeping temporary tables in shared buffers

From
Asim Praveen
Date:


On Thu, May 24, 2018 at 8:19 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> So then you have to think about how to transition smoothly between "rel
> is in local buffers" and "rel is in shared buffers", bearing in mind that
> ever having the same page in two different buffers would be disastrous.

Local buffers would not be used at all if temp tables start residing in shared buffers.  The transition mentioned above shouldn't be needed.

>
> I think that would be a deal breaker right there, because of the
> distributed overhead of making the tags bigger.  However, I don't
> actually understand why you would need to do that.  Temp tables
> have unique OIDs/relfilenodes anyway, don't they?  Or if I'm
> misremembering and they don't, couldn't we make them so?

My parochial vision of the overhead is restricted to 4 * NBuffers of additional shared memory, as 4 bytes are being added to BufferTag.  May I please get some enlightenment?

Temp tables have unique filename on disk: t_<backendID>_<relfilenode>.  The logic to assign OIDs and relfilenodes, however, doesn't differ.  Given a RelFileNode, it is not possible to tell if it's a temp table or not.  RelFileNodeBackend allows for that distinction but it's not used by buffer manager.

>
> taking a performance hit to avoid it could be a net loss.  The only reason
> why you'd care about writing at all is to try to make the buffers clean
> in case they have to be reclaimed for some other use --- and if the
> checkpointer does such a write instead of the bgwriter, why's that bad?

Yes, a temp table's buffer would need to be written out only if it needs to be repurposed for a different page.  It is not bad, our description wasn't clear enough.

Asim

Re: Keeping temporary tables in shared buffers

From
Heikki Linnakangas
Date:
On 25/05/18 09:25, Asim Praveen wrote:
> On Thu, May 24, 2018 at 8:19 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>
>> So then you have to think about how to transition smoothly between "rel
>> is in local buffers" and "rel is in shared buffers", bearing in mind that
>> ever having the same page in two different buffers would be disastrous.
> 
> Local buffers would not be used at all if temp tables start residing in
> shared buffers.  The transition mentioned above shouldn't be needed.

What is the performance difference between the local buffer manager and 
the shared buffer manager? The local buffer manager avoids all the 
locking overhead, which has to amount to something, but how big a 
difference is it?

>> I think that would be a deal breaker right there, because of the
>> distributed overhead of making the tags bigger.  However, I don't
>> actually understand why you would need to do that.  Temp tables
>> have unique OIDs/relfilenodes anyway, don't they?  Or if I'm
>> misremembering and they don't, couldn't we make them so?
> 
> My parochial vision of the overhead is restricted to 4 * NBuffers of
> additional shared memory, as 4 bytes are being added to BufferTag.  May I
> please get some enlightenment?

Any extra fields in BufferTag make computing the hash more expensive. 
It's a very hot code path, so any cycles spent are significant.

In relation to Andres' patches to rewrite the buffer manager with a 
radix tree, there was actually some discussion of trying to make 
BufferTag *smaller*. For example, we could rearrange things so that 
pg_class.relfilenode is 64 bits wide. Then you could assume that it 
never wraps around, and is unique across all relations in the cluster. 
Then you could replace the 12-byte relfilenode+dbid+spcid triplet, with 
just the 8-byte relfilenode. Doing something like that might be the 
solution here, too.

> Temp tables have unique filename on disk: t_<backendID>_<relfilenode>.  The
> logic to assign OIDs and relfilenodes, however, doesn't differ.  Given a
> RelFileNode, it is not possible to tell if it's a temp table or not.
> RelFileNodeBackend allows for that distinction but it's not used by buffer
> manager.

Could you store the backendid in BufferDesc, outside of BufferTag? Is it 
possible for a normal table and a temporary table to have the same 
relfilenode+dbid+spcid triplet?

- Heikki


Re: Keeping temporary tables in shared buffers

From
Andres Freund
Date:
On 2018-05-25 09:40:10 +0300, Heikki Linnakangas wrote:
> On 25/05/18 09:25, Asim Praveen wrote:
> > My parochial vision of the overhead is restricted to 4 * NBuffers of
> > additional shared memory, as 4 bytes are being added to BufferTag.  May I
> > please get some enlightenment?
> 
> Any extra fields in BufferTag make computing the hash more expensive. It's a
> very hot code path, so any cycles spent are significant.

Indeed, very much so.

But I'm not sure we need anything in the tags themselves. We don't
denote buffers for unlogged tables in the tag itself either. As Tom
observed the oids for temp tables are either unique or can be made
unique easy enough.  And the temporaryness can be declared in a bit in
the buffer header, rather than the tag itself. I don't see why a hash
lookup would need to know that.


> In relation to Andres' patches to rewrite the buffer manager with a radix
> tree, there was actually some discussion of trying to make BufferTag
> *smaller*.

FWIW, in the latest version that doesn't matter that much
anymore. Instead of one big tree it's a hashtable of trees (although it
potentially should rather be a tree of trees). The hashtable maps to a
radix tree, and that radix tree is just indexed by the offset.  The root
of the tree is then cached inside the smgr, avoiding the need to
repeatedly look it up.


> For example, we could rearrange things so that
> pg_class.relfilenode is 64 bits wide. Then you could assume that it never
> wraps around, and is unique across all relations in the cluster. Then you
> could replace the 12-byte relfilenode+dbid+spcid triplet, with just the
> 8-byte relfilenode. Doing something like that might be the solution here,
> too.

OTOH it's quite useful to have the buffertag be something that can (or
rather could) be efficiently searched for in a hierachical
fashion. While, by far, not as crucial performancewise as dropping an
individual relation, it would be nice not to have to scan all of s_b to
drop a database.


> > Temp tables have unique filename on disk: t_<backendID>_<relfilenode>.  The
> > logic to assign OIDs and relfilenodes, however, doesn't differ.  Given a
> > RelFileNode, it is not possible to tell if it's a temp table or not.
> > RelFileNodeBackend allows for that distinction but it's not used by buffer
> > manager.
> 
> Could you store the backendid in BufferDesc, outside of BufferTag? Is it
> possible for a normal table and a temporary table to have the same
> relfilenode+dbid+spcid triplet?

When starting to work on the radix tree stuff I had, to address the size
of buffer tag issue you mention above, a prototype patch that created a
shared 'relfilenode' table. That guaranteed that relfilenodes are
unique.  That'd work here as well, and would allow to get rid of a good
chunk of uglyness we have around allocating relfilenodes right now (like
not unlinking files etc).

But more generally, I don't see why it'd be that problematic to just get
rid of the backendid? I don't really see any technical necessity to have
it.

Greetings,

Andres Freund


Re: Keeping temporary tables in shared buffers

From
Ashwin Agrawal
Date:

On Thu, May 24, 2018 at 11:50 PM, Andres Freund <andres@anarazel.de> wrote:
On 2018-05-25 09:40:10 +0300, Heikki Linnakangas wrote:
> On 25/05/18 09:25, Asim Praveen wrote:
> > My parochial vision of the overhead is restricted to 4 * NBuffers of
> > additional shared memory, as 4 bytes are being added to BufferTag.  May I
> > please get some enlightenment?
>
> Any extra fields in BufferTag make computing the hash more expensive. It's a
> very hot code path, so any cycles spent are significant.

Indeed, very much so.

But I'm not sure we need anything in the tags themselves. We don't
denote buffers for unlogged tables in the tag itself either. As Tom
observed the oids for temp tables are either unique or can be made
unique easy enough.  And the temporaryness can be declared in a bit in
the buffer header, rather than the tag itself. I don't see why a hash
lookup would need to know that.

Currently, relfilenodes (specifically spcid,dbid,relfilenode) for temp and regular tables can collide as temp files have "t_nnn" representation on-disk. Due to this relfilenode allocation logic can assign same relfilenode for temp and non-temp. If relfilenode uniqueness can be achieved then need for adding anything to buffer tag goes away.

When starting to work on the radix tree stuff I had, to address the size
of buffer tag issue you mention above, a prototype patch that created a
shared 'relfilenode' table. That guaranteed that relfilenodes are
unique.  That'd work here as well, and would allow to get rid of a good
chunk of uglyness we have around allocating relfilenodes right now (like
not unlinking files etc).

That would be great!
 

But more generally, I don't see why it'd be that problematic to just get
rid of the backendid? I don't really see any technical necessity to have
it.

Backendid was also added it seems due to same reason of not having unique relfilnodes for temp tables. So, yes with uniqueness guaranteed this can go away as well.

Re: Keeping temporary tables in shared buffers

From
Amit Kapila
Date:
On Fri, May 25, 2018 at 6:33 AM, Asim Praveen <apraveen@pivotal.io> wrote:
> Hello
>
> We are evaluating the use of shared buffers for temporary tables.  The
> advantage being queries involving temporary tables can make use of parallel
> workers.
>

This is one way, but I think there are other choices as well.  We can
identify and flush all the dirty (local) buffers for the relation
being accessed parallelly.  Now, once the parallel operation is
started, we won't allow performing any write operation on them.  It
could be expensive if we have a lot of dirty local buffers for a
particular relation.  I think if we are worried about the cost of
writes, then we can try some different way to parallelize temporary
table scan.  At the beginning of the scan, leader backend will
remember the dirty blocks present in local buffers, it can then share
the list with parallel workers which will skip scanning those blocks
and in the end leader ensures that all those blocks will be scanned by
the leader.  This shouldn't incur a much additional cost as the
skipped blocks should be present in local buffers of backend.

I understand that none of these alternatives are straight-forward, but
I think it is worth considering whether we have any better way to
allow parallel temporary table scans.


> Challenges:
> 1. We lose the performance benefit of local buffers.

Yeah, I think cases, where we need to drop temp relations, will become
costlier as they have to traverse all the shared buffers instead of
just local buffers.

I think if we use shared buffers for temp relations, there will be
some overhead for other backends as well, especially for the cases
when backends need to evict buffers.  It is quite possible that if the
relation is in local buffers, we might not write it at all, but moving
it to shared buffers will increase its probability of being written to
disk.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Re: Keeping temporary tables in shared buffers

From
Asim Praveen
Date:
Hi Amit

On Mon, May 28, 2018 at 4:25 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> This is one way, but I think there are other choices as well.  We can
> identify and flush all the dirty (local) buffers for the relation
> being accessed parallelly.  Now, once the parallel operation is
> started, we won't allow performing any write operation on them.  It

We talked about this in person in Ottawa and it was great meeting you!
 To summarize, the above proposal to continue using local buffers for
temp tables is a step forward, however, it enables only certain kinds
of queries to be parallelized for temp tables.  E.g. queries changing
a temp table in any way cannot be parallelized due to the restriction
of no writes during parallel operation.

>
> Yeah, I think cases, where we need to drop temp relations, will become
> costlier as they have to traverse all the shared buffers instead of
> just local buffers.
>

This is a valid concern.  The idea of using a radix tree of block
numbers as proposed by Andres [1] is worth pursuing.  Cost of
identifying and dropping shared buffers belonging to a relation using
radix tree would be reduced to O(log n).

Asim

[1] https://www.postgresql.org/message-id/20150912201941.GA8311%40alap3.anarazel.de


Re: Keeping temporary tables in shared buffers

From
Bruce Momjian
Date:
On Sat, Jun  2, 2018 at 05:18:17PM -0400, Asim Praveen wrote:
> Hi Amit
> 
> On Mon, May 28, 2018 at 4:25 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > This is one way, but I think there are other choices as well.  We can
> > identify and flush all the dirty (local) buffers for the relation
> > being accessed parallelly.  Now, once the parallel operation is
> > started, we won't allow performing any write operation on them.  It
> 
> We talked about this in person in Ottawa and it was great meeting you!
>  To summarize, the above proposal to continue using local buffers for
> temp tables is a step forward, however, it enables only certain kinds
> of queries to be parallelized for temp tables.  E.g. queries changing
> a temp table in any way cannot be parallelized due to the restriction
> of no writes during parallel operation.

Should this be a TODO item?

-- 
  Bruce Momjian  <bruce@momjian.us>        http://momjian.us
  EnterpriseDB                             http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+                      Ancient Roman grave inscription +


Re: Keeping temporary tables in shared buffers

From
Amit Kapila
Date:
On Wed, Jun 20, 2018 at 8:47 PM, Bruce Momjian <bruce@momjian.us> wrote:
> On Sat, Jun  2, 2018 at 05:18:17PM -0400, Asim Praveen wrote:
>> Hi Amit
>>
>> On Mon, May 28, 2018 at 4:25 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>> >
>> > This is one way, but I think there are other choices as well.  We can
>> > identify and flush all the dirty (local) buffers for the relation
>> > being accessed parallelly.  Now, once the parallel operation is
>> > started, we won't allow performing any write operation on them.  It
>>
>> We talked about this in person in Ottawa and it was great meeting you!
>>  To summarize, the above proposal to continue using local buffers for
>> temp tables is a step forward, however, it enables only certain kinds
>> of queries to be parallelized for temp tables.  E.g. queries changing
>> a temp table in any way cannot be parallelized due to the restriction
>> of no writes during parallel operation.
>
> Should this be a TODO item?
>

+1.  I think we have not hammered out the design completely, but if
somebody is willing to put effort, it is not an unsolvable problem.
AFAIU, this thread is about parallelizing queries that refer temp
tables, however, it is not clear from the title of this thread.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Re: Keeping temporary tables in shared buffers

From
Robert Haas
Date:
On Mon, May 28, 2018 at 4:25 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> On Fri, May 25, 2018 at 6:33 AM, Asim Praveen <apraveen@pivotal.io> wrote:
>> We are evaluating the use of shared buffers for temporary tables.  The
>> advantage being queries involving temporary tables can make use of parallel
>> workers.
> This is one way, but I think there are other choices as well.  We can
> identify and flush all the dirty (local) buffers for the relation
> being accessed parallelly.  Now, once the parallel operation is
> started, we won't allow performing any write operation on them.  It
> could be expensive if we have a lot of dirty local buffers for a
> particular relation.  I think if we are worried about the cost of
> writes, then we can try some different way to parallelize temporary
> table scan.  At the beginning of the scan, leader backend will
> remember the dirty blocks present in local buffers, it can then share
> the list with parallel workers which will skip scanning those blocks
> and in the end leader ensures that all those blocks will be scanned by
> the leader.  This shouldn't incur a much additional cost as the
> skipped blocks should be present in local buffers of backend.

This sounds awkward and limiting.  How about using DSA to allocate
space for the backend's temporary buffers and a dshash for lookups?
Then all backends can share, but we don't have to go through
shared_buffers.

Honestly, I don't see how pushing this through the main shared_buffers
arena is every going to work.  Widening the buffer tags by another 4
bytes is a non-starter, I think.  Maybe with some large rejiggering we
could get to a point where every relation, temporary or permanent, can
be identified by a single OID, regardless of tablespace, etc.  But if
you wait for that to happen this might not happen for a loooong time.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: Keeping temporary tables in shared buffers

From
Amit Kapila
Date:
On Fri, Jun 22, 2018 at 6:09 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Mon, May 28, 2018 at 4:25 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>> On Fri, May 25, 2018 at 6:33 AM, Asim Praveen <apraveen@pivotal.io> wrote:
>>> We are evaluating the use of shared buffers for temporary tables.  The
>>> advantage being queries involving temporary tables can make use of parallel
>>> workers.
>> This is one way, but I think there are other choices as well.  We can
>> identify and flush all the dirty (local) buffers for the relation
>> being accessed parallelly.  Now, once the parallel operation is
>> started, we won't allow performing any write operation on them.  It
>> could be expensive if we have a lot of dirty local buffers for a
>> particular relation.  I think if we are worried about the cost of
>> writes, then we can try some different way to parallelize temporary
>> table scan.  At the beginning of the scan, leader backend will
>> remember the dirty blocks present in local buffers, it can then share
>> the list with parallel workers which will skip scanning those blocks
>> and in the end leader ensures that all those blocks will be scanned by
>> the leader.  This shouldn't incur a much additional cost as the
>> skipped blocks should be present in local buffers of backend.
>
> This sounds awkward and limiting.  How about using DSA to allocate
> space for the backend's temporary buffers and a dshash for lookups?
>

That's a better idea.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Re: Keeping temporary tables in shared buffers

From
Bruce Momjian
Date:
On Thu, Jun 21, 2018 at 07:42:54AM +0530, Amit Kapila wrote:
> On Wed, Jun 20, 2018 at 8:47 PM, Bruce Momjian <bruce@momjian.us> wrote:
> > On Sat, Jun  2, 2018 at 05:18:17PM -0400, Asim Praveen wrote:
> >> Hi Amit
> >>
> >> On Mon, May 28, 2018 at 4:25 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> >> >
> >> > This is one way, but I think there are other choices as well.  We can
> >> > identify and flush all the dirty (local) buffers for the relation
> >> > being accessed parallelly.  Now, once the parallel operation is
> >> > started, we won't allow performing any write operation on them.  It
> >>
> >> We talked about this in person in Ottawa and it was great meeting you!
> >>  To summarize, the above proposal to continue using local buffers for
> >> temp tables is a step forward, however, it enables only certain kinds
> >> of queries to be parallelized for temp tables.  E.g. queries changing
> >> a temp table in any way cannot be parallelized due to the restriction
> >> of no writes during parallel operation.
> >
> > Should this be a TODO item?
> >
> 
> +1.  I think we have not hammered out the design completely, but if
> somebody is willing to put effort, it is not an unsolvable problem.
> AFAIU, this thread is about parallelizing queries that refer temp
> tables, however, it is not clear from the title of this thread.

Seems it is already documented on the wiki:

    https://wiki.postgresql.org/wiki/Parallel_Query#What_Parts_of_a_Query_Can_Run_In_Parallel.3F

    o  Scans of plain tables may not appear below Gather if (1) they are
       temporary tables ...
       ----------------

-- 
  Bruce Momjian  <bruce@momjian.us>        http://momjian.us
  EnterpriseDB                             http://enterprisedb.com

+ As you are, so once was I.  As I am, so you will be. +
+                      Ancient Roman grave inscription +