Thread: Keeping temporary tables in shared buffers
Hello
We are evaluating the use of shared buffers for temporary tables. The advantage being queries involving temporary tables can make use of parallel workers.
Challenges:
1. We lose the performance benefit of local buffers.
2. Additional complexity in shared buffer manager - BufferTag needs to include backend ID to distinguish 'my' temp buffers from non-temp or 'others' temp buffers.
3. Checkpointer needs to skip them for fsync but bgwriter needs to write them out.
Has someone already thought about this? Any workloads that might benefit from this idea?
Asim and David
Asim Praveen <apraveen@pivotal.io> writes: > We are evaluating the use of shared buffers for temporary tables. The > advantage being queries involving temporary tables can make use of parallel > workers. Hm... > Challenges: > 1. We lose the performance benefit of local buffers. Yeah. This would be an absolute dead loss for any situation where you couldn't get major parallelism wins, which I'm afraid would be often. So then you have to think about how to transition smoothly between "rel is in local buffers" and "rel is in shared buffers", bearing in mind that ever having the same page in two different buffers would be disastrous. > 2. Additional complexity in shared buffer manager - BufferTag needs to > include backend ID to distinguish 'my' temp buffers from non-temp or > 'others' temp buffers. I think that would be a deal breaker right there, because of the distributed overhead of making the tags bigger. However, I don't actually understand why you would need to do that. Temp tables have unique OIDs/relfilenodes anyway, don't they? Or if I'm misremembering and they don't, couldn't we make them so? > 3. Checkpointer needs to skip them for fsync but bgwriter needs to write > them out. Not really sure why that would be a "must", either. If the checkpointer performs some useless writes, that's a bit of a performance drag, but taking a performance hit to avoid it could be a net loss. The only reason why you'd care about writing at all is to try to make the buffers clean in case they have to be reclaimed for some other use --- and if the checkpointer does such a write instead of the bgwriter, why's that bad? regards, tom lane
On Thu, May 24, 2018 at 8:19 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> So then you have to think about how to transition smoothly between "rel
> is in local buffers" and "rel is in shared buffers", bearing in mind that
> ever having the same page in two different buffers would be disastrous.
> ever having the same page in two different buffers would be disastrous.
Local buffers would not be used at all if temp tables start residing in shared buffers. The transition mentioned above shouldn't be needed.
> I think that would be a deal breaker right there, because of the
> distributed overhead of making the tags bigger. However, I don't
> actually understand why you would need to do that. Temp tables
> have unique OIDs/relfilenodes anyway, don't they? Or if I'm
> misremembering and they don't, couldn't we make them so?
My parochial vision of the overhead is restricted to 4 * NBuffers of additional shared memory, as 4 bytes are being added to BufferTag. May I please get some enlightenment?
Temp tables have unique filename on disk: t_<backendID>_<relfilenode>. The logic to assign OIDs and relfilenodes, however, doesn't differ. Given a RelFileNode, it is not possible to tell if it's a temp table or not. RelFileNodeBackend allows for that distinction but it's not used by buffer manager.
>
> taking a performance hit to avoid it could be a net loss. The only reason
> why you'd care about writing at all is to try to make the buffers clean
> in case they have to be reclaimed for some other use --- and if the
> checkpointer does such a write instead of the bgwriter, why's that bad?
Yes, a temp table's buffer would need to be written out only if it needs to be repurposed for a different page. It is not bad, our description wasn't clear enough.
Asim
On 25/05/18 09:25, Asim Praveen wrote: > On Thu, May 24, 2018 at 8:19 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> >> So then you have to think about how to transition smoothly between "rel >> is in local buffers" and "rel is in shared buffers", bearing in mind that >> ever having the same page in two different buffers would be disastrous. > > Local buffers would not be used at all if temp tables start residing in > shared buffers. The transition mentioned above shouldn't be needed. What is the performance difference between the local buffer manager and the shared buffer manager? The local buffer manager avoids all the locking overhead, which has to amount to something, but how big a difference is it? >> I think that would be a deal breaker right there, because of the >> distributed overhead of making the tags bigger. However, I don't >> actually understand why you would need to do that. Temp tables >> have unique OIDs/relfilenodes anyway, don't they? Or if I'm >> misremembering and they don't, couldn't we make them so? > > My parochial vision of the overhead is restricted to 4 * NBuffers of > additional shared memory, as 4 bytes are being added to BufferTag. May I > please get some enlightenment? Any extra fields in BufferTag make computing the hash more expensive. It's a very hot code path, so any cycles spent are significant. In relation to Andres' patches to rewrite the buffer manager with a radix tree, there was actually some discussion of trying to make BufferTag *smaller*. For example, we could rearrange things so that pg_class.relfilenode is 64 bits wide. Then you could assume that it never wraps around, and is unique across all relations in the cluster. Then you could replace the 12-byte relfilenode+dbid+spcid triplet, with just the 8-byte relfilenode. Doing something like that might be the solution here, too. > Temp tables have unique filename on disk: t_<backendID>_<relfilenode>. The > logic to assign OIDs and relfilenodes, however, doesn't differ. Given a > RelFileNode, it is not possible to tell if it's a temp table or not. > RelFileNodeBackend allows for that distinction but it's not used by buffer > manager. Could you store the backendid in BufferDesc, outside of BufferTag? Is it possible for a normal table and a temporary table to have the same relfilenode+dbid+spcid triplet? - Heikki
On 2018-05-25 09:40:10 +0300, Heikki Linnakangas wrote: > On 25/05/18 09:25, Asim Praveen wrote: > > My parochial vision of the overhead is restricted to 4 * NBuffers of > > additional shared memory, as 4 bytes are being added to BufferTag. May I > > please get some enlightenment? > > Any extra fields in BufferTag make computing the hash more expensive. It's a > very hot code path, so any cycles spent are significant. Indeed, very much so. But I'm not sure we need anything in the tags themselves. We don't denote buffers for unlogged tables in the tag itself either. As Tom observed the oids for temp tables are either unique or can be made unique easy enough. And the temporaryness can be declared in a bit in the buffer header, rather than the tag itself. I don't see why a hash lookup would need to know that. > In relation to Andres' patches to rewrite the buffer manager with a radix > tree, there was actually some discussion of trying to make BufferTag > *smaller*. FWIW, in the latest version that doesn't matter that much anymore. Instead of one big tree it's a hashtable of trees (although it potentially should rather be a tree of trees). The hashtable maps to a radix tree, and that radix tree is just indexed by the offset. The root of the tree is then cached inside the smgr, avoiding the need to repeatedly look it up. > For example, we could rearrange things so that > pg_class.relfilenode is 64 bits wide. Then you could assume that it never > wraps around, and is unique across all relations in the cluster. Then you > could replace the 12-byte relfilenode+dbid+spcid triplet, with just the > 8-byte relfilenode. Doing something like that might be the solution here, > too. OTOH it's quite useful to have the buffertag be something that can (or rather could) be efficiently searched for in a hierachical fashion. While, by far, not as crucial performancewise as dropping an individual relation, it would be nice not to have to scan all of s_b to drop a database. > > Temp tables have unique filename on disk: t_<backendID>_<relfilenode>. The > > logic to assign OIDs and relfilenodes, however, doesn't differ. Given a > > RelFileNode, it is not possible to tell if it's a temp table or not. > > RelFileNodeBackend allows for that distinction but it's not used by buffer > > manager. > > Could you store the backendid in BufferDesc, outside of BufferTag? Is it > possible for a normal table and a temporary table to have the same > relfilenode+dbid+spcid triplet? When starting to work on the radix tree stuff I had, to address the size of buffer tag issue you mention above, a prototype patch that created a shared 'relfilenode' table. That guaranteed that relfilenodes are unique. That'd work here as well, and would allow to get rid of a good chunk of uglyness we have around allocating relfilenodes right now (like not unlinking files etc). But more generally, I don't see why it'd be that problematic to just get rid of the backendid? I don't really see any technical necessity to have it. Greetings, Andres Freund
On Thu, May 24, 2018 at 11:50 PM, Andres Freund <andres@anarazel.de> wrote:
On 2018-05-25 09:40:10 +0300, Heikki Linnakangas wrote:
> On 25/05/18 09:25, Asim Praveen wrote:
> > My parochial vision of the overhead is restricted to 4 * NBuffers of
> > additional shared memory, as 4 bytes are being added to BufferTag. May I
> > please get some enlightenment?
>
> Any extra fields in BufferTag make computing the hash more expensive. It's a
> very hot code path, so any cycles spent are significant.
Indeed, very much so.
But I'm not sure we need anything in the tags themselves. We don't
denote buffers for unlogged tables in the tag itself either. As Tom
observed the oids for temp tables are either unique or can be made
unique easy enough. And the temporaryness can be declared in a bit in
the buffer header, rather than the tag itself. I don't see why a hash
lookup would need to know that.
Currently, relfilenodes (specifically spcid,dbid,relfilenode) for temp and regular tables can collide as temp files have "t_nnn" representation on-disk. Due to this relfilenode allocation logic can assign same relfilenode for temp and non-temp. If relfilenode uniqueness can be achieved then need for adding anything to buffer tag goes away.
When starting to work on the radix tree stuff I had, to address the size
of buffer tag issue you mention above, a prototype patch that created a
shared 'relfilenode' table. That guaranteed that relfilenodes are
unique. That'd work here as well, and would allow to get rid of a good
chunk of uglyness we have around allocating relfilenodes right now (like
not unlinking files etc).
That would be great!
But more generally, I don't see why it'd be that problematic to just get
rid of the backendid? I don't really see any technical necessity to have
it.
Backendid was also added it seems due to same reason of not having unique relfilnodes for temp tables. So, yes with uniqueness guaranteed this can go away as well.
On Fri, May 25, 2018 at 6:33 AM, Asim Praveen <apraveen@pivotal.io> wrote: > Hello > > We are evaluating the use of shared buffers for temporary tables. The > advantage being queries involving temporary tables can make use of parallel > workers. > This is one way, but I think there are other choices as well. We can identify and flush all the dirty (local) buffers for the relation being accessed parallelly. Now, once the parallel operation is started, we won't allow performing any write operation on them. It could be expensive if we have a lot of dirty local buffers for a particular relation. I think if we are worried about the cost of writes, then we can try some different way to parallelize temporary table scan. At the beginning of the scan, leader backend will remember the dirty blocks present in local buffers, it can then share the list with parallel workers which will skip scanning those blocks and in the end leader ensures that all those blocks will be scanned by the leader. This shouldn't incur a much additional cost as the skipped blocks should be present in local buffers of backend. I understand that none of these alternatives are straight-forward, but I think it is worth considering whether we have any better way to allow parallel temporary table scans. > Challenges: > 1. We lose the performance benefit of local buffers. Yeah, I think cases, where we need to drop temp relations, will become costlier as they have to traverse all the shared buffers instead of just local buffers. I think if we use shared buffers for temp relations, there will be some overhead for other backends as well, especially for the cases when backends need to evict buffers. It is quite possible that if the relation is in local buffers, we might not write it at all, but moving it to shared buffers will increase its probability of being written to disk. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
Hi Amit On Mon, May 28, 2018 at 4:25 AM, Amit Kapila <amit.kapila16@gmail.com> wrote: > > This is one way, but I think there are other choices as well. We can > identify and flush all the dirty (local) buffers for the relation > being accessed parallelly. Now, once the parallel operation is > started, we won't allow performing any write operation on them. It We talked about this in person in Ottawa and it was great meeting you! To summarize, the above proposal to continue using local buffers for temp tables is a step forward, however, it enables only certain kinds of queries to be parallelized for temp tables. E.g. queries changing a temp table in any way cannot be parallelized due to the restriction of no writes during parallel operation. > > Yeah, I think cases, where we need to drop temp relations, will become > costlier as they have to traverse all the shared buffers instead of > just local buffers. > This is a valid concern. The idea of using a radix tree of block numbers as proposed by Andres [1] is worth pursuing. Cost of identifying and dropping shared buffers belonging to a relation using radix tree would be reduced to O(log n). Asim [1] https://www.postgresql.org/message-id/20150912201941.GA8311%40alap3.anarazel.de
On Sat, Jun 2, 2018 at 05:18:17PM -0400, Asim Praveen wrote: > Hi Amit > > On Mon, May 28, 2018 at 4:25 AM, Amit Kapila <amit.kapila16@gmail.com> wrote: > > > > This is one way, but I think there are other choices as well. We can > > identify and flush all the dirty (local) buffers for the relation > > being accessed parallelly. Now, once the parallel operation is > > started, we won't allow performing any write operation on them. It > > We talked about this in person in Ottawa and it was great meeting you! > To summarize, the above proposal to continue using local buffers for > temp tables is a step forward, however, it enables only certain kinds > of queries to be parallelized for temp tables. E.g. queries changing > a temp table in any way cannot be parallelized due to the restriction > of no writes during parallel operation. Should this be a TODO item? -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + As you are, so once was I. As I am, so you will be. + + Ancient Roman grave inscription +
On Wed, Jun 20, 2018 at 8:47 PM, Bruce Momjian <bruce@momjian.us> wrote: > On Sat, Jun 2, 2018 at 05:18:17PM -0400, Asim Praveen wrote: >> Hi Amit >> >> On Mon, May 28, 2018 at 4:25 AM, Amit Kapila <amit.kapila16@gmail.com> wrote: >> > >> > This is one way, but I think there are other choices as well. We can >> > identify and flush all the dirty (local) buffers for the relation >> > being accessed parallelly. Now, once the parallel operation is >> > started, we won't allow performing any write operation on them. It >> >> We talked about this in person in Ottawa and it was great meeting you! >> To summarize, the above proposal to continue using local buffers for >> temp tables is a step forward, however, it enables only certain kinds >> of queries to be parallelized for temp tables. E.g. queries changing >> a temp table in any way cannot be parallelized due to the restriction >> of no writes during parallel operation. > > Should this be a TODO item? > +1. I think we have not hammered out the design completely, but if somebody is willing to put effort, it is not an unsolvable problem. AFAIU, this thread is about parallelizing queries that refer temp tables, however, it is not clear from the title of this thread. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On Mon, May 28, 2018 at 4:25 AM, Amit Kapila <amit.kapila16@gmail.com> wrote: > On Fri, May 25, 2018 at 6:33 AM, Asim Praveen <apraveen@pivotal.io> wrote: >> We are evaluating the use of shared buffers for temporary tables. The >> advantage being queries involving temporary tables can make use of parallel >> workers. > This is one way, but I think there are other choices as well. We can > identify and flush all the dirty (local) buffers for the relation > being accessed parallelly. Now, once the parallel operation is > started, we won't allow performing any write operation on them. It > could be expensive if we have a lot of dirty local buffers for a > particular relation. I think if we are worried about the cost of > writes, then we can try some different way to parallelize temporary > table scan. At the beginning of the scan, leader backend will > remember the dirty blocks present in local buffers, it can then share > the list with parallel workers which will skip scanning those blocks > and in the end leader ensures that all those blocks will be scanned by > the leader. This shouldn't incur a much additional cost as the > skipped blocks should be present in local buffers of backend. This sounds awkward and limiting. How about using DSA to allocate space for the backend's temporary buffers and a dshash for lookups? Then all backends can share, but we don't have to go through shared_buffers. Honestly, I don't see how pushing this through the main shared_buffers arena is every going to work. Widening the buffer tags by another 4 bytes is a non-starter, I think. Maybe with some large rejiggering we could get to a point where every relation, temporary or permanent, can be identified by a single OID, regardless of tablespace, etc. But if you wait for that to happen this might not happen for a loooong time. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Fri, Jun 22, 2018 at 6:09 PM, Robert Haas <robertmhaas@gmail.com> wrote: > On Mon, May 28, 2018 at 4:25 AM, Amit Kapila <amit.kapila16@gmail.com> wrote: >> On Fri, May 25, 2018 at 6:33 AM, Asim Praveen <apraveen@pivotal.io> wrote: >>> We are evaluating the use of shared buffers for temporary tables. The >>> advantage being queries involving temporary tables can make use of parallel >>> workers. >> This is one way, but I think there are other choices as well. We can >> identify and flush all the dirty (local) buffers for the relation >> being accessed parallelly. Now, once the parallel operation is >> started, we won't allow performing any write operation on them. It >> could be expensive if we have a lot of dirty local buffers for a >> particular relation. I think if we are worried about the cost of >> writes, then we can try some different way to parallelize temporary >> table scan. At the beginning of the scan, leader backend will >> remember the dirty blocks present in local buffers, it can then share >> the list with parallel workers which will skip scanning those blocks >> and in the end leader ensures that all those blocks will be scanned by >> the leader. This shouldn't incur a much additional cost as the >> skipped blocks should be present in local buffers of backend. > > This sounds awkward and limiting. How about using DSA to allocate > space for the backend's temporary buffers and a dshash for lookups? > That's a better idea. -- With Regards, Amit Kapila. EnterpriseDB: http://www.enterprisedb.com
On Thu, Jun 21, 2018 at 07:42:54AM +0530, Amit Kapila wrote: > On Wed, Jun 20, 2018 at 8:47 PM, Bruce Momjian <bruce@momjian.us> wrote: > > On Sat, Jun 2, 2018 at 05:18:17PM -0400, Asim Praveen wrote: > >> Hi Amit > >> > >> On Mon, May 28, 2018 at 4:25 AM, Amit Kapila <amit.kapila16@gmail.com> wrote: > >> > > >> > This is one way, but I think there are other choices as well. We can > >> > identify and flush all the dirty (local) buffers for the relation > >> > being accessed parallelly. Now, once the parallel operation is > >> > started, we won't allow performing any write operation on them. It > >> > >> We talked about this in person in Ottawa and it was great meeting you! > >> To summarize, the above proposal to continue using local buffers for > >> temp tables is a step forward, however, it enables only certain kinds > >> of queries to be parallelized for temp tables. E.g. queries changing > >> a temp table in any way cannot be parallelized due to the restriction > >> of no writes during parallel operation. > > > > Should this be a TODO item? > > > > +1. I think we have not hammered out the design completely, but if > somebody is willing to put effort, it is not an unsolvable problem. > AFAIU, this thread is about parallelizing queries that refer temp > tables, however, it is not clear from the title of this thread. Seems it is already documented on the wiki: https://wiki.postgresql.org/wiki/Parallel_Query#What_Parts_of_a_Query_Can_Run_In_Parallel.3F o Scans of plain tables may not appear below Gather if (1) they are temporary tables ... ---------------- -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + As you are, so once was I. As I am, so you will be. + + Ancient Roman grave inscription +