Re: Use generation memory context for tuplestore.c - Mailing list pgsql-hackers
From | David Rowley |
---|---|
Subject | Re: Use generation memory context for tuplestore.c |
Date | |
Msg-id | CAApHDvoSF_NL9zBmT7TA9Lu8=qPTUr3bbRHTReTqWjejcLt7Qg@mail.gmail.com Whole thread Raw |
In response to | Re: Use generation memory context for tuplestore.c (Matthias van de Meent <boekewurm+postgres@gmail.com>) |
Responses |
Re: Use generation memory context for tuplestore.c
|
List | pgsql-hackers |
On Sat, 4 May 2024 at 03:51, Matthias van de Meent <boekewurm+postgres@gmail.com> wrote: > Was a bump context considered? If so, why didn't it make the cut? > If tuplestore_trim is the only reason why the type of context in patch > 2 is a generation context, then couldn't we make the type of context > conditional on state->eflags & EXEC_FLAG_REWIND, and use a bump > context if we require rewind capabilities (i.e. where _trim is never > effectively executed)? I didn't really want to raise all this here, but to answer why I didn't use bump... There's a bit more that would need to be done to allow bump to work in use-cases where no trim support is needed. Namely, if you look at writetup_heap(), you'll see a heap_free_minimal_tuple(), which is pfreeing the memory that was allocated for the tuple in either tuplestore_puttupleslot(), tuplestore_puttuple() or tuplestore_putvalues(). So basically, what happens if we're still loading the tuplestore and we've spilled to disk, once the tuplestore_put* function is called, we allocate memory for the tuple that might get stored in RAM (we don't know yet), but then call tuplestore_puttuple_common() which decides if the tuple goes to RAM or disk, then because we're spilling to disk, the write function pfree's the memory we allocate in the tuplestore_put function after the tuple is safely written to the disk buffer. This is a fairly inefficient design. While, we do need to still form a tuple and store it somewhere for tuplestore_putvalues(), we don't need to do that for a heap tuple. I think it should be possible to write directly from the table's page. Overall tuplestore.c seems quite confused as to how it's meant to work. You have tuplestore_begin_heap() function, which is the *only* external function to create a tuplestore. We then pretend we're agnostic about how we store tuples that won't fit in memory by providing callbacks for read/write/copy, but we only have 1 set of functions for those and instead of having some other begin method we use when not dealing with heap tuples, we use some other tuplestore_put* function. It seems like another pass is required to fix all this and I think that should be: 1. Get rid of the function pointers and just hardcode which static functions we call to read/write/copy. 2. Rename tuplestore_begin_heap() to tuplestore_begin(). 3. See if we can rearrange the code so that the copying to the tuple context is only done when we are in TSS_INMEM. I'm not sure what that would look like, but it's required before we could use bump so we don't pfree a bump allocated chunk. Or maybe there's a way to fix this by adding other begin functions and making it work more like tuplesort. I've not really looked into that. I'd rather tackle these problems independently and I believe there are much bigger wins to moving from aset to generation than generation to bump, so that's where I've started. Thanks for having a look at the patch. David
pgsql-hackers by date: