Thread: Rethinking TupleTableSlot deforming
Hi, I've previously mentioned (e.g. [1]) that tuple deforming is a serious bottlneck. I've also experimented successfully [2] making slot_deform_tuple() faster. But nontheless, tuple deforming is still a *major* bottleneck in many cases, if not the *the* major bottleneck. We could partially address that by JITing the work slot_deform_tuple does. Various people have, with good but not raving success, played with that. Alternatively/Additionally we can change the tuple format to make deforming faster. But I think the bigger issue than the above is actually that we're just performing a lot of useless work in a number of common scenarios. We're always deforming all columns up to the one needed. Very often that's a lot of useless work. I've experimented with selectively replacing slot_getattr calls heap_getattr(), and for some queries that can yield massive speedups. And obviously significant slowdowns in others. That's the case even when preceding columns are varlena and/or contain nulls. I.e. a good chunk of the problem is storing the results of deforming, not accessing the data. ISTM, we need to change slots so that they contain information about which columns are interesting. For the hot paths we'd then only ever allow access to those columns, and we'd only ever deform them. Combined with the approach in [2] that allows us to deform tuples a lot more efficiently. What I'm basically thinking is that expression evaluation would always make sure the slots have computed the relevant column set, and deform at the beginning. There's some cases where we likely would still need to fall back to a slower path (e.g. whole row refs), but that seems fine. That then also allows us to nearly always avoid the slot_getattr() call, and instead look at tts_values/nulls directly. The checks slot_getattr() performs, and the call itself, are quite expensive. What I'm thinking about is a) a new ExecInitExpr()/ExecBuildProjectionInfo() which always compute a set of interesting columns. b) replacing all accesses to tts_values/isnull with an inline function. In optimized builds that functions won't do anythingbut reference the relevant element, but in assert enabled builds it'd check whether said column is actually knownto be accessed. c) Make ExecEvalExpr(), ExecProject(), ExecQual() (and perhaps some other places) call the new deforming function whichensures the relevant columns are available. d) Replace nearly all slot_getattr/slot_getsomeattrs calls with the function introduced in b). To me it seems this work will be a good bit easier once [2] is actually implemented instead of prototyped, because treating ExecInitExpr() non-recursively allows to build such 'column sets' more easily / naturally. Comments? Alternative suggestions? Greetings, Andres Freund [1] http://archives.postgresql.org/20160624232953.beub22r6yqux4gcp@alap3.anarazel.de [2] http://archives.postgresql.org/message-id/20160714011850.bd5zhu35szle3n3c%40alap3.anarazel.de
On Fri, Jul 22, 2016 at 2:56 AM, Andres Freund <andres@anarazel.de> wrote: > > But I think the bigger issue than the above is actually that we're just > performing a lot of useless work in a number of common scenarios. We're > always deforming all columns up to the one needed. Very often that's a > lot of useless work. I've experimented with selectively replacing > slot_getattr calls heap_getattr(), and for some queries that can yield > massive speedups. And obviously significant slowdowns in others. That's > the case even when preceding columns are varlena and/or contain nulls. > I.e. a good chunk of the problem is storing the results of deforming, > not accessing the data. As I said when we chatted I'm a bit puzzled. My understanding was that the whole point of the "slots" thing was precisely to avoid this kind of extra work. So I would love to learn where my understanding diverges from reality. I wonder if there's some simple bug in the code leading to it not doing what it's intended to do. My understanding is that the idea of slots is that when we decode the Nth attribute we decode every attribute up until that attribute and have somewhere to put it -- in the slot. When we later refer to an earlier attribute we can just fetch it from the slot directly. This is the difference between using slot_getattr and heap_getattr directly. -- greg
Greg Stark <stark@mit.edu> writes: > On Fri, Jul 22, 2016 at 2:56 AM, Andres Freund <andres@anarazel.de> wrote: >> But I think the bigger issue than the above is actually that we're just >> performing a lot of useless work in a number of common scenarios. We're >> always deforming all columns up to the one needed. Very often that's a >> lot of useless work. > As I said when we chatted I'm a bit puzzled. I'm really suspicious of this line of argument as well. It's possible that if you only consider all-fixed-width, never-null columns, it might look like deforming the columns before the one you need is a waste of effort. But as soon as you relax either of those assumptions, you have to crawl over the earlier columns anyway, and saving aside the results is going to be close to free. I can certainly believe that there's some merit in trying to arrange for the columns we need to be earlier rather than later. In a sorting or grouping step, we know we only need access to certain columns --- but those columns are likely to be at the end not the beginning. If we're doing a projection step anyway to construct the input, we could probably rearrange that. Maybe we could even go further, and require the planner to always set up the input so that the sort/group columns are exactly 1..N in order, removing the need for the executor to cope with any variation. regards, tom lane
On 2016-07-22 14:33:32 +0100, Greg Stark wrote: > On Fri, Jul 22, 2016 at 2:56 AM, Andres Freund <andres@anarazel.de> wrote: > > > > But I think the bigger issue than the above is actually that we're just > > performing a lot of useless work in a number of common scenarios. We're > > always deforming all columns up to the one needed. Very often that's a > > lot of useless work. I've experimented with selectively replacing > > slot_getattr calls heap_getattr(), and for some queries that can yield > > massive speedups. And obviously significant slowdowns in others. That's > > the case even when preceding columns are varlena and/or contain nulls. > > I.e. a good chunk of the problem is storing the results of deforming, > > not accessing the data. > > As I said when we chatted I'm a bit puzzled. I think that's a separate slant on slots than what I/we talked around pgcon. > My understanding was that > the whole point of the "slots" thing was precisely to avoid this kind > of extra work. So I would love to learn where my understanding > diverges from reality. I wonder if there's some simple bug in the code > leading to it not doing what it's intended to do. > > My understanding is that the idea of slots is that when we decode the > Nth attribute we decode every attribute up until that attribute and > have somewhere to put it -- in the slot. When we later refer to an > earlier attribute we can just fetch it from the slot directly. This is > the difference between using slot_getattr and heap_getattr directly. Sure, it's intentional. But not working that well. The problem is that very often we actually don't ever access the preceding columns. Sometimes that knowledge would let us skip directly to the interesting column(s) (fixed width, no nulls columns preceding). But even if not, just the fact that we don't have to copy the column data for columns never accessed can make for a *significant* speed bump. For !varlena datatypes we don't ever have to access the data, just the NULL bitmap. And for varlena ones, we just need to check the varlena header for the width, not make variable width copy. So what I'm saying is, that we should pre-compute the set of columns we're interested in, and deform just those, but still in bulk. Andres
On 2016-07-22 10:09:18 -0400, Tom Lane wrote: > Greg Stark <stark@mit.edu> writes: > > On Fri, Jul 22, 2016 at 2:56 AM, Andres Freund <andres@anarazel.de> wrote: > >> But I think the bigger issue than the above is actually that we're just > >> performing a lot of useless work in a number of common scenarios. We're > >> always deforming all columns up to the one needed. Very often that's a > >> lot of useless work. > > > As I said when we chatted I'm a bit puzzled. > > I'm really suspicious of this line of argument as well. It's possible > that if you only consider all-fixed-width, never-null columns, it might > look like deforming the columns before the one you need is a waste of > effort. But as soon as you relax either of those assumptions, you have > to crawl over the earlier columns anyway, and saving aside the results > is going to be close to free. Not according to my measurements. And that doesn't seems that surprising. For a null check you'll need to access (not cheaply so!) the null bitmap, to skip a varlena datum one needs to access the varlena header. Copying the actual datum, *especially* in the varlena case, is a more expensive than that; especially because otherwise you'll often not have to touch the source cachelines at all. > I can certainly believe that there's some merit in trying to arrange for > the columns we need to be earlier rather than later. In a sorting or > grouping step, we know we only need access to certain columns --- but > those columns are likely to be at the end not the beginning. If we're > doing a projection step anyway to construct the input, we could probably > rearrange that. Oh, that's a neat idea as well. > Maybe we could even go further, and require the planner > to always set up the input so that the sort/group columns are exactly 1..N > in order, removing the need for the executor to cope with any variation. I guess that columns both returned to the upper node, and involved in sorting would make it hard to entirely and efficiently rely on that, without forcing more expensive projections. But it's probably worth tinkering with. Greetings, Andres Freund
On 7/22/16 1:35 PM, Andres Freund wrote: > Sure, it's intentional. But not working that well. The problem is that > very often we actually don't ever access the preceding > columns. Sometimes that knowledge would let us skip directly to the > interesting column(s) (fixed width, no nulls columns preceding). But > even if not, just the fact that we don't have to copy the column data > for columns never accessed can make for a *significant* speed bump. For > !varlena datatypes we don't ever have to access the data, just the NULL > bitmap. And for varlena ones, we just need to check the varlena header > for the width, not make variable width copy. > > So what I'm saying is, that we should pre-compute the set of columns > we're interested in, and deform just those, but still in bulk. Another option would be to remember the tuple offsets (NOT attcacheoff) that have been computed as well as whether a varlena attribute has actually been deformed. That eliminates the need to pre-declare what attributes you're allowed to reference. Somewhat related, it's occurred to me that it would be nice if expanded objects supported partial expansion. IE: if we converted JSONB to be an expanded object, you would not want to proactively expand every value in a JSON object, because each of those values could itself be very large. Obviously most of the work there would be up to the type that was handling the details of the expansion, but IIRC there's some additional support that would be needed in expanded objects themselves. -- Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX Experts in Analytics, Data Architecture and PostgreSQL Data in Trouble? Get it in Treble! http://BlueTreble.com 855-TREBLE2 (855-873-2532) mobile: 512-569-9461
On 2016-07-22 13:51:31 -0500, Jim Nasby wrote: > Another option would be to remember the > tuple offsets (NOT attcacheoff) that have > been computed as well as whether a > varlena attribute has actually been > deformed. That eliminates the need to > pre-declare what attributes you're > allowed to reference. That'd make access more expensive, so that doesn't seem a very enticing solution.
Andres Freund <andres@anarazel.de> writes: > On 2016-07-22 10:09:18 -0400, Tom Lane wrote: >> I'm really suspicious of this line of argument as well. It's possible >> that if you only consider all-fixed-width, never-null columns, it might >> look like deforming the columns before the one you need is a waste of >> effort. But as soon as you relax either of those assumptions, you have >> to crawl over the earlier columns anyway, and saving aside the results >> is going to be close to free. > Not according to my measurements. And that doesn't seems that > surprising. For a null check you'll need to access (not cheaply so!) the > null bitmap, to skip a varlena datum one needs to access the varlena > header. Copying the actual datum, *especially* in the varlena case, is a > more expensive than that; especially because otherwise you'll often not > have to touch the source cachelines at all. But that's nonsense. We don't copy varlena datums during tuple deforming, only save aside pointers to them. And we would have to access the column's null bit and varlena header in any case if our goal is to find a column located later. Yeah, we could skip physically storing the pointer into the slot's Datum array, but not much more than that. >> Maybe we could even go further, and require the planner >> to always set up the input so that the sort/group columns are exactly 1..N >> in order, removing the need for the executor to cope with any variation. > I guess that columns both returned to the upper node, and involved in > sorting would make it hard to entirely and efficiently rely on that, > without forcing more expensive projections. But it's probably worth > tinkering with. Well, it's a question of whether an extra projection at the scan level is worth the savings in column access during the sort or group stage. My gut feeling is that it very likely would be a win for a multicolumn sort. (For a single sort key column we already amortize fetching of the column datum, so maybe not so much in that case.) Whether the column is needed at upper levels doesn't seem relevant to me. setrefs.c will find it wherever it is. regards, tom lane
On 7/22/16 1:58 PM, Andres Freund wrote: > On 2016-07-22 13:51:31 -0500, Jim Nasby wrote: >> Another option would be to remember the >> tuple offsets (NOT attcacheoff) that have >> been computed as well as whether a >> varlena attribute has actually been >> deformed. That eliminates the need to >> pre-declare what attributes you're >> allowed to reference. > > That'd make access more expensive, so that doesn't seem a very enticing > solution. Yeah, it's certainly a tradeoff. The advantage is this might be less invasive; it might also work in more cases like heap_deform_tuple (which I think is used a lot more than table slots). It's certainly possible it's not worth it, but I wanted to put the idea out there. -- Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX Experts in Analytics, Data Architecture and PostgreSQL Data in Trouble? Get it in Treble! http://BlueTreble.com 855-TREBLE2 (855-873-2532) mobile: 512-569-9461
On 2016-07-22 15:00:32 -0400, Tom Lane wrote: > Andres Freund <andres@anarazel.de> writes: > > On 2016-07-22 10:09:18 -0400, Tom Lane wrote: > >> I'm really suspicious of this line of argument as well. It's possible > >> that if you only consider all-fixed-width, never-null columns, it might > >> look like deforming the columns before the one you need is a waste of > >> effort. But as soon as you relax either of those assumptions, you have > >> to crawl over the earlier columns anyway, and saving aside the results > >> is going to be close to free. > > > Not according to my measurements. And that doesn't seems that > > surprising. For a null check you'll need to access (not cheaply so!) the > > null bitmap, to skip a varlena datum one needs to access the varlena > > header. Copying the actual datum, *especially* in the varlena case, is a > > more expensive than that; especially because otherwise you'll often not > > have to touch the source cachelines at all. > > But that's nonsense. We don't copy varlena datums during tuple deforming, > only save aside pointers to them. Indeed. I've hacked up a test for this, I'll dig that up, and show the results. > Well, it's a question of whether an extra projection at the scan level is > worth the savings in column access during the sort or group stage. My gut > feeling is that it very likely would be a win for a multicolumn sort. > (For a single sort key column we already amortize fetching of the column > datum, so maybe not so much in that case.) > > Whether the column is needed at upper levels doesn't seem relevant to me. > setrefs.c will find it wherever it is. Well, the projection is what I was thinking of. Andres
Andres Freund <andres@anarazel.de> writes: > On 2016-07-22 15:00:32 -0400, Tom Lane wrote: >> Well, it's a question of whether an extra projection at the scan level is >> worth the savings in column access during the sort or group stage. > Well, the projection is what I was thinking of. One point here is that unless your test case is a blind "SELECT *", there's almost certainly going to be a projection before a sort anyway, because we always want to get rid of any unreferenced columns to minimize the data volume going through the sort. regards, tom lane
Andres Freund wrote: > Hi, > > I've previously mentioned (e.g. [1]) that tuple deforming is a serious > bottlneck. I've also experimented successfully [2] making > slot_deform_tuple() faster. I'm currently messing with making heapam.c be just one possible implementation of tuple storage, to allow other modules to implement tuple storage in different ways. Each such module would have a pg_am row, and among the set of routines that such a module would have to provide is the equivalent of slot_deform_tuple, which takes a tuple in whatever format the storage module uses and puts it into executor-formatted tts_values/tts_isnull arrays. For tuples coming from heapam.c-formatted tables that would be pretty much slot_deform_tuple, but of course other modules could have tuples in completely different formats. (This is one part in the larger project of columnar storage, where a storage module would store tuples in some columnar format rather than row-oriented. Obviously, tuple deforming means a completely different thing in that world than in heapam.c). No patch to show at present; I mentioned this project in Brussels. I intend to present for 10.0. -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services