Thread: Datum values consistency within one query
Imagine a function that was going to take a table of, say, images, and so something to them over and over, like: SELECT pixel_count(img), clr_name, img_name FROM images img CROSS JOIN colors clr When you run this function, you find that a great amount of time is being spend in the decompression/detoasting routines, so you think: I have a nested loop here, driven on the 'img' side, if I can avoid re-loading the big image object over and over I can make things faster. Getting the datum value is really fast, so I can have a cache that keeps the latest detoasted object around, and update it when the datum changes, and store the cache information in the parent context. Like so: struct { Datum d; bytea *ba; } DatumCache; PG_FUNCTION_INFO_V1(pixel_count); Datum pixel_count(PG_FUNCTION_ARGS) { Datum d = PG_GETARG_DATUM(0); DatumCache *dcache = fcinfo->flinfo->fn_extra; bytea *ba; if (!dcache) { dcache = MemoryContextAllocZero(fcinfo->flinfo->fn_mcxt, sizeof(DatumCache)); fcinfo->flinfo->fn_extra = dcache; } if (dcache->d != d) { if (dcache->ba) pfree(dcache->ba); MemoryContext old_context = MemoryContextSwitchTo(fcinfo->flinfo->fn_mcxt); dcache->ba = PG_GETARG_BYTEA_P_COPY(0); MemoryContextSwitchTo(old_context); } ba = dcache->ba; /* now do things with ba here */ } Now, notwithstanding any concerns about the particularities of my example (I've found order-of-magnitude improvements on PostGIS workloads avoiding the detoasting overhead this way) is my core assumption correct: within the context of a single SQL statement, will the Datum values for a particular object remain constant? They *seem* to, in the examples I'm running. But do they always?
Paul Ramsey <pramsey@cleverelephant.ca> writes: > Getting the datum value is really fast, so I can have a cache that > keeps the latest detoasted object around, and update it when the datum > changes, and store the cache information in the parent context. Like > so: Jeez, no, not like that. You're just testing a pointer. Most likely, if this is happening in a table scan, the pointer is pointing into some shared buffer. If that buffer gets re-used to hold some other page, you could receive the identical pointer value but it's pointing to completely different data. The risk of false pointer match would be even higher at plan levels above a scan, I think, because it'd possibly just be pointing into a plan node's output tuple slot. The case where this would actually be worth doing, probably, is where you are receiving a toasted-out-of-line datum. In that case you could legitimately use the toast pointer ID values (va_valueid + va_toastrelid) as a lookup key for a cache, as long as it had a lifespan of a statement or less. You'd have to get a bit in bed with the details of toast pointers, but it's not like those are going anywhere. It would be interesting to tie that into the "expanded object" infrastructure, perhaps, especially if the contents of the objects you're interested in aren't just flat blobs of data. regards, tom lane
> On Apr 2, 2020, at 4:30 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > > Paul Ramsey <pramsey@cleverelephant.ca> writes: >> Getting the datum value is really fast, so I can have a cache that >> keeps the latest detoasted object around, and update it when the datum >> changes, and store the cache information in the parent context. Like >> so: > > Jeez, no, not like that. You're just testing a pointer. > ... > The case where this would actually be worth doing, probably, is where > you are receiving a toasted-out-of-line datum. In that case you could > legitimately use the toast pointer ID values (va_valueid + va_toastrelid) > as a lookup key for a cache, as long as it had a lifespan of a statement > or less. You'd have to get a bit in bed with the details of toast > pointers, but it's not like those are going anywhere. So, if I tested for VARATT_IS_EXTENDED(), and then for VARATT_IS_EXTERNAL_ONDISK(attr) and then did VARATT_EXTERNAL_GET_POINTER(toast_pointer,attr), I could use va_valueid + va_toastrelid as keys in the cache for things thatpassed that filter? What about large const values that haven't been stored in a table yet? (eg, ST_Buffer(ST_MakePoint(0, 0), 100, 10000)) isthere a stable key I can use for them? > It would be interesting to tie that into the "expanded object" > infrastructure, perhaps, especially if the contents of the objects > you're interested in aren't just flat blobs of data. Yeah, I'm wrestling with the right place to do this stuff, it's not just the detoasting going on, I also build in-memorytrees on large objects and hold them around for as long as the object keeps showing repeatedly up in the query,I just test the cache right now by using memcmp on the previous value and that's really pricey. P
Paul Ramsey <pramsey@cleverelephant.ca> writes: > So, if I tested for VARATT_IS_EXTENDED(), and then for VARATT_IS_EXTERNAL_ONDISK(attr) and then did VARATT_EXTERNAL_GET_POINTER(toast_pointer,attr), I could use va_valueid + va_toastrelid as keys in the cache for things thatpassed that filter? I'm pretty sure VARATT_IS_EXTERNAL_ONDISK subsumes the other, so you don't need to check VARATT_IS_EXTENDED, but yeah. > What about large const values that haven't been stored in a table yet? (eg, ST_Buffer(ST_MakePoint(0, 0), 100, 10000))is there a stable key I can use for them? Nope. If you could convert them into "expanded datums" then you might have something ... but without any knowledge about where they are coming from it's hard to see how to detect that a value is the same one you dealt with before. regards, tom lane