Hi,
I got pinged about issues (compiler warnings, and some test failures) in
the simple patch version shared in May. So here's a rebased and cleaned
up version addressing that, and a couple additional issues I ran into.
FWIW if you run check-world on this, you may get failures in io_workers
TAP test. That's a pre-existing issue [1], the patch just makes it
easier to hit as it (probably) added AIO in some part of the test.
Otherwise it should pass all tests (and it does for me on CI).
The main changes in the patches and remaining questions:
(1) fixed compiler warnings
These were mostly due to contrib/test AMs with not-updated ambeginscan()
implementations.
(2) GiST fixes
I fixed a bug in how the prefetching handled distances, leading to
"tuples returned out of order" errors. It did not copy the Datums when
batching the reordered values, not realizing it may be FLOAT8, and on
32-bit systems the Datum is just a pointer. Fixed by datumCopy().
I'm not aware of any actual bug in the GiST code, but I'm sure the
memory management there is sketchy and likely leaks memory. Needs some
more thought and testing. The SP-GiST may have similar issues.
(3) ambeginscan(heap, index, ....)
I originally undid the changes to ambeginscan(), i.e. the callback was
restored back to what master has. To to create the ReadStream the AM
needs the heap, but it could build Relation using index->rd_index->indrelid.
That worked, but I did not like it for two reasons. The AM then needs to
manage the relation (close it etc.). And there was no way to know when
ambeginscan() gets called for a bitmap scan, in which case the
read_stream is unnecessary/useless. So it got created, but never used.
Not very expensive, but messy.
So I ended up restoring the ambeginscan() change, i.e. it now gets the
heap relation. I ended up passing it as the first argument, mostly for
consistency with index_beginscan(), which also does (heap, index, ...).
I renamed the index argument from 'rel' to 'index' in a couple of the
indexes, it was confusing to have 'heap' and 'rel'.
(4) lastBlock
I added the optimization to not queue duplicate block numbers, i.e. if
the index returns a sequence of TIDs from the same block, we skip
queueing that and simply use the buffer we already have. This is quite a
bit more efficient.
This is something the read_next callback in each AM needs to do, but
it's pretty simple.
(5) xs_visible
The current patch expects the AM to set the xs_visible even if it's not
using ReadStream (which is required to do that in the callback). If the
AM does not do that, index-only scans are broken.
But it occurs to me we could handle this in index_getnext_tid(). If the
AM does not use a ReadStream (xs_rs==NULL), we can check the VM and
store the value in xs_visible. It'd need moving the vmBuffer to the scan
descriptor (it's now in IndexOnlyScanState), but that seems OK. And the
AMs now add the buffer anyway.
(6) SGML
I added a couple paragraphs to indexam.sgml, documenting the new heap
argument, and also requirements from the read_next callback (e.g. the
lastBlock and xs_visible setting).
(7) remaining annoyances
There's a couple things that still annoy me - the "read_next" callbacks
are very similar, and duplicate a fair amount of code to stuff they're
required to. There's a little bit AM-specific code to get the next item
from the ScanOpaque structs, and then code to skip duplicate block
numbers and check the visibility map (if needed).
I believe both of these things could be refactored into some shared
place. The AMs would just call a function from indexam.c (which seems OK
from layering POV, and there's plenty of such calls).
I believe the same place could also act as the "scan manager" component
managing the prefetching (and related stuff?), as suggested by Peter
Geoghegan some time ago.
I ran out of time to work on this today, but I'll look into this soon.
FWIW I'm still planning to work on the "complex" patch version and see
if it can be moved forward. I've been having some very helpful chats
about this with Peter Geoghegan, and I'm still open to the possibility
of making it work. This simpler version is partially a hedge to have at
least something in case the complex patch does not make it.
regards
[1]
https://www.postgresql.org/message-id/t5aqjhkj6xdkido535pds7fk5z4finoxra4zypefjqnlieevbg%40357aaf6u525j
--
Tomas Vondra