Re: [WIP]Vertical Clustered Index (columnar store extension) - take2 - Mailing list pgsql-hackers

From Tomas Vondra
Subject Re: [WIP]Vertical Clustered Index (columnar store extension) - take2
Date
Msg-id a748aa6b-c7e6-4d02-a590-ab404d590448@vondra.me
Whole thread Raw
In response to Re: [WIP]Vertical Clustered Index (columnar store extension) - take2  ("Aya Iwata (Fujitsu)" <iwata.aya@fujitsu.com>)
List pgsql-hackers
Hello Iwata-san,

Thanks for presenting the patch at pgconf.dev last week, and for the
discussions at the APFS session. As promised I did a quick review of the
patch this week, trying to understand the design, the trade-offs etc.

I'm still trying to understand the patch, so please correct me when I
got something wrong.

Before I get to the review comments (mostly about the first part) and
questions, let me comment on a couple items from your earlier response:

On 5/17/25 18:31, Aya Iwata (Fujitsu) wrote:
> Hi hackers,
> 
> I will share the notes on the discussions in PGConf.dev. Thanks all for participation.
>  
> Feedback on the Advanced Patch Feedback Session;
>  
> - A Basic idea that allows both OLTP/OLAP workloads on the same table is OK
>   - Buffering mechanism has already been used by existing code GIN, IIUC

Yes. GIN has a "fastupdate" feature, which allows accumulating entries
in a buffer, and merging them into the index later (as a batch, which
makes it more efficient). This is similar to the WOS/ROS architecture.

AFAIK some other columnar databases use a similar WOS/ROS architecture,
with similar goal (to move the index updates into a background job, and
make it more efficient).

I expect there to be a trade-off between the WOS size and query
performance. Larger WOS allows larger updates / better efficiency, but
it likely also means slower queries (that also have to scan the WOS).

> - IAM seems a more proper approach than TAM
>   - One reason is that VCI can only optimize the data lookup stuff
>   - TAM needs all possible queries; it's too much

I believe this (IAM vs. TAM) is very much an open question.

At the conference the goal of the patch was explained as a way to
improve performance with analytical queries, while maintaining OLTP
performance and features. From this point of view, it makes sense to
consider VCI to be an "accelerator" feature. Which is what indexes do,
i.e. it's a way to speed-up queries, but we can do without them if
that's faster.

If VCI got implemented as TAM, i.e. serving as the primary storage, we
wouldn't have this option - we'd have to do VCI even for OLTP workloads
(e.g. random inserts / selects), and that sounds quite inefficient. Not
to mention the VCI would need to support everything heap does, and it
supports a lot. VCI doesn't even support arbitrary data types, etc.

But this hinges on the assumption that the VCI can be implemented as an
IAM, right? And that IAM is a good fit for the stuff VCI needs to do.
Because if it can't be implemented as IAM, then the simplicity argument
is quite moot. I'm not saying it can't be done as IAM, I don't have a
clear opinion on that yet.

Another reason to think about this is that after the talk, someone in
the audience pointed out this patch originally started on Postgres 9.5,
and that back then there was no TAM interface. So when implementing it
there was no IAM vs. TAM choice, IAM was the only way to do something
like this, and thus it was used. Maybe if TAM was available back then,
it'd be done differently? Not sure.

Also, Alvaro seemed to think TAM is the way to go, and in order to keep
the OLTP performance he suggested to use both heap and VCI at the same
time, in different "forks". I'm not sure how would that work, or if we
can already do that - AFAIK we can't, because ForkNumber does not allow
adding custom forks. We'd have to relax that, or invent some sort of
federated TAM (that just multiplexes it to two TAMs). Maybe.

But it's not like the IAM approach doesn't need to do this. The first
patch had to add stuff to a lot of random places to make this work. And
some of the places touch stuff that we don't expect indexes to worry
about, like ALTER TABLE, etc.

> - Divide the patch more and more
>   - to allow committers to consider pros and cons and push one by one
>   - 15 patches is the maximum amount
>   - Separate features to some committable group
>  

Yes. Both parts need to be divided into small (but meaningful) pieces. I
don't recall where the "15 patches" came from, I don't think there's
some sort of maximum number of patches. All I care about is that the
patches need to be small-ish, and still make sense. Ideally, each part
should be committable on it's own. This makes it easier to test/review,
but also easier to agree on individual parts - in which case we may
commit the "preparatory" pieces and not have to rebase them forever.

The 0001 part is ~83K, which is not entirely crazy, but I can easily
imagine it split into ~10 parts, each addressing a different subset. For
0002 it's even more important, because no one can review a 1.3MB patch.

> Sawada san suggested several points;
> 1. Find parts that are useful not only for the VCI and dispatching new threads
> 2. Develop codes incrementally. E.g., VCI has a custom scan, but we may
> be able to give up on the first version
>  

Yes, I think this makes perfect sense.

I'd probably keep it in a single thread (essentially as a single patch
series), because it's very useful to see the "bigger picture" why the
parts are important. But if you find something that can stand on it's
own, then sure. But you'll still need to keep the part in the main patch
series (it should not be necessary for people to collect patches from
different threads, and it would also break the CI testing commitfests
patches).

Breaking 0002 into smaller parts makes perfect sense. Something like:

- minimum working patch, with only serial sequential scan
- add parallel sequential scan
- add other custom scans, one by one ...
- ...

> 
> Also, I have questions and advice below on May 16th.
>  
> - How to handle visibility?
>   - What if transactions that insert tuples are aborted?
> - VCI vs Index Only Scan
>   - VCI seems the same as Index Only Scan
>   - VCI has compression. So, is the amount of size better than the Index?
> - How about the efficiency for storage? How much data can we store?
> - using TAM (Alvalo's suggestion)
>   - Most of the code can be ported from the heap. ROS and WOS can exist as forks
>     of main files, like .vm and .fm. API can be like:
>  
> ```
> CREATE TABLE foo (id, product, price) USING vci WITH (to_be_indexed=price);
> ```
> 
Yes, I think those questions are important. Not only for understanding
the current patch, but also to understand what trade-offs the solution
can make (e.g. about the consistency/visibility model), etc.

What I really miss in the current patch is some sort of READMEs with

- high-level design of the VCI indexes

- description of the consistency/visibility model (does it behave the
same way as querying the heap, can it be out of sync for, ...)

- WOS/ROS architecture (when are rows inserted into WOS, promoted into
ROS, what triggers that, ...)

- what's the in-memory / on-disk format

- some places in the patch mention "internal VCI tables" but I have no
idea what that is

- how does the execution work? compression? crucial points to consider
for optimal performance, etc.

- limitations (temporary - can be relaxed in the future, permanent -
inherent to the columnar design) and trade-offs

- what are the various custom scan executor nodes

- what "background" processes happen (custom workers, ...)

- anything else substantial for understanding the design

Maybe there's some of this in the 0002 patch, but I haven't stumbled
over it so far.

------------------------------------------------------------------------

Finally, some review comments about the 0001 patch - it's mostly in the
order as appearing in git diff, file by file. Some of the points repeat,
because multiple places made me think about the same thing. And some of
it was mentioned above.

- It seems the VCI is implemented as an index AM, but it seems to be
trying to do some TAM stuff, e.g. with visibility info etc. For example,
I don't think indexes generally care about XIDs that deleted rows, etc.
Similarly, I can't think of another index AM that'd try to hook into
tablecmds.c or execExpr.

- Not sure what's the right choice (makes sense to have it as index on
top of heap, so that we do regular stuff for OLTP queries, and columnar
stuff using the index for analytics). But it seems it does stuff
expected from table AM and not index AM.

- But making a table AM would make it much more demanding (it'd have to
support storing all data types, ...) and would hurt OLTP performance.

- I very much support the idea of implementing this as an extension (so
minimal infrastructure changes to core + extension in contrib or
entirely outside).

- But the patch is not there, not even with 0001, because there's a
number of places mentioning "vci". There should be no reference to VCI
in core, it should be generic extensibility infrastructure. It's a good
indication of places that may need some extensibility.


relation_open
------------------
- Why is this change (disabling assert enforcing proper locking) needed?
There's a comment in lock.c which mentions VCI parallel workers, but
it's suspicious.


reloptions.c
------------------
- Changes that should be moved to the contrib module Why should in-core
reloptions know about this? See for example how "contrib/bloom" defined
reloptions.


heapam.c
------------------
- add_index_delete_hook, not sure I understand the purpose of this,
likely related to the consistency model, needs explanation. Why is it
prefixed with "add_"?

- If we need such DELETE hook, heapam.c is not where to call it, it
should be called from the same "layer" adding tuples to indexes (which
certainly does not happen in heap_insert).

- Furthermore, if we need this, why should it be a hook and not a new
callback in IndexAmRoutine, just like aminsert()?


initscan
------------------
- Isn't it using keep_startblock in a wrong way? it certainly is not
meant to indicate "keep the old number of blocks"

- Why is this even needed? This is in heapam, so why would this even
matter for VCI, which is an index? Shouldn't that be mostly independent
from which TAM is used by the table?

heapam_handler.c
------------------
- Why does this need to preserve the IndexHeapTuple at all? And why
should it be done in heapam_handler? Again, shouldn't IAM be mostly
independent of this?

- Using a global variable does not seem great. We probably can't have
concurrent calls (only a single heapam_index_build_range_scan call at a
time), but globals are dubious.

- If VCI really needs to remember the tuple, there should be a "proper"
way to explicitly pass it (adjust the IndexAmRoutine callback to have a
new argument or something like that).


heapam_visibility.c
-------------------
- It's probably sensible to make visibility snapshots extensible, i.e.
to allow defining custom snapshot types etc. But the patch still does
that by hard-coding stuff in core, even if the actual "visibility" code
is in extension.

- But this doesn't really allow custom snapshot types, those are still
hardcoded SNAPSHOT_VCI_WOS2ROS / SNAPSHOT_VCI_LOCALROS.

- The add_snapshot_satisfies_hook does not seem like the way to make
this extensible - what if two extensions want to do this. First
HeapTupleSatisfiesVisibility would have to know about all the
extensions, and it could easily lead to infinite loop. Because we call
the first hook, and it calls HeapTupleSatisfiesVisibility again.

- This is probably a good example of VCI trying to do TAM stuff. TAM
interface has tuple_satisfies_snapshot, which seems to have the same
purpose as the hook. Maybe not (it still assumes regular snapshots).

- But I still don't understand why should a new index AM require changes
to how heap AM does visibility checks (or any visibility checks in
general, as indexes don't have visibility info in PG). And why would it
make sense for heap to know about this. IAM is the layer *above* TAM, in
a way.


xlogfuncs.c
------------------
- RECOVERY_VCI_PAUSE_REQUESTED - what's that for? why do we need a
separate state from RECOVERY_PAUSE_REQUESTED?


xlogrecovery.c
------------------
- I don't see why we should not log that the recovery is paused because
of VCI. I sure users would still like to know why it's paused:

  /* If pause requested by VCI, the log is not output. */
  else if (GetRecoveryPauseState() != RECOVERY_VCI_PAUSE_REQUESTED)


dependency.c
------------------
- add_drop_relation_hook raises mostly the same questions as
add_index_delete_hook

- Don't we already have event triggers to do this kind of stuff in an
extensible way?

- Why should an index even care about the table getting dropped? I mean,
it'll get dropped too, before the table. So why this?


index.c
------------------

- Why do we need add_reindex_index_hook at all? we already have ambuild
callback, right?

- It seems really suspicious ConstructTupleDescriptor gains the ability
to index system attributes, why is that needed? surely the VCI does not
need to index system attributes, right? Or why is that needed?

- It seems add_reindex_index_hook is used to "skip" rebuilding the
index, but why would it be useful/necessary, and why should it be done here?

- Why couldn't ambuild() just return? How do I rebuild the index if
something goes wrong (e.g. it gets corrupted). Or to remove bloat, if
that applies to VCI.


explain.c
------------------
- T_CustomPlanMarkPos - so this introduces a new type of custom plan
node? How is that different from CustomScan? We don't even have MarkPos
plan in core, right? So what does it do?

- It even appears to be exactly the same as CustomScan in most places,
but it also introduces some new callbacks (ExplainCustomPlanTargetRel
and SetBoundCustomScan).

- If useful, this should be a separate patch (custom scan improvements).
And it definitely needs to update the CustomScan docs in SGML.

- I don't understand why we need ExplainPropertySortGroupKeys, does not
align with the rest of the API (ExplainProperty is for different data
types - float, text, ...). Maybe expose show_sort_group_keys? Or just
copy show_sort_group_keys into the extension ...


indexcmds.c
------------------
- Yet another place referencing "vci" explicitly (not extensible, should
not be in core)

- Not sure I understand the limitations mentioned there.


tablecmds.c
------------------
- add_alter_tablespace_hook, add_alter_table_change_owner_hook,
  add_alter_table_change_schema_hook

- Why does VCI this even need to handle these actions?

- And we have event triggers, so why have these new hooks?

- Not sure what the code in ATExecSetRelOptions does (another example
that handles "vci" AM explicitly, in a non-extensible way).

- Is it maybe rebuilding some reloptions? Maybe it's about the attnums?


vacuum.c
------------------
- add_skip_vacuum_hook is used to "disable" vacuum for internal VCI tables.

- Why. And what is "internal table"?


execAmi.c
------------------
- what's T_CustomPlanMarkPos?


execExpr.c
------------------
- Why does this expose ExecReadyExpr, ExecCreateExprSetupSteps?

- Isn't this a bit too fragile?

- Probably needed because VciExecInitExpr needs a new argument? But why
does it need that?


execExprInterp.c
------------------
- Why does this need ExprEvalVar_hook and ExprEvalParam_hook?

- Doesn't even check if the hooks are != NULL

- CASE_EEOP_VCI_VAR, CASE_EEOP_VCI_PARAM_EXEC - not extensible, should
not reference VCI explicitly

- VciExprEvalVarHook (for llvmjit?)



execGrouping.c
------------------
- LookupTupleHashEntry_internal removes comment, why?


execProcnode.c, execScan.c
------------------
- New MarkPos node, apparetly, but why do we need that? Why can't we
just build the Materialize when constructing the plan?


execUtils.c
------------------
- again, disables locking check, very suspicious


nodeCustom.c
------------------
- Isn't it strange that ExecInitCustomScan short-circuits the init,
skipping most of the initialization for selected scans?

- If anything, it should not hard-code the scan names like this. Why is
this necessary? is the rest of the initialization irrelevant, or what?
Maybe this is sign it should be table AM rather than index AM?

- But IOS scans would also do this initialization, so ...


nodeModifytable.c
------------------
- add_should_index_insert_hook another strange hook

- Doesn't the hook apply to all relations all the time, not just those
with VCI indexes etc.?


gen_node_support.c
------------------
- So is CustomPlanMarkPos a proper node, or why do we need it? why is
CustomPlan not enough?

- What's SmcAllocSetContext for? Doesn't seem to be mentioned anywhere.


params.c
------------------
- ExecInitParam_hook seems unused, no?


allpaths.c
------------------
- set_plain_rel_pathlist - I don't understand why this checks the VCI
vs. parallelism like this, what it's trying to prevent. Maybe if you
don't want certain plan shape, don't generate the partial paths in that
case from the CustomScan planner callback?

- It seems this tries to allow parallel seqscan, but not other more
complex parallel scans (like parallel agg/join ...), but that could be
done in callback, no?

- Again, we should not reference the index AM this explicitly, it's not
extensible.


createplan.c
------------------
- I don't think we want to make copy_plan_costsize part of API. Just a
detail, though.


autovacuum.c
------------------
- Why does this need a custom autovacuum launcher, or what's the purpose
of this code? is there a special autovacuum launcher for VCI indexes?

- Again, explicit references to "VCI", should not be done like this.


bgworker.c
------------------
- I have no idea what's the point of these changes. seems to be enabling
canceling bgworkers, but that seems like a generic improvement. In which
case it should be a separate patch.

- No idea if it's correct etc.

- Seems to be done to allow CountOtherDBBackends() to cancel backends,
so that it can be dropped etc.

- There apparently are bgworkers doing cleanup for VCI, and we want to
treat them like autovacuum workers (which get interrupted in various
cases instead of waiting forever).

- But this will cancel all bgworkers, no? Not just the VCI ones ...


lock.c
------------------
- this seems a bit strange, surely this should use the same locking
approach as other parallel queries

/*
 * Don't lock for VCI parallel workers and other type of workers should go
 * in normal flow, In case if there is any change in background worker
 * name for VCI parallel workers, the following code also needs an update.
 * FIXME: Try to use the community parallelism code, so that we don't need
 * our own VCI parallel infrastructure.
 */
if (AmBackgroundWorkerProcess() && strstr(MyBgworkerEntry->bgw_name,
"backend="))

- Presumably this is why some of the Asserts on locking were commented
out earlier.


lwlock.c
------------------
- Vci_lwlock_kind seems unused.


timestamp.c
------------------
- I believe the IntervalAggState is internal on purpose, why should it
be made external? It also means timestamp.h has to worry about FRONTEND.


relcache.c
------------------
- add_skip_vci_index_hook - yet another hook, same questions as for the
other hooks earlier

- It seems suspicious we have to skip VCI indexes like this during
planning in RelationGetIndexAttrBitmap, not sure why that's needed? Why
should't we cost the index and maybe reject it based on that?

- Again, references VCI explicitly, so not extensible.


mctx.c
------------------
- why MemoryContextMethods shouldn't be "const"?


common.c
pg_dump.c
pg_dump.h
------------------
- isvciview breaks extensibility, tied to a particular IAM / extension.
Needs to be abstracted / made extensible.

- What is "vci view table" for, actually?


reloptions.h
------------------
- Why does this need to add an entry into the relopt_kind enum? It
should be extensible, see the "contrib/bloom".


xlogrecovery.h
------------------
- Why is RECOVERY_VCI_PAUSE_REQUESTED separate from
RECOVERY_PAUSE_REQUESTED?


timestamp.h
------------------
- The FRONTEND stuff is a sign making IntervalAggState public may not be
a good idea.


executor.h
------------------
- How come it's suddenly OK to return entry->additional without the if
conditions in TupleHashEntryGetAdditional?

- Ah, it's now stored as a separate palloc chunk, not right after the
tuple? Well, maybe that's a separately useful improvement? should be a
separate patch, then. No opinion on whether it's a good idea.

- What's with the struct LimitState; ? Why is it needed?


extensible.h
------------------
- Adds SetBoundCustomScan / ExplainCustomPlanTargetRel, but then it
should also add them to the SGML docs.


memnodes.h
------------------
- What's SmcAllocSetContext? It's not defined anywhere.


params.h
------------------
- Adds noNeedLoadFromMain, loadedFromMain to ParamExecData, but it
doesn't seem to be used anywhere (in either patch).


plannodes.h
------------------
- Adds plan_no to Plan, but it's not set anywhere in core, only in the
VCI extension - which seems wrong (e.g. what if there are two extensions
that want to set it, and they disagree?)

- Either if we want/need this, it should be initialized by core during
planning, in a non-ambiguous way.

- But also how is this different from the existing plan_node_id?


autovacuum.h
------------------
- It seems a bit strage to add "built-in" bgworkers for an extension. If
the extension really needs some sort of workers, it should be defined in
the extension.


bgworker.h
------------------
- So what's the problem with canceling bgworkers? Should this be done
  as a separate patch?


itemptr.h
------------------
- Why do we need this SizeOfIptrData definition? How is it different
from sizeof(ItemPointerData)?


lwlock.h
------------------
- LWLockKind / Vci_lwlock_kind seems unused


numeric.h
------------------
- Why does this expose NumericVar, when the second patch redefines the
struct anyway? also, isn't that fragile? although, we're unlikely to
tweak the NumericVar definition.


rel.h
------------------
- Again, not extensible way to do stuff in an extension.

- We should not have pieces specific to AM in StdRdOptions.

- Doesn't AM have custom reloptions now? Like contrib/bloom?


I also tried to build 0002, but unfortunately that fails with:

executor/vci_executor.c: In function ‘vci_setup_executor_hook’:
executor/vci_executor.c:115:28: error: assignment to
‘ExecutorStart_hook_type’ {aka ‘void (*)(QueryDesc *, int)’} from
incompatible pointer type ‘_Bool (*)(QueryDesc *, int)’
[-Wincompatible-pointer-types]
  115 |         ExecutorStart_hook = vci_executor_start_routine;
      |                            ^
executor/vci_executor.c: In function ‘vci_executor_start_routine’:
executor/vci_executor.c:161:28: error: void value not ignored as it
ought to be
  161 |                 plan_valid = executor_start_prev(queryDesc, eflags);
      |                            ^
executor/vci_executor.c:163:28: error: void value not ignored as it
ought to be
  163 |                 plan_valid = standard_ExecutorStart(queryDesc,
eflags);
      |                            ^
make: *** [../../src/Makefile.global:973: executor/vci_executor.o] Error 1

The extension is not added to contrib/Makefile, so "make world" does not
trigger this failure.



regards

-- 
Tomas Vondra




pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: Why our Valgrind reports suck
Next
From: Chapman Flack
Date:
Subject: tighten generic_option_name, or store more carefully in catalog?