Thread: [WIP]Vertical Clustered Index (columnar store extension) - take2
Hi All,
Suggestions
==========
When analyzing real-time data collected by PostgreSQL,
it can be difficult to tune the current PostgreSQL server for satisfactory performance.
Therefore, we propose Vertical Clustered Indexing (VCI), an in-memory column store function that holds data in a state suitable for business analysis and is also expected to improve analysis performance.
With VCI, you can also expect to run analysis 7.8 times faster. This is achieved by the analytics engine, which optimizes parallel processing of column-oriented data, in addition to the fact that VCI stores data in a columnar format, enabling efficient retrieval of the columns needed for analysis.
Similar Features
============
One column store feature available with postgres is Citus Columnar Table.
If you introduces the citus extension, which allows columnar tables to be used using the columnar access method.
This function is intended to analyze the accumulated data. Therefore, you cannot update or delete data.
VCI supports data updates and deletions. This enables you to analyze not only the accumulated data but also the data that occurs in real time.
Implementing VCI
============
To introduce an updatable column store, we explain how updates to row-oriented data are propagated to column-oriented data.
VCI has two storage areas.
- Write Optimized Storage (WOS)
- Read Optimized Storage (ROS)
Describes WOS.
The WOS stores data for all columns in the VCI in a row-oriented format.
All newly added data is stored in the WOS relation along with the transaction information.
Using WOS to delete and update newly added data has no significant performance impact compared to deleting from columnar storage.
ROS is the storage area where all column data is stored.
When inserting/updating/deleting, data is written synchronously to WOS. It does not compress or index the data.
This avoids the overhead of converting to a columnar while updating the data.
After a certain amount of data accumulates in the WOS, the ROS control daemon converts it to column data asynchronously with updates.
Column data transformation compresses and indexes the data and writes it to ROS.
Describes searching for data.
Since there are two storage formats, the SELECT process needs to convert the WOS data to local ROS to determine whether it is visible or invisible. This conversion cost depends on the number of tuples present in the WOS file. This may introduce some performance overhead.
Obtain search results by referencing the local ROS and referencing the ROS in parallel.
These implementation ideas are also posted on Fujitsu's blog for your reference. [1]
Past discussions
===========
We've proposed features before. [2]
This thread also explains the details, so please check it.
In a previous thread, we suggested implementing a modification to the PostgreSQL backend code.
Based on the FB we received at that time, we think we need to re-implement this feature in pluggable storage using the table access method API.
I also got a FB of the features I needed from a PostgreSQLPro member. We believe it is necessary to deal with these issues in stages.
- Need to provide vector processing for nodes (filter, grand aggregate, aggregation with group by...) to speed up computation
- Requires parallel processing support such as scanning
It is assumed that the re-implementation will also require additional functionality to the current Table Access Method API.
It is useful not only for VCI but also for other access methods.
Therefore, we decided to propose the VCI feature to the community and proceed with development.
Request matter
===========
Are members of the PostgreSQL hackers interested in VCI features?
We welcome your comments and suggestions on this feature.
In particular, if you have any questions, required features, or implementations, please let me know.
Regards,
Aya Iwata
FUJITSU LIMITED
RE: [WIP]Vertical Clustered Index (columnar store extension) - take2
Hi Alvaro san, I am sorry for my late reply. I continue to work on proposing VCI feature to the community. > I think this is definitely an important and welcome development. > I'm looking forward to patches in this area. Thank you! I am currently preparing to share VCI designs with PGConf.dev. I look forward to sharing more about VCI with you. Best regards, Aya Iwata FUJITSU LIMITED
RE: [WIP]Vertical Clustered Index (columnar store extension) - take2
Hi Yura san, > I just don't get, why it should be "in-memory"? All the same things you > describe further, but storing in paged index on-disk with caching > through shared_buffers - why this way it wouldn't work? We make the columnar store resident in memory for maximum search performance. But I'm not very particular about this. Comments are welcome. Best regards, Aya Iwata FUJITSU LIMITED > -----Original Message----- > From: Yura Sokolov <y.sokolov@postgrespro.ru> > Sent: Wednesday, January 15, 2025 11:44 PM > To: pgsql-hackers@lists.postgresql.org > Subject: Re: [WIP]Vertical Clustered Index (columnar store extension) - take2 > > 07.10.2024 17:53, Aya Iwata (Fujitsu) wrote: > > Hi All, > > > > Suggestions > > > > ========== > > > > When analyzing real-time data collected by PostgreSQL, > > > > it can be difficult to tune the current PostgreSQL server for > > satisfactory performance. > > > > Therefore, we propose Vertical Clustered Indexing (VCI), an in-memory > > column store function that holds data in a state suitable for business > > analysis and is also expected to improve analysis performance. > > I just don't get, why it should be "in-memory"? All the same things you > describe further, but storing in paged index on-disk with caching > through shared_buffers - why this way it wouldn't work? > >
08.04.2025 13:29, Aya Iwata (Fujitsu) wrote: > Hi Yura san, > > >> I just don't get, why it should be "in-memory"? All the same things you >> describe further, but storing in paged index on-disk with caching >> through shared_buffers - why this way it wouldn't work? > > We make the columnar store resident in memory for maximum search performance. > But I'm not very particular about this. Comments are welcome. I just wanted to say: there is no need to be super fast. There is the need to be remarkably faster than it is now. ClickHouse, DuckDB, Vertica - they are not in-memory, they are disk based. But they are very fast. If PostgreSQL will be just as twice slower as ClickHouse, it will be very great! Most of users will not setup ClickHouse at all then, because twice slower is still very fast. Databases could be very huge. Even when they are in "columnar" format, which usually consumes less space. And memory is still costs more than disk space. Certainly there are users who think they need "in-memory". But the truth is very few of them really need "in-memory". All of this is just my opinion. I could be wrong. >> -----Original Message----- >> From: Yura Sokolov <y.sokolov@postgrespro.ru> >> Sent: Wednesday, January 15, 2025 11:44 PM >> To: pgsql-hackers@lists.postgresql.org >> Subject: Re: [WIP]Vertical Clustered Index (columnar store extension) - take2 >> >> 07.10.2024 17:53, Aya Iwata (Fujitsu) wrote: >>> Hi All, >>> >>> Suggestions >>> >>> ========== >>> >>> When analyzing real-time data collected by PostgreSQL, >>> >>> it can be difficult to tune the current PostgreSQL server for >>> satisfactory performance. >>> >>> Therefore, we propose Vertical Clustered Indexing (VCI), an in-memory >>> column store function that holds data in a state suitable for business >>> analysis and is also expected to improve analysis performance. >> >> I just don't get, why it should be "in-memory"? All the same things you >> describe further, but storing in paged index on-disk with caching >> through shared_buffers - why this way it wouldn't work? -- regards Yura Sokolov aka funny-falcon
RE: [WIP]Vertical Clustered Index (columnar store extension) - take2
Hello, I found some dead codes in 0001 patches. I removed. Here are new patches. Regards, Aya Iwata Fujitsu Limited
Attachment
Re: [WIP]Vertical Clustered Index (columnar store extension) - take2
Hi hackers, I will share the notes on the discussions in PGConf.dev. Thanks all for participation. Feedback on the Advanced Patch Feedback Session; - A Basic idea that allows both OLTP/OLAP workloads on the same table is OK - Buffering mechanism has already been used by existing code GIN, IIUC - IAM seems a more proper approach than TAM - One reason is that VCI can only optimize the data lookup stuff - TAM needs all possible queries; it's too much - Divide the patch more and more - to allow committers to consider pros and cons and push one by one - 15 patches is the maximum amount - Separate features to some committable group Sawada san suggested several points; 1. Find parts that are useful not only for the VCI and dispatching new threads 2. Develop codes incrementally. E.g., VCI has a custom scan, but we may be able to give up on the first version Also, I have questions and advice below on May 16th. - How to handle visibility? - What if transactions that insert tuples are aborted? - VCI vs Index Only Scan - VCI seems the same as Index Only Scan - VCI has compression. So, is the amount of size better than the Index? - How about the efficiency for storage? How much data can we store? - using TAM (Alvalo's suggestion) - Most of the code can be ported from the heap. ROS and WOS can exist as forks of main files, like .vm and .fm. API can be like: ``` CREATE TABLE foo (id, product, price) USING vci WITH (to_be_indexed=price); ``` Again, I really appreciate your efforts. Regards, Aya Iwata Fujitsu Limited
Hello Iwata-san, Thanks for presenting the patch at pgconf.dev last week, and for the discussions at the APFS session. As promised I did a quick review of the patch this week, trying to understand the design, the trade-offs etc. I'm still trying to understand the patch, so please correct me when I got something wrong. Before I get to the review comments (mostly about the first part) and questions, let me comment on a couple items from your earlier response: On 5/17/25 18:31, Aya Iwata (Fujitsu) wrote: > Hi hackers, > > I will share the notes on the discussions in PGConf.dev. Thanks all for participation. > > Feedback on the Advanced Patch Feedback Session; > > - A Basic idea that allows both OLTP/OLAP workloads on the same table is OK > - Buffering mechanism has already been used by existing code GIN, IIUC Yes. GIN has a "fastupdate" feature, which allows accumulating entries in a buffer, and merging them into the index later (as a batch, which makes it more efficient). This is similar to the WOS/ROS architecture. AFAIK some other columnar databases use a similar WOS/ROS architecture, with similar goal (to move the index updates into a background job, and make it more efficient). I expect there to be a trade-off between the WOS size and query performance. Larger WOS allows larger updates / better efficiency, but it likely also means slower queries (that also have to scan the WOS). > - IAM seems a more proper approach than TAM > - One reason is that VCI can only optimize the data lookup stuff > - TAM needs all possible queries; it's too much I believe this (IAM vs. TAM) is very much an open question. At the conference the goal of the patch was explained as a way to improve performance with analytical queries, while maintaining OLTP performance and features. From this point of view, it makes sense to consider VCI to be an "accelerator" feature. Which is what indexes do, i.e. it's a way to speed-up queries, but we can do without them if that's faster. If VCI got implemented as TAM, i.e. serving as the primary storage, we wouldn't have this option - we'd have to do VCI even for OLTP workloads (e.g. random inserts / selects), and that sounds quite inefficient. Not to mention the VCI would need to support everything heap does, and it supports a lot. VCI doesn't even support arbitrary data types, etc. But this hinges on the assumption that the VCI can be implemented as an IAM, right? And that IAM is a good fit for the stuff VCI needs to do. Because if it can't be implemented as IAM, then the simplicity argument is quite moot. I'm not saying it can't be done as IAM, I don't have a clear opinion on that yet. Another reason to think about this is that after the talk, someone in the audience pointed out this patch originally started on Postgres 9.5, and that back then there was no TAM interface. So when implementing it there was no IAM vs. TAM choice, IAM was the only way to do something like this, and thus it was used. Maybe if TAM was available back then, it'd be done differently? Not sure. Also, Alvaro seemed to think TAM is the way to go, and in order to keep the OLTP performance he suggested to use both heap and VCI at the same time, in different "forks". I'm not sure how would that work, or if we can already do that - AFAIK we can't, because ForkNumber does not allow adding custom forks. We'd have to relax that, or invent some sort of federated TAM (that just multiplexes it to two TAMs). Maybe. But it's not like the IAM approach doesn't need to do this. The first patch had to add stuff to a lot of random places to make this work. And some of the places touch stuff that we don't expect indexes to worry about, like ALTER TABLE, etc. > - Divide the patch more and more > - to allow committers to consider pros and cons and push one by one > - 15 patches is the maximum amount > - Separate features to some committable group > Yes. Both parts need to be divided into small (but meaningful) pieces. I don't recall where the "15 patches" came from, I don't think there's some sort of maximum number of patches. All I care about is that the patches need to be small-ish, and still make sense. Ideally, each part should be committable on it's own. This makes it easier to test/review, but also easier to agree on individual parts - in which case we may commit the "preparatory" pieces and not have to rebase them forever. The 0001 part is ~83K, which is not entirely crazy, but I can easily imagine it split into ~10 parts, each addressing a different subset. For 0002 it's even more important, because no one can review a 1.3MB patch. > Sawada san suggested several points; > 1. Find parts that are useful not only for the VCI and dispatching new threads > 2. Develop codes incrementally. E.g., VCI has a custom scan, but we may > be able to give up on the first version > Yes, I think this makes perfect sense. I'd probably keep it in a single thread (essentially as a single patch series), because it's very useful to see the "bigger picture" why the parts are important. But if you find something that can stand on it's own, then sure. But you'll still need to keep the part in the main patch series (it should not be necessary for people to collect patches from different threads, and it would also break the CI testing commitfests patches). Breaking 0002 into smaller parts makes perfect sense. Something like: - minimum working patch, with only serial sequential scan - add parallel sequential scan - add other custom scans, one by one ... - ... > > Also, I have questions and advice below on May 16th. > > - How to handle visibility? > - What if transactions that insert tuples are aborted? > - VCI vs Index Only Scan > - VCI seems the same as Index Only Scan > - VCI has compression. So, is the amount of size better than the Index? > - How about the efficiency for storage? How much data can we store? > - using TAM (Alvalo's suggestion) > - Most of the code can be ported from the heap. ROS and WOS can exist as forks > of main files, like .vm and .fm. API can be like: > > ``` > CREATE TABLE foo (id, product, price) USING vci WITH (to_be_indexed=price); > ``` > Yes, I think those questions are important. Not only for understanding the current patch, but also to understand what trade-offs the solution can make (e.g. about the consistency/visibility model), etc. What I really miss in the current patch is some sort of READMEs with - high-level design of the VCI indexes - description of the consistency/visibility model (does it behave the same way as querying the heap, can it be out of sync for, ...) - WOS/ROS architecture (when are rows inserted into WOS, promoted into ROS, what triggers that, ...) - what's the in-memory / on-disk format - some places in the patch mention "internal VCI tables" but I have no idea what that is - how does the execution work? compression? crucial points to consider for optimal performance, etc. - limitations (temporary - can be relaxed in the future, permanent - inherent to the columnar design) and trade-offs - what are the various custom scan executor nodes - what "background" processes happen (custom workers, ...) - anything else substantial for understanding the design Maybe there's some of this in the 0002 patch, but I haven't stumbled over it so far. ------------------------------------------------------------------------ Finally, some review comments about the 0001 patch - it's mostly in the order as appearing in git diff, file by file. Some of the points repeat, because multiple places made me think about the same thing. And some of it was mentioned above. - It seems the VCI is implemented as an index AM, but it seems to be trying to do some TAM stuff, e.g. with visibility info etc. For example, I don't think indexes generally care about XIDs that deleted rows, etc. Similarly, I can't think of another index AM that'd try to hook into tablecmds.c or execExpr. - Not sure what's the right choice (makes sense to have it as index on top of heap, so that we do regular stuff for OLTP queries, and columnar stuff using the index for analytics). But it seems it does stuff expected from table AM and not index AM. - But making a table AM would make it much more demanding (it'd have to support storing all data types, ...) and would hurt OLTP performance. - I very much support the idea of implementing this as an extension (so minimal infrastructure changes to core + extension in contrib or entirely outside). - But the patch is not there, not even with 0001, because there's a number of places mentioning "vci". There should be no reference to VCI in core, it should be generic extensibility infrastructure. It's a good indication of places that may need some extensibility. relation_open ------------------ - Why is this change (disabling assert enforcing proper locking) needed? There's a comment in lock.c which mentions VCI parallel workers, but it's suspicious. reloptions.c ------------------ - Changes that should be moved to the contrib module Why should in-core reloptions know about this? See for example how "contrib/bloom" defined reloptions. heapam.c ------------------ - add_index_delete_hook, not sure I understand the purpose of this, likely related to the consistency model, needs explanation. Why is it prefixed with "add_"? - If we need such DELETE hook, heapam.c is not where to call it, it should be called from the same "layer" adding tuples to indexes (which certainly does not happen in heap_insert). - Furthermore, if we need this, why should it be a hook and not a new callback in IndexAmRoutine, just like aminsert()? initscan ------------------ - Isn't it using keep_startblock in a wrong way? it certainly is not meant to indicate "keep the old number of blocks" - Why is this even needed? This is in heapam, so why would this even matter for VCI, which is an index? Shouldn't that be mostly independent from which TAM is used by the table? heapam_handler.c ------------------ - Why does this need to preserve the IndexHeapTuple at all? And why should it be done in heapam_handler? Again, shouldn't IAM be mostly independent of this? - Using a global variable does not seem great. We probably can't have concurrent calls (only a single heapam_index_build_range_scan call at a time), but globals are dubious. - If VCI really needs to remember the tuple, there should be a "proper" way to explicitly pass it (adjust the IndexAmRoutine callback to have a new argument or something like that). heapam_visibility.c ------------------- - It's probably sensible to make visibility snapshots extensible, i.e. to allow defining custom snapshot types etc. But the patch still does that by hard-coding stuff in core, even if the actual "visibility" code is in extension. - But this doesn't really allow custom snapshot types, those are still hardcoded SNAPSHOT_VCI_WOS2ROS / SNAPSHOT_VCI_LOCALROS. - The add_snapshot_satisfies_hook does not seem like the way to make this extensible - what if two extensions want to do this. First HeapTupleSatisfiesVisibility would have to know about all the extensions, and it could easily lead to infinite loop. Because we call the first hook, and it calls HeapTupleSatisfiesVisibility again. - This is probably a good example of VCI trying to do TAM stuff. TAM interface has tuple_satisfies_snapshot, which seems to have the same purpose as the hook. Maybe not (it still assumes regular snapshots). - But I still don't understand why should a new index AM require changes to how heap AM does visibility checks (or any visibility checks in general, as indexes don't have visibility info in PG). And why would it make sense for heap to know about this. IAM is the layer *above* TAM, in a way. xlogfuncs.c ------------------ - RECOVERY_VCI_PAUSE_REQUESTED - what's that for? why do we need a separate state from RECOVERY_PAUSE_REQUESTED? xlogrecovery.c ------------------ - I don't see why we should not log that the recovery is paused because of VCI. I sure users would still like to know why it's paused: /* If pause requested by VCI, the log is not output. */ else if (GetRecoveryPauseState() != RECOVERY_VCI_PAUSE_REQUESTED) dependency.c ------------------ - add_drop_relation_hook raises mostly the same questions as add_index_delete_hook - Don't we already have event triggers to do this kind of stuff in an extensible way? - Why should an index even care about the table getting dropped? I mean, it'll get dropped too, before the table. So why this? index.c ------------------ - Why do we need add_reindex_index_hook at all? we already have ambuild callback, right? - It seems really suspicious ConstructTupleDescriptor gains the ability to index system attributes, why is that needed? surely the VCI does not need to index system attributes, right? Or why is that needed? - It seems add_reindex_index_hook is used to "skip" rebuilding the index, but why would it be useful/necessary, and why should it be done here? - Why couldn't ambuild() just return? How do I rebuild the index if something goes wrong (e.g. it gets corrupted). Or to remove bloat, if that applies to VCI. explain.c ------------------ - T_CustomPlanMarkPos - so this introduces a new type of custom plan node? How is that different from CustomScan? We don't even have MarkPos plan in core, right? So what does it do? - It even appears to be exactly the same as CustomScan in most places, but it also introduces some new callbacks (ExplainCustomPlanTargetRel and SetBoundCustomScan). - If useful, this should be a separate patch (custom scan improvements). And it definitely needs to update the CustomScan docs in SGML. - I don't understand why we need ExplainPropertySortGroupKeys, does not align with the rest of the API (ExplainProperty is for different data types - float, text, ...). Maybe expose show_sort_group_keys? Or just copy show_sort_group_keys into the extension ... indexcmds.c ------------------ - Yet another place referencing "vci" explicitly (not extensible, should not be in core) - Not sure I understand the limitations mentioned there. tablecmds.c ------------------ - add_alter_tablespace_hook, add_alter_table_change_owner_hook, add_alter_table_change_schema_hook - Why does VCI this even need to handle these actions? - And we have event triggers, so why have these new hooks? - Not sure what the code in ATExecSetRelOptions does (another example that handles "vci" AM explicitly, in a non-extensible way). - Is it maybe rebuilding some reloptions? Maybe it's about the attnums? vacuum.c ------------------ - add_skip_vacuum_hook is used to "disable" vacuum for internal VCI tables. - Why. And what is "internal table"? execAmi.c ------------------ - what's T_CustomPlanMarkPos? execExpr.c ------------------ - Why does this expose ExecReadyExpr, ExecCreateExprSetupSteps? - Isn't this a bit too fragile? - Probably needed because VciExecInitExpr needs a new argument? But why does it need that? execExprInterp.c ------------------ - Why does this need ExprEvalVar_hook and ExprEvalParam_hook? - Doesn't even check if the hooks are != NULL - CASE_EEOP_VCI_VAR, CASE_EEOP_VCI_PARAM_EXEC - not extensible, should not reference VCI explicitly - VciExprEvalVarHook (for llvmjit?) execGrouping.c ------------------ - LookupTupleHashEntry_internal removes comment, why? execProcnode.c, execScan.c ------------------ - New MarkPos node, apparetly, but why do we need that? Why can't we just build the Materialize when constructing the plan? execUtils.c ------------------ - again, disables locking check, very suspicious nodeCustom.c ------------------ - Isn't it strange that ExecInitCustomScan short-circuits the init, skipping most of the initialization for selected scans? - If anything, it should not hard-code the scan names like this. Why is this necessary? is the rest of the initialization irrelevant, or what? Maybe this is sign it should be table AM rather than index AM? - But IOS scans would also do this initialization, so ... nodeModifytable.c ------------------ - add_should_index_insert_hook another strange hook - Doesn't the hook apply to all relations all the time, not just those with VCI indexes etc.? gen_node_support.c ------------------ - So is CustomPlanMarkPos a proper node, or why do we need it? why is CustomPlan not enough? - What's SmcAllocSetContext for? Doesn't seem to be mentioned anywhere. params.c ------------------ - ExecInitParam_hook seems unused, no? allpaths.c ------------------ - set_plain_rel_pathlist - I don't understand why this checks the VCI vs. parallelism like this, what it's trying to prevent. Maybe if you don't want certain plan shape, don't generate the partial paths in that case from the CustomScan planner callback? - It seems this tries to allow parallel seqscan, but not other more complex parallel scans (like parallel agg/join ...), but that could be done in callback, no? - Again, we should not reference the index AM this explicitly, it's not extensible. createplan.c ------------------ - I don't think we want to make copy_plan_costsize part of API. Just a detail, though. autovacuum.c ------------------ - Why does this need a custom autovacuum launcher, or what's the purpose of this code? is there a special autovacuum launcher for VCI indexes? - Again, explicit references to "VCI", should not be done like this. bgworker.c ------------------ - I have no idea what's the point of these changes. seems to be enabling canceling bgworkers, but that seems like a generic improvement. In which case it should be a separate patch. - No idea if it's correct etc. - Seems to be done to allow CountOtherDBBackends() to cancel backends, so that it can be dropped etc. - There apparently are bgworkers doing cleanup for VCI, and we want to treat them like autovacuum workers (which get interrupted in various cases instead of waiting forever). - But this will cancel all bgworkers, no? Not just the VCI ones ... lock.c ------------------ - this seems a bit strange, surely this should use the same locking approach as other parallel queries /* * Don't lock for VCI parallel workers and other type of workers should go * in normal flow, In case if there is any change in background worker * name for VCI parallel workers, the following code also needs an update. * FIXME: Try to use the community parallelism code, so that we don't need * our own VCI parallel infrastructure. */ if (AmBackgroundWorkerProcess() && strstr(MyBgworkerEntry->bgw_name, "backend=")) - Presumably this is why some of the Asserts on locking were commented out earlier. lwlock.c ------------------ - Vci_lwlock_kind seems unused. timestamp.c ------------------ - I believe the IntervalAggState is internal on purpose, why should it be made external? It also means timestamp.h has to worry about FRONTEND. relcache.c ------------------ - add_skip_vci_index_hook - yet another hook, same questions as for the other hooks earlier - It seems suspicious we have to skip VCI indexes like this during planning in RelationGetIndexAttrBitmap, not sure why that's needed? Why should't we cost the index and maybe reject it based on that? - Again, references VCI explicitly, so not extensible. mctx.c ------------------ - why MemoryContextMethods shouldn't be "const"? common.c pg_dump.c pg_dump.h ------------------ - isvciview breaks extensibility, tied to a particular IAM / extension. Needs to be abstracted / made extensible. - What is "vci view table" for, actually? reloptions.h ------------------ - Why does this need to add an entry into the relopt_kind enum? It should be extensible, see the "contrib/bloom". xlogrecovery.h ------------------ - Why is RECOVERY_VCI_PAUSE_REQUESTED separate from RECOVERY_PAUSE_REQUESTED? timestamp.h ------------------ - The FRONTEND stuff is a sign making IntervalAggState public may not be a good idea. executor.h ------------------ - How come it's suddenly OK to return entry->additional without the if conditions in TupleHashEntryGetAdditional? - Ah, it's now stored as a separate palloc chunk, not right after the tuple? Well, maybe that's a separately useful improvement? should be a separate patch, then. No opinion on whether it's a good idea. - What's with the struct LimitState; ? Why is it needed? extensible.h ------------------ - Adds SetBoundCustomScan / ExplainCustomPlanTargetRel, but then it should also add them to the SGML docs. memnodes.h ------------------ - What's SmcAllocSetContext? It's not defined anywhere. params.h ------------------ - Adds noNeedLoadFromMain, loadedFromMain to ParamExecData, but it doesn't seem to be used anywhere (in either patch). plannodes.h ------------------ - Adds plan_no to Plan, but it's not set anywhere in core, only in the VCI extension - which seems wrong (e.g. what if there are two extensions that want to set it, and they disagree?) - Either if we want/need this, it should be initialized by core during planning, in a non-ambiguous way. - But also how is this different from the existing plan_node_id? autovacuum.h ------------------ - It seems a bit strage to add "built-in" bgworkers for an extension. If the extension really needs some sort of workers, it should be defined in the extension. bgworker.h ------------------ - So what's the problem with canceling bgworkers? Should this be done as a separate patch? itemptr.h ------------------ - Why do we need this SizeOfIptrData definition? How is it different from sizeof(ItemPointerData)? lwlock.h ------------------ - LWLockKind / Vci_lwlock_kind seems unused numeric.h ------------------ - Why does this expose NumericVar, when the second patch redefines the struct anyway? also, isn't that fragile? although, we're unlikely to tweak the NumericVar definition. rel.h ------------------ - Again, not extensible way to do stuff in an extension. - We should not have pieces specific to AM in StdRdOptions. - Doesn't AM have custom reloptions now? Like contrib/bloom? I also tried to build 0002, but unfortunately that fails with: executor/vci_executor.c: In function ‘vci_setup_executor_hook’: executor/vci_executor.c:115:28: error: assignment to ‘ExecutorStart_hook_type’ {aka ‘void (*)(QueryDesc *, int)’} from incompatible pointer type ‘_Bool (*)(QueryDesc *, int)’ [-Wincompatible-pointer-types] 115 | ExecutorStart_hook = vci_executor_start_routine; | ^ executor/vci_executor.c: In function ‘vci_executor_start_routine’: executor/vci_executor.c:161:28: error: void value not ignored as it ought to be 161 | plan_valid = executor_start_prev(queryDesc, eflags); | ^ executor/vci_executor.c:163:28: error: void value not ignored as it ought to be 163 | plan_valid = standard_ExecutorStart(queryDesc, eflags); | ^ make: *** [../../src/Makefile.global:973: executor/vci_executor.o] Error 1 The extension is not added to contrib/Makefile, so "make world" does not trigger this failure. regards -- Tomas Vondra
Also, Alvaro seemed to think TAM is the way to go, and in order to keep
the OLTP performance he suggested to use both heap and VCI at the same
time, in different "forks". I'm not sure how would that work, or if we
can already do that - AFAIK we can't, because ForkNumber does not allow
adding custom forks. We'd have to relax that, or invent some sort of
federated TAM (that just multiplexes it to two TAMs). Maybe.
But it's not like the IAM approach doesn't need to do this. The first
patch had to add stuff to a lot of random places to make this work. And
some of the places touch stuff that we don't expect indexes to worry
about, like ALTER TABLE, etc.
On 6/4/25 19:59, Jim Nasby wrote: > > > On Fri, May 23, 2025 at 4:29 PM Tomas Vondra <tomas@vondra.me > <mailto:tomas@vondra.me>> wrote: > > Also, Alvaro seemed to think TAM is the way to go, and in order to keep > the OLTP performance he suggested to use both heap and VCI at the same > time, in different "forks". I'm not sure how would that work, or if we > can already do that - AFAIK we can't, because ForkNumber does not allow > adding custom forks. We'd have to relax that, or invent some sort of > federated TAM (that just multiplexes it to two TAMs). Maybe. > > But it's not like the IAM approach doesn't need to do this. The first > patch had to add stuff to a lot of random places to make this work. And > some of the places touch stuff that we don't expect indexes to worry > about, like ALTER TABLE, etc. > > > I suspect another option would be to handle this with table inheritance: > have one child that is heap-based, a second that's VCI, and a background > job to move data from heap to VCI (and vice-versa for updates and maybe > deletes). > > Note that you could actually implement all that in user-space. > Personally I'd much rather have a way to do pure VCI / column-store > sooner and manage it myself than have to wait another release (or more) > to get a complete solution... I don't see how could this ever work with the optimizer, which assumes scanning an inheritance hierarchy means scanning all parts. But this would require making planner "smarter" to know it should scan only one of the child relations. And I believe it's not possible to do that while constructing scans for the heap/VCI parts, those places are not aware of what other parts are being scanned etc. Sure, you could do this in "user-space" by constructing queries that reference either the heap or VCI part. But then why put that into inheritance tree at all? It certainly does not help with moving data between the parts. I may be missing something, of course. But it's not clear to me how is this supposed to work ... What I can imagine is "VCI" as a "proxy" TAM on top of heap, keeping the columnar format in a separate fork. And using either that from custom scans, or the heap as a fallback for cases not supported by VCI. regars -- Tomas Vondra
On 6/4/25 19:59, Jim Nasby wrote:
>
>
> On Fri, May 23, 2025 at 4:29 PM Tomas Vondra <tomas@vondra.me
> <mailto:tomas@vondra.me>> wrote:
>
> Also, Alvaro seemed to think TAM is the way to go, and in order to keep
> the OLTP performance he suggested to use both heap and VCI at the same
> time, in different "forks". I'm not sure how would that work, or if we
> can already do that - AFAIK we can't, because ForkNumber does not allow
> adding custom forks. We'd have to relax that, or invent some sort of
> federated TAM (that just multiplexes it to two TAMs). Maybe.
>
> But it's not like the IAM approach doesn't need to do this. The first
> patch had to add stuff to a lot of random places to make this work. And
> some of the places touch stuff that we don't expect indexes to worry
> about, like ALTER TABLE, etc.
>
>
> I suspect another option would be to handle this with table inheritance:
> have one child that is heap-based, a second that's VCI, and a background
> job to move data from heap to VCI (and vice-versa for updates and maybe
> deletes).
>
> Note that you could actually implement all that in user-space.
> Personally I'd much rather have a way to do pure VCI / column-store
> sooner and manage it myself than have to wait another release (or more)
> to get a complete solution...
I don't see how could this ever work with the optimizer, which assumes
scanning an inheritance hierarchy means scanning all parts. But this
would require making planner "smarter" to know it should scan only one
of the child relations. And I believe it's not possible to do that while
constructing scans for the heap/VCI parts, those places are not aware of
what other parts are being scanned etc.
Sure, you could do this in "user-space" by constructing queries that
reference either the heap or VCI part. But then why put that into
inheritance tree at all? It certainly does not help with moving data
between the parts.
What I can imagine is "VCI" as a "proxy" TAM on top of heap, keeping the
columnar format in a separate fork. And using either that from custom
scans, or the heap as a fallback for cases not supported by VCI.
What I can imagine is "VCI" as a "proxy" TAM on top of heap, keeping the
columnar format in a separate fork. And using either that from custom
scans, or the heap as a fallback for cases not supported by VCI.Yeah, there'd definitely need to be some kind of proxy... I'm just suggesting that we don't *have* to do that as a separate fork...