Thread: Alternative SAOP support for GiST
Hi All,
During my journey of designing Pg based solution at my work I was severely hit by several shortcomings in GiST.
The most severe one is the lack of support for SAOP filters as it makes it difficult to have partition pruning and index (only) scans working together.
To overcome the difficulties I implemented a simple extension:
Since it provides a separate operator class from btree_gist it requires re-indexing the data.
So I thought maybe it would be a good idea to incorporate it into btree_gist.
I am aware of two patches related to SAOP being discussed at the moment but I am not sure if SAOP support for GiST indexes is planned.
Let me know if you think it is a good idea to work on a patch.
More general remark:
I am wondering if SAOP support in core should not be implemented by mapping SAOP operations to specialised operators and delegating
all the work to index AMs. That way core could be decoupled from particular index AMs implementation details.
Thanks!
—
Michal
Attachment
Hi All,
On 11 Mar 2024, at 18:58, Michał Kłeczek <michal@kleczek.org> wrote:Hi All,During my journey of designing Pg based solution at my work I was severely hit by several shortcomings in GiST.The most severe one is the lack of support for SAOP filters as it makes it difficult to have partition pruning and index (only) scans working together.To overcome the difficulties I implemented a simple extension:Since it provides a separate operator class from btree_gist it requires re-indexing the data.So I thought maybe it would be a good idea to incorporate it into btree_gist.
While working on supporting (sort of) SAOP support for Gist I was stuck with the interplay between Gist consistent function and partition pruning.
Not sure how it applies to SAOP handling in general though.
I’ve implemented an operator class/family that supports Gist index scan for the following query:
SELECT * FROM tbl WHERE col ||= ARRAY[‘a’, ‘b’, ‘c’];
It all works well except cases where tbl is partitioned by “col” column.
In this case index scan unnecessarily scans pages for values that are not in the partition.
I am not sure if it works as expected (ie. no unnecessary scans) in case of ANY = (ARRAY[]) and nbtree.
In case of Gist the only place where pre-processing of queries can be performed is consistent function.
But right now there is no possibility to access any scan related information as it is not passed to consistent function.
The only thing available is GISTENTRY and it does not contain any metadata.
As a workaround I’ve added options to the op family that allows (redundantly) specifying MODULUS/REMAINDER for the index:
CREATE INDEX idx ON tbl_partition_01 USING gist ( col gist_text_extra_ops (modulus=4, remainder=2) )
and use the values to filter out query array passed to consistent function.
This is of course not ideal:
- the information is redundant
- changing partitioning scheme requires re-indexing
I am wondering if it is possible to extend consistent function API so that either ScanKey or even the whole IndexScanDesc is passed as additional parameter.
—
Michal
I realised it might be enough to pass sk_attno to consistent function as that should be enough to lookup metadata if needed.
Attached is a draft patch that does this.
—
Michal
On 18 Mar 2024, at 14:31, Michał Kłeczek <michal@kleczek.org> wrote:Hi All,On 11 Mar 2024, at 18:58, Michał Kłeczek <michal@kleczek.org> wrote:Hi All,During my journey of designing Pg based solution at my work I was severely hit by several shortcomings in GiST.The most severe one is the lack of support for SAOP filters as it makes it difficult to have partition pruning and index (only) scans working together.To overcome the difficulties I implemented a simple extension:Since it provides a separate operator class from btree_gist it requires re-indexing the data.So I thought maybe it would be a good idea to incorporate it into btree_gist.While working on supporting (sort of) SAOP support for Gist I was stuck with the interplay between Gist consistent function and partition pruning.Not sure how it applies to SAOP handling in general though.I’ve implemented an operator class/family that supports Gist index scan for the following query:SELECT * FROM tbl WHERE col ||= ARRAY[‘a’, ‘b’, ‘c’];It all works well except cases where tbl is partitioned by “col” column.In this case index scan unnecessarily scans pages for values that are not in the partition.I am not sure if it works as expected (ie. no unnecessary scans) in case of ANY = (ARRAY[]) and nbtree.In case of Gist the only place where pre-processing of queries can be performed is consistent function.But right now there is no possibility to access any scan related information as it is not passed to consistent function.The only thing available is GISTENTRY and it does not contain any metadata.As a workaround I’ve added options to the op family that allows (redundantly) specifying MODULUS/REMAINDER for the index:CREATE INDEX idx ON tbl_partition_01 USING gist ( col gist_text_extra_ops (modulus=4, remainder=2) )and use the values to filter out query array passed to consistent function.This is of course not ideal:- the information is redundant- changing partitioning scheme requires re-indexingI am wondering if it is possible to extend consistent function API so that either ScanKey or even the whole IndexScanDesc is passed as additional parameter.—Michal
Attachment
Wrong file in the previous message - sorry for the noise.
Attached is a fixed patch.
Thanks,
Michal
On 18 Mar 2024, at 15:14, Michał Kłeczek <michal@kleczek.org> wrote:I realised it might be enough to pass sk_attno to consistent function as that should be enough to lookup metadata if needed.<0001-Pass-key-sk_attno-to-consistent-function.patch>Attached is a draft patch that does this.—MichalOn 18 Mar 2024, at 14:31, Michał Kłeczek <michal@kleczek.org> wrote:Hi All,On 11 Mar 2024, at 18:58, Michał Kłeczek <michal@kleczek.org> wrote:Hi All,During my journey of designing Pg based solution at my work I was severely hit by several shortcomings in GiST.The most severe one is the lack of support for SAOP filters as it makes it difficult to have partition pruning and index (only) scans working together.To overcome the difficulties I implemented a simple extension:Since it provides a separate operator class from btree_gist it requires re-indexing the data.So I thought maybe it would be a good idea to incorporate it into btree_gist.While working on supporting (sort of) SAOP support for Gist I was stuck with the interplay between Gist consistent function and partition pruning.Not sure how it applies to SAOP handling in general though.I’ve implemented an operator class/family that supports Gist index scan for the following query:SELECT * FROM tbl WHERE col ||= ARRAY[‘a’, ‘b’, ‘c’];It all works well except cases where tbl is partitioned by “col” column.In this case index scan unnecessarily scans pages for values that are not in the partition.I am not sure if it works as expected (ie. no unnecessary scans) in case of ANY = (ARRAY[]) and nbtree.In case of Gist the only place where pre-processing of queries can be performed is consistent function.But right now there is no possibility to access any scan related information as it is not passed to consistent function.The only thing available is GISTENTRY and it does not contain any metadata.As a workaround I’ve added options to the op family that allows (redundantly) specifying MODULUS/REMAINDER for the index:CREATE INDEX idx ON tbl_partition_01 USING gist ( col gist_text_extra_ops (modulus=4, remainder=2) )and use the values to filter out query array passed to consistent function.This is of course not ideal:- the information is redundant- changing partitioning scheme requires re-indexingI am wondering if it is possible to extend consistent function API so that either ScanKey or even the whole IndexScanDesc is passed as additional parameter.—Michal
Attachment
Hi All, Since it looks like there is not much interest in the patch I will try to provide some background to explain why I thinkit is needed. We are in the process of migration from an old db platform to PostgreSQL. Our database is around 10TB big and contains around 10 billion financial transactions in a single table. Each transaction is assigned to an account (column acc_number). We have partitioned the table BY HASH (acc_number). A client can query transactions belonging to his accounts using several criteria - among them is te xt search. Queries are of type TOP N (ie ORDER BY … LIMIT ). The list of accounts that we are querying is provided as a parameter to the query. We have decided to use a single Gist index supporting all queries (reasons described in [1]). There are several problems with Gist usage (but I still think we have no other choice) but the most important is that we cannot use SAOP in our queries - since Gist does not support it the planner decides to perform Bitmap Scan which in turn does not support ORDER BY … LIMIT well because requires Sort. So when we use “= ANY (array of account numbers) … ORDER BY ...” the plan requires reading all records meeting search criteria and then sort. As a workaround we have to perform LATERAL joins: unnest(list of account numbers) AS a(n) LATERAL JOIN (SELECT * FROM … WHERE account_number = a.n ORDER BY … LIMIT …) ORDERBY … LIMIT … It is still bad because requires multiple scans of the same partition if account number hashes are the same. What we really need is for Gist to support “= ANY (…)”. As Gist index is extensible in terms of queries it supports it is quite easy to implement an operator class/family with operator: ||= (text, text[]) that has semantics the same as “= ANY (…)” With this operator we can write our queries like: account_number ||= [list of account numbers] AND account_number = ANY ([list of account numbers]) — redundant for partition pruning as it does not understand ||= and have optimal plans: Limit — Merge Append —— Index scan of relevant partitions The problem is that now each partition scan is for the same list of accounts. The “consistent” function cannot assume anything about contents of the table so it has to check all elements of the list and that in turn causes reading unnecessary index pages for accounts not in this partition. What we need is a way for the “consistent” function to be able to pre-process input query array and remove elements that are not relevant for this scan. To do that it is necessary to have enough information to read necessary metadata fromthe catalog. The proposed patch addresses this need and seems (to me) largely uncontroversial as it does not break any existing extensions. Attached is a patch with consolidated changes (I am pretty new to managing patches so previous two were partial and not somethingshareable, I am afraid). — Michal
Attachment
On Tue, 19 Mar 2024 at 17:00, Michał Kłeczek <michal@kleczek.org> wrote: > > Hi All, > > Since it looks like there is not much interest in the patch I will try to provide some background to explain why I thinkit is needed. > [...] > What we really need is for Gist to support “= ANY (…)”. > As Gist index is extensible in terms of queries it supports it is quite easy to implement an operator class/family withoperator: > > ||= (text, text[]) > > that has semantics the same as “= ANY (…)” I've had a similar idea while working on BRIN, and was planning on overloading `@>` for @>(anyarray, anyelement) or using a new `@>>(anyarray, anyelement)` operator. No implementation yet, though. > With this operator we can write our queries like: > > account_number ||= [list of account numbers] AND > account_number = ANY ([list of account numbers]) — redundant for partition pruning as it does not understand ||= > > and have optimal plans: > > Limit > — Merge Append > —— Index scan of relevant partitions > > The problem is that now each partition scan is for the same list of accounts. > The “consistent” function cannot assume anything about contents of the table so it has to check all elements of the list > and that in turn causes reading unnecessary index pages for accounts not in this partition. You seem to already be using your own operator class, so you may want to look into CREATE FUNCTION's support_function parameter; which would handle SupportRequestIndexCondition and/or SupportRequestSimplify. I suspect a support function that emits multiple clauses that each apply to only a single partition at a time should get you quite far if combined with per-partition constraints that filter all but one of those. Also, at plan-time you can get away with much more than at runtime. > What we need is a way for the “consistent” function to be able to pre-process input query array and remove elements > that are not relevant for this scan. To do that it is necessary to have enough information to read necessary metadata fromthe catalog. > The proposed patch addresses this need and seems (to me) largely uncontroversial as it does not break any existing extensions. I don't think "my index operator class will go into the table definition and check if parts of the scankey are consistent with the table constraints" is a good reason to expose the index column to operator classes. Note that operator classes were built specifically so that they're independent from their column position. It doens't really make sense to expose this. Maybe my suggestion up above helps? Kind regards, Matthias van de Meent
> On 21 Mar 2024, at 23:42, Matthias van de Meent <boekewurm+postgres@gmail.com> wrote: > > On Tue, 19 Mar 2024 at 17:00, Michał Kłeczek <michal@kleczek.org> wrote: > >> With this operator we can write our queries like: >> >> account_number ||= [list of account numbers] AND >> account_number = ANY ([list of account numbers]) — redundant for partition pruning as it does not understand ||= >> >> and have optimal plans: >> >> Limit >> — Merge Append >> —— Index scan of relevant partitions >> >> The problem is that now each partition scan is for the same list of accounts. >> The “consistent” function cannot assume anything about contents of the table so it has to check all elements of the list >> and that in turn causes reading unnecessary index pages for accounts not in this partition. > > You seem to already be using your own operator class, so you may want > to look into CREATE FUNCTION's support_function parameter; which would > handle SupportRequestIndexCondition and/or SupportRequestSimplify. > I suspect a support function that emits multiple clauses that each > apply to only a single partition at a time should get you quite far if > combined with per-partition constraints that filter all but one of > those. Also, at plan-time you can get away with much more than at > runtime. Thanks for suggestion. I am afraid I don’t understand how it might actually work though: 1) I think planning time is too early for us - we are heavily using planner_mode = force_generic_plan: - we have many partitions and planning times started to dominate execution time - generic plans are not worse than specialised 2) I am not sure how I could transform "col ||= [array]" to multiple criteria to make sure it works well with partition pruning and planner. It looks like what you are suggesting is to generate something like: (part_condition AND col ||= [subarray1]) OR (part_condition AND col ||= [subarray2]) and hope the planner would generate proper Merge Append node (which I doubt would happen and planner would generate Bitmapscan due to lack of OR support in Gist). What’s more - there is no part_condition for hash partitions. > >> What we need is a way for the “consistent” function to be able to pre-process input query array and remove elements >> that are not relevant for this scan. To do that it is necessary to have enough information to read necessary metadatafrom the catalog. > >> The proposed patch addresses this need and seems (to me) largely uncontroversial as it does not break any existing extensions. > > I don't think "my index operator class will go into the table > definition and check if parts of the scankey are consistent with the > table constraints" is a good reason to expose the index column to > operator classes. Quite possibly but I still don’t see any other way to do that TBH. — Michal
> On 22 Mar 2024, at 01:29, Michał Kłeczek <michal@kleczek.org> wrote: > > > >> On 21 Mar 2024, at 23:42, Matthias van de Meent <boekewurm+postgres@gmail.com> wrote: >> >>> On Tue, 19 Mar 2024 at 17:00, Michał Kłeczek <michal@kleczek.org> wrote: >>> >>> With this operator we can write our queries like: >>> >>> account_number ||= [list of account numbers] AND >>> account_number = ANY ([list of account numbers]) — redundant for partition pruning as it does not understand ||= >>> >>> and have optimal plans: >>> >>> Limit >>> — Merge Append >>> —— Index scan of relevant partitions >>> >>> The problem is that now each partition scan is for the same list of accounts. >>> The “consistent” function cannot assume anything about contents of the table so it has to check all elements of the list >>> and that in turn causes reading unnecessary index pages for accounts not in this partition. >> >> You seem to already be using your own operator class, so you may want >> to look into CREATE FUNCTION's support_function parameter; which would >> handle SupportRequestIndexCondition and/or SupportRequestSimplify. >> I suspect a support function that emits multiple clauses that each >> apply to only a single partition at a time should get you quite far if >> combined with per-partition constraints that filter all but one of >> those. Also, at plan-time you can get away with much more than at >> runtime. Thinking about it some more - the suggestion goes backwards - ie. there must have been some deep misunderstanding: If I was able to write my query such that the planner generates optimal plan, I would not implement the custom operator inthe first place. And this need of custom operator and operator class triggered the proposal in turn. — Michal
On 22 Mar 2024, at 10:11, Michał Kłeczek <michal@kleczek.org> wrote:
On 22 Mar 2024, at 01:29, Michał Kłeczek <michal@kleczek.org> wrote:
On 21 Mar 2024, at 23:42, Matthias van de Meent <boekewurm+postgres@gmail.com> wrote:
You seem to already be using your own operator class, so you may want
to look into CREATE FUNCTION's support_function parameter; which would
handle SupportRequestIndexCondition and/or SupportRequestSimplify.
I suspect a support function that emits multiple clauses that each
apply to only a single partition at a time should get you quite far if
combined with per-partition constraints that filter all but one of
those. Also, at plan-time you can get away with much more than at
runtime.
Thinking about it some more - the suggestion goes backwards - ie. there must have been some deep misunderstanding:
If I was able to write my query such that the planner generates optimal plan, I would not implement the custom operator in the first place.
To make it more concrete:
create extension pg_trgm;
create table t ( pk text not null primary key, data text not null ) partition by hash (pk);
create table t_2_0 partition of t for values with ( modulus 2, remainder 0 );
create table t_2_0 partition of t for values with ( modulus 2, remainder 1 );
create index ti on t using gist ( pk, data gist_trgm_ops);
explain select * from t where pk = any (array['1', '2', '4', '5']) and data % 'what' order by data <-> 'what' limit 5;
Limit (cost=41.32..41.33 rows=2 width=21)
-> Sort (cost=41.32..41.33 rows=2 width=21)
Sort Key: ((t.data <-> 'what'::text))
-> Append (cost=16.63..41.31 rows=2 width=21)
-> Bitmap Heap Scan on t_2_0 t_1 (cost=16.63..20.65 rows=1 width=21)
Recheck Cond: ((pk = ANY ('{1,2,4,5}'::text[])) AND (data % 'what'::text))
-> Bitmap Index Scan on t_2_0_pk_data_idx (cost=0.00..16.63 rows=1 width=0)
Index Cond: ((pk = ANY ('{1,2,4,5}'::text[])) AND (data % 'what'::text))
-> Bitmap Heap Scan on t_2_1 t_2 (cost=16.63..20.65 rows=1 width=21)
Recheck Cond: ((pk = ANY ('{1,2,4,5}'::text[])) AND (data % 'what'::text))
-> Bitmap Index Scan on t_2_1_pk_data_idx (cost=0.00..16.63 rows=1 width=0)
Index Cond: ((pk = ANY ('{1,2,4,5}'::text[])) AND (data % 'what'::text))
That’s bad as the number of records to sort might be huge. So:
set enable_bitmapscan to off;
Limit (cost=0.30..242.85 rows=2 width=21)
-> Merge Append (cost=0.30..242.85 rows=2 width=21)
Sort Key: ((t.data <-> 'what'::text))
-> Index Scan using t_2_0_pk_data_idx on t_2_0 t_1 (cost=0.15..121.40 rows=1 width=21)
Index Cond: (data % 'what'::text)
Order By: (data <-> 'what'::text)
Filter: (pk = ANY ('{1,2,4,5}'::text[]))
-> Index Scan using t_2_1_pk_data_idx on t_2_1 t_2 (cost=0.15..121.42 rows=1 width=21)
Index Cond: (data % 'what'::text)
Order By: (data <-> 'what'::text)
Filter: (pk = ANY ('{1,2,4,5}'::text[]))
That’s bad as well since pk = ANY([…]) is not in Index Cond.
Lets use ||= operator then:
drop index ti;
create index ti on t using gist ( pk gist_extra_text_ops, data gist_trgm_ops);
explain select * from t where pk = any (array['1', '2', '4', '5']) and pk ||= array['1', '2', '4', '5'] and data % 'what' order by data <-> 'what' limit 5;
Limit (cost=0.30..153.70 rows=2 width=21)
-> Merge Append (cost=0.30..153.70 rows=2 width=21)
Sort Key: ((t.data <-> 'what'::text))
-> Index Scan using t_2_0_pk_data_idx on t_2_0 t_1 (cost=0.15..76.84 rows=1 width=21)
Index Cond: ((pk ||= '{1,2,4,5}'::text[]) AND (data % 'what'::text))
Order By: (data <-> 'what'::text)
Filter: (pk = ANY ('{1,2,4,5}'::text[]))
-> Index Scan using t_2_1_pk_data_idx on t_2_1 t_2 (cost=0.15..76.84 rows=1 width=21)
Index Cond: ((pk ||= '{1,2,4,5}'::text[]) AND (data % 'what'::text))
Order By: (data <-> 'what'::text)
Filter: (pk = ANY ('{1,2,4,5}'::text[]))
That’s much better. There is still Filter on SAOP but it is almost harmless: always true and does not require heap access.
A few observations:
1. I am not sure why SAOP can be Index Cond in case of Bitmap Index Scan but not in case of Index Scan - don’t know what the interplay between the planner and amsearcharray is.
2. In all cases array passed to (Bitmap) Index Scan is NOT filtered by partition pruning.
The second point means useless index page reads (amplified by the number of elements in input array and the width of the index).
It is the _second_ point I would like to address with this patch - for us it makes a big difference as it means almost order of magnitude (around 5-10 times based on preliminary tests) fewer buffers read.
I’ve done an experiment with adding partitioning bounds information as options when creating index. Table as above (so only two index columns) but with 128 partitions (modulus 128).
Without filtering array elements we get 103 buffers read:
explain (analyse, buffers) select * from t where pk ||= array['1', '5', '10', '2', '3', '4', '23', '5', '7', '0'] and pk = any (array['1', '5', '10', '2', '3', '4', '23', '5', '7', '0']) and data % 'ever 3' order by data <-> '3' limit 5;
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=2.82..43.07 rows=5 width=32) (actual time=8.093..9.390 rows=4 loops=1)
Buffers: shared hit=103
-> Merge Append (cost=2.82..75.28 rows=9 width=32) (actual time=8.091..9.387 rows=4 loops=1)
Sort Key: ((t.data <-> '3'::text))
Buffers: shared hit=103
-> Index Scan using t_128_9_pk_data_idx on t_128_9 t_1 (cost=0.30..8.33 rows=1 width=32) (actual time=0.741..0.741 rows=0 loops=1)
Index Cond: ((pk ||= '{1,5,10,2,3,4,23,5,7,0}'::text[]) AND (data % 'ever 3'::text))
Order By: (data <-> '3'::text)
Filter: (pk = ANY ('{1,5,10,2,3,4,23,5,7,0}'::text[]))
Buffers: shared hit=10
… more partitions
After creating indexes for all partitions with some metadata used by consistent function to filter out array elements:
create index ti_128_0 on t_128_0 (pk gist_extra_text_ops (modulus=128, remainder=0), data gist_trgm_ops (siglen=128))
The result is only 29 buffers read:
explain (analyse, buffers) select * from t where pk ||= array['1', '5', '10', '2', '3', '4', '23', '5', '7', '0'] and pk = any (array['1', '5', '10', '2', '3', '4', '23', '5', '7', '0']) and data % 'ever 3' order by data <-> '3' limit 5;
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=2.82..43.07 rows=5 width=32) (actual time=0.852..0.864 rows=4 loops=1)
Buffers: shared hit=29
-> Merge Append (cost=2.82..75.28 rows=9 width=32) (actual time=0.850..0.862 rows=4 loops=1)
Sort Key: ((t.data <-> '3'::text))
Buffers: shared hit=29
-> Index Scan using ti_128_9 on t_128_9 t_1 (cost=0.30..8.33 rows=1 width=32) (actual time=0.045..0.045 rows=0 loops=1)
Index Cond: ((pk ||= '{1,5,10,2,3,4,23,5,7,0}'::text[]) AND (data % 'ever 3'::text))
Order By: (data <-> '3'::text)
Filter: (pk = ANY ('{1,5,10,2,3,4,23,5,7,0}'::text[]))
Buffers: shared hit=1
…. more partitions
There is 3x fewer buffers read in this case (103 vs 29).
It makes a huge difference in memory pressure in our case (10 billion rows 10 TB data, wide index to support various search criteria - text/similiarity search among them).
I understand extensibility of GIST makes many operators opaque to the planner and it is the “consistent” function that can perform optimisations (or we can come up with some additional mechanism during planning phase).
Providing more information to “consistent” function would make it possible to implement optimisations not only for array scan keys but for ranges and others.
—
Michal
=?utf-8?Q?Micha=C5=82_K=C5=82eczek?= <michal@kleczek.org> writes: > I understand extensibility of GIST makes many operators opaque to the planner and it is the “consistent” function thatcan perform optimisations (or we can come up with some additional mechanism during planning phase). > Providing more information to “consistent” function would make it possible to implement optimisations not only for arrayscan keys but for ranges and others. > What we can do is to provide the index attribute number (reduntantly) as option - it is going to work but is somewhat ugly- especially that this information is already available when calling consistent function. While the proposed change is simple enough and wouldn't break anything, I share Matthias' distaste for it, because the motivation you've given for it is deeply unsatisfying and indeed troubling. A GIST consistent function is surely the wrong place to be optimizing away unnecessary partitions: that consideration is not unique to GIST indexes (or even to indexscans), much less to particular opclasses. Moreover, having a consistent function inspect catalog state seems like a kluge of the first order, not least because there'd be no good place to cache the lookup results, so you'd be doing them over and over again. I like Matthias' suggestion of seeing whether you could use a planner support function to split up the query by partitions during planning. There are bits of what you mentioned that I would gladly take patches for --- for example, I think the reason GIST lacks SAOP support is mostly lack of round tuits, not that it'd be a bad thing. But it's not clear to me that that alone would help you much. The whole design as you've described it seems like multiple layers of kluges, so that getting rid of any one of them doesn't really make it not klugy. (I also have serious doubts about how well it would perform for you, even with this additional kluge in place. I don't think that a multidimensional GIST over zillions of rows will perform very well: the upper tree layers are just not going to be very selective.) regards, tom lane
Hi Tom, Thanks for looking at it. > On 24 Jul 2024, at 22:19, Tom Lane <tgl@sss.pgh.pa.us> wrote: > > =?utf-8?Q?Micha=C5=82_K=C5=82eczek?= <michal@kleczek.org> writes: >> I understand extensibility of GIST makes many operators opaque to the planner and it is the “consistent” function thatcan perform optimisations (or we can come up with some additional mechanism during planning phase). >> Providing more information to “consistent” function would make it possible to implement optimisations not only for arrayscan keys but for ranges and others. > >> What we can do is to provide the index attribute number (reduntantly) as option - it is going to work but is somewhatugly - especially that this information is already available when calling consistent function. > > While the proposed change is simple enough and wouldn't break > anything, I share Matthias' distaste for it, because the motivation > you've given for it is deeply unsatisfying and indeed troubling. > A GIST consistent function is surely the wrong place to be optimizing > away unnecessary partitions: that consideration is not unique to > GIST indexes (or even to indexscans), much less to particular opclasses. > Moreover, having a consistent function inspect catalog state seems > like a kluge of the first order, not least because there'd be no good > place to cache the lookup results, so you'd be doing them over and > over again. Indeed - caching results of such lookups is clumsy. And I agree the whole thing is hackish. But in reality I think this is due to fundamental mismatch between extensibility interface of GIST and the lack of it in partition pruning code. > > I like Matthias' suggestion of seeing whether you could use a > planner support function to split up the query by partitions > during planning. You mean implement custom partition pruning algorithm using query rewriting? As I’ve written in another message in this thread: If it was possible to write the query in such a way then the whole discussion would be moot and I wouldn’t propose this patch :) > > There are bits of what you mentioned that I would gladly take > patches for --- for example, I think the reason GIST lacks SAOP > support is mostly lack of round tuits, not that it'd be a bad > thing. But it's not clear to me that that alone would help you > much. The whole design as you've described it seems like multiple > layers of kluges, so that getting rid of any one of them doesn't > really make it not klugy. Kluges are workarounds for lack of two things in GIST: - missing SAOP support (which I try to simulate using custom operator) - missing ORDER BY support (for which I posted a draft idea in another patch) > (I also have serious doubts about how well > it would perform for you, even with this additional kluge in place. > I don't think that a multidimensional GIST over zillions of rows will > perform very well: the upper tree layers are just not going to be very > selective.) It work surprisingly well in practice as our performance tests show. A single multicolumn GIST index is capable of handling most of our queries. Regards, — Michal
On Fri, 22 Mar 2024, 01:29 Michał Kłeczek, <michal@kleczek.org> wrote: > On 21 Mar 2024, at 23:42, Matthias van de Meent <boekewurm+postgres@gmail.com> wrote: >> On Tue, 19 Mar 2024 at 17:00, Michał Kłeczek <michal@kleczek.org> wrote: >>> With this operator we can write our queries like: >>> >>> account_number ||= [list of account numbers] AND >>> account_number = ANY ([list of account numbers]) — redundant for partition pruning as it does not understand ||= >>> >>> and have optimal plans: >>> >>> Limit >>> — Merge Append >>> —— Index scan of relevant partitions >>> >>> The problem is that now each partition scan is for the same list of accounts. >>> The “consistent” function cannot assume anything about contents of the table so it has to check all elements of the list >>> and that in turn causes reading unnecessary index pages for accounts not in this partition. >> >> You seem to already be using your own operator class, so you may want >> to look into CREATE FUNCTION's support_function parameter; which would >> handle SupportRequestIndexCondition and/or SupportRequestSimplify. >> I suspect a support function that emits multiple clauses that each >> apply to only a single partition at a time should get you quite far if >> combined with per-partition constraints that filter all but one of >> those. Also, at plan-time you can get away with much more than at >> runtime. > > Thanks for suggestion. > > I am afraid I don’t understand how it might actually work though: [...] > 2) I am not sure how I could transform > "col ||= [array]" to multiple criteria to make sure it works well with partition pruning and planner. > > It looks like what you are suggesting is to generate something like: > (part_condition AND col ||= [subarray1]) OR (part_condition AND col ||= [subarray2]) > and hope the planner would generate proper Merge Append node (which I doubt would happen and planner would generate Bitmapscan due to lack of OR support in Gist). > What’s more - there is no part_condition for hash partitions. I would probably (try to) implement something like the following: - Alter each partition by adding a constraint `CHECK (hash(partitioncol) % part_modulus = part_remainder)`, to give the planner the tools to do partition pruning. This solves the partition pruning part, in userspace. I woudln't be opposed to a fix in PostgreSQL if done well - hash partition pruning sounds like a niche, but a valid niche nonetheless. - Add support function that translates e.g. ||=(array, elem) on the base table into a list of OR-ed `(hash(partitioncol) % part_modulus = part_remainder AND col ||= [sublist])`-statements, one for each of the partitions. - Make sure there's a planner facility that pushes down OR branches and removes non-matched qualifiers. Presumably, partition pruning can already take care of that. The planner should be able to deduct that each partition has their own -unique- CHECK constraint, and that this check can't be satisfied by any other partition and thus is ignored for those other partitions' scans. Alternatively, partition the table not using HASH, but using LIST ((hash(partitioncol) % modulus)) - this should enable partition pruning without creating those manual CHECK constraints. However, that'd be at the cost of access to hash partitioning-related features, like different modulos at the same partitioning level. All in all, this still seems like a very (very) specific optimization, of which I'm not sure that it is generalizable. However, array introspection and filtering for SAOP equality checks feel like a relatively easy (?) push-down optimization in (e.g.) runtime partition pruning (or planning); isn't that even better patch potential here? Kind regards, Matthias van de Meent Neon (https://neon.tech)
On 26 Jul 2024, at 01:28, Matthias van de Meent <boekewurm+postgres@gmail.com> wrote:
All in all, this still seems like a very (very) specific optimization,
of which I'm not sure that it is generalizable. However, array
introspection and filtering for SAOP equality checks feel like a
relatively easy (?) push-down optimization in (e.g.) runtime partition
pruning (or planning); isn't that even better patch potential here?
The issue is not partition pruning for SAOP - it works fine. The issue is lack of SAOP support in GIST.
Because I cannot use SAOP I have two options:
1) LATERAL JOIN (ie. iterate through input array elements, SELECT rows for each, merge results)
2) Implement a custom operator that emulates SAOP and provide consistent function for it. Additionally provide SAOP clause (redundantly) to enable partition pruning.
In case of 1):
- the query becomes convoluted with multiple redundant ORDER BY and LIMIT clauses
- unnecessary sort is performed (because we have to merge results of subqueries)
- some partitions are scanned multiple times (per each element in input array that happens to land in the same partition)
In case of 2):
- the whole input array is passed to consistent function for each partition so we unnecessarily search for non-existent rows
To fix the issue in 2) I need to somehow filter input array per partition - hence this patch.
Regards,
—
Michal
On 26 Jul 2024, at 10:10, Michał Kłeczek <michal@kleczek.org> wrote:On 26 Jul 2024, at 01:28, Matthias van de Meent <boekewurm+postgres@gmail.com> wrote:
All in all, this still seems like a very (very) specific optimization,
of which I'm not sure that it is generalizable. However, array
introspection and filtering for SAOP equality checks feel like a
relatively easy (?) push-down optimization in (e.g.) runtime partition
pruning (or planning); isn't that even better patch potential here?The issue is not partition pruning for SAOP - it works fine. The issue is lack of SAOP support in GIST.Because I cannot use SAOP I have two options:1) LATERAL JOIN (ie. iterate through input array elements, SELECT rows for each, merge results)2) Implement a custom operator that emulates SAOP and provide consistent function for it. Additionally provide SAOP clause (redundantly) to enable partition pruning.In case of 1):- the query becomes convoluted with multiple redundant ORDER BY and LIMIT clauses- unnecessary sort is performed (because we have to merge results of subqueries)- some partitions are scanned multiple times (per each element in input array that happens to land in the same partition)In case of 2):- the whole input array is passed to consistent function for each partition so we unnecessarily search for non-existent rows
Which is especially painful in case of sharding implementation based on partitioning and postgres_fdw as it requires multiple
SELECTS from remote partition.
To fix the issue in 2) I need to somehow filter input array per partition - hence this patch.
Regards,
—
Michal