Thread: Wrapping a where clause to preserve rows with nulls
I have a query where I have to run a where clause generated by another system (i.e., I can't modify that where clause. The where clause may return `null`, but I actually want to keep rows that return `null` (and rows that return `true` but not rows that return `false`).
I thought it would be as simple as wrapping in `(...) is not false` but that seems to prevent index usage.
For example, let's say that given the table:
CREATE TABLE test_index (value INTEGER);
CREATE INDEX idx_value ON test_index(value);
CREATE INDEX idx_value ON test_index(value);
And the predicate `value = 5000`, if I run the original query that excludes rows where the predicate is null it uses the index:
SELECT *
FROM test_index
WHERE value = 5000;
But as soon as I tack on an `IS NOT FALSE` the index is not used:
SELECT *
FROM test_index
WHERE (value = 5000) IS NOT FALSE;
This was surprising to me. I was hoping this might be able to use the index.
Is there any way to include the rows where the predicate evaluates to null while still using an index?
I'll note that the clause is arbitrary in the sense that I don't generate it and cannot edit it but it's basically a bunch of boolean comparisons chained i.e. `col_a >= 1 and col_b <=5 and col_c ...` so I can in general add an index on say col_a and it does get used.
On Wed, Dec 18, 2024 at 9:47 PM Adrian Garcia Badaracco <adrian@adriangb.com> wrote:
I have a query where I have to run a where clause generated by another system (i.e., I can't modify that where clause. The where clause may return `null`, but I actually want to keep rows that return `null` (and rows that return `true` but not rows that return `false`).I thought it would be as simple as wrapping in `(...) is not false` but that seems to prevent index usage.For example, let's say that given the table:CREATE TABLE test_index (value INTEGER);
CREATE INDEX idx_value ON test_index(value);And the predicate `value = 5000`, if I run the original query that excludes rows where the predicate is null it uses the index:SELECT *FROM test_indexWHERE value = 5000;But as soon as I tack on an `IS NOT FALSE` the index is not used:SELECT *FROM test_indexWHERE (value = 5000) IS NOT FALSE;This was surprising to me. I was hoping this might be able to use the index.Is there any way to include the rows where the predicate evaluates to null while still using an index?
On Wednesday, December 18, 2024, Adrian Garcia Badaracco <adrian@adriangb.com> wrote:
Is there any way to include the rows where the predicate evaluates to null while still using an index?
That seems quite unlikely. Your definition of equality is incompatible with the system’s standard definition while requiring that the standard equals operator be used in the query. Null values and non-null values are not considered equal, or, put another way, always considered distinct from each other. A btree index, which handles =, can’t be told to behave differently and so cannot fulfill your desire to produce rows where the stored value is null; it can only produce those equal to 5000.
David J.
"David G. Johnston" <david.g.johnston@gmail.com> writes: > On Wednesday, December 18, 2024, Adrian Garcia Badaracco < > adrian@adriangb.com> wrote: >> Is there any way to include the rows where the predicate evaluates to null >> while still using an index? > ... A btree index, which handles =, can’t be told to behave > differently and so cannot fulfill your desire to produce rows where the > stored value is null; it can only produce those equal to 5000. Not in a single scan, no. But multiple scans are possible: regression=# create table t (id int unique); CREATE TABLE regression=# explain select * from t where id = 5000 or id is null; QUERY PLAN ------------------------------------------------------------------------------ Bitmap Heap Scan on t (cost=8.42..18.98 rows=14 width=4) Recheck Cond: ((id IS NULL) OR (id = 5000)) -> BitmapOr (cost=8.42..8.42 rows=14 width=0) -> Bitmap Index Scan on t_id_key (cost=0.00..4.25 rows=13 width=0) Index Cond: (id IS NULL) -> Bitmap Index Scan on t_id_key (cost=0.00..4.16 rows=1 width=0) Index Cond: (id = 5000) (7 rows) The OP was quite unclear about what semantics he wants for multiple-variable WHERE clauses, but maybe something like this would work: WHERE (original-clause) OR x IS NULL OR y IS NULL OR ... where each variable mentioned in original-clause is allowed to also be NULL. Or perhaps what is wanted is WHERE (original-clause) OR (x IS NULL AND y IS NULL AND ...) ?? regards, tom lane
Thank you for the great idea Tom. While yes I can't modify the original WHERE clause I do think I'll be able to introspect it or get the system generating it to tell me which columns it references and then add an OR x is NULL OR y is NULL ...
For context, just in case it's interesting, I store Parquet statistics in a Postgres table and run the output of this thing on them: https://github.com/apache/datafusion/blob/f92442ea8e8944c78f8e40d6648d049ff8e335ec/datafusion/physical-optimizer/src/pruning.rs#L146-L456
Hence why I can't really control the WHERE clause (at least not without re-implementing a bunch of finicky error prone code).
On Wed, Dec 18, 2024 at 10:38 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
"David G. Johnston" <david.g.johnston@gmail.com> writes:
> On Wednesday, December 18, 2024, Adrian Garcia Badaracco <
> adrian@adriangb.com> wrote:
>> Is there any way to include the rows where the predicate evaluates to null
>> while still using an index?
> ... A btree index, which handles =, can’t be told to behave
> differently and so cannot fulfill your desire to produce rows where the
> stored value is null; it can only produce those equal to 5000.
Not in a single scan, no. But multiple scans are possible:
regression=# create table t (id int unique);
CREATE TABLE
regression=# explain select * from t where id = 5000 or id is null;
QUERY PLAN
------------------------------------------------------------------------------
Bitmap Heap Scan on t (cost=8.42..18.98 rows=14 width=4)
Recheck Cond: ((id IS NULL) OR (id = 5000))
-> BitmapOr (cost=8.42..8.42 rows=14 width=0)
-> Bitmap Index Scan on t_id_key (cost=0.00..4.25 rows=13 width=0)
Index Cond: (id IS NULL)
-> Bitmap Index Scan on t_id_key (cost=0.00..4.16 rows=1 width=0)
Index Cond: (id = 5000)
(7 rows)
The OP was quite unclear about what semantics he wants for
multiple-variable WHERE clauses, but maybe something like this
would work:
WHERE (original-clause) OR x IS NULL OR y IS NULL OR ...
where each variable mentioned in original-clause is allowed
to also be NULL. Or perhaps what is wanted is
WHERE (original-clause) OR (x IS NULL AND y IS NULL AND ...)
??
regards, tom lane
Well, there is a wrinkle: if the predicate returns `false` but one of the columns is null then the whole thing ends up `true` when I'd want it to be `false`. Say col_a = [1] and col_b = [null]:
WHERE (col_a < 1 AND col_b > 1) OR col_a IS NULL OR col_b IS NULL -> WHERE (false AND null) OR false OR true -> WHERE false OR false OR true -> true.
That's still a pretty good solution for now.
On Wed, Dec 18, 2024 at 10:41 PM Adrian Garcia Badaracco <adrian@adriangb.com> wrote:
Thank you for the great idea Tom. While yes I can't modify the original WHERE clause I do think I'll be able to introspect it or get the system generating it to tell me which columns it references and then add an OR x is NULL OR y is NULL ...For context, just in case it's interesting, I store Parquet statistics in a Postgres table and run the output of this thing on them: https://github.com/apache/datafusion/blob/f92442ea8e8944c78f8e40d6648d049ff8e335ec/datafusion/physical-optimizer/src/pruning.rs#L146-L456Hence why I can't really control the WHERE clause (at least not without re-implementing a bunch of finicky error prone code).On Wed, Dec 18, 2024 at 10:38 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:"David G. Johnston" <david.g.johnston@gmail.com> writes:
> On Wednesday, December 18, 2024, Adrian Garcia Badaracco <
> adrian@adriangb.com> wrote:
>> Is there any way to include the rows where the predicate evaluates to null
>> while still using an index?
> ... A btree index, which handles =, can’t be told to behave
> differently and so cannot fulfill your desire to produce rows where the
> stored value is null; it can only produce those equal to 5000.
Not in a single scan, no. But multiple scans are possible:
regression=# create table t (id int unique);
CREATE TABLE
regression=# explain select * from t where id = 5000 or id is null;
QUERY PLAN
------------------------------------------------------------------------------
Bitmap Heap Scan on t (cost=8.42..18.98 rows=14 width=4)
Recheck Cond: ((id IS NULL) OR (id = 5000))
-> BitmapOr (cost=8.42..8.42 rows=14 width=0)
-> Bitmap Index Scan on t_id_key (cost=0.00..4.25 rows=13 width=0)
Index Cond: (id IS NULL)
-> Bitmap Index Scan on t_id_key (cost=0.00..4.16 rows=1 width=0)
Index Cond: (id = 5000)
(7 rows)
The OP was quite unclear about what semantics he wants for
multiple-variable WHERE clauses, but maybe something like this
would work:
WHERE (original-clause) OR x IS NULL OR y IS NULL OR ...
where each variable mentioned in original-clause is allowed
to also be NULL. Or perhaps what is wanted is
WHERE (original-clause) OR (x IS NULL AND y IS NULL AND ...)
??
regards, tom lane