Re: [HACKERS] Improve OR conditions on joined columns (common starschema problem) - Mailing list pgsql-hackers

From Jim Nasby
Subject Re: [HACKERS] Improve OR conditions on joined columns (common starschema problem)
Date
Msg-id dfa004d8-6b58-d006-bf90-f2d854838e2c@BlueTreble.com
Whole thread Raw
In response to Re: [HACKERS] Improve OR conditions on joined columns.  (Tom Lane <tgl@sss.pgh.pa.us>)
Responses Re: [HACKERS] Improve OR conditions on joined columns (common starschema problem)
Re: [HACKERS] Improve OR conditions on joined columns (common star schema problem)
List pgsql-hackers
On 2/8/17 5:54 PM, Tom Lane wrote:
> Although ... actually, that may not be the bottleneck for what you're
> after.  The issue here is not so much discovering a clever plan for
> a union as realizing that the query could be cast as a union in the
> first place.

Right; their current workaround is to manually write a UNION. That's 
*significantly* faster than the OR.

> Maybe it'd be better to imagine this as something closer to planagg.c,
> that is it knows how to apply a specific high-level optimization that
> might or might not be a win, so it builds a path describing that and sees
> if it looks cheaper than a path done the normal way.  The fact that
> we could even build a path describing a union is something that wasn't
> there before 9.6, but maybe there's enough infrastructure for that now.

It's encouraging that there's at least a template to follow there... If 
there is missing infrastructure, is it likely to be major? My uninformed 
guess would be that the pathification was the major hurdle.

>> There's another transform using arrays that's possible as well (see
>> attached example); I believe that would work regardless of uniqueness.
>
> That doesn't look terribly promising for automated application.
> And I think it's really dependent on the exact shape of the OR
> clause, which is an unpleasant limitation.  Considering the amount

Not sure what you mean by shape; do you mean whether the OR conditions 
are rels vs Consts or Vars?

> of work this will take to do at all, you'd want it to be pretty
> general I think.  I'm imagining something that would look for an
> OR in which each clause referenced only one rel, then if it can
> identify a way to re-unique-ify the result, split into a UNION
> with an arm for each rel used in the OR.  The nature of the
> conditions in each OR arm don't seem to matter ... though probably
> you'd have to insist on there not being volatile conditions anywhere
> in the query, because they'd get evaluated too many times.

In my experience, the common use case here is a large fact table that 
joins to a number of dimension tables. If you can filter the large fact 
table down to a fairly small size, those joins don't matter much. But if 
a big chunk of your filter comes from the joins, you're in trouble.

UNION isn't really a great way to solve this problem because you still 
end up doing a full join just to run the filter, but really all that you 
need are the values from the dimension table(s) that you need for the 
filter. IOW, change

WHERE t1 IN ('a','b') OR t2 IN ('c','d')

into

WHERE f1 IN (1,2) OR f2 IN (3,4)

(assuming a,b,c,d maps to 1,2,3,4)

BTW, there's an important caveat here: users generally do NOT want 
duplicate rows from the fact table if the dimension table results aren't 
unique. I thought my array solution was equivalent to what the JOINs 
would do in that case but that's actually wrong. The array solution does 
provide the behavior users generally want here though. JOIN is the 
easiest tool to pick up for this, so it's what people gravitate to, but 
I suspect most users would be happier with a construct that worked like 
the array trick does, but was easier to accomplish.

I wonder if any other databases have come up with non-standard syntax to 
do this.
-- 
Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
Experts in Analytics, Data Architecture and PostgreSQL
Data in Trouble? Get it in Treble! http://BlueTreble.com
855-TREBLE2 (855-873-2532)



pgsql-hackers by date:

Previous
From: Andres Freund
Date:
Subject: Re: [HACKERS] [PATCH] Rename pg_switch_xlog to pg_switch_wal
Next
From: Stephen Frost
Date:
Subject: Re: [HACKERS] [PATCH] Rename pg_switch_xlog to pg_switch_wal