Thread: Problem with index in OR'd expression
Hello, I've been using PostgreSQL for a few years and mostly love it. Aside from a few (perceived, anyway) annoying limitations in PL/PGSQL (which I almost exclusively am using for db interaction), I'm very satisfied with it. I ran across this problem several months back and decided to blow it off back then, but now I'm really wanting to understand why it's happening, if there's a work around, if I'm doing something wrong, and if it's just a shortcoming / bug if the developers are aware of it and might implement a fix some day. I've already done a fair amount of testing (a total of hours, I'm sure) and Googling around on it. I also read this FAQ on indexes: http://www.postgresql.org/docs/faqs.FAQ.html#item4.6 I think my issue is not explained by any of the resources I reviewed. To simplify my actual problem, I conducted a very simple test. First I defined a table: create table t ( c1 int primary key, c2 int ) with the single index (primary key). Then I filled the table with 100,000 rows using a quick pl/pgsql function (c1 = 1, 2, 3... 100,000). Then I ran these two tests (with EXPLAIN in pgAdmin): select * from t where c1 = 75000; select * from t where ((0 is null) OR (c1 = 75000)); The first one properly uses the index on c1, the second does not. Obviously, a human looking at the second one would realize it's essentially identical to the first and properly evaluate the (0 is null) part once (since it's immutable) and then ignore it for the rest of the searching. Now, I'm sure some of you might ask "why the hell are you doing that in the first place?" I have a good reason. I write a lot of pl/pgsql functions that are search functions with a list of *optional* parameters. I don't know ahead of time whether a user will include on or not. In MSSQL, what I'm able to do (with no obvious index problems that I've seen) is add those all to the WHERE clause like this: ((vC1 IS NULL) OR (C1 = vC1)) AND ((vC2 IS NULL) OR (C2 = vC2)) ... (here vC1 and vC2 represent variables passed into the pl/pgsql function). So my question were basically asked at the beginning of this post: is there another way to get the optimizer to understand what I'm trying to do here? Is this a known problem? Is it working as preferred and unlikely to change any time soon? Is there some setting I can hit somewhere to make it work like I want? The only solution that I've come up with so far is making all of my statements that otherwise wouldn't have to be dynamic and then only including criteria for ones that I really need (based on the tested nullness of the variables), but I find that highly annoying and have run into other problems as a result as well. I'd appreciate any suggestions you might have to help resolve this. Thank, John Lawler
postgresql.org@tgice.com writes: > select * from t where c1 = 75000; > select * from t where ((0 is null) OR (c1 = 75000)); > The first one properly uses the index on c1, the second does not. > Obviously, a human looking at the second one would realize it's > essentially identical to the first and properly evaluate the (0 is null) > part once (since it's immutable) and then ignore it for the rest of the > searching. Well, you could update --- 8.2 contains code to recognize that the IS NULL expression is constant, but prior releases do not. However, if you're hoping to do this: > ((vC1 IS NULL) OR (C1 = vC1)) AND > ((vC2 IS NULL) OR (C2 = vC2)) ... you're still gonna lose because those are variables not constants ... regards, tom lane
Tom Lane wrote: > Well, you could update --- 8.2 contains code to recognize that the IS > NULL expression is constant, but prior releases do not. That's excellent to hear -- I'd missed that in my perusing of the changelogs between 8.0.x and 8.2. That does give me one more reason to upgrade. It appears I did not mention what version I was running -- sorry for that, though you guessed it was < 8.2. It's actually 8.0.x. > However, if you're hoping to do this: > >> ((vC1 IS NULL) OR (C1 = vC1)) AND >> ((vC2 IS NULL) OR (C2 = vC2)) ... > > you're still gonna lose because those are variables not constants ... Well, that *is* what I'm hoping to do. I understand how (0 IS NULL) is different from (variable IS NULL), but isn't it reasonable to expect that PG could evaluate that expression only once (knowing that the variable couldn't change during the course of the query execution) and then treat that expression as constant? I appreciate that you're saying that it won't work even in 8.2, but what I'm getting at is would it be possible to add it in the future? As I mentioned, I'm pretty sure that that must be what MSSQL (6.5, 7, 2000 and 2005 [all of which I've had some experience with]) seem to be doing. Now failing all of this, does any one have a better idea for what I'm trying to do? A simple syntax for optionally including WHERE criteria depending on the null-ness of variables (w/o having to go to dynamic execution)? Thanks for your reply Tom. jl
postgresql.org@tgice.com wrote: > Tom Lane wrote: >> you're still gonna lose because those are variables not constants ... > > Well, that *is* what I'm hoping to do. I understand how (0 IS NULL) is > different from (variable IS NULL), but isn't it reasonable to expect > that PG could evaluate that expression only once (knowing that the > variable couldn't change during the course of the query execution) and > then treat that expression as constant? Also, I should mention that in my case, the "variables" are actually constants either defined in the function param list (and thus non-modifiable within the function body) or defined with: DECLARE vC1 CONSTANT varchar(10) := 'blah' -type syntax. I would submit that in that situation, it would be reasonable for a user to expect my suggested syntax to still use the indicated indexes. Do you agree? It it possible that something other than picking up the 'IS NULL' boolean as constant has changed between 8.0.x & 8.2 that might make this work? Thanks, jl
postgresql.org@tgice.com writes: > I would submit that in that situation, it would be > reasonable for a user to expect my suggested syntax to still use the > indicated indexes. The only thing that will make that work is if "indexed_col IS NULL" were an indexable condition, which it isn't because the PG index API only supports "indexed_col operator something" as an indexable condition (IS NULL is not an operator, and even if it were, there's no "something" on its righthand side). Fixing this has been on the radar screen for awhile, but it's not done, largely for lack of agreement about a reasonably clean way to change that API. regards, tom lane
Tom Lane wrote: > postgresql.org@tgice.com writes: >> I would submit that in that situation, it would be >> reasonable for a user to expect my suggested syntax to still use the >> indicated indexes. > > The only thing that will make that work is if "indexed_col IS NULL" were > an indexable condition, which it isn't because the PG index API only > supports "indexed_col operator something" as an indexable condition > (IS NULL is not an operator, and even if it were, there's no "something" > on its righthand side). Fixing this has been on the radar screen for > awhile, but it's not done, largely for lack of agreement about a > reasonably clean way to change that API. Sorry to keep this issue alive even longer, Tom, but I think I may've been unclear with my example. I was referring to the situation where one has this in a WHERE clause: ((vConstant IS NULL) OR (Table.IndexedCol = vConstant)) where vConstant is a *constant* parameter in a pl/pgsql function. In the latest versions (8.1 *or* 8.2), would you expect this to successfully use the index on Table.IndexedCol and not have PG be confused (into a sequential scan) by the (vConstant IS NULL) expression? As I indicated, I'm currently running 8.0.x, and am wondering whether it would be worth the effort to upgrade to 8.1 or 8.2 (Gentoo doesn't yet have PG at 8.2, and I'm a bit lazy with installing things outside of Portage) to solve this issue or whether I should just enable a workaround for now and keep an eye on future releases for a better solution to this problem. Thanks again, John
On mið, 2006-12-27 at 11:02 -0600, postgresql.org@tgice.com wrote: > > I was referring to the situation where one has this in a WHERE clause: > > ((vConstant IS NULL) OR (Table.IndexedCol = vConstant)) > > where vConstant is a *constant* parameter in a pl/pgsql function. Reguardless of the issue whether pl/pgsql could be expected to optimize this case, I find it difficult to imagine a scenario where this kind of coding makes sense. I understand that in some cases on would like to do this with a *variable* to simplify logic, but what possible gain can be had from doing this with a constant, other that obfuscation? It does not make sense to spend CPU-time on optimizing a case that is not useful or common. gnari
postgresql.org@tgice.com writes: > I was referring to the situation where one has this in a WHERE clause: > ((vConstant IS NULL) OR (Table.IndexedCol = vConstant)) > where vConstant is a *constant* parameter in a pl/pgsql function. My mistake, I was thinking of Table.IndexedCol IS NULL. > In the latest versions (8.1 *or* 8.2), would you expect this to > successfully use the index on Table.IndexedCol and not have PG be > confused (into a sequential scan) by the (vConstant IS NULL) expression? No, and there's no "confusion" about it: with that WHERE clause, the plan might have to return every row in the table. The index is useless. Since you know that the two OR'd conditions are mutually exclusive, perhaps you could transform the query into a UNION operation, as in this example: regression=# prepare foo(int) as select * from tenk1 where $1 is null union all select * from tenk1 where unique2 = $1; PREPARE regression=# explain analyze execute foo(42); QUERY PLAN ----------------------------------------------------------------------------------------------------------------------------- Append (cost=0.00..561.02 rows=10001 width=244) (actual time=0.169..0.201 rows=1 loops=1) -> Result (cost=0.00..458.00 rows=10000 width=244) (actual time=0.012..0.012 rows=0 loops=1) One-Time Filter: ($1 IS NULL) -> Seq Scan on tenk1 (cost=0.00..458.00 rows=10000 width=244) (never executed) -> Index Scan using tenk1_unique2 on tenk1 (cost=0.00..3.01 rows=1 width=244) (actual time=0.142..0.166 rows=1 loops=1) Index Cond: (unique2 = $1) Total runtime: 1.092 ms (7 rows) Because of the one-time filter, the seqscan isn't executed unless needed. However I'm not sure that this sort of approach scales up if you have more than one of these conditions in a query ... regards, tom lane
Ragnar wrote: > Reguardless of the issue whether pl/pgsql could be expected to > optimize this case, I find it difficult to imagine a scenario > where this kind of coding makes sense. > > I understand that in some cases on would like to do this with > a *variable* to simplify logic, but what possible gain can > be had from doing this with a constant, other that obfuscation? Well, in one way it's a variable, but in another a constant. It's a variable in the context of general PG usage... e.g., my application code may call the function with whatever parameters a user chooses, leaving some parameters null and others not. Within the context of the function (after calling), these variables are constant and I'm attempting to use my OR syntax as shorthand to avoid having to use a dynamic statement *only* because of this situation. As I've mentioned, this approach seems to work with MSSQL 6.5+, which I assume we consider as a valid competitor to PG... if this didn't work anywhere else, I probably wouldn't even have brought it up. I'll re-iterate another question I attempted to pose which was: what have other PG application developers done in this situation? Is it most common to just use dynamic statements? Thanks for your response. jl
On mán, 2007-01-01 at 14:21 -0600, postgresql.org@tgice.com wrote: > Within the context of the function (after calling), these variables are > constant and I'm attempting to use my OR syntax as shorthand to avoid > having to use a dynamic statement *only* because of this situation. > As I've mentioned, this approach seems to work with MSSQL 6.5+, which I > assume we consider as a valid competitor to PG... if this didn't work > anywhere else, I probably wouldn't even have brought it up. > > I'll re-iterate another question I attempted to pose which was: what > have other PG application developers done in this situation? Is it most > common to just use dynamic statements? ps/pgsql will prepare the statement caching the plan for subsequent calls, thus making this not easily optimized. to avoid this in ps/pgsql , I believe it is common to to make the statement dynamic by using EXECUTE, thus in effect making sure the query is planned each time. gnari