Thread: Speeding up LIKE with placeholders?
Is there any good way to speed up SQL that uses like and has placeholders? Here's the scoop. I've got a system that uses a lot of pre-generated SQL with placeholders in it. At runtime these SQL statements are fired off (through the C PQexecParams function, if that matters) for execution. No prepares or anything, just bare statements with $1 and friends, with the values passed in as parameters. Straightforward, and no big deal. Unfortunately, performance is horrible. And when I mean horrible, we're talking 6 orders of magnitude (101355.884 ms vs 0.267 ms) when checked out via EXPLAIN ANALYZE. The slow version has the SQL defined as a function with the parameters passed in, while the fast way has the parameters substituted in, and the query plan for the slow version notes that it's doing a sequential scan, while the fast version uses one of the indexes. (And the field being LIKEd has a b-tree index on it) The LIKE condition always has a constant prefix -- it's 'S%' or 'S42343%' -- so it fits the index. Now, I'd not be surprised for a generic function to do this, if the plan is created when the function is created, and I can deal with that. I'd figure, though, that since the parameters are being passed into PQexecParams basically to get them out of band so I don't have to deal with escaping, quoting, and suchlike things, that the optimizer would look at things *after* the substitution was done. Is there anything I can do to speed this up, short of doing the parameter substitution myself and skipping PQexecParams here? (Which I'd rather not, since it's a pain and somewhat error-prone (for me, at least)) -- Dan --------------------------------------it's like this------------------- Dan Sugalski even samurai dan@sidhe.org have teddy bears and even teddy bears get drunk
Dan Sugalski <dan@sidhe.org> writes: > I'd figure, though, that since the parameters are being passed > into PQexecParams basically to get them out of band so I don't have > to deal with escaping, quoting, and suchlike things, that the > optimizer would look at things *after* the substitution was done. You'd figure wrong :-(. The present mechanism for the LIKE-to-index optimization requires the LIKE pattern to be a pure, unadulterated constant. regards, tom lane
At 5:19 PM -0400 9/10/04, Tom Lane wrote: >Dan Sugalski <dan@sidhe.org> writes: >> I'd figure, though, that since the parameters are being passed >> into PQexecParams basically to get them out of band so I don't have >> to deal with escaping, quoting, and suchlike things, that the >> optimizer would look at things *after* the substitution was done. > >You'd figure wrong :-(. The present mechanism for the LIKE-to-index >optimization requires the LIKE pattern to be a pure, unadulterated constant. Well. Darn. Would I regret it if I asked where in the source this lies so I could go fix it? -- Dan --------------------------------------it's like this------------------- Dan Sugalski even samurai dan@sidhe.org have teddy bears and even teddy bears get drunk
Dan Sugalski <dan@sidhe.org> writes: > Would I regret it if I asked where in the source this lies so I could > go fix it? If it were easy to fix it would have been fixed before now ... I have toyed with the notion of converting "var LIKE pattern" to "var LIKE pattern AND var >= lowbound(pattern) AND var < highbound(pattern)" where lowbound() and highbound() are actual functions that we leave in the generated plan, rather than insisting that the planner derive these bounds before making the plan at all. Then the pattern wouldn't have to be a true constant. However, it falls down on this problem: what shall those functions do if the supplied pattern isn't left-anchored at all? highbound in particular doesn't have a valid result it can give that's guaranteed larger than all possible values of var. Not to mention that a full-table index scan is the very last thing you want --- I think the planner would really be abdicating its responsibilities to generate a plan with that kind of downside risk. You could possibly sidestep this argument by envisioning a query like var LIKE ('^' || $1) but I doubt that anyone actually writes such things. In the end, LIKE is the sort of thing that you really have to run a planning cycle for in order to get a reasonable plan. regards, tom lane
At 5:55 PM -0400 9/10/04, Tom Lane wrote: >Dan Sugalski <dan@sidhe.org> writes: >> Would I regret it if I asked where in the source this lies so I could >> go fix it? > >If it were easy to fix it would have been fixed before now ... Oh, I wasn't expecting it to be an *easy* fix... :) The question is whether it's less work to make the fix in Postgres or in my back-end library code. Worst case I fix it in my code and submit a doc patch, but I'm up for at least investigating the general fix. Since the only difference in this case is that the parameters are pulled out for transport rather than being in band (a properly-escaped string substitution could turn this case from a PQexecParams call into a PQexec call) I was thinking the thing to do would be to either teach the planner to look in the parameter list when it gets handed $xxx variables, or have the back-end do the substitution to the SQL before handing the code to the planner. I'm not sure I like either option (the pre-substitution's got some issues when handed large parameters, while teaching the planner to use a parameter list may require thumping a lot of back-end code) and it may amount to nothing, but I figured it was worth a look, if for no other reason than to find a big mass of code I can safely run screaming from. ;-) -- Dan --------------------------------------it's like this------------------- Dan Sugalski even samurai dan@sidhe.org have teddy bears and even teddy bears get drunk
Dan Sugalski <dan@sidhe.org> writes: > Since the only difference in this case is that the parameters are > pulled out for transport rather than being in band (a > properly-escaped string substitution could turn this case from a > PQexecParams call into a PQexec call) I was thinking the thing to do > would be to either teach the planner to look in the parameter list > when it gets handed $xxx variables, or have the back-end do the > substitution to the SQL before handing the code to the planner. This has already been considered and rejected. Oliver Jowett did the part that is safe, which is to use the parameter values for estimation purposes in other contexts, but pre-substituting a parameter value for LIKE calls the mere correctness of the plan into question. What it would take to make it workable is a change in the semantics of the v3 protocol messages, such that there is no re-use of a plan. That, no one is up for quite yet, when we just hacked the protocol last year ... regards, tom lane
At 6:33 PM -0400 9/10/04, Tom Lane wrote: >Dan Sugalski <dan@sidhe.org> writes: >> Since the only difference in this case is that the parameters are >> pulled out for transport rather than being in band (a >> properly-escaped string substitution could turn this case from a >> PQexecParams call into a PQexec call) I was thinking the thing to do >> would be to either teach the planner to look in the parameter list >> when it gets handed $xxx variables, or have the back-end do the >> substitution to the SQL before handing the code to the planner. > >This has already been considered and rejected. Oliver Jowett did the >part that is safe, which is to use the parameter values for estimation >purposes in other contexts, but pre-substituting a parameter value for >LIKE calls the mere correctness of the plan into question. Ouch. Okay, fair 'nuff. (I figured the parameters could be factored in before the plan was made. Wrongly, I see, now that I poke around in the code a bit :) Plan B for me it is. >What it would take to make it workable is a change in the semantics of >the v3 protocol messages, such that there is no re-use of a plan. That, >no one is up for quite yet, when we just hacked the protocol last year ... It might be possible with a backwards-compatible protocol change, but that's more work than I'm up for, and this is the wrong list for it anyway. -- Dan --------------------------------------it's like this------------------- Dan Sugalski even samurai dan@sidhe.org have teddy bears and even teddy bears get drunk
Tom Lane <tgl@sss.pgh.pa.us> writes: > I think the planner would really be abdicating its responsibilities to > generate a plan with that kind of downside risk. Sure, but what about the risk of using a sequential scan the other 99% of the time? The downside risk of the index scan is a 5x slowdown or so. The downside risk of the sequential scan is unbounded. > You could possibly sidestep this argument by envisioning a query like > var LIKE ('^' || $1) > but I doubt that anyone actually writes such things. In the end, LIKE > is the sort of thing that you really have to run a planning cycle for > in order to get a reasonable plan. Actually ^ doesn't mean anything to LIKE. There's no way to anchor a LIKE pattern except by ensuring it doesn't start with % or _. I don't know. I wrote code that did "LIKE ?||'%'" on Oracle tons of times and it always used an index scan. I was really impressed when I first checked whether that worked and really happy when it did. And it always ran just fine. In retrospect I would have done something like "LIKE escape(?)||'%'". Except there's no such function. And if I had to write it myself I would do it in the application. String manipulation in SQL always being such a pain. And in any case I would have to check for an empty argument and handle that with some friendly UI message, which can't be done with a simple function in the query. So the database would be none the wiser and I still would have been disappointed if it didn't use the index scan. In the end it's always possible to fool the planner into producing a bad plan. It's just got to pick the plan that's most likely to be the one the user intended and least dangerous. It's hard to picture someone intentionally doing ?||'%' without thinking it would use an index scan. If they didn't check for leading %s and _s or empty parameters then it was their oversight or they were expecting it to be slow. -- greg
On Sat, Sep 11, 2004 at 02:30:35AM -0400, Greg Stark wrote: > I don't know. I wrote code that did "LIKE ?||'%'" on Oracle tons of times and > it always used an index scan. I was really impressed when I first checked > whether that worked and really happy when it did. And it always ran just fine. Note, this would have worked fine in previous versions of postgresql, because there was no such thing as prepared statements so the plan needed to be recreated for each query, thus it could take advantage of all knowledge available in the query itself. Kinda ironic that a seemingly very nice feature (seperating query from arguments) has such a side-effect (less information for planner). -- Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ > Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a > tool for doing 5% of the work and then sitting around waiting for someone > else to do the other 95% so you can sue them.
Attachment
If I understand correctly your problem is that the plan for your prepared query is bad because the LIKE parameter is not known at prepare time... Immediate solutions : Is it possible to use EXECUTE in your SQL to sidestep this and replan the query with the right values ? Can you put your query in a function so that it's planned on first execution with real parameter values ? There is still the old : SELECT ... WHERE field BETWEEN 'prefix' AND 'prefiy'; but it kinda sucks ;) --------------------------- Maybe this just indicates a lack of user-friendliness in the API ? Suggestions for improvement (in random order): Don't flame me for suggesting impossible things... I'm just wishing. * Maybe a new operator like <column STARTSWITH 'prefix'> could replace <column LIKE 'prefix%'> and tell the planner with absolute certainty that, whatever the value is, use of an index is possible ? It's possible to create user defined operators in Postgres... so... * In a future version, when planning prepared queries, maybe the planner could flag a query as "replan according to real parameter values every time it's executed" when it detects a hard to plan condition like "LIKE $1" or other hard to predict cases ? In this case the cost of planning would occur at each execution, but the query parsing time would still be saved. Or maybe the planner could write itself a note saying : "I planned the query for the generic LIKE, but if $1 does not start with a %, I should replan it when it's executed" * Maybe there should be a keyword like "REPLAN" to allow the user to specify that the query should be replanned every time ? * Maybe there should be a keyword like "DEFER PLAN" to allow the user to specify that the query should be not be planned when PREPARE is called, but rather on its first execution, with real parameters, and the plan then stored and used for subsequent queries ? This has the disadvantage of depending on the next query to specify adequate parameters. Maybe we could tell the planner which parameters to use for preparing the plan : PREPARE myquery( text, integer ) WITH PARAMETERS ('aa%', 5) AS SELECT * FROM mytable WHERE ... so the plan stored for this prepared query would be generated using the specified parameter values instead of just placeholders (so the planner may generate a better plan, in this case it would notice the prefix nature of the LIKE 'aa%'). * Maybe we could give the planner hints about the placeholders in a prepared query. Something like PREPARE myquery( text, integer ) AS SELECT * FROM mytable WHERE ... would become : PREPARE myquery( text, integer ) WHERE expression AS SELECT * FROM mytable WHERE ... where expression would be any boolean expression giving hints to the planner, like : $2 NOT NULL in this case the planner would know that it should take into account only the stats for $2 not null,10 ; and ignore the non-indexed NULLs for example. However this is redundant as the "$2 NOT NULL" can as well be put in the WHERE part of the query. This could be used in the 'LIKE $1' case by specifying that $1 cannot start with a '%'. The planner would have to be pretty smart to take this into account... When the expression is false, the query would be re-planned instead of using the stored plan. * Maybe we could add a clause to this poster's SELECT : PREPARE myquery AS SELECT * FROM mytable WHERE myfield LIKE $1 AND NOT ($1 LIKE '%%%'); In this case the planner would have to recognize the "NOT ($1 LIKE '%%%')" part and take appropriate measures... When executing the query this would not add any cost as it is a constant. However if the user messes up and sends $1='%something', he'll get no results. Anyway, have a nice day !
At 12:30 PM +0200 9/11/04, Pierre-Frédéric Caillaud wrote: > If I understand correctly your problem is >that the plan for your prepared query is bad >because the LIKE parameter is not known at >prepare time... Almost. My problem is that queries made with the PQexecParams C call are automatically split into a prepare and execute pair. So while as far as my code's concerned everything goes over to the server in one big wad, the same as if I'd sent it over with a PQexec call, the server sees it in several pieces. The only way to fix this is to change the way that PQexecParams is handled, which from what I can see of the postgres code would require a protocol change and corresponding extension of the server back end, or a change in the way prepared queries are executed by the server, neither of which is a particularly simple thing to do. (Which I didn't realize initially) -- Dan --------------------------------------it's like this------------------- Dan Sugalski even samurai dan@sidhe.org have teddy bears and even teddy bears get drunk
Martijn van Oosterhout <kleptog@svana.org> writes: > Kinda ironic that a seemingly very nice feature (seperating query from > arguments) has such a side-effect (less information for planner). It was an known and expected downside. But there is an upside too. For me it's actually a benefit, because it means I'm less likely to called in the middle of the night because the web site when someone entered some unusual parameters and caused a bad plan. Whatever plan it's using when I test it will be the same as the plan it uses for others, which may not be ideal but it won't be a surprise and I can anticipate whether it'll be a reasonable plan in the future. I can't anticipate all the potential plans the database might come up with for all the possible parameters. -- greg