Thread: Suboptimal query plan when using expensive BCRYPT functions
Hi there, I've got a relatively simple query that contains expensive BCRYPT functions that gets optimized in a way that causes postgres to compute more bcrypt hashes than necessary, thereby dramatically slowing things down. In a certain part of our application we need to lookup users by their username, email address and password. Now we don't store plaintext passwords and so the query needs to compute bcrypt hashes on the fly: SELECT DISTINCT u.* FROM auth_user u JOIN bb_userprofile p ON p.user_id = u.id JOIN bb_identity i ON i.profile_id = p.id WHERE ( ( u.username ILIKE 'detkin' OR i.email ILIKE 'foo@example.com' ) AND ( SUBSTRING(password FROM 8) = CRYPT( 'detkin', SUBSTRING(password FROM 8)) ) ) These queries are generated by a parser that translates from an external query language to SQL run on the database. This test db contains 12 user records. With a single bcrypt hash taking ~300ms to compute, this is a recipe for disaster and so the app only allows queries that require only a very small number of bcrypt computation. E.g. the user must always "AND" the password lookup with a clause like " username = 'foo' AND password = 'bar'" (username is unique). However, while the query above technically only needs to compute 1 hash (there is a user 'detkin' and email 'foo@example.com' does not exist), it instead creates a query plan that computes hashes *before* filtering on username and email, leading to 12 hash computations and a very slow query. The EXPLAIN (ANALYZE, BUFFERS) is here: http://explain.depesz.com/s/yhE The schemas for the 3 tables involved are here: http://pgsql.privatepaste.com/f72020ad0a As a quick experiment I tried moving the joins and email lookup into a nested IN query, but that still generates a plan that computes hashes for all 12 users, before picking out the 1 whose username matches. Is there any way I can get postgres to perform the hash calculations on the *result* of the other parts of the where clause, instead of the other way around? Or else rewrite the query? Cheers, Erik
On Fri, Mar 21, 2014 at 5:59 PM, Erik van Zijst <erik.van.zijst@gmail.com> wrote:
Hi there,
I've got a relatively simple query that contains expensive BCRYPT
functions that gets optimized in a way that causes postgres to compute
more bcrypt hashes than necessary, thereby dramatically slowing things
down.
In a certain part of our application we need to lookup users by their
username, email address and password. Now we don't store plaintext
passwords and so the query needs to compute bcrypt hashes on the fly:
SELECT DISTINCT u.*
FROM auth_user u
JOIN bb_userprofile p ON p.user_id = u.id
JOIN bb_identity i ON i.profile_id = p.id
WHERE
(
(
u.username ILIKE 'detkin'
OR
i.email ILIKE 'foo@example.com'
)
AND
(
SUBSTRING(password FROM 8) = CRYPT(
'detkin', SUBSTRING(password FROM 8))
)
)
These queries are generated by a parser that translates from an
external query language to SQL run on the database. This test db
contains 12 user records.
With a single bcrypt hash taking ~300ms to compute, this is a recipe
for disaster and so the app only allows queries that require only a
very small number of bcrypt computation.
E.g. the user must always "AND" the password lookup with a clause like
" username = 'foo' AND password = 'bar'" (username is unique).
However, while the query above technically only needs to compute 1
hash (there is a user 'detkin' and email 'foo@example.com' does not
exist), it instead creates a query plan that computes hashes *before*
filtering on username and email, leading to 12 hash computations and a
very slow query.
The EXPLAIN (ANALYZE, BUFFERS) is here: http://explain.depesz.com/s/yhE
The schemas for the 3 tables involved are here:
http://pgsql.privatepaste.com/f72020ad0a
As a quick experiment I tried moving the joins and email lookup into a
nested IN query, but that still generates a plan that computes hashes
for all 12 users, before picking out the 1 whose username matches.
Is there any way I can get postgres to perform the hash calculations
on the *result* of the other parts of the where clause, instead of the
other way around? Or else rewrite the query?
Cheers,
Erik
(untested), but how about something like the following:
WITH au AS (
SELECT DISTINCT u.*
FROM auth_user u
JOIN bb_userprofile p ON p.user_id = u.id
JOIN bb_identity i ON i.profile_id = p.id
WHERE u.username ILIKE 'detkin'
OR i.email ILIKE 'foo@example.com')
SELECT au.*
FROM au
WHERE SUBSTRING(au.password FROM 8) = CRYPT('detkin', SUBSTRING(au.password FROM 8));
WITH au AS (
SELECT DISTINCT u.*
FROM auth_user u
JOIN bb_userprofile p ON p.user_id = u.id
JOIN bb_identity i ON i.profile_id = p.id
WHERE u.username ILIKE 'detkin'
OR i.email ILIKE 'foo@example.com')
SELECT au.*
FROM au
WHERE SUBSTRING(au.password FROM 8) = CRYPT('detkin', SUBSTRING(au.password FROM 8));
Yes, that works (it does at least on my small test database). However, these queries are generated by a parser that translates complex parse trees from a higher level DSL that doesn't lend itself well to logically isolating the crypt checks from the remaining conditions, as password checks might be present at arbitrary depths. For example: ( active eq true AND ( password eq "foo" OR password eq "bar" ) ) AND ( username eq "erik" OR email contains "bar" ) Currently the SQL generator translates each AST node into individual predicates that straightforwardly concatenate into a single SQL WHERE clause. For this to work, the individual nodes should compose well. I don't immediately see how the above query could be automatically translated into SQL when taking the WITH-AS approach. I could nonetheless take a stab at it, but life would certainly be easier if I could translate each component independently and leave optimization to the query planner. Cheers, Erik On Sat, Mar 22, 2014 at 1:51 PM, bricklen <bricklen@gmail.com> wrote: > > On Fri, Mar 21, 2014 at 5:59 PM, Erik van Zijst <erik.van.zijst@gmail.com> > wrote: >> >> Hi there, >> >> I've got a relatively simple query that contains expensive BCRYPT >> functions that gets optimized in a way that causes postgres to compute >> more bcrypt hashes than necessary, thereby dramatically slowing things >> down. >> >> In a certain part of our application we need to lookup users by their >> username, email address and password. Now we don't store plaintext >> passwords and so the query needs to compute bcrypt hashes on the fly: >> >> SELECT DISTINCT u.* >> FROM auth_user u >> JOIN bb_userprofile p ON p.user_id = u.id >> JOIN bb_identity i ON i.profile_id = p.id >> WHERE >> ( >> ( >> u.username ILIKE 'detkin' >> OR >> i.email ILIKE 'foo@example.com' >> ) >> AND >> ( >> SUBSTRING(password FROM 8) = CRYPT( >> 'detkin', SUBSTRING(password FROM 8)) >> ) >> ) >> >> These queries are generated by a parser that translates from an >> external query language to SQL run on the database. This test db >> contains 12 user records. >> >> With a single bcrypt hash taking ~300ms to compute, this is a recipe >> for disaster and so the app only allows queries that require only a >> very small number of bcrypt computation. >> >> E.g. the user must always "AND" the password lookup with a clause like >> " username = 'foo' AND password = 'bar'" (username is unique). >> >> However, while the query above technically only needs to compute 1 >> hash (there is a user 'detkin' and email 'foo@example.com' does not >> exist), it instead creates a query plan that computes hashes *before* >> filtering on username and email, leading to 12 hash computations and a >> very slow query. >> >> The EXPLAIN (ANALYZE, BUFFERS) is here: http://explain.depesz.com/s/yhE >> >> The schemas for the 3 tables involved are here: >> http://pgsql.privatepaste.com/f72020ad0a >> >> As a quick experiment I tried moving the joins and email lookup into a >> nested IN query, but that still generates a plan that computes hashes >> for all 12 users, before picking out the 1 whose username matches. >> >> Is there any way I can get postgres to perform the hash calculations >> on the *result* of the other parts of the where clause, instead of the >> other way around? Or else rewrite the query? >> >> Cheers, >> Erik > > > (untested), but how about something like the following: > > WITH au AS ( > > SELECT DISTINCT u.* > FROM auth_user u > JOIN bb_userprofile p ON p.user_id = u.id > JOIN bb_identity i ON i.profile_id = p.id > WHERE u.username ILIKE 'detkin' > > OR i.email ILIKE 'foo@example.com') > > SELECT au.* > FROM au > WHERE SUBSTRING(au.password FROM 8) = CRYPT('detkin', SUBSTRING(au.password > FROM 8)); >
On Sat, Mar 22, 2014 at 3:27 PM, Erik van Zijst <erik.van.zijst@gmail.com> wrote:
Yes, that works (it does at least on my small test database).
However, these queries are generated by a parser that translates
complex parse trees from a higher level DSL that doesn't lend itself
well to logically isolating the crypt checks from the remaining
conditions, as password checks might be present at arbitrary depths.
For example:
(
active eq true
AND
(
password eq "foo"
OR
password eq "bar"
)
)
AND
(
username eq "erik"
OR
email contains "bar"
)
Currently the SQL generator translates each AST node into individual
predicates that straightforwardly concatenate into a single SQL WHERE
clause. For this to work, the individual nodes should compose well. I
don't immediately see how the above query could be automatically
translated into SQL when taking the WITH-AS approach.
I could nonetheless take a stab at it, but life would certainly be
easier if I could translate each component independently and leave
optimization to the query planner.
How about encapsulating the revised query inside a db function? That simplifies the query for your query generator to something like "select x,y,z from your_func(p_user,p_email,p_crypt)"
On Sat, Mar 22, 2014 at 3:56 PM, bricklen <bricklen@gmail.com> wrote: > On Sat, Mar 22, 2014 at 3:27 PM, Erik van Zijst <erik.van.zijst@gmail.com> >> I could nonetheless take a stab at it, but life would certainly be >> easier if I could translate each component independently and leave >> optimization to the query planner. > > How about encapsulating the revised query inside a db function? That > simplifies the query for your query generator to something like "select > x,y,z from your_func(p_user,p_email,p_crypt)" I'm not really sure I understand how a db function would make things easier. What would the implementation for your_func() be and what would the SQL look like for the DSL example which contains multiple password checks? Cheers, Erik
On Sat, Mar 22, 2014 at 8:37 PM, Erik van Zijst <erik.van.zijst@gmail.com> wrote:
On Sat, Mar 22, 2014 at 3:56 PM, bricklen <bricklen@gmail.com> wrote:
> On Sat, Mar 22, 2014 at 3:27 PM, Erik van Zijst <erik.van.zijst@gmail.com>>> I could nonetheless take a stab at it, but life would certainly beI'm not really sure I understand how a db function would make things
>> easier if I could translate each component independently and leave
>> optimization to the query planner.
>
> How about encapsulating the revised query inside a db function? That
> simplifies the query for your query generator to something like "select
> x,y,z from your_func(p_user,p_email,p_crypt)"
easier. What would the implementation for your_func() be and what
would the SQL look like for the DSL example which contains multiple
password checks?
I just reread your previous post about the checks being at potentially arbitrary depths. In that case, the function may or may not help. Without a representative database to test with I can't say one way or the other. Perhaps someone else will have some other ideas of what could be useful here.
bricklen <bricklen@gmail.com> writes: > Perhaps someone else will have some other ideas of what could be useful > here. Maybe I'm missing something ... but isn't the OP's query completely bogus? SELECT DISTINCT u.* FROM auth_user u JOIN bb_userprofile p ON p.user_id = u.id JOIN bb_identity i ON i.profile_id = p.id WHERE ( ( u.username ILIKE 'detkin' OR i.email ILIKE 'foo(at)example(dot)com' ) AND ( SUBSTRING(password FROM 8) = CRYPT( 'detkin', SUBSTRING(password FROM 8)) ) ) Granting that there are not chance collisions of password hashes (which would surely be a bad thing if there were), success of the second AND arm means that we are on user detkin's row of auth_user. Therefore the OR business is entirely nonfunctional: if the password test passes, then the u.username ILIKE 'detkin' clause succeeds a fortiori, while if the password test fails, it hardly matters what i.email is, because the WHERE clause as a whole fails. Ergo, the whole WHERE clause might as well just be written "u.username = 'detkin'". If it were a RIGHT JOIN rather than just a JOIN, this argument would fail ... but as written, the query makes little sense. I'll pass gently over the question of whether the password test as shown could ever succeed at all. I suppose we've been shown a lobotomized version of the real logic, but it's hard to give advice in such situations. regards, tom lane
On Sat, Mar 22, 2014 at 11:40 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Maybe I'm missing something ... but isn't the OP's query completely bogus? > > SELECT DISTINCT u.* > FROM auth_user u > JOIN bb_userprofile p ON p.user_id = u.id > JOIN bb_identity i ON i.profile_id = p.id > WHERE > ( > ( > u.username ILIKE 'detkin' > OR > i.email ILIKE 'foo(at)example(dot)com' > ) > AND > ( > SUBSTRING(password FROM 8) = CRYPT( > 'detkin', SUBSTRING(password FROM 8)) > ) > ) > > Granting that there are not chance collisions of password hashes (which > would surely be a bad thing if there were), Would it? Any hashing system is inherently open to collision (although you're more likely to find 2 identical snowflakes), but how does that affect our situation? It means you simply would have found another password for that user that is just as valid. The system will accept it. > success of the second AND arm > means that we are on user detkin's row of auth_user. My password could be 'detkin' too, but my username is 'erik'. > Therefore the OR > business is entirely nonfunctional: if the password test passes, then > the u.username ILIKE 'detkin' clause succeeds a fortiori, while if the > password test fails, it hardly matters what i.email is, because the WHERE > clause as a whole fails. My email could be 'foo@example.com', my username 'erik' and my password 'detkin'. Users are identified through their unique username or email address. Passwords are not unique. > I suppose we've been shown a lobotomized version of the real logic, > but it's hard to give advice in such situations. This is an actual query taken from the system. Cheers, Erik
On 03/22/2014 02:59 AM, Erik van Zijst wrote: > Is there any way I can get postgres to perform the hash calculations > on the *result* of the other parts of the where clause, instead of the > other way around? Or else rewrite the query? The planner doesn't know that the crypt function is expensive. That can be fixed with "ALTER FUNCTION crypt(text, text) COST <high value>". Even with that, I'm not sure if the planner is smart enough to optimize the query the way you'd want, but it's worth a try. - Heikki
On Mon, Mar 24, 2014 at 12:08 AM, Heikki Linnakangas <hlinnakangas@vmware.com> wrote: > On 03/22/2014 02:59 AM, Erik van Zijst wrote: >> >> Is there any way I can get postgres to perform the hash calculations >> on the *result* of the other parts of the where clause, instead of the >> other way around? Or else rewrite the query? > > > The planner doesn't know that the crypt function is expensive. That can be > fixed with "ALTER FUNCTION crypt(text, text) COST <high value>". Even with > that, I'm not sure if the planner is smart enough to optimize the query the > way you'd want, but it's worth a try. Thanks. That's the kind of hint I was looking for. I just gave it a go, but unfortunately it's not enough to get the optimizer to change the plan, regardless of the cost value. Cheers, Erik