Thread: Suboptimal query plan when using expensive BCRYPT functions

Suboptimal query plan when using expensive BCRYPT functions

From
Erik van Zijst
Date:
Hi there,

I've got a relatively simple query that contains expensive BCRYPT
functions that gets optimized in a way that causes postgres to compute
more bcrypt hashes than necessary, thereby dramatically slowing things
down.

In a certain part of our application we need to lookup users by their
username, email address and password. Now we don't store plaintext
passwords and so the query needs to compute bcrypt hashes on the fly:

    SELECT DISTINCT u.*
    FROM auth_user u
    JOIN bb_userprofile p ON p.user_id = u.id
    JOIN bb_identity i ON i.profile_id = p.id
    WHERE
    (
      (
        u.username ILIKE 'detkin'
        OR
        i.email ILIKE 'foo@example.com'
      )
      AND
      (
        SUBSTRING(password FROM 8) = CRYPT(
          'detkin', SUBSTRING(password FROM 8))
      )
    )

These queries are generated by a parser that translates from an
external query language to SQL run on the database. This test db
contains 12 user records.

With a single bcrypt hash taking ~300ms to compute, this is a recipe
for disaster and so the app only allows queries that require only a
very small number of bcrypt computation.

E.g. the user must always "AND" the password lookup with a clause like
" username = 'foo' AND password = 'bar'" (username is unique).

However, while the query above technically only needs to compute 1
hash (there is a user 'detkin' and email 'foo@example.com' does not
exist), it instead creates a query plan that computes hashes *before*
filtering on username and email, leading to 12 hash computations and a
very slow query.

The EXPLAIN (ANALYZE, BUFFERS) is here: http://explain.depesz.com/s/yhE

The schemas for the 3 tables involved are here:
http://pgsql.privatepaste.com/f72020ad0a

As a quick experiment I tried moving the joins and email lookup into a
nested IN query, but that still generates a plan that computes hashes
for all 12 users, before picking out the 1 whose username matches.

Is there any way I can get postgres to perform the hash calculations
on the *result* of the other parts of the where clause, instead of the
other way around? Or else rewrite the query?

Cheers,
Erik


Re: Suboptimal query plan when using expensive BCRYPT functions

From
bricklen
Date:

On Fri, Mar 21, 2014 at 5:59 PM, Erik van Zijst <erik.van.zijst@gmail.com> wrote:
Hi there,

I've got a relatively simple query that contains expensive BCRYPT
functions that gets optimized in a way that causes postgres to compute
more bcrypt hashes than necessary, thereby dramatically slowing things
down.

In a certain part of our application we need to lookup users by their
username, email address and password. Now we don't store plaintext
passwords and so the query needs to compute bcrypt hashes on the fly:

    SELECT DISTINCT u.*
    FROM auth_user u
    JOIN bb_userprofile p ON p.user_id = u.id
    JOIN bb_identity i ON i.profile_id = p.id
    WHERE
    (
      (
        u.username ILIKE 'detkin'
        OR
        i.email ILIKE 'foo@example.com'
      )
      AND
      (
        SUBSTRING(password FROM 8) = CRYPT(
          'detkin', SUBSTRING(password FROM 8))
      )
    )

These queries are generated by a parser that translates from an
external query language to SQL run on the database. This test db
contains 12 user records.

With a single bcrypt hash taking ~300ms to compute, this is a recipe
for disaster and so the app only allows queries that require only a
very small number of bcrypt computation.

E.g. the user must always "AND" the password lookup with a clause like
" username = 'foo' AND password = 'bar'" (username is unique).

However, while the query above technically only needs to compute 1
hash (there is a user 'detkin' and email 'foo@example.com' does not
exist), it instead creates a query plan that computes hashes *before*
filtering on username and email, leading to 12 hash computations and a
very slow query.

The EXPLAIN (ANALYZE, BUFFERS) is here: http://explain.depesz.com/s/yhE

The schemas for the 3 tables involved are here:
http://pgsql.privatepaste.com/f72020ad0a

As a quick experiment I tried moving the joins and email lookup into a
nested IN query, but that still generates a plan that computes hashes
for all 12 users, before picking out the 1 whose username matches.

Is there any way I can get postgres to perform the hash calculations
on the *result* of the other parts of the where clause, instead of the
other way around? Or else rewrite the query?

Cheers,
Erik

(untested), but how about something like the following:

WITH au AS (
    SELECT DISTINCT u.*
    FROM auth_user u
    JOIN bb_userprofile p ON p.user_id = u.id
    JOIN bb_identity i ON i.profile_id = p.id
    WHERE u.username ILIKE 'detkin'
    OR i.email ILIKE 'foo@example.com')

SELECT au.*
FROM au
WHERE SUBSTRING(au.password FROM 8) = CRYPT('detkin', SUBSTRING(au.password FROM 8));

Re: Suboptimal query plan when using expensive BCRYPT functions

From
Erik van Zijst
Date:
Yes, that works (it does at least on my small test database).

However, these queries are generated by a parser that translates
complex parse trees from a higher level DSL that doesn't lend itself
well to logically isolating the crypt checks from the remaining
conditions, as password checks might be present at arbitrary depths.

For example:

    (
      active eq true
      AND
      (
        password eq "foo"
        OR
        password eq "bar"
      )
    )
    AND
    (
      username eq "erik"
      OR
      email contains "bar"
    )

Currently the SQL generator translates each AST node into individual
predicates that straightforwardly concatenate into a single SQL WHERE
clause. For this to work, the individual nodes should compose well. I
don't immediately see how the above query could be automatically
translated into SQL when taking the WITH-AS approach.

I could nonetheless take a stab at it, but life would certainly be
easier if I could translate each component independently and leave
optimization to the query planner.

Cheers,
Erik


On Sat, Mar 22, 2014 at 1:51 PM, bricklen <bricklen@gmail.com> wrote:
>
> On Fri, Mar 21, 2014 at 5:59 PM, Erik van Zijst <erik.van.zijst@gmail.com>
> wrote:
>>
>> Hi there,
>>
>> I've got a relatively simple query that contains expensive BCRYPT
>> functions that gets optimized in a way that causes postgres to compute
>> more bcrypt hashes than necessary, thereby dramatically slowing things
>> down.
>>
>> In a certain part of our application we need to lookup users by their
>> username, email address and password. Now we don't store plaintext
>> passwords and so the query needs to compute bcrypt hashes on the fly:
>>
>>     SELECT DISTINCT u.*
>>     FROM auth_user u
>>     JOIN bb_userprofile p ON p.user_id = u.id
>>     JOIN bb_identity i ON i.profile_id = p.id
>>     WHERE
>>     (
>>       (
>>         u.username ILIKE 'detkin'
>>         OR
>>         i.email ILIKE 'foo@example.com'
>>       )
>>       AND
>>       (
>>         SUBSTRING(password FROM 8) = CRYPT(
>>           'detkin', SUBSTRING(password FROM 8))
>>       )
>>     )
>>
>> These queries are generated by a parser that translates from an
>> external query language to SQL run on the database. This test db
>> contains 12 user records.
>>
>> With a single bcrypt hash taking ~300ms to compute, this is a recipe
>> for disaster and so the app only allows queries that require only a
>> very small number of bcrypt computation.
>>
>> E.g. the user must always "AND" the password lookup with a clause like
>> " username = 'foo' AND password = 'bar'" (username is unique).
>>
>> However, while the query above technically only needs to compute 1
>> hash (there is a user 'detkin' and email 'foo@example.com' does not
>> exist), it instead creates a query plan that computes hashes *before*
>> filtering on username and email, leading to 12 hash computations and a
>> very slow query.
>>
>> The EXPLAIN (ANALYZE, BUFFERS) is here: http://explain.depesz.com/s/yhE
>>
>> The schemas for the 3 tables involved are here:
>> http://pgsql.privatepaste.com/f72020ad0a
>>
>> As a quick experiment I tried moving the joins and email lookup into a
>> nested IN query, but that still generates a plan that computes hashes
>> for all 12 users, before picking out the 1 whose username matches.
>>
>> Is there any way I can get postgres to perform the hash calculations
>> on the *result* of the other parts of the where clause, instead of the
>> other way around? Or else rewrite the query?
>>
>> Cheers,
>> Erik
>
>
> (untested), but how about something like the following:
>
> WITH au AS (
>
>     SELECT DISTINCT u.*
>     FROM auth_user u
>     JOIN bb_userprofile p ON p.user_id = u.id
>     JOIN bb_identity i ON i.profile_id = p.id
>     WHERE u.username ILIKE 'detkin'
>
>     OR i.email ILIKE 'foo@example.com')
>
> SELECT au.*
> FROM au
> WHERE SUBSTRING(au.password FROM 8) = CRYPT('detkin', SUBSTRING(au.password
> FROM 8));
>


Re: Suboptimal query plan when using expensive BCRYPT functions

From
bricklen
Date:

On Sat, Mar 22, 2014 at 3:27 PM, Erik van Zijst <erik.van.zijst@gmail.com> wrote:
Yes, that works (it does at least on my small test database).

However, these queries are generated by a parser that translates
complex parse trees from a higher level DSL that doesn't lend itself
well to logically isolating the crypt checks from the remaining
conditions, as password checks might be present at arbitrary depths.

For example:

    (
      active eq true
      AND
      (
        password eq "foo"
        OR
        password eq "bar"
      )
    )
    AND
    (
      username eq "erik"
      OR
      email contains "bar"
    )

Currently the SQL generator translates each AST node into individual
predicates that straightforwardly concatenate into a single SQL WHERE
clause. For this to work, the individual nodes should compose well. I
don't immediately see how the above query could be automatically
translated into SQL when taking the WITH-AS approach.

I could nonetheless take a stab at it, but life would certainly be
easier if I could translate each component independently and leave
optimization to the query planner.


How about encapsulating the revised query inside a db function? That simplifies the query for your query generator to something like "select x,y,z from your_func(p_user,p_email,p_crypt)"

Re: Suboptimal query plan when using expensive BCRYPT functions

From
Erik van Zijst
Date:
On Sat, Mar 22, 2014 at 3:56 PM, bricklen <bricklen@gmail.com> wrote:
> On Sat, Mar 22, 2014 at 3:27 PM, Erik van Zijst <erik.van.zijst@gmail.com>
>> I could nonetheless take a stab at it, but life would certainly be
>> easier if I could translate each component independently and leave
>> optimization to the query planner.
>
> How about encapsulating the revised query inside a db function? That
> simplifies the query for your query generator to something like "select
> x,y,z from your_func(p_user,p_email,p_crypt)"

I'm not really sure I understand how a db function would make things
easier. What would the implementation for your_func() be and what
would the SQL look like for the DSL example which contains multiple
password checks?

Cheers,
Erik


Re: Suboptimal query plan when using expensive BCRYPT functions

From
bricklen
Date:

On Sat, Mar 22, 2014 at 8:37 PM, Erik van Zijst <erik.van.zijst@gmail.com> wrote:
On Sat, Mar 22, 2014 at 3:56 PM, bricklen <bricklen@gmail.com> wrote:
> On Sat, Mar 22, 2014 at 3:27 PM, Erik van Zijst <erik.van.zijst@gmail.com>
>> I could nonetheless take a stab at it, but life would certainly be
>> easier if I could translate each component independently and leave
>> optimization to the query planner.
>
> How about encapsulating the revised query inside a db function? That
> simplifies the query for your query generator to something like "select
> x,y,z from your_func(p_user,p_email,p_crypt)"

I'm not really sure I understand how a db function would make things
easier. What would the implementation for your_func() be and what
would the SQL look like for the DSL example which contains multiple
password checks?

I just reread your previous post about the checks being at potentially arbitrary depths. In that case, the function may or may not help. Without a representative database to test with I can't say one way or the other. Perhaps someone else will have some other ideas of what could be useful here.

 

Re: Suboptimal query plan when using expensive BCRYPT functions

From
Tom Lane
Date:
bricklen <bricklen@gmail.com> writes:
> Perhaps someone else will have some other ideas of what could be useful
> here.

Maybe I'm missing something ... but isn't the OP's query completely bogus?

    SELECT DISTINCT u.*
    FROM auth_user u
    JOIN bb_userprofile p ON p.user_id = u.id
    JOIN bb_identity i ON i.profile_id = p.id
    WHERE
    (
      (
        u.username ILIKE 'detkin'
        OR
        i.email ILIKE 'foo(at)example(dot)com'
      )
      AND
      (
        SUBSTRING(password FROM 8) = CRYPT(
          'detkin', SUBSTRING(password FROM 8))
      )
    )

Granting that there are not chance collisions of password hashes (which
would surely be a bad thing if there were), success of the second AND arm
means that we are on user detkin's row of auth_user.  Therefore the OR
business is entirely nonfunctional: if the password test passes, then
the u.username ILIKE 'detkin' clause succeeds a fortiori, while if the
password test fails, it hardly matters what i.email is, because the WHERE
clause as a whole fails.  Ergo, the whole WHERE clause might as well just
be written "u.username = 'detkin'".  If it were a RIGHT JOIN rather than
just a JOIN, this argument would fail ... but as written, the query
makes little sense.

I'll pass gently over the question of whether the password test as shown
could ever succeed at all.

I suppose we've been shown a lobotomized version of the real logic,
but it's hard to give advice in such situations.

            regards, tom lane


Re: Suboptimal query plan when using expensive BCRYPT functions

From
Erik van Zijst
Date:
On Sat, Mar 22, 2014 at 11:40 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Maybe I'm missing something ... but isn't the OP's query completely bogus?
>
>     SELECT DISTINCT u.*
>     FROM auth_user u
>     JOIN bb_userprofile p ON p.user_id = u.id
>     JOIN bb_identity i ON i.profile_id = p.id
>     WHERE
>     (
>       (
>         u.username ILIKE 'detkin'
>         OR
>         i.email ILIKE 'foo(at)example(dot)com'
>       )
>       AND
>       (
>         SUBSTRING(password FROM 8) = CRYPT(
>           'detkin', SUBSTRING(password FROM 8))
>       )
>     )
>
> Granting that there are not chance collisions of password hashes (which
> would surely be a bad thing if there were),

Would it?

Any hashing system is inherently open to collision (although you're
more likely to find 2 identical snowflakes), but how does that affect
our situation? It means you simply would have found another password
for that user that is just as valid. The system will accept it.

> success of the second AND arm
> means that we are on user detkin's row of auth_user.

My password could be 'detkin' too, but my username is 'erik'.

> Therefore the OR
> business is entirely nonfunctional: if the password test passes, then
> the u.username ILIKE 'detkin' clause succeeds a fortiori, while if the
> password test fails, it hardly matters what i.email is, because the WHERE
> clause as a whole fails.

My email could be 'foo@example.com', my username 'erik' and my
password 'detkin'.

Users are identified through their unique username or email address.
Passwords are not unique.

> I suppose we've been shown a lobotomized version of the real logic,
> but it's hard to give advice in such situations.

This is an actual query taken from the system.

Cheers,
Erik


Re: Suboptimal query plan when using expensive BCRYPT functions

From
Heikki Linnakangas
Date:
On 03/22/2014 02:59 AM, Erik van Zijst wrote:
> Is there any way I can get postgres to perform the hash calculations
> on the *result* of the other parts of the where clause, instead of the
> other way around? Or else rewrite the query?

The planner doesn't know that the crypt function is expensive. That can
be fixed with "ALTER FUNCTION crypt(text, text) COST <high value>". Even
with that, I'm not sure if the planner is smart enough to optimize the
query the way you'd want, but it's worth a try.

- Heikki


Re: Suboptimal query plan when using expensive BCRYPT functions

From
Erik van Zijst
Date:
On Mon, Mar 24, 2014 at 12:08 AM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:
> On 03/22/2014 02:59 AM, Erik van Zijst wrote:
>>
>> Is there any way I can get postgres to perform the hash calculations
>> on the *result* of the other parts of the where clause, instead of the
>> other way around? Or else rewrite the query?
>
>
> The planner doesn't know that the crypt function is expensive. That can be
> fixed with "ALTER FUNCTION crypt(text, text) COST <high value>". Even with
> that, I'm not sure if the planner is smart enough to optimize the query the
> way you'd want, but it's worth a try.

Thanks. That's the kind of hint I was looking for.

I just gave it a go, but unfortunately it's not enough to get the
optimizer to change the plan, regardless of the cost value.

Cheers,
Erik