Thread: wildcard text filter switched to boolean column, performance is way worse

wildcard text filter switched to boolean column, performance is way worse

From
Mike Broers
Date:
I had a query that was filtering with a wildcard search of a text field for %SUCCESS%. The query took about 5 seconds and was running often so I wanted to improve it.  I suggested that the engineers include a new boolean column for successful status.  They implemented the requested field, but the query that filters on that new column runs very long (i kill it after letting it run for about an hour).  Can someone help me understand why that is the case and how to resolve it? 

First query: 
SELECT *
FROM "lead" 
WHERE ( NOT ( ( "lead"."id" IN
                 ( SELECT U1."lead_id" AS "lead_id"
                  FROM "event" U1
                  WHERE U1."event_type" = 'type_1' )
               OR ( "lead"."id" IN
                     ( SELECT U1."lead_id" AS "lead_id"
                      FROM "event" U1
                      WHERE U1."event_type" = 'type_2' )
                   AND "lead"."id" IN
                     ( SELECT U1."lead_id" AS "lead_id"
                      FROM "event" U1
                      WHERE UPPER(U1."response"::text) LIKE UPPER('%success%') ) ) ) )
       AND NOT ("lead"."ReferenceNumber" = '') ) ;

explain/analyze result:

Seq Scan on lead  (cost=130951.81..158059.21 rows=139957 width=369) (actual time=4699.619..4699.869 rows=1 loops=1)                   
    Filter: ((NOT (hashed SubPlan 1)) AND (("ReferenceNumber")::text <> ''::text) AND ((NOT (hashed SubPlan 2)) OR (NOT (hashed SubPlan 3))))  
    Rows Removed by Filter: 375369                                                                                                             
    SubPlan 1                                                                                                                                  
      ->  Seq Scan on event u1  (cost=0.00..42408.62 rows=7748 width=4) (actual time=0.005..171.350 rows=7414 loops=1)                  
            Filter: ((event_type)::text = 'type_1'::text)                                                                                  
            Rows Removed by Filter: 1099436                                                                                                    
    SubPlan 2                                                                                                                                  
      ->  Seq Scan on event u1_1  (cost=0.00..42408.62 rows=375665 width=4) (actual time=0.006..219.092 rows=373298 loops=1)            
            Filter: ((event_type)::text = 'type_2'::text)                                                                                  
            Rows Removed by Filter: 733552                                                                                                     
    SubPlan 3                                                                                                                                  
      ->  Seq Scan on event u1_2  (cost=0.00..45175.75 rows=111 width=4) (actual time=0.040..3389.550 rows=712952 loops=1)              
            Filter: (upper(response) ~~ '%SUCCESS%'::text)                                                                                     
            Rows Removed by Filter: 393898   

The main thing that sticks out to me for this plan is the low estimate for the rows it will return on the %SUCCESS% filter.

Here is the second query with explain:

SELECT *
FROM "lead" 
WHERE
(
 NOT 
 (
  ("lead"."id" IN 
   (
    SELECT U1."lead_id" AS "lead_id" 
    FROM "event" U1 
    WHERE U1."event_type" ='type_1'
   ) 
   OR
   ("lead"."id" IN 
    (
     SELECT U1."lead_id" AS "lead_id" 
     FROM "event" U1 
     WHERE U1."event_type" = 'type_2
     ) 
   AND "lead"."id" IN 
    (
     SELECT U1."lead_id" AS "lead_id" 
     FROM "event" U1 
     WHERE successful
    )
   )
  )
 ) 
 AND NOT  ("lead"."ReferenceNumber" = '')
) ;

explain result:
  Seq Scan on lead  (cost=85775.78..9005687281.12 rows=139957 width=369)                                                         
    Filter: ((NOT (hashed SubPlan 1)) AND (("ReferenceNumber")::text <> ''::text) AND ((NOT (hashed SubPlan 2)) OR (NOT (SubPlan 3))))  
    SubPlan 1                                                                                                                           
      ->  Seq Scan on event u1  (cost=0.00..42408.62 rows=7748 width=4)                                                          
            Filter: ((event_type)::text = 'type_1'::text)                                                                           
    SubPlan 2                                                                                                                           
      ->  Seq Scan on event u1_1  (cost=0.00..42408.62 rows=375665 width=4)                                                      
            Filter: ((event_type)::text = 'type_2'::text)                                                                           
    SubPlan 3                                                                                                                           
      ->  Materialize  (cost=0.00..46154.43 rows=731185 width=4)                                                                        
            ->  Seq Scan on event u1_2  (cost=0.00..39641.50 rows=731185 width=4)                                                
                  Filter: successful             

Here is does a materialize and estimates rows properly, but as stated this query just hangs and pegs load.  There are no locks and its in an active state the whole time.  I am running these queries in a test environment on a recently exported full schema from production, with a reindex and a vacuum/analyze.  This is postgres 9.3.6 on rhel6.  

When I run just the different subquery element: 
SELECT U1."lead_id" AS "lead_id"                                                                                                      FROM "event" U1                                                                                                                               WHERE successful;

it returns in about 250ms, with the text field %SUCCESS% it runs in about 4 seconds.  This seemed like a low hanging fruit query improvement so I'm surprised its not working, it seems like we are just lucky that the planner is estimating that filter incorrectly in the original form.  

I'm sure the query just needs to be completely overhauled and am starting to pull it apart and work with the engineers to get something more efficient set up overall, but I am not sure how to answer the question as to why this original attempt at improving the query is not successful.

Any guidance is greatly appreciated, thanks!  









Mike Broers <mbroers@gmail.com> writes:
> I had a query that was filtering with a wildcard search of a text field for
> %SUCCESS%. The query took about 5 seconds and was running often so I wanted
> to improve it.  I suggested that the engineers include a new boolean column
> for successful status.  They implemented the requested field, but the query
> that filters on that new column runs very long (i kill it after letting it
> run for about an hour).  Can someone help me understand why that is the
> case and how to resolve it?

It's hashing the subplan output in the first case and not the second:

> Seq Scan on lead  (cost=130951.81..158059.21 rows=139957 width=369) (actual
> time=4699.619..4699.869 rows=1 loops=1)
>     Filter: ((NOT (hashed SubPlan 1)) AND (("ReferenceNumber")::text <>
> ''::text) AND ((NOT (hashed SubPlan 2)) OR (NOT (hashed SubPlan 3))))
                                                   ^^^^^^^^^^^^^^^^
vs

>   Seq Scan on lead  (cost=85775.78..9005687281.12 rows=139957 width=369)
>     Filter: ((NOT (hashed SubPlan 1)) AND (("ReferenceNumber")::text <>
> ''::text) AND ((NOT (hashed SubPlan 2)) OR (NOT (SubPlan 3))))
                                                   ^^^^^^^^^

Presumably, the new more-accurate rows count causes the planner to realize
that the hash table will exceed work_mem so it doesn't choose to hash ...
but for your situation, you'd rather it did, because what you're getting
instead is a Materialize node that spills to disk (again, because the data
involved exceeds work_mem) and that's a killer for this query.  You should
be able to get back the old behavior if you raise work_mem enough.

Another idea you might think about is changing the OR'd IN conditions
to a single IN over a UNION ALL of the subselects.  I'm not really sure if
that would produce a better plan, but it's worth trying if it wouldn't
require too much app-side contortion.

            regards, tom lane


Thanks, very informative! I'll experiment with work_mem settings and report back.

On Tue, Jul 7, 2015 at 11:02 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Mike Broers <mbroers@gmail.com> writes:
> I had a query that was filtering with a wildcard search of a text field for
> %SUCCESS%. The query took about 5 seconds and was running often so I wanted
> to improve it.  I suggested that the engineers include a new boolean column
> for successful status.  They implemented the requested field, but the query
> that filters on that new column runs very long (i kill it after letting it
> run for about an hour).  Can someone help me understand why that is the
> case and how to resolve it?

It's hashing the subplan output in the first case and not the second:

> Seq Scan on lead  (cost=130951.81..158059.21 rows=139957 width=369) (actual
> time=4699.619..4699.869 rows=1 loops=1)
>     Filter: ((NOT (hashed SubPlan 1)) AND (("ReferenceNumber")::text <>
> ''::text) AND ((NOT (hashed SubPlan 2)) OR (NOT (hashed SubPlan 3))))
                                                   ^^^^^^^^^^^^^^^^
vs

>   Seq Scan on lead  (cost=85775.78..9005687281.12 rows=139957 width=369)
>     Filter: ((NOT (hashed SubPlan 1)) AND (("ReferenceNumber")::text <>
> ''::text) AND ((NOT (hashed SubPlan 2)) OR (NOT (SubPlan 3))))
                                                   ^^^^^^^^^

Presumably, the new more-accurate rows count causes the planner to realize
that the hash table will exceed work_mem so it doesn't choose to hash ...
but for your situation, you'd rather it did, because what you're getting
instead is a Materialize node that spills to disk (again, because the data
involved exceeds work_mem) and that's a killer for this query.  You should
be able to get back the old behavior if you raise work_mem enough.

Another idea you might think about is changing the OR'd IN conditions
to a single IN over a UNION ALL of the subselects.  I'm not really sure if
that would produce a better plan, but it's worth trying if it wouldn't
require too much app-side contortion.

                        regards, tom lane

After bumping up work_mem from 12MB to 25MB that last materialize is indeed hashing and this cut the query time by about 60%.  Thanks, this was very helpful and gives me something else to look for when troubleshooting explains.  



On Tue, Jul 7, 2015 at 11:10 AM, Mike Broers <mbroers@gmail.com> wrote:
Thanks, very informative! I'll experiment with work_mem settings and report back.

On Tue, Jul 7, 2015 at 11:02 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Mike Broers <mbroers@gmail.com> writes:
> I had a query that was filtering with a wildcard search of a text field for
> %SUCCESS%. The query took about 5 seconds and was running often so I wanted
> to improve it.  I suggested that the engineers include a new boolean column
> for successful status.  They implemented the requested field, but the query
> that filters on that new column runs very long (i kill it after letting it
> run for about an hour).  Can someone help me understand why that is the
> case and how to resolve it?

It's hashing the subplan output in the first case and not the second:

> Seq Scan on lead  (cost=130951.81..158059.21 rows=139957 width=369) (actual
> time=4699.619..4699.869 rows=1 loops=1)
>     Filter: ((NOT (hashed SubPlan 1)) AND (("ReferenceNumber")::text <>
> ''::text) AND ((NOT (hashed SubPlan 2)) OR (NOT (hashed SubPlan 3))))
                                                   ^^^^^^^^^^^^^^^^
vs

>   Seq Scan on lead  (cost=85775.78..9005687281.12 rows=139957 width=369)
>     Filter: ((NOT (hashed SubPlan 1)) AND (("ReferenceNumber")::text <>
> ''::text) AND ((NOT (hashed SubPlan 2)) OR (NOT (SubPlan 3))))
                                                   ^^^^^^^^^

Presumably, the new more-accurate rows count causes the planner to realize
that the hash table will exceed work_mem so it doesn't choose to hash ...
but for your situation, you'd rather it did, because what you're getting
instead is a Materialize node that spills to disk (again, because the data
involved exceeds work_mem) and that's a killer for this query.  You should
be able to get back the old behavior if you raise work_mem enough.

Another idea you might think about is changing the OR'd IN conditions
to a single IN over a UNION ALL of the subselects.  I'm not really sure if
that would produce a better plan, but it's worth trying if it wouldn't
require too much app-side contortion.

                        regards, tom lane


Hello,
> 
> 
> 
> 
> 
> From: pgsql-performance-owner@postgresql.org [mailto:pgsql-performance-owner@postgresql.org] On Behalf Of Mike
Broers
> Sent: Dienstag, 7. Juli 2015 18:28
> To: Tom Lane
> Cc: pgsql-performance@postgresql.org
> Subject: Re: [PERFORM] wildcard text filter switched to boolean column, performance is way worse
> 
> After bumping up work_mem from 12MB to 25MB that last materialize is indeed hashing and this cut the query time by
about60%.  Thanks, this was very helpful and gives me something else to look for when troubleshooting explains.  
 
> 
> 
> 
> On Tue, Jul 7, 2015 at 11:10 AM, Mike Broers <mbroers@gmail.com> wrote:
> Thanks, very informative! I'll experiment with work_mem settings and report back.
> 
> On Tue, Jul 7, 2015 at 11:02 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Mike Broers <mbroers@gmail.com> writes:
> > I had a query that was filtering with a wildcard search of a text field for
> > %SUCCESS%. The query took about 5 seconds and was running often so I wanted
> > to improve it.  I suggested that the engineers include a new boolean column
> > for successful status.  They implemented the requested field, but the query
> > that filters on that new column runs very long (i kill it after letting it
> > run for about an hour).  Can someone help me understand why that is the
> > case and how to resolve it?
> 
> It's hashing the subplan output in the first case and not the second:
> 
> > Seq Scan on lead  (cost=130951.81..158059.21 rows=139957 width=369) (actual
> > time=4699.619..4699.869 rows=1 loops=1)
> >     Filter: ((NOT (hashed SubPlan 1)) AND (("ReferenceNumber")::text <>
> > ''::text) AND ((NOT (hashed SubPlan 2)) OR (NOT (hashed SubPlan 3))))
>                                                    ^^^^^^^^^^^^^^^^
> vs
> 
> >   Seq Scan on lead  (cost=85775.78..9005687281.12 rows=139957 width=369)
> >     Filter: ((NOT (hashed SubPlan 1)) AND (("ReferenceNumber")::text <>
> > ''::text) AND ((NOT (hashed SubPlan 2)) OR (NOT (SubPlan 3))))
>                                                    ^^^^^^^^^
> 
> Presumably, the new more-accurate rows count causes the planner to realize
> that the hash table will exceed work_mem so it doesn't choose to hash ...
> but for your situation, you'd rather it did, because what you're getting
> instead is a Materialize node that spills to disk (again, because the data
> involved exceeds work_mem) and that's a killer for this query.  You should
> be able to get back the old behavior if you raise work_mem enough.
> 
> Another idea you might think about is changing the OR'd IN conditions
> to a single IN over a UNION ALL of the subselects.  I'm not really sure if
> that would produce a better plan, but it's worth trying if it wouldn't
> require too much app-side contortion.


Hello,
you might try to use a CTE to first collect the IDs to exclude, and join them to your main table.
This should result in an anti join pattern.

Something like:

WITH IDS as (
    SELECT U1."lead_id" AS "lead_id" 
    FROM "event" U1 
    WHERE U1."event_type" ='type_1'
    UNION (
       SELECT U1."lead_id" AS "lead_id" 
       FROM "event" U1 
       WHERE U1."event_type" = 'type_2'
       INTERSECT
       SELECT U1."lead_id" AS "lead_id" 
       FROM "event" U1 
       WHERE successful
    )
)
SELECT * FROM lead LEFT OUTER JOIN IDS ON (lead.id=IDS.lead_id)
WHERE IDS.lead_id IS NULL;

regards,

Marc Mamin






>                         regards, tom lane
> 
> 
>