Thread: Query performance on ILIKE with AND operator...

Query performance on ILIKE with AND operator...

From
"Federico Simonetti (Liveye)"
Date:
Hello all,

I'm encountering a quite strange performance problem. Look at the
following two queries and their execution times. The only difference is
the first query has OR operator and the second query has AND operator.
Any ideas?

Thank you in advance,
Federico


[FIRST QUERY: EXEC TIME 0.015 SECS]

    explain analyze SELECT * FROM ViewHttp
    WHERE txContentType ilike '%html%' OR vchost ilike '%www.%'
    ORDER BY iDStart DESC, iS_ID DESC, iF_ID DESC, iSubID DESC
    OFFSET 0 LIMIT 201

"Limit  (cost=12.75..3736.66 rows=201 width=1250) (actual
time=0.000..15.000 rows=201 loops=1)"
"  ->  Nested Loop  (cost=12.75..1996879.04 rows=107782 width=1250)
(actual time=0.000..15.000 rows=201 loops=1)"
"        ->  Nested Loop  (cost=12.75..1524883.03 rows=6334 width=1106)
(actual time=0.000..15.000 rows=201 loops=1)"
"              Join Filter: ("outer".isensorid = "inner".isensorid)"
"              ->  Index Scan Backward using idx_0009_ord4 on
detail0009  (cost=0.00..1489241.53 rows=6334 width=1005) (actual
time=0.000..15.000 rows=201 loops=1)"
"                    Filter: ((txcontenttype ~~* '%html%'::text) OR
((vchost)::text ~~* '%www.%'::text))"
"              ->  Materialize  (cost=12.75..15.25 rows=250 width=101)
(actual time=0.000..0.000 rows=1 loops=201)"
"                    ->  Seq Scan on sensors  (cost=0.00..12.50 rows=250
width=101) (actual time=0.000..0.000 rows=1 loops=1)"
"        ->  Index Scan using connections_pkey on connections
(cost=0.00..74.25 rows=18 width=168) (actual time=0.000..0.000 rows=1
loops=201)"
"              Index Cond: (("outer".is_id = connections.is_id) AND
("outer".if_id = connections.if_id))"
"Total runtime: 15.000 ms"




[SECOND QUERY: EXEC TIME 13.844 SECS]

    explain analyze SELECT * FROM ViewHttp
    WHERE txContentType ilike '%html%' AND vchost ilike '%www.%'
    ORDER BY iDStart DESC, iS_ID DESC, iF_ID DESC, iSubID DESC
    OFFSET 0 LIMIT 201

"Limit  (cost=22476.81..22477.31 rows=201 width=1250) (actual
time=13187.000..13187.000 rows=201 loops=1)"
"  ->  Sort  (cost=22476.81..22477.92 rows=443 width=1250) (actual
time=13187.000..13187.000 rows=201 loops=1)"
"        Sort Key: detail0009.idstart, detail0009.isensorid,
detail0009.iforensicid, detail0009.isubid"
"        ->  Hash Join  (cost=13.13..22457.34 rows=443 width=1250)
(actual time=469.000..10966.000 rows=53559 loops=1)"
"              Hash Cond: ("outer".isensorid = "inner".isensorid)"
"              ->  Nested Loop  (cost=0.00..22437.57 rows=443
width=1165) (actual time=469.000..10201.000 rows=53559 loops=1)"
"                    ->  Seq Scan on detail0009  (cost=0.00..20500.11
rows=26 width=1005) (actual time=453.000..5983.000 rows=53588 loops=1)"
"                          Filter: ((txcontenttype ~~* '%html%'::text)
AND ((vchost)::text ~~* '%www.%'::text))"
"                    ->  Index Scan using connections_pkey on
connections  (cost=0.00..74.25 rows=18 width=168) (actual
time=0.063..0.065 rows=1 loops=53588)"
"                          Index Cond: (("outer".isensorid =
connections.isensorid) AND ("outer".iforensicid = connections.iforensicid))"
"              ->  Hash  (cost=12.50..12.50 rows=250 width=101) (actual
time=0.000..0.000 rows=0 loops=1)"
"                    ->  Seq Scan on sensors  (cost=0.00..12.50 rows=250
width=101) (actual time=0.000..0.000 rows=1 loops=1)"
"Total runtime: 13844.000 ms"



Re: Query performance on ILIKE with AND operator...

From
Tom Lane
Date:
"Federico Simonetti (Liveye)" <federico@liveye.net> writes:
> I'm encountering a quite strange performance problem.

The problem stems from the horrid misestimation of the number of rows
fetched from detail0009:

> "                    ->  Seq Scan on detail0009  (cost=0.00..20500.11
> rows=26 width=1005) (actual time=453.000..5983.000 rows=53588 loops=1)"
> "                          Filter: ((txcontenttype ~~* '%html%'::text)
> AND ((vchost)::text ~~* '%www.%'::text))"

When the planner is off by a factor of two thousand about the number of
rows involved, it's not very likely to produce a good plan :-(

In the OR case the rowcount estimate is 6334, which is somewhat closer
to reality (only about a factor of 10 off, looks like), and that changes
the plan to something that works acceptably well.

Assuming that this is web-log data, the prevalence of www and html
together is hardly surprising, but PG's statistical mechanisms will
never realize it.  Not sure about a good workaround.  Does it make
sense to combine the two conditions into one?
    (vchost || txcontenttype) ilike '%www.%html%'

            regards, tom lane

Re: Query performance on ILIKE with AND operator...

From
"Federico Simonetti (Liveye)"
Date:
Sorry but this does not seem to improve performance, it takes even more time, have a look at these data:

explain analyze SELECT * FROM ViewHttp
WHERE (vchost || txcontenttype) ilike '%www.%html%'
ORDER BY iDStart DESC, iSensorID DESC, iForensicID DESC, iSubID DESC
OFFSET 0 LIMIT 201


"Limit  (cost=22740.77..22741.28 rows=201 width=1250) (actual time=14234.000..14234.000 rows=201 loops=1)"
"  ->  Sort  (cost=22740.77..22741.89 rows=447 width=1250) (actual time=14234.000..14234.000 rows=201 loops=1)"
"        Sort Key: detail0009.idstart, detail0009.isensorid, detail0009.iforensicid, detail0009.isubid"
"        ->  Hash Join  (cost=13.13..22721.10 rows=447 width=1250) (actual time=469.000..12140.000 rows=54035 loops=1)"
"              Hash Cond: ("outer".isensorid = "inner".isensorid)"
"              ->  Nested Loop  (cost=0.00..22701.27 rows=447 width=1165) (actual time=469.000..11428.000 rows=54035 loops=1)"
"                    ->  Seq Scan on detail0009  (cost=0.00..20763.77 rows=26 width=1005) (actual time=453.000..6345.000 rows=54064 loops=1)"
"                          Filter: (((vchost)::text || txcontenttype) ~~* '%www.%html%'::text)"
"                    ->  Index Scan using connections_pkey on connections  (cost=0.00..74.25 rows=18 width=168) (actual time=0.073..0.077 rows=1 loops=54064)"
"                          Index Cond: (("outer".isensorid = connections.isensorid) AND ("outer".iforensicid = connections.iforensicid))"
"              ->  Hash  (cost=12.50..12.50 rows=250 width=101) (actual time=0.000..0.000 rows=0 loops=1)"
"                    ->  Seq Scan on sensors  (cost=0.00..12.50 rows=250 width=101) (actual time=0.000..0.000 rows=1 loops=1)"
"Total runtime: 14234.000 ms"


Thanks for your help anyway...

Federico






Tom Lane ha scritto:
"Federico Simonetti (Liveye)" <federico@liveye.net> writes: 
I'm encountering a quite strange performance problem.   
The problem stems from the horrid misestimation of the number of rows
fetched from detail0009:
 
"                    ->  Seq Scan on detail0009  (cost=0.00..20500.11
rows=26 width=1005) (actual time=453.000..5983.000 rows=53588 loops=1)"
"                          Filter: ((txcontenttype ~~* '%html%'::text)
AND ((vchost)::text ~~* '%www.%'::text))"   
When the planner is off by a factor of two thousand about the number of
rows involved, it's not very likely to produce a good plan :-(

In the OR case the rowcount estimate is 6334, which is somewhat closer
to reality (only about a factor of 10 off, looks like), and that changes
the plan to something that works acceptably well.

Assuming that this is web-log data, the prevalence of www and html
together is hardly surprising, but PG's statistical mechanisms will
never realize it.  Not sure about a good workaround.  Does it make
sense to combine the two conditions into one?(vchost || txcontenttype) ilike '%www.%html%'
		regards, tom lane

---------------------------(end of broadcast)---------------------------
TIP 6: explain analyze is your friend