Thread: Seeking help with a query that takes too long
[I originally posted this using the wrong E-Mail account, so a double posting may occur if the first message gets released by the moderator later- sorry!] Hi- I have a query that I'm trying to speed up. I haven't been able to come up with any workable ideas for speeding it up, so I'm seeking some input from the list. I'm using version 7.3.2 I have three tables: case_data (1,947,386 rows) actor (3,385,669 rows) actor_case_assignment (8,668,650 rows) As the names imply, actor_case_assignment contains records that assign an actor to a case. Actors such as attorneys or judges may have many cases, while the average actor (we hope) only has one. What I'm trying to do is link these tables to get back a single row per actor that shows the actor's name, the number of cases that actor is assigned to, and if they only have one case, I want the public_id for that case. This means I have to do a group by to get the case count, but I'm then forced to use an aggregate function like max on the other fields. All of the fields ending in "_id" have unique indexes, and actor_full_name_uppercase is indexed. An analyze is done every night & the database is fairly stable in it's composition. Here's the select: select actor.actor_id, max(actor.actor_full_name), max(case_data.case_public_id), max(case_data.case_id), count(case_data.case_id) as case_count from actor, actor_case_assignment, case_data where actor.actor_full_name_uppercase like upper('sanders%') and actor.actor_id = actor_case_assignment.actor_id and case_data.case_id = actor_case_assignment.case_id group by actor.actor_id order by max(actor.actor_full_name), case_count desc limit 1000; Here's the explain analyze: QUERY PLAN ---------------------------------------------------------------------------- ---------------------------------------------------------------------------- -------------------------------------------------- Limit (cost=2214.71..2214.72 rows=1 width=115) (actual time=120034.61..120035.67 rows=1000 loops=1) -> Sort (cost=2214.71..2214.72 rows=1 width=115) (actual time=120034.60..120034.98 rows=1001 loops=1) Sort Key: max((actor.actor_full_name)::text), count(case_data.case_id) -> Aggregate (cost=2214.67..2214.70 rows=1 width=115) (actual time=119962.80..120011.49 rows=3456 loops=1) -> Group (cost=2214.67..2214.68 rows=2 width=115) (actual time=119962.76..119987.04 rows=5879 loops=1) -> Sort (cost=2214.67..2214.68 rows=2 width=115) (actual time=119962.74..119965.09 rows=5879 loops=1) Sort Key: actor.actor_id -> Nested Loop (cost=0.00..2214.66 rows=2 width=115) (actual time=59.05..119929.71 rows=5879 loops=1) -> Nested Loop (cost=0.00..2205.26 rows=3 width=76) (actual time=51.46..66089.04 rows=5882 loops=1) -> Index Scan using actor_full_name_uppercase on actor (cost=0.00..6.01 rows=1 width=42) (actual time=37.62..677.44 rows=3501 loops=1) Index Cond: ((actor_full_name_uppercase >= 'SANDERS'::character varying) AND (actor_full_name_uppercase < 'SANDERT'::character varying)) Filter: (actor_full_name_uppercase ~~ 'SANDERS%'::text) -> Index Scan using actor_case_assignment_actor_id on actor_case_assignment (cost=0.00..2165.93 rows=2666 width=34) (actual time=16.37..18.67 rows=2 loops=3501) Index Cond: ("outer".actor_id = actor_case_assignment.actor_id) -> Index Scan using case_data_case_id on case_data (cost=0.00..3.66 rows=1 width=39) (actual time=9.14..9.15 rows=1 loops=5882) Index Cond: (case_data.case_id = "outer".case_id) Total runtime: 120038.60 msec (17 rows) Any ideas? Thanks! -Nick --------------------------------------------------------------------- Nick Fankhauser nickf@doxpop.com Phone 1.765.965.7363 Fax 1.765.962.9788 doxpop - Court records at your fingertips - http://www.doxpop.com/
On Wed, 12 Nov 2003 08:34:50 -0500, "Nick Fankhauser" <nickf@ontko.com> wrote: > -> Index Scan using >actor_full_name_uppercase on actor (cost=0.00..6.01 rows=1 width=42) ^^^^^^ >(actual time=37.62..677.44 rows=3501 loops=1) ^^^^^^^^^ > Index Cond: >((actor_full_name_uppercase >= 'SANDERS'::character varying) AND >(actor_full_name_uppercase < 'SANDERT'::character varying)) > Filter: >(actor_full_name_uppercase ~~ 'SANDERS%'::text) Nick, can you find out why this row count estimation is so far off? \x SELECT * FROM pg_stats WHERE tablename='actor' AND attname='actor_full_name_uppercase'; BTW, there seem to be missing cases: > -> Nested Loop (cost=0.00..2214.66 rows=2 width=115) > (actual time=59.05..119929.71 rows=5879 loops=1) ^^^^ > -> Nested Loop (cost=0.00..2205.26 rows=3 width=76) > (actual time=51.46..66089.04 rows=5882 loops=1) ^^^^ Servus Manfred
>(actual time=37.62..677.44 rows=3501 loops=1) ^^^^^^^^^ > Nick, can you find out why this row count estimation is so far off? It's actually correct: prod1=# select count(actor_id) from actor where actor_full_name_uppercase like 'SANDERS%'; count ------- 3501 (1 row) Of course, I merely chose "SANDERS" arbitrarily as a name that falls somewhere near the middle of the frequency range for names. SMITH or JONES would represent a worst-case, and something like KOIZAR would probably be unique. Here are the stats: prod1=# SELECT * FROM pg_stats prod1-# WHERE tablename='actor' AND attname='actor_full_name_uppercase'; -[ RECORD 1 ]-----+------------------------------------------------------------------- ---------------------------------------------------------------------------- ---------------------------------------------------------------------------- ------------------------------------------------------------- schemaname | public tablename | actor attname | actor_full_name_uppercase null_frac | 0.000333333 avg_width | 21 n_distinct | 24215 most_common_vals | {"STATE OF INDIANA","INDIANA DEPARTMENT OF REVENUE","BARTH CONS SCHOOL CORP","HOWARD COUNTY CLERK","ADVANCED RECOVERY SERVICES","STATE OF INDIANA-DEPT OF REVENUE","ALLIED COLLECTION SERVICE INC","CREDIT BUREAU OF LAPORTE","MIDWEST COLLECTION SVC INC","NCO FINANCIAL SYSTEMS INC"} most_common_freqs | {0.0153333,0.0143333,0.00433333,0.00433333,0.004,0.00366667,0.00333333,0.003 33333,0.00266667,0.00266667} histogram_bounds | {"(POE) ESTELLE, DENISE","BRIEN, LIISI","COTTRELL, CAROL","FAMILY RENTALS","HAYNES, TAMIKA","KESSLER, VICTORIA","MEFFORD, VERNON L","PHILLIPS, GERALD L","SHELTON, ANTOINETTE","TRICARICO, MELISSA SUE","ZUEHLKE, THOMAS L"} correlation | -0.00147395 I think this means that the average is 357 per actor. As you can see, the range of assignments varies from people with a single parking ticket to "State of Indiana", which is party to many thousands of cases. > BTW, there seem to be missing cases: > > -> Nested Loop (cost=0.00..2214.66 rows=2 width=115) > > (actual time=59.05..119929.71 rows=5879 loops=1) > ^^^^ > > -> Nested Loop (cost=0.00..2205.26 rows=3 width=76) > > (actual time=51.46..66089.04 rows=5882 loops=1) This is expected- We actually aggregate data from many county court databases, with varying levels of data "cleanliness". Regards, -Nick
> >actor_full_name_uppercase on actor (cost=0.00..6.01 rows=1 width=42) > ^^^^^^ > >(actual time=37.62..677.44 rows=3501 loops=1) > ^^^^^^^^^ > Nick, can you find out why this row count estimation is so far off? ^^^^^^^^^ Oops- I read this backward- I see what you mean now. That's a good question. I'm not sure what part of the stats this estimate might be pulled from. The average is 357, but the most common frequency may be around 1. -Nick
"Nick Fankhauser" <nickf@ontko.com> writes: >> Nick, can you find out why this row count estimation is so far off? > It's actually correct: Sure, the 3501 was the "actual". The estimate was 1 row, which was pretty far off :-( > Here are the stats: It looks like you are running with the default statistics target (10). Try boosting it to 100 or even more for this column (see ALTER TABLE SET STATISTICS, then re-ANALYZE) and see if the estimate gets better. I think the major problem is likely here: > n_distinct | 24215 which is no doubt much too small (do you have an idea of the number of distinct actor_full_name_uppercase values?) regards, tom lane
> It looks like you are running with the default statistics target (10). > Try boosting it to 100 or even more for this column (see ALTER TABLE > SET STATISTICS, then re-ANALYZE) and see if the estimate gets better. Here are the results & a few more clues: prod1=# alter table actor alter column actor_full_name_uppercase set statistics 1000; ALTER TABLE prod1=# analyze actor; ANALYZE prod1=# select count(distinct actor_full_name_uppercase) from actor; count --------- 1453371 (1 row) prod1=# select count(actor_id) from actor; count --------- 3386359 (1 row) This indicates to me that 1 isn't too shabby as an estimate if the whole name is specified, but I'm not sure how this gets altered in the case of a "LIKE" prod1=# \x Expanded display is on. prod1=# SELECT * FROM pg_stats prod1-# WHERE tablename='actor' AND attname='actor_full_name_uppercase'; <Header boilerplate snipped out> schemaname | public tablename | actor attname | actor_full_name_uppercase null_frac | 0.000586667 avg_width | 21 n_distinct | -0.14701 <Long list of values and frequencies snipped out> correlation | -0.00211291 Question: What does it mean when n_distinct is negative? New results of explain analyze: QUERY PLAN ---------------------------------------------------------------------------- ---------------------------------------------------------------------------- -------------------------------------------------- Limit (cost=252683.61..252683.68 rows=28 width=116) (actual time=169377.32..169378.39 rows=1000 loops=1) -> Sort (cost=252683.61..252683.68 rows=29 width=116) (actual time=169377.31..169377.69 rows=1001 loops=1) Sort Key: max((actor.actor_full_name)::text), count(case_data.case_id) -> Aggregate (cost=252678.57..252682.91 rows=29 width=116) (actual time=169305.79..169354.50 rows=3456 loops=1) -> Group (cost=252678.57..252680.01 rows=289 width=116) (actual time=169305.76..169330.00 rows=5879 loops=1) -> Sort (cost=252678.57..252679.29 rows=289 width=116) (actual time=169305.75..169308.15 rows=5879 loops=1) Sort Key: actor.actor_id -> Nested Loop (cost=0.00..252666.74 rows=289 width=116) (actual time=89.27..169273.51 rows=5879 loops=1) -> Nested Loop (cost=0.00..251608.11 rows=289 width=77) (actual time=57.73..92753.49 rows=5882 loops=1) -> Index Scan using actor_full_name_uppercase on actor (cost=0.00..456.88 rows=113 width=42) (actual time=32.80..3197.28 rows=3501 loops=1) Index Cond: ((actor_full_name_uppercase >= 'SANDERS'::character varying) AND (actor_full_name_uppercase < 'SANDERT'::character varying)) Filter: (actor_full_name_uppercase ~~ 'SANDERS%'::text) -> Index Scan using actor_case_assignment_actor_id on actor_case_assignment (cost=0.00..2181.29 rows=2616 width=35) (actual time=22.26..25.57 rows=2 loops=3501) Index Cond: ("outer".actor_id = actor_case_assignment.actor_id) -> Index Scan using case_data_case_id on case_data (cost=0.00..3.65 rows=1 width=39) (actual time=13.00..13.00 rows=1 loops=5882) Index Cond: (case_data.case_id = "outer".case_id) Total runtime: 169381.38 msec (17 rows)
"Nick Fankhauser" <nickf@ontko.com> writes: > This indicates to me that 1 isn't too shabby as an estimate if the whole > name is specified, but I'm not sure how this gets altered in the case of a > "LIKE" For a pattern like "SANDERS%", the estimate is basically a range estimate for this condition: > ((actor_full_name_uppercase >= 'SANDERS'::character varying) AND > (actor_full_name_uppercase < 'SANDERT'::character varying)) > n_distinct | -0.14701 > Question: What does it mean when n_distinct is negative? It means that the number of distinct values is estimated as a fraction of the table size, rather than an absolute number. In this case 14.7% of the table size, which is a bit off compared to the correct value of 43% (1453371/3386359), but at least it's of the right order of magnitude now ... > -> Index Scan using > actor_full_name_uppercase on actor (cost=0.00..456.88 rows=113 width=42) > (actual time=32.80..3197.28 rows=3501 loops=1) Hmm. Better, but not enough better to force a different plan choice. You might have to resort to brute force, like "set enable_nestloop=false". Just out of curiosity, what do you get if you do that? regards, tom lane
> You might have to resort to brute force, like "set enable_nestloop=false". > Just out of curiosity, what do you get if you do that? I get a different plan, but similar execution time: Limit (cost=323437.13..323437.13 rows=1 width=115) (actual time=170921.89..170922.95 rows=1000 loops=1) -> Sort (cost=323437.13..323437.13 rows=1 width=115) (actual time=170921.89..170922.26 rows=1001 loops=1) Sort Key: max((actor.actor_full_name)::text), count(case_data.case_id) -> Aggregate (cost=323437.08..323437.12 rows=1 width=115) (actual time=170849.94..170898.06 rows=3457 loops=1) -> Group (cost=323437.08..323437.09 rows=3 width=115) (actual time=170849.90..170873.60 rows=5880 loops=1) -> Sort (cost=323437.08..323437.08 rows=3 width=115) (actual time=170847.97..170850.21 rows=5880 loops=1) Sort Key: actor.actor_id -> Hash Join (cost=253333.29..323437.06 rows=3 width=115) (actual time=122873.80..170814.27 rows=5880 loops=1) Hash Cond: ("outer".case_id = "inner".case_id) -> Seq Scan on case_data (cost=0.00..60368.16 rows=1947116 width=39) (actual time=12.95..43542.25 rows=1947377 loops=1) -> Hash (cost=253333.28..253333.28 rows=3 width=76) (actual time=122844.40..122844.40 rows=0 loops=1) -> Hash Join (cost=6.02..253333.28 rows=3 width=76) (actual time=24992.70..122810.32 rows=5883 loops=1) Hash Cond: ("outer".actor_id = "inner".actor_id) -> Seq Scan on actor_case_assignment (cost=0.00..209980.49 rows=8669349 width=34) (actual time=9.13..85504.05 rows=8670467 loops=1) -> Hash (cost=6.01..6.01 rows=1 width=42) (actual time=24926.56..24926.56 rows=0 loops=1) -> Index Scan using actor_full_name_uppercase on actor (cost=0.00..6.01 rows=1 width=42) (actual time=51.67..24900.53 rows=3502 loops=1) Index Cond: ((actor_full_name_uppercase >= 'SANDERS'::character varying) AND (actor_full_name_uppercase < 'SANDERT'::character varying)) Filter: (actor_full_name_uppercase ~~ 'SANDERS%'::text) Total runtime: 170925.93 msec (19 rows) -Nick
On Wed, 12 Nov 2003 13:27:53 -0500, "Nick Fankhauser" <nickf@ontko.com> wrote: > >> You might have to resort to brute force, like "set enable_nestloop=false". > -> Seq Scan on >actor_case_assignment (cost=0.00..209980.49 rows=8669349 width=34) (actual >time=9.13..85504.05 rows=8670467 loops=1) Does actor_case_assignment contain more columns than just the two ids? If yes, do these additional fields account for ca. 70 bytes per tuple? If not, try VACUUM FULL ANALYSE actor_case_assignment; > -> Index Scan using >actor_full_name_uppercase on actor (cost=0.00..6.01 rows=1 width=42) >(actual time=51.67..24900.53 rows=3502 loops=1) This same index scan on actor has been much faster in your previous postings (677ms, 3200ms), probably due to caching effects. 7ms per tuple returned looks like a lot of disk seeks are involved. Is clustering actor on actor_full_name_uppercase an option or would this slow down other queries? Servus Manfred
> Does actor_case_assignment contain more columns than just the two ids? > If yes, do these additional fields account for ca. 70 bytes per tuple? > If not, try > VACUUM FULL ANALYSE actor_case_assignment; actor_case_assignment has its own primary key and a "role" field in addition to the ids you've seen, so 70 bytes sounds reasonable. (The PK is to allow a remote mirroring application to update these records- otherwise it would be unnecessary.) > 7ms per > tuple returned looks like a lot of disk seeks are involved. Is > clustering actor on actor_full_name_uppercase an option or would this > slow down other queries? Good question... I've never used clustering in PostgreSQL before, so I'm unsure. I presume this is like clustering in Oracle where the table is ordered to match the index? If so, I think you may be onto something because the only other field We regularly query on is the actor_id. Actor_id has a unique index with no clustering currently, so I don't think I'd lose a thing by clustering on actor_full_name_uppercase. I'll give this a try & let you know how it changes. BTW, you are correct that caching has a big affect on the actual time figures in this case- I'm working on my development DB, so cahced info doesn't get trampled as quickly by other users. Is there a way to flush out the cache in a testing situation like this in order to start from a consistent base? Thanks! -Nick
On Fri, 14 Nov 2003 11:00:38 -0500, "Nick Fankhauser" <nickf@ontko.com> wrote: >Good question... I've never used clustering in PostgreSQL before, so I'm >unsure. I presume this is like clustering in Oracle where the table is >ordered to match the index? Yes, something like that. With the exception that Postgres looses the clustered status, while you INSERT and UPDATE tuples. So you have to re-CLUSTER from time to time. Look at pg_stats.correlation to see, if its necessary. > Is there a way to flush out >the cache in a testing situation like this in order to start from a >consistent base? To flush Postgres shared buffers: SELECT count(*) FROM another_large_table; To flush your database pages from the OS cache: tar cf /dev/null /some/large/directory And run each of your tests at least twice to get a feeling how caching affects your specific queries. Servus Manfred