FW: Slow query performance - Mailing list pgsql-general

From Dann Corbit
Subject FW: Slow query performance
Date
Msg-id D425483C2C5C9F49B5B7A41F89441547010011E7@postal.corporate.connx.com
Whole thread Raw
List pgsql-general

From: Kevin Galligan [mailto:kgalligan@gmail.com]
Sent: Wednesday, October 29, 2008 3:16 PM
To: Dann Corbit
Subject: Re: [GENERAL] Slow query performance

 

Columns are as follows:

account  | integer               |
city     | character varying(20) |
zip      | character(5)          |
dincome  | character(1)          |
sex      | character(1)          |
mob      | boolean               |
religion | character(1)          |
groupcd  | character(1)          |
lastdata | character varying(4)  |
countycd | character varying(3)  |
state    | character varying(2)  |
dutype   | character varying(1)  |
orders   | integer               |
countysz | character varying(1)  |
ethnic   | character varying(2)  |
language | character varying(2)  |
cbsacode | character varying(5)  |
cbsatype | character varying(1)  |
age      | smallint              |
dob      | date                  |
ccard    | boolean               |
lor      | integer               |
bankcard | boolean               |
lastord  | date                  |
total    | integer               |
lmob     | boolean               |
homeown  | character varying(1)  |
ord1st   | date                  |
ordlast  | date                  |
married  | boolean               |
deptcard | boolean               |

>>

You did not show us the indexes.

If you have pgadmin III, go to the table and copy/paste the actual definition, including indexes.

<<
From here its about another 100 columns with either booleans or smallints, mostly null values.

I eventually killed the vacuum.  I will run it again, but was just going through the indexes.  All were of this format...

"INFO:  index "idx_jordlast" now contains 265658026 row versions in 728409 pages
DETAIL:  0 index row versions were removed.
0 index pages have been deleted, 0 are currently reusable.
CPU 0.88s/0.13u sec elapsed 90.38 sec."

An example of a slow query is...

select count(*) from bigdatatable where age between 22 and 40 and state = 'NY';
>>

Is there an index on state and age?

Is there an index on state?

Is there an index on age?

That is important missing information.

If there is no index on either column, then you will do a table scan.

If all of your slow queries look like the above, then create a clustered index on state,age

<<
I know count is not optimized on postgresql like it is on mysql due to transaction isolation (at least that's what I've read.  Makes sense to me).  I understand it'll take time to actually count the rows.  However, here's the output of 'explain analyze select count(*) from bigdatatable where age between 22 and 40 and state = 'NY';'

 Aggregate  (cost=5179639.55..5179639.56 rows=1 width=0) (actual time=389529.895..389529.897 rows=1 loops=1)
   ->  Bitmap Heap Scan on bigdatatable  (cost=285410.65..5172649.63 rows=2795968 width=0) (actual time=6727.848..387159.175                                 rows=2553273 loops=1)
         Recheck Cond: ((state)::text = 'NY'::text)
         Filter: ((age >= 22) AND (age <= 40))
         ->  Bitmap Index Scan on idx_jstate  (cost=0.00..284711.66 rows=15425370 width=0) (actual time=6298.950..6298.950 ro                                ws=16821828 loops=1)
               Index Cond: ((state)::text = 'NY'::text)
 Total runtime: 389544.088 ms

It looks like the index scans are around 6 seconds or so each, which is fine.  then it looks like "Aggregate" suddenly jumps up to 6 minutes.

I know the database design is crude.  Its really just a big flat table.  I didn't put too much into weeding out which columns should be indexed and which shouldn't (just slapped an index on each).  Happy to do that work, but right now I'm in panic mode and just need to figure out which way to start going.  I had a design on mysql which worked pretty good at 10 to 20 % of full size, but degraded quite a bit at full size.  compounding this is there is another implementation we've seen that uses the full size of similar data and returns actual results in seconds (I originally planned to used a 5% size db for estimated results, then the full size for getting the actual data.  This plan was rejected :(

Any thoughts?  It seemed to work OK when I had a table with 10 cols but about the same data length.  That may have been an artificial test, though.

Again.  This is read-only once the data is set up.  Client wants to run pretty much arbitrary queries, so its hard to isolate certain things for optimization, although there are some "knowns".

Will start the full vacuum process again.

Thanks in advance,
-Kevin

On Wed, Oct 29, 2008 at 4:52 PM, Dann Corbit <DCorbit@connx.com> wrote:
>> -----Original Message-----
>> From: pgsql-general-owner@postgresql.org [mailto:pgsql-general-
>> owner@postgresql.org] On Behalf Of Kevin Galligan
>> Sent: Wednesday, October 29, 2008 1:18 PM
>> To: pgsql-general@postgresql.org
>> Subject: [GENERAL] Slow query performance
>>
>> I'm approaching the end of my rope here.  I have a large database.
>> 250 million rows (ish).  Each row has potentially about 500 pieces of
>> data, although most of the columns are sparsely populated.
>>
>> What I'm trying to do is, essentially, search for sub-sets of that
>> data based on arbitrary queries of those data columns.  the queries
>> would be relatively simple ("dirbtl is not null and qqrq between 20
>> and 40").  After the database is built, it is read only.
>>
>> So, I started with maybe 10-15 fields in a main table, as most records
>> have values for those fields.  Then had individual tables for the
>> other values.  The idea is that the percentage of rows with values
>> drops off significantly after those main tables.  That, an each
>> individual query looks at probably 3 or 4 fields in total.  The
>> performance of those queries was pretty bad.  Its got to join large
>> numbers of values, which didn't really work out well.
>>
>> So, went the other direction completely.  I rebuilt the database with
>> a much larger main table.  Any values with 5% or greater filled in
>> rows were added to this table.  Maybe 130 columns.  Indexes applied to
>> most of these.  Some limited testing with a smaller table seemed to
>> indicate that queries on a single table without a join would work much
>> faster.
>>
>> So, built that huge table.  now query time is terrible.  Maybe a
>> minute or more for simple queries.
>>
>> I'm running vacuum/analyze right now (which is also taking forever,
>> BTW).
>>
>> The box has 15 g of ram.  I made the shared_buffers setting to 8 or 9
>> gig.  My first question, what would be better to bump up to increase
>> the performance?  I thought that was the field to jack up to improve
>> query time or index caching, but I've read conflicting data.  The 15
>> ram is available.
>>
>> I originally had this in mysql.  Also bad performance.  I understand
>> how to optimize that much better, but it just wasn't cutting it.
>>
>> Anyway, help with tuning the settings would be greatly appreciated.
>> Advice on how best to lay this out would also be helpful (I know its
>> difficult without serious detail).
>
> Show the full schema for your table and the output of:
> VACUUM VERBOSE ANALYZE <your_table>
>
> Show the query that is slow.
>
>

pgsql-general by date:

Previous
From: "Kevin Galligan"
Date:
Subject: Re: Slow query performance
Next
From: "Dave Page"
Date:
Subject: Re: using plpgsql debuggers