Thread: Slow query performance

Slow query performance

From

"Kevin Galligan"

Date:

29 October 2008, 17:45:19

I'm approaching the end of my rope here.  I have a large database.
250 million rows (ish).  Each row has potentially about 500 pieces of
data, although most of the columns are sparsely populated.

What I'm trying to do is, essentially, search for sub-sets of that
data based on arbitrary queries of those data columns.  the queries
would be relatively simple ("dirbtl is not null and qqrq between 20
and 40").  After the database is built, it is read only.

So, I started with maybe 10-15 fields in a main table, as most records
have values for those fields.  Then had individual tables for the
other values.  The idea is that the percentage of rows with values
drops off significantly after those main tables.  That, an each
individual query looks at probably 3 or 4 fields in total.  The
performance of those queries was pretty bad.  Its got to join large
numbers of values, which didn't really work out well.

So, went the other direction completely.  I rebuilt the database with
a much larger main table.  Any values with 5% or greater filled in
rows were added to this table.  Maybe 130 columns.  Indexes applied to
most of these.  Some limited testing with a smaller table seemed to
indicate that queries on a single table without a join would work much
faster.

So, built that huge table.  now query time is terrible.  Maybe a
minute or more for simple queries.

I'm running vacuum/analyze right now (which is also taking forever, BTW).

The box has 15 g of ram.  I made the shared_buffers setting to 8 or 9
gig.  My first question, what would be better to bump up to increase
the performance?  I thought that was the field to jack up to improve
query time or index caching, but I've read conflicting data.  The 15
ram is available.

I originally had this in mysql.  Also bad performance.  I understand
how to optimize that much better, but it just wasn't cutting it.

Anyway, help with tuning the settings would be greatly appreciated.
Advice on how best to lay this out would also be helpful (I know its
difficult without serious detail).

Thanks in advance,
-Kevin

Re: Slow query performance

From

"Scott Marlowe"

Date:

29 October 2008, 19:18:56

On Wed, Oct 29, 2008 at 2:18 PM, Kevin Galligan <kgalligan@gmail.com> wrote:
> I'm approaching the end of my rope here.  I have a large database.
> 250 million rows (ish).  Each row has potentially about 500 pieces of
> data, although most of the columns are sparsely populated.

A couple of notes here.  PostgreSQL stores null values as a single bit
in a bit field, making sparsely populated tables quite efficient as
long as you store the non-existent values as null and not '' or some
other real value.

Have you run explain analyze on your queries yet?  Pick a slow one,
run explain analyze on it and post it and we'll see what we can do.

Re: Slow query performance

From

"Kevin Galligan"

Date:

29 October 2008, 19:36:04

An example of a slow query is...

select count(*) from bigdatatable where age between 22 and 40 and state = 'NY';

explain analyze returned the following...

Aggregate (cost=5179639.55..5179639.56 rows=1 width=0) (actual time=389529.895..389529.897 rows=1 loops=1)
-> Bitmap Heap Scan on bigdatatable (cost=285410.65..5172649.63 rows=2795968 width=0) (actual time=6727.848..387159.175

                            rows=2553273 loops=1)
         Recheck Cond: ((state)::text = 'NY'::text)
         Filter: ((age >= 22) AND (age <= 40))
         -> Bitmap Index Scan on idx_jstate (cost=0.00..284711.66 rows=15425370 width=0) (actual time=6298.950..6298.950 ro                                ws=16821828 loops=1)
               Index Cond: ((state)::text = 'NY'::text)
Total runtime: 389544.088 ms

It looks like the index scans are around 6 seconds or so each, but then the bitmap heap scan and aggregate jump up to 6 mintues.

More detail on the table design and other stuff in a bit...

On Wed, Oct 29, 2008 at 6:18 PM, Scott Marlowe <scott.marlowe@gmail.com> wrote:

On Wed, Oct 29, 2008 at 2:18 PM, Kevin Galligan <kgalligan@gmail.com> wrote:
> I'm approaching the end of my rope here. I have a large database.
> 250 million rows (ish). Each row has potentially about 500 pieces of
> data, although most of the columns are sparsely populated.

A couple of notes here. PostgreSQL stores null values as a single bit
in a bit field, making sparsely populated tables quite efficient as
long as you store the non-existent values as null and not '' or some
other real value.

Have you run explain analyze on your queries yet? Pick a slow one,
run explain analyze on it and post it and we'll see what we can do.

Re: Slow query performance

From

"Nick Mellor"

Date:

31 October 2008, 06:42:00

Hi Kevin,

I'm not deeply knowledgeable about PostgreSQL, but my guess is that 2 things
are doing you in:

(1) scanning all those nulls during SELECTs (even though PostgreSQL is
efficient at nulls, there are still tens or hundreds of billions of them)

(2) All those single-field indexes, and aggregations between them

Both (1) and (2) make it very difficult for PG to cache effectively.

You have the advantage that your data is read-only at query time, so I'd
suggest this (fairly lengthy) experiment:

Go back to a split-table format, where you have:

- one rectangular table containing those fields which are always, or nearly
always, filled (the "Full" table.) You talk about there being "some knowns".
Does that mean that you know some combinations of two or more fields will be
selected on very frequently? Optimise any of these combinations in the
"full" table fields using multi-field indexes across these combinations
(e.g. (state,age).) Put full single-field indexes on all fields in the
"Full" table.

- one or more tables (one initially, see below) containing the fields that
are mostly null (the "Sparse" table.) Store the sparse data in a "depivoted"
form. Explanation follows:


"Sparse" Table
---

Instead of (the original "sparse" table):

Id,field1,field2,field3,field4,...field500
Rec1,...
Rec2,...
...
RecM,...

store the sparse data as:

Id,fieldname,value
Rec1,field1name,field1value(Rec1)
Rec1,field2name,field1value(Rec2)
...
Rec1,field500name,field500value(Rec1)
Rec2,field1name,field1value(Rec2)
...
RecM,field500name,field500value(RecM)

I.e. one row per cell in the "sparse" table, and an Id to link to the "Full"
table.

For null values, don't store a depivoted record at all. I'd estimate this
would give you a few billion rows at most in this table.

(If anyone has a better name for this process than "depivoting", please pass
it along!)

In the depivoted table, put single-field indexes and multi-field indexes on
every combination of Id, fieldname, value in the depivoted data:
(Id)
(fieldname)
(value)
(Id,fieldname)
(Id,value)
(fieldname,value)

You might eventually have to keep a different depivoted table for each type
of field value (boolean, integer, character varying etc) but you could do a
dirty experiment by converting all values to CHARACTER VARYING and having a
look at query performance using the new structure before doing further work.


Rationale
---

"Depivoting" makes sense to me because your data is so sparse. The huge
number of nulls may be causing severe aggregation bottlenecks and many cache
misses. All those indexes will absolutely kill the query performance of any
database in terms of hard disk seeks, data cache limits and aggregation
time. I'm not surprise that both MySQL and PostgreSQL struggle.

Too many indexes probably gave you a diminishing return on a table this big
because 15G RAM on your server is way too small a cache for so many fields
and indexes.

"Depivoting" eliminates the need to scan the huge number of nulls in the
dataset. Even if nulls are very efficiently handled in PostgreSQL, you're
talking about hundreds of billions of them, and hundreds of billions of
anything is never going to be quick. Better not to process them at all.

"De-pivoting" will (of course) eventually mean rewriting all your querying
code, and you'll probably need to "rebuild" the sparse data into a wide
table format at some point. But if the result set is small, this should be a
small price to pay for better SELECT query performance.

If you want to do a quick and dirty experiment, I have an MS Access app that
depivots arbitrarily wide tables. I'd be glad to pass it along, although
there is a risk it's too small a gun for the job. But if Access manages okay
with the depivoted table, it might be worth a try. Based on 500 fields, 250M
records, 2% filled it looks like it might depivot your table overnight, or
better. You'd finish with about 2.5 billion rows.

Best wishes,

Nick

> -----Original Message-----
> From: pgsql-general-owner@postgresql.org [mailto:pgsql-general-
> owner@postgresql.org] On Behalf Of Kevin Galligan
> Sent: Thursday, 30 October 2008 7:18 AM
> To: pgsql-general@postgresql.org
> Subject: [GENERAL] Slow query performance
>
> I'm approaching the end of my rope here.  I have a large database.
> 250 million rows (ish).  Each row has potentially about 500 pieces of
> data, although most of the columns are sparsely populated.
>
> What I'm trying to do is, essentially, search for sub-sets of that
> data based on arbitrary queries of those data columns.  the queries
> would be relatively simple ("dirbtl is not null and qqrq between 20
> and 40").  After the database is built, it is read only.
>
> So, I started with maybe 10-15 fields in a main table, as most records
> have values for those fields.  Then had individual tables for the
> other values.  The idea is that the percentage of rows with values
> drops off significantly after those main tables.  That, an each
> individual query looks at probably 3 or 4 fields in total.  The
> performance of those queries was pretty bad.  Its got to join large
> numbers of values, which didn't really work out well.
>
> So, went the other direction completely.  I rebuilt the database with
> a much larger main table.  Any values with 5% or greater filled in
> rows were added to this table.  Maybe 130 columns.  Indexes applied to
> most of these.  Some limited testing with a smaller table seemed to
> indicate that queries on a single table without a join would work much
> faster.
>
> So, built that huge table.  now query time is terrible.  Maybe a
> minute or more for simple queries.
>
> I'm running vacuum/analyze right now (which is also taking forever, BTW).
>
> The box has 15 g of ram.  I made the shared_buffers setting to 8 or 9
> gig.  My first question, what would be better to bump up to increase
> the performance?  I thought that was the field to jack up to improve
> query time or index caching, but I've read conflicting data.  The 15
> ram is available.
>
> I originally had this in mysql.  Also bad performance.  I understand
> how to optimize that much better, but it just wasn't cutting it.
>
> Anyway, help with tuning the settings would be greatly appreciated.
> Advice on how best to lay this out would also be helpful (I know its
> difficult without serious detail).
>
> Thanks in advance,
> -Kevin
>
> --
> Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-general

Attachment

Nick Mellor (nick.mellor.groups@pobox.com).vcf

Re: Slow query performance

From

"Isak Hansen"

Date:

31 October 2008, 07:40:50

On Wed, Oct 29, 2008 at 9:18 PM, Kevin Galligan <kgalligan@gmail.com> wrote:
> I'm approaching the end of my rope here.  I have a large database.
> 250 million rows (ish).  Each row has potentially about 500 pieces of
> data, although most of the columns are sparsely populated.
>
*snip*
>
> So, went the other direction completely.  I rebuilt the database with
> a much larger main table.  Any values with 5% or greater filled in
> rows were added to this table.  Maybe 130 columns.  Indexes applied to
> most of these.  Some limited testing with a smaller table seemed to
> indicate that queries on a single table without a join would work much
> faster.
>
> So, built that huge table.  now query time is terrible.  Maybe a
> minute or more for simple queries.

Are indexes on sparsely populated columns already handled efficiently,
or could partial indexes with only non-null values improve things?


Isak

Re: Slow query performance

From

Joris Dobbelsteen

Date:

02 November 2008, 15:34:48

Kevin Galligan wrote, On 29-10-08 23:35:
> An example of a slow query is...
>
> select count(*) from bigdatatable where age between 22 and 40 and state
> = 'NY';
>
> explain analyze returned the following...
>
>  Aggregate  (cost=5179639.55..5179639.56 rows=1 width=0) (actual
> time=389529.895..389529.897 rows=1 loops=1)
>    ->  Bitmap Heap Scan on bigdatatable  (cost=285410.65..5172649.63
> rows=2795968 width=0) (actual time=6727.848..387159.175
>                             rows=2553273 loops=1)
>          Recheck Cond: ((state)::text = 'NY'::text)
>          Filter: ((age >= 22) AND (age <= 40))
>          ->  Bitmap Index Scan on idx_jstate  (cost=0.00..284711.66
> rows=15425370 width=0) (actual time=6298.950..6298.950
> ro                                ws=16821828 loops=1)
>                Index Cond: ((state)::text = 'NY'::text)
>  Total runtime: 389544.088 ms
>
> It looks like the index scans are around 6 seconds or so each, but then
> the bitmap heap scan and aggregate jump up to 6 mintues.

Indeed. Its cause is that PostGreSQL must traverse the data in order to
verify if the data is valid for the transaction. This means A LOT of
data must be retrieved from disk.

The only real thing you can do is reduce I/O load, by reducing the
amount of data that must be traversed (or ensuring the data is stored
closely together, but thats really hard to get right). This requires
optimizing your database design for that single goal.
This will not make it scale any better than it currently does, however.
The query will scale O(N) with the size of your table, you want other
techniques to do better.

Another thing is spending extra money on hardware that can sustain
higher I/O seek rates (more and/or faster spindles).

- Joris

> More detail on the table design and other stuff in a bit...
>
>
> On Wed, Oct 29, 2008 at 6:18 PM, Scott Marlowe <scott.marlowe@gmail.com
> <mailto:scott.marlowe@gmail.com>> wrote:
>
>     On Wed, Oct 29, 2008 at 2:18 PM, Kevin Galligan <kgalligan@gmail.com
>     <mailto:kgalligan@gmail.com>> wrote:
>      > I'm approaching the end of my rope here.  I have a large database.
>      > 250 million rows (ish).  Each row has potentially about 500 pieces of
>      > data, although most of the columns are sparsely populated.
>
>     A couple of notes here.  PostgreSQL stores null values as a single bit
>     in a bit field, making sparsely populated tables quite efficient as
>     long as you store the non-existent values as null and not '' or some
>     other real value.
>
>     Have you run explain analyze on your queries yet?  Pick a slow one,
>     run explain analyze on it and post it and we'll see what we can do.