Re: perf problem with huge table - Mailing list pgsql-performance
From | Dave Crooke |
---|---|
Subject | Re: perf problem with huge table |
Date | |
Msg-id | ca24673e1002101516wa2545aek80cbe88df935698@mail.gmail.com Whole thread Raw |
In response to | perf problem with huge table (rama <rama.rama@tiscali.it>) |
Responses |
Re: perf problem with huge table
|
List | pgsql-performance |
Hi Rama
I'm actually looking at going in the other direction ....
I have an app using PG where we have a single table where we just added a lot of data, and I'm ending up with many millions of rows, and I'm finding that the single table schema simply doesn't scale.
In PG, the table partitioning is only handled by the database for reads, for insert/update you need to do quite a lot of DIY (setting up triggers, etc.) so I am planning to just use named tables and generate the necessary DDL / DML in vanilla SQL the same way that your older code does.
My experience is mostly with Oracle, which is not MVCC, so I've had to relearn some stuff:
- Oracle often answers simple queries (e.g. counts and max / min) using only the index, which is of course pre-sorted. PG has to go out and fetch the rows to see if they are still in scope, and if they are stored all over the place on disk it means an 8K random page fetch for each row. This means that adding an index to PG is not nearly the silver bullet that it can be with some non-MVCC databases.
- PG's indexes seem to be quite a bit larger than Oracle's, but that's gut feel, I haven't been doing true comparisons ... however, for my app I have limited myself to only two indexes on that table, and each index is larger (in disk space) than the table itself ... I have 60GB of data and 140GB of indexes :-)
- There is a lot of row turnover in my big table (I age out data) .... a big delete (millions of rows) in PG seems a bit more expensive to process than in Oracle, however PG is not nearly as sensitive to transaction sizes as Oracle is, so you can cheerfully throw out one big "DELETE from FOO where ..." and let the database chew on it
I am interested to hear about your progress.
Cheers
Dave
I'm actually looking at going in the other direction ....
I have an app using PG where we have a single table where we just added a lot of data, and I'm ending up with many millions of rows, and I'm finding that the single table schema simply doesn't scale.
In PG, the table partitioning is only handled by the database for reads, for insert/update you need to do quite a lot of DIY (setting up triggers, etc.) so I am planning to just use named tables and generate the necessary DDL / DML in vanilla SQL the same way that your older code does.
My experience is mostly with Oracle, which is not MVCC, so I've had to relearn some stuff:
- Oracle often answers simple queries (e.g. counts and max / min) using only the index, which is of course pre-sorted. PG has to go out and fetch the rows to see if they are still in scope, and if they are stored all over the place on disk it means an 8K random page fetch for each row. This means that adding an index to PG is not nearly the silver bullet that it can be with some non-MVCC databases.
- PG's indexes seem to be quite a bit larger than Oracle's, but that's gut feel, I haven't been doing true comparisons ... however, for my app I have limited myself to only two indexes on that table, and each index is larger (in disk space) than the table itself ... I have 60GB of data and 140GB of indexes :-)
- There is a lot of row turnover in my big table (I age out data) .... a big delete (millions of rows) in PG seems a bit more expensive to process than in Oracle, however PG is not nearly as sensitive to transaction sizes as Oracle is, so you can cheerfully throw out one big "DELETE from FOO where ..." and let the database chew on it
I am interested to hear about your progress.
Cheers
Dave
On Wed, Feb 10, 2010 at 4:13 PM, rama <rama.rama@tiscali.it> wrote:
Hi all,
i am trying to move my app from M$sql to PGsql, but i need a bit of help :)
on M$sql, i had certain tables that was made as follow (sorry pseudo code)
contab_y
date
amt
uid
contab_yd
date
amt
uid
contab_ymd
date
amt
uid
and so on..
this was used to "solidify" (aggregate..btw sorry for my terrible english) the data on it..
so basically, i get
contab_y
date = 2010
amt = 100
uid = 1
contab_ym
date = 2010-01
amt = 10
uid = 1
----
date = 2010-02
amt = 90
uid = 1
contab_ymd
date=2010-01-01
amt = 1
uid = 1
----
blabla
in that way, when i need to do a query for a long ranges (ie: 1 year) i just take the rows that are contained to contab_y
if i need to got a query for a couple of days, i can go on ymd, if i need to get some data for the other timeframe, i can do some cool intersection between
the different table using some huge (but fast) queries.
Now, the matter is that this design is hard to mantain, and the tables are difficult to check
what i have try is to go for a "normal" approach, using just a table that contains all the data, and some proper indexing.
The issue is that this table can contains easilly 100M rows :)
that's why the other guys do all this work to speed-up queryes splitting data on different table and precalculating the sums.
I am here to ask for an advice to PGsql experts:
what do you think i can do to better manage this situation?
there are some other cases where i can take a look at? maybe some documentation, or some technique that i don't know?
any advice is really appreciated!
--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance
pgsql-performance by date: