Thread: perf problem with huge table

perf problem with huge table

From

rama

Date:

10 February 2010, 18:19:44


Hi all,

i am trying to move my app from M$sql to PGsql, but i need a bit of help :)


on M$sql, i had certain tables that was made as follow (sorry pseudo code)

contab_y
   date
   amt
   uid


contab_yd
  date
  amt
  uid

contab_ymd
  date
  amt
  uid


and so on..

this was used to "solidify" (aggregate..btw sorry for my terrible english) the data on it..

so basically, i get

contab_y
date = 2010
amt = 100
uid = 1

contab_ym
  date = 2010-01
  amt = 10
  uid = 1
----
  date = 2010-02
  amt = 90
  uid = 1


contab_ymd
   date=2010-01-01
   amt = 1
   uid = 1
----
blabla


in that way, when i need to do a query for a long ranges  (ie: 1 year) i just take the rows that are contained to
contab_y
if i need to got a query for a couple of days, i can go on ymd, if i need to get some data for the other timeframe, i
cando some cool intersection between  
the different table using some huge (but fast) queries.


Now, the matter is that this design is hard to mantain, and the tables are difficult to check

what i have try is to go for a "normal" approach, using just a table that contains all the data, and some proper
indexing.
The issue is that this table can contains easilly 100M rows :)
that's why the other guys do all this work to speed-up queryes splitting data on different table and precalculating the
sums.


I am here to ask for an advice to PGsql experts:
what do you think i can do to better manage this situation?
there are some other cases where i can take a look at? maybe some documentation, or some technique that i don't know?
any advice is really appreciated!

Re: perf problem with huge table

From

Justin Graf

Date:

10 February 2010, 18:45:29

On 2/10/2010 5:13 PM, rama wrote:
> in that way, when i need to do a query for a long ranges  (ie: 1 year) i just take the rows that are contained to
contab_y
> if i need to got a query for a couple of days, i can go on ymd, if i need to get some data for the other timeframe, i
cando some cool intersection between 
> the different table using some huge (but fast) queries.
>
>
> Now, the matter is that this design is hard to mantain, and the tables are difficult to check
>
> what i have try is to go for a "normal" approach, using just a table that contains all the data, and some proper
indexing.
> The issue is that this table can contains easilly 100M rows :)
> that's why the other guys do all this work to speed-up queryes splitting data on different table and precalculating
thesums. 
>
>
> I am here to ask for an advice to PGsql experts:
> what do you think i can do to better manage this situation?
> there are some other cases where i can take a look at? maybe some documentation, or some technique that i don't know?
> any advice is really appreciated!
>
Look into table partitioning
http://www.postgresql.org/docs/current/static/ddl-partitioning.html

Its similar to what you are doing but it simplifies queries and logic to
access large data sets.

All legitimate Magwerks Corporation quotations are sent in a .PDF file attachment with a unique ID number generated by
ourproprietary quotation system. Quotations received via any other form of communication will not be honored. 

CONFIDENTIALITY NOTICE: This e-mail, including attachments, may contain legally privileged, confidential or other
informationproprietary to Magwerks Corporation and is intended solely for the use of the individual to whom it
addresses.If the reader of this e-mail is not the intended recipient or authorized agent, the reader is hereby notified
thatany unauthorized viewing, dissemination, distribution or copying of this e-mail is strictly prohibited. If you have
receivedthis e-mail in error, please notify the sender by replying to this message and destroy all occurrences of this
e-mailimmediately. 
Thank you.

Re: perf problem with huge table

From

Dave Crooke

Date:

10 February 2010, 19:16:28

Hi Rama

I'm actually looking at going in the other direction ....

I have an app using PG where we have a single table where we just added a lot of data, and I'm ending up with many millions of rows, and I'm finding that the single table schema simply doesn't scale.

In PG, the table partitioning is only handled by the database for reads, for insert/update you need to do quite a lot of DIY (setting up triggers, etc.) so I am planning to just use named tables and generate the necessary DDL / DML in vanilla SQL the same way that your older code does.

My experience is mostly with Oracle, which is not MVCC, so I've had to relearn some stuff:

- Oracle often answers simple queries (e.g. counts and max / min) using only the index, which is of course pre-sorted. PG has to go out and fetch the rows to see if they are still in scope, and if they are stored all over the place on disk it means an 8K random page fetch for each row. This means that adding an index to PG is not nearly the silver bullet that it can be with some non-MVCC databases.

- PG's indexes seem to be quite a bit larger than Oracle's, but that's gut feel, I haven't been doing true comparisons ... however, for my app I have limited myself to only two indexes on that table, and each index is larger (in disk space) than the table itself ... I have 60GB of data and 140GB of indexes :-)

- There is a lot of row turnover in my big table (I age out data) .... a big delete (millions of rows) in PG seems a bit more expensive to process than in Oracle, however PG is not nearly as sensitive to transaction sizes as Oracle is, so you can cheerfully throw out one big "DELETE from FOO where ..." and let the database chew on it

I am interested to hear about your progress.

Cheers
Dave

On Wed, Feb 10, 2010 at 4:13 PM, rama <rama.rama@tiscali.it> wrote:

Hi all,

i am trying to move my app from M$sql to PGsql, but i need a bit of help :)

on M$sql, i had certain tables that was made as follow (sorry pseudo code)

contab_y
date
amt
uid

contab_yd
date
amt
uid

contab_ymd
date
amt
uid

and so on..

this was used to "solidify" (aggregate..btw sorry for my terrible english) the data on it..

so basically, i get

contab_y
date = 2010
amt = 100
uid = 1

contab_ym
date = 2010-01
amt = 10
uid = 1
----
date = 2010-02
amt = 90
uid = 1

contab_ymd
date=2010-01-01
amt = 1
uid = 1
----
blabla

in that way, when i need to do a query for a long ranges (ie: 1 year) i just take the rows that are contained to contab_y
if i need to got a query for a couple of days, i can go on ymd, if i need to get some data for the other timeframe, i can do some cool intersection between
the different table using some huge (but fast) queries.

Now, the matter is that this design is hard to mantain, and the tables are difficult to check

what i have try is to go for a "normal" approach, using just a table that contains all the data, and some proper indexing.
The issue is that this table can contains easilly 100M rows :)
that's why the other guys do all this work to speed-up queryes splitting data on different table and precalculating the sums.

I am here to ask for an advice to PGsql experts:
what do you think i can do to better manage this situation?
there are some other cases where i can take a look at? maybe some documentation, or some technique that i don't know?
any advice is really appreciated!

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: perf problem with huge table

From

Jon Lewison

Date:

10 February 2010, 19:31:07

On Wed, Feb 10, 2010 at 4:16 PM, Dave Crooke <dcrooke@gmail.com> wrote:

Hi Rama

I'm actually looking at going in the other direction ....

I have an app using PG where we have a single table where we just added a lot of data, and I'm ending up with many millions of rows, and I'm finding that the single table schema simply doesn't scale.

In PG, the table partitioning is only handled by the database for reads, for insert/update you need to do quite a lot of DIY (setting up triggers, etc.) so I am planning to just use named tables and generate the necessary DDL / DML in vanilla SQL the same way that your older code does.

My experience is mostly with Oracle, which is not MVCC, so I've had to relearn some stuff:

Just a nit, but Oracle implements MVCC. 90% of the databases out there do.

- Oracle often answers simple queries (e.g. counts and max / min) using only the index, which is of course pre-sorted. PG has to go out and fetch the rows to see if they are still in scope, and if they are stored all over the place on disk it means an 8K random page fetch for each row. This means that adding an index to PG is not nearly the silver bullet that it can be with some non-MVCC databases.

- PG's indexes seem to be quite a bit larger than Oracle's, but that's gut feel, I haven't been doing true comparisons ... however, for my app I have limited myself to only two indexes on that table, and each index is larger (in disk space) than the table itself ... I have 60GB of data and 140GB of indexes :-)

- There is a lot of row turnover in my big table (I age out data) .... a big delete (millions of rows) in PG seems a bit more expensive to process than in Oracle, however PG is not nearly as sensitive to transaction sizes as Oracle is, so you can cheerfully throw out one big "DELETE from FOO where ..." and let the database chew on it .

I find partitioning pretty useful in this scenario if the data allows is. Aging out data just means dropping a partition rather than a delete statement.

Re: perf problem with huge table

From

Dave Crooke

Date:

10 February 2010, 19:48:23

On Wed, Feb 10, 2010 at 5:30 PM, Jon Lewison <jlewison1@gmail.com> wrote:

Just a nit, but Oracle implements MVCC. 90% of the databases out there do.

Sorry, I spoke imprecisely. What I meant was the difference in how the rows are stored internally .... in Oracle, the main tablespace contains only the newest version of a row, which is (where possible) updated in place - queries in a transaction that can still "see" an older version have to pull it from the UNDO tablespace (rollback segments in Oracle 8 and older).

In Postgres, all versions of all rows are in the main table, and have validity ranges associated with them ("this version of this row existed between transaction ids x and y"). Once a version goes out of scope, it has to be garbage collected by the vacuuming process so the space can be re-used.

In general, this means Oracle is faster *if* you're only doing lots of small transactions (consider how these different models handle an update to a single field in a single row) but it is more sensitive to the scale of transactions .... doing a really big transaction against a database with an OLTP workload can upset Oracle's digestion as it causes a lot of UNDO lookups, PG's performance is a lot more predictable in this regard.

Both models have benefits and drawbacks ... when designing a schema for performance it's important to understand these differences.

I find partitioning pretty useful in this scenario if the data allows is. Aging out data just means dropping a partition rather than a delete statement.

Forgot to say this - yes, absolutely agree .... dropping a table is a lot cheaper than a transactional delete.

In general, I think partitioning is more important / beneficial with PG's style of MVCC than with Oracle or SQL-Server (which I think is closer to Oracle than PG).

Cheers
Dave

Re: perf problem with huge table

From

Jon Lewison

Date:

10 February 2010, 20:09:31

On Wed, Feb 10, 2010 at 4:48 PM, Dave Crooke <dcrooke@gmail.com> wrote:

On Wed, Feb 10, 2010 at 5:30 PM, Jon Lewison <jlewison1@gmail.com> wrote:

Just a nit, but Oracle implements MVCC. 90% of the databases out there do.

Sorry, I spoke imprecisely. What I meant was the difference in how the rows are stored internally .... in Oracle, the main tablespace contains only the newest version of a row, which is (where possible) updated in place - queries in a transaction that can still "see" an older version have to pull it from the UNDO tablespace (rollback segments in Oracle 8 and older).

In Postgres, all versions of all rows are in the main table, and have validity ranges associated with them ("this version of this row existed between transaction ids x and y"). Once a version goes out of scope, it has to be garbage collected by the vacuuming process so the space can be re-used.

In general, this means Oracle is faster *if* you're only doing lots of small transactions (consider how these different models handle an update to a single field in a single row) but it is more sensitive to the scale of transactions .... doing a really big transaction against a database with an OLTP workload can upset Oracle's digestion as it causes a lot of UNDO lookups, PG's performance is a lot more predictable in this regard.

Both models have benefits and drawbacks ... when designing a schema for performance it's important to understand these differences.

Yes, absolutely. It's not unusual to see the UNDO tablespace increase in size by several gigs for a large bulk load.

Speaking of rollback segments I'm assuming that since all storage for non-visible row versions is in the main table that PostgreSQL has no equivalent for an ORA-01555.

- Jon

Re: perf problem with huge table

From

Dave Crooke

Date:

10 February 2010, 20:51:44

Actually, in a way it does .... "No space left on device" or similar ;-)

Cheers
Dave

P.S. for those not familiar with Oracle, ORA-01555 translates to "your query / transaction is kinda old and I've forgotten the data, so I'm just going to throw an error at you now". If you're reading, your SELECT randomly fails, if you're writing it forces a rollback of your transaction.

On Wed, Feb 10, 2010 at 6:09 PM, Jon Lewison <jlewison1@gmail.com> wrote:

Speaking of rollback segments I'm assuming that since all storage for non-visible row versions is in the main table that PostgreSQL has no equivalent for an ORA-01555.

- Jon

Re: perf problem with huge table

From

Greg Smith

Date:

10 February 2010, 21:42:30

rama wrote:
> in that way, when i need to do a query for a long ranges  (ie: 1 year) i just take the rows that are contained to
contab_y
> if i need to got a query for a couple of days, i can go on ymd, if i need to get some data for the other timeframe, i
cando some cool intersection between  
> the different table using some huge (but fast) queries.
>
> Now, the matter is that this design is hard to mantain, and the tables are difficult to check
>

You sound like you're trying to implement something like materialized
views one at a time; have you considered adopting the more general
techniques used to maintain those so that you're not doing custom
development each time for the design?

http://tech.jonathangardner.net/wiki/PostgreSQL/Materialized_Views
http://www.pgcon.org/2008/schedule/events/69.en.html

I think that sort of approach is more practical than it would have been
for you in MySQL, so maybe this wasn't on your list of possibilities before.

--
Greg Smith    2ndQuadrant   Baltimore, MD
PostgreSQL Training, Services and Support
greg@2ndQuadrant.com  www.2ndQuadrant.com

Re: perf problem with huge table

From

jesper@krogh.cc

Date:

11 February 2010, 06:05:36

> Hi all,
>
> i am trying to move my app from M$sql to PGsql, but i need a bit of help
> :)

Except from all the other good advises about partitioning the dataset and
such there is another aspect to "keep in mind". When you have a large
dataset and your queries become "IO-bound" the "tuple density" is going to
hit you in 2 ways. Your dataset seems to have a natural clustering around
the time, which is also what you would use for the partitioning. That also
means that if you sort of have the clustering of data on disk you would
have the tuples you need to satisfy a query on the "same page" or pages
"close to".

The cost of checking visibillity for a tuple is to some degree a function
of the "tuple size", so if you can do anything to increase the tuple
density that will most likely benefit speed in two ways:

* You increace the likelyhood that the next tuple was in the same page and
  then dont result in a random I/O seek.
* You increace the total amount of tuples you have sitting in your system
  cache in the same amount of pages (memory) so they dont result in a
  random I/O seek.

So .. if you are carrying around columns you "dont really need", then
throw them away. (that could be colums that trivially can be computed
bases on other colums), but you need to do your own testing here. To
stress the first point theres a sample run on a fairly old desktop with
one SATA drive.

testtable has the "id integer" and a "data" which is 486 bytes of text.
testtable2 has the "id integer" and a data integer.

both filled with 10M tuples and PG restarted and rand drop caches before
to simulate "totally disk bound system".

testdb=# select count(id) from testtable where id > 8000000 and id < 8500000;
 count
--------
 499999
(1 row)

Time: 7909.464 ms
testdb=# select count(id) from testtable2 where id > 8000000 and id <
8500000;
 count
--------
 499999
(1 row)

Time: 2149.509 ms

In this sample.. 4 times faster, the query does not touch the "data" column.
(on better hardware you'll most likely see better results).

If the columns are needed, you can push less frequently used columns to a
1:1 relation.. but that gives you some administrative overhead, but then
you can desice at query time if you want the extra random seeks to
access data.

You have the same picture the "other way around" if your queries are
accession data sitting in TOAST, you'll be paying "double random IO"-cost
for getting the tuple. So it is definately a tradeoff, that should be done
with care.

I've monkeypatched my own PG using this patch to toy around with criteria
to send the "less frequently used data" to a TOAST table.
http://article.gmane.org/gmane.comp.db.postgresql.devel.general/135158/match=

Google "vertical partition" for more, this is basically what it is.

(I belive this could benefit my own application, so I'm also
trying to push some interest into the area).

--
Jesper

Re: perf problem with huge table

From

Leo Mannhart

Date:

11 February 2010, 10:39:17

Dave Crooke wrote:
> On Wed, Feb 10, 2010 at 5:30 PM, Jon Lewison <jlewison1@gmail.com
> <mailto:jlewison1@gmail.com>> wrote:
>
>
>
>     Just a nit, but Oracle implements MVCC.  90% of the databases out
>     there do.
>
>
> Sorry, I spoke imprecisely. What I meant was the difference in how the
> rows are stored internally .... in Oracle, the main tablespace contains
> only the newest version of a row, which is (where possible) updated in
> place - queries in a transaction that can still "see" an older version
> have to pull it from the UNDO tablespace (rollback segments in Oracle 8
> and older).
>
> In Postgres, all versions of all rows are in the main table, and have
> validity ranges associated with them ("this version of this row existed
> between transaction ids x and y"). Once a version goes out of scope, it
> has to be garbage collected by the vacuuming process so the space can be
> re-used.
>
> In general, this means Oracle is faster *if* you're only doing lots of
> small transactions (consider how these different models handle an update
> to a single field in a single row) but it is more sensitive to the scale
> of transactions .... doing a really big transaction against a database
> with an OLTP workload can upset Oracle's digestion as it causes a lot of
> UNDO lookups, PG's performance is a lot more predictable in this regard.
>
> Both models have benefits and drawbacks ... when designing a schema for
> performance it's important to understand these differences.
>
>
>     I find partitioning pretty useful in this scenario if the data
>     allows is.  Aging out data just means dropping a partition rather
>     than a delete statement.
>
>
> Forgot to say this - yes, absolutely agree .... dropping a table is a
> lot cheaper than a transactional delete.
>
> In general, I think partitioning is more important / beneficial with
> PG's style of MVCC than with Oracle or SQL-Server (which I think is
> closer to Oracle than PG).

I would like to disagree here a little bit

Where Oracle's table partitioning is coming in very handy is for example
when you have to replace the data of a big (read-only) table on a
regularly basis (typically the replicated data from another system).
In this case, you just create a partitioned table of exact the same
columns/indexes whatsoever as the data table.

To load, you then do load the data into the partitioned table, i.e.
- truncate the partitioned table, disable constraints, drop indexes
- load the data into the partitioned table
- rebuild all indexes etc. on the partitioned table

during all this time (even if it takes hours) the application can still
access the data in the data table without interfering the bulk load.

Once you have prepared the data in the partitioned table, you
- exchange the partition with the data table
wich is a dictionary operation, that means, the application is (if ever)
only blocked during this operation which is in the sub-seconds range.

If you have to do this with convetional updates or deletes/inserts resp.
then this might not even be possible in the given timeframe.

just as an example
Leo

p.s. just to make it a little bit clearer about the famous ORA-01555:
Oracle is not "forgetting" the data as the Oracle RDBMS is of course
also ACID-compliant. The ORA-01555 can happen

- when the rollback tablespace is really to small to hold all the data
changed in the transaction (which I consider a configuration error)

- when a long running (read) transaction is trying to change a record
which is already updated AND COMMITTED by another transaction. The key
here is, that a second transaction has changed a record which is also
needed by the first transaction and the second transaction commited the
work. Committing the change means, the data in the rollback segment is
no longer needed, as it can be read directly from the data block (after
all it is commited and this means valid and visible to other
transactions). If the first transaction now tries to read the data from
the rollback segment to see the unchanged state, it will still succeed
(it is still there, nothing happend until now to the rollback segment).
The problem of the ORA-01555 shows up only, if now a third transaction
needs space in the rollback segment. As the entry from the first/second
transaction is marked committed (and therefore no longer needed), it is
perfectly valid for transaction #3 to grab this rollback segment and to
store its old value there. If THEN (and only then) comes transaction #1
again, asking for the old, unchanged value when the transaction started,
THEN the famous ORA-01555 is raised as this value is now overwritten by
transaction #3.
Thats why in newer versions you have to set the retention time of the
rollback blocks/segments to a value bigger than your expected longest
transaction. This will decrease the likelihood of the ORA-01555
drastically (but it is still not zero, as you could easily construct an
example where it still will fail with ORA-0155 as a transaction can
still run longer than you expected or the changed data is bigger the the
whole rollback tablespace)

>
>
> Cheers
> Dave
>
>
>