Performance on large data transformations - Mailing list pgsql-performance

From Josh Berkus
Subject Performance on large data transformations
Date
Msg-id 200303170938.38365.josh@agliodbs.com
Whole thread Raw
Responses Re: Performance on large data transformations  ("Ron Mayer" <ron@intervideo.com>)
List pgsql-performance
Folks,

On one database, I have an overnight data transformation procedure that goes
like:

TableA has about 125,000 records.

Begin Transaction:
1) Update 80% of records in TableA
2) Update 10% of records in TableA
3) Update 65% of records in TableA
4) Update 55% of records in TableA
5) Update 15% or records in TableA with references to other records in TableA
6) Flag what hasn't been updated.
Commit

I've found that, no matter what my FSM settings (I've gone as high as
1,000,000) by the time I reach step 4 execution has slowed down considerably,
and for step 5 it takes the server more than 10 minutes to complete the
update statement. During this period, CPU, RAM and disk I/O are almost idle
... the system seems to spend all of its time doing lengthy seeks.  There is,
for that matter, no kernel swap activity, but I'm not sure how to measure
Postgres temp file activity.

(FYI:  Dual Athalon 1600mhz/1gb/Hardware Raid 1 with xlog on seperate SCSI
drive/Red Hat Linux 8.0/PostgreSQL 7.2.4)

The only way around this I've found is to break up the above into seperate
transactions with VACUUMs in between, and "simulate" a transaction by making
a back-up copy of the table and restoring from it if something goes wrong.
I've tried enough different methods to be reasonably certain that there is no
way around this in 7.2.4.

The reason I bring this up is that PostgreSQL's dramatic plunge in performance
in large serial updates is really problematic for us in the OLAP database
market, where large data transformations, as well as extensive use of
calculated temporary tables, is common.    I was particularly distressed when
I had to tell a client considering switching from MSSQL to Postgres for an
OLAP database that they might just be trading one set of problems for
another.

Is there any way we can improve on this kind of operation in future versions
of PostgreSQL?

--
Josh Berkus
Aglio Database Solutions
San Francisco

pgsql-performance by date:

Previous
From: Chris Sutton
Date:
Subject: Re: postgresql meltdown on PlanetMath.org
Next
From: Max Baker
Date:
Subject: Re: postgresql meltdown on PlanetMath.org