Thread: Performance on large data transformations

Performance on large data transformations

From
Josh Berkus
Date:
Folks,

On one database, I have an overnight data transformation procedure that goes
like:

TableA has about 125,000 records.

Begin Transaction:
1) Update 80% of records in TableA
2) Update 10% of records in TableA
3) Update 65% of records in TableA
4) Update 55% of records in TableA
5) Update 15% or records in TableA with references to other records in TableA
6) Flag what hasn't been updated.
Commit

I've found that, no matter what my FSM settings (I've gone as high as
1,000,000) by the time I reach step 4 execution has slowed down considerably,
and for step 5 it takes the server more than 10 minutes to complete the
update statement. During this period, CPU, RAM and disk I/O are almost idle
... the system seems to spend all of its time doing lengthy seeks.  There is,
for that matter, no kernel swap activity, but I'm not sure how to measure
Postgres temp file activity.

(FYI:  Dual Athalon 1600mhz/1gb/Hardware Raid 1 with xlog on seperate SCSI
drive/Red Hat Linux 8.0/PostgreSQL 7.2.4)

The only way around this I've found is to break up the above into seperate
transactions with VACUUMs in between, and "simulate" a transaction by making
a back-up copy of the table and restoring from it if something goes wrong.
I've tried enough different methods to be reasonably certain that there is no
way around this in 7.2.4.

The reason I bring this up is that PostgreSQL's dramatic plunge in performance
in large serial updates is really problematic for us in the OLAP database
market, where large data transformations, as well as extensive use of
calculated temporary tables, is common.    I was particularly distressed when
I had to tell a client considering switching from MSSQL to Postgres for an
OLAP database that they might just be trading one set of problems for
another.

Is there any way we can improve on this kind of operation in future versions
of PostgreSQL?

--
Josh Berkus
Aglio Database Solutions
San Francisco

Re: Performance on large data transformations

From
"Ron Mayer"
Date:
Josh Berkus wrote:
>
>There is, for that matter, no kernel swap activity, but I'm not
>sure how to measure Postgres temp file activity.

Of course you could:

   mv /wherever/data/base/16556/pgsql_tmp /some_other_disk/
   ln -s /some_other_disk/pgsql_tmp /wherever/data/base/16556

and use "iostat" from the "systat" package to watch how much you're
using the disk the temp directory's on.


In fact, for OLAP stuff I've had this help performance because
quite a few data warehousing operations look like:
   First read from main database,
   do a big hash-or-sort, (which gets written to pgsql_tmp),
   then read from this temporary table and write result to main database

PS: Yes, I know this doesn't help the FSM stuff you asked about.


    Ron