Help with extracting large volumes of records across related tables - Mailing list pgsql-performance

From Damien Dougan
Subject Help with extracting large volumes of records across related tables
Date
Msg-id 004b01c49986$228c3d60$6e01a8c0@a31p005
Whole thread Raw
Responses Re: Help with extracting large volumes of records across related tables  (Paul Thomas <paul@tmsl.demon.co.uk>)
Re: Help with extracting large volumes of records across related  (Mischa Sandberg <ischamay.andbergsay@activestateway.com>)
List pgsql-performance
Hi All,

I am having a performance problem extracting a large volume of data from
Postgres 7.4.2, and was wondering if there was a more cunning way to get
the data out of the DB...

This isn't a performance problem with any particular PgSQL operation,
its more a strategy for getting large volumes of related tables out of
the DB whilst perserving the relations between them.


Basically we have a number of tables, which are exposed as 2 public
views (say PvA and PvB). For each row in PvA, there are a number of
related rows in PvB (this number is arbitrary, which is one of the
reasons why it cant be expressed as additional columns in PvA - so we
really need 2 sets of tables - which leads to two sets of extract calls
- interwoven to associate PvA with PvB).


The problem is that for extraction, we ultimately want to grab a row
from PvA, and then all the related rows from PvB and store them together
offline (e.g. in XML).

However, the number of rows at any time on the DB is likely to be in the
millions, with perhaps 25% of them being suitable for extraction at any
given batch run (ie several hundred thousand to several million).


Currently, the proposal is to grab several hundred rows from PvA (thus
avoiding issues with the resultset being very large), and then process
each of them by selecting the related rows in PvB (again, several
hundred rows at a time to avoid problems with large result sets).

So the algorithm is basically:


    Do

        Select the next 200 rows from PvA

        For each PvA row Do
            Write current PvA row as XML

            Do
                Select the next 200 rows from PvB

                For each PvB row Do
                    Write current PvB row as XML
within the parent PvA XML Element
                End For
            While More Rows
        End For
    While More Rows


However, this has a fairly significant performance impact, and I was
wondering if there was a faster way to do it (something like taking a
dump of the tables so they can be worked on offline - but a basic dump
means we have lost the 1:M relationships between PvA and PvB).


Are there any tools/tricks/tips with regards to extracting large volumes
of data across related tables from Postgres? It doesnt have to export
into XML, we can do post-processing on the extracted data as needed -
the important thing is to keep the relationship between PvA and PvB on a
row-by-row basis.


Many thanks,

Damien


pgsql-performance by date:

Previous
From: Christopher Browne
Date:
Subject: Re: Data Warehouse Reevaluation - MySQL vs Postgres -- merge tables
Next
From: Paul Thomas
Date:
Subject: Re: Help with extracting large volumes of records across related tables