Thread: Help with extracting large volumes of records across related tables

Help with extracting large volumes of records across related tables

From

"Damien Dougan"

Date:

13 September 2004, 12:38:17

Hi All,

I am having a performance problem extracting a large volume of data from
Postgres 7.4.2, and was wondering if there was a more cunning way to get
the data out of the DB...

This isn't a performance problem with any particular PgSQL operation,
its more a strategy for getting large volumes of related tables out of
the DB whilst perserving the relations between them.


Basically we have a number of tables, which are exposed as 2 public
views (say PvA and PvB). For each row in PvA, there are a number of
related rows in PvB (this number is arbitrary, which is one of the
reasons why it cant be expressed as additional columns in PvA - so we
really need 2 sets of tables - which leads to two sets of extract calls
- interwoven to associate PvA with PvB).


The problem is that for extraction, we ultimately want to grab a row
from PvA, and then all the related rows from PvB and store them together
offline (e.g. in XML).

However, the number of rows at any time on the DB is likely to be in the
millions, with perhaps 25% of them being suitable for extraction at any
given batch run (ie several hundred thousand to several million).


Currently, the proposal is to grab several hundred rows from PvA (thus
avoiding issues with the resultset being very large), and then process
each of them by selecting the related rows in PvB (again, several
hundred rows at a time to avoid problems with large result sets).

So the algorithm is basically:


    Do

        Select the next 200 rows from PvA

        For each PvA row Do
            Write current PvA row as XML

            Do
                Select the next 200 rows from PvB

                For each PvB row Do
                    Write current PvB row as XML
within the parent PvA XML Element
                End For
            While More Rows
        End For
    While More Rows


However, this has a fairly significant performance impact, and I was
wondering if there was a faster way to do it (something like taking a
dump of the tables so they can be worked on offline - but a basic dump
means we have lost the 1:M relationships between PvA and PvB).


Are there any tools/tricks/tips with regards to extracting large volumes
of data across related tables from Postgres? It doesnt have to export
into XML, we can do post-processing on the extracted data as needed -
the important thing is to keep the relationship between PvA and PvB on a
row-by-row basis.


Many thanks,

Damien

Re: Help with extracting large volumes of records across related tables

From

Paul Thomas

Date:

13 September 2004, 13:12:05

On 13/09/2004 12:38 Damien Dougan wrote:
> [snip]
> Are there any tools/tricks/tips with regards to extracting large volumes
> of data across related tables from Postgres? It doesnt have to export
> into XML, we can do post-processing on the extracted data as needed -
> the important thing is to keep the relationship between PvA and PvB on a
> row-by-row basis.

Have you considered using cursors?

--
Paul Thomas
+------------------------------+-------------------------------------------+
| Thomas Micro Systems Limited | Software Solutions for Business           |
| Computer Consultants         | http://www.thomas-micro-systems-ltd.co.uk |
+------------------------------+-------------------------------------------+

Re: Help with extracting large volumes of records across related tables

From

Pierre-Frédéric Caillaud

Date:

13 September 2004, 14:01:00


    There's a very simple solution using cursors.

    As an example :

create table categories ( id serial primary key, name text );
create table items ( id serial primary key, cat_id integer references
categories(id), name text );
create index items_cat_idx on items( cat_id );

insert stuff...

select * from categories;
  id |   name
----+----------
   1 | tools
   2 | supplies
   3 | food
(3 lignes)

select * from items;
  id | cat_id |     name
----+--------+--------------
   1 |      1 | hammer
   2 |      1 | screwdriver
   3 |      2 | nails
   4 |      2 | screws
   5 |      1 | wrench
   6 |      2 | bolts
   7 |      2 | cement
   8 |      3 | beer
   9 |      3 | burgers
  10 |      3 | french fries
(10 lignes)

    Now (supposing you use Python) you use the extremely simple sample
program below :

import psycopg
db = psycopg.connect("host=localhost dbname=rencontres user=rencontres
password=.........")

#    Simple. Let's make some cursors.
cursor = db.cursor()
cursor.execute( "BEGIN;" )
cursor.execute( "declare cat_cursor no scroll cursor without hold for
select * from categories order by id for read only;" )
cursor.execute( "declare items_cursor no scroll cursor without hold for
select * from items order by cat_id for read only;" )

# set up some generators
def qcursor( cursor, psql_cursor_name ):
    while True:
        cursor.execute( "fetch 2 from %s;" % psql_cursor_name )guess
        if not cursor.rowcount:
            break
#        print "%s fetched %d rows." % (psql_cursor_name, cursor.rowcount)
        for row in cursor.dictfetchall():
            yield row
    print "%s exhausted." % psql_cursor_name

# use the generators
categories = qcursor( cursor, "cat_cursor" )
items = qcursor( cursor, "items_cursor" )

current_item = items.next()
for cat in categories:
    print "Category : ", cat

    # if no items (or all items in category are done) skip to next category
    if cat['id'] < current_item['cat_id']:
        continue

    # case of items without category (should not happen)
    while cat['id'] > current_item['cat_id']:
        current_item = items.next()

    while current_item['cat_id'] == cat['id']:
        print "\t", current_item
        current_item = items.next()


It produces the following output :

Category :  {'id': 1, 'name': 'tools'}
         {'cat_id': 1, 'id': 1, 'name': 'hammer'}
         {'cat_id': 1, 'id': 2, 'name': 'screwdriver'}
         {'cat_id': 1, 'id': 5, 'name': 'wrench'}
Category :  {'id': 2, 'name': 'supplies'}
         {'cat_id': 2, 'id': 3, 'name': 'nails'}
         {'cat_id': 2, 'id': 4, 'name': 'screws'}
         {'cat_id': 2, 'id': 6, 'name': 'bolts'}
         {'cat_id': 2, 'id': 7, 'name': 'cement'}
Category :  {'id': 3, 'name': 'food'}
         {'cat_id': 3, 'id': 8, 'name': 'beer'}
         {'cat_id': 3, 'id': 9, 'name': 'burgers'}
         {'cat_id': 3, 'id': 10, 'name': 'french fries'}

This simple code, with "fetch 1000" instead of "fetch 2", dumps a database
of several million rows, where each categories contains generally 1 but
often 2-4 items, at the speed of about 10.000 items/s.

Satisfied ?

Re: Help with extracting large volumes of records across related tables

From

Pierre-Frédéric Caillaud

Date:

13 September 2004, 14:04:48

My simple python program dumps 1653992 items in 1654000 categories in :

real    3m12.029s
user    1m36.720s
sys     0m2.220s

It was running on the same machine as postgresql (AthlonXP 2500).
I Ctrl-C'd it before it dumped all the database but you get an idea.

If you don't know Python and Generators, have a look !

Re: Help with extracting large volumes of records across related tables

From

"Damien Dougan"

Date:

13 September 2004, 14:45:09

Pierre-Frederic, Paul,

Thanks for your fast response (especially for the python code and
performance figure) - I'll chase this up as a solution - looks most
promising!

Cheers,

Damien

Re: Help with extracting large volumes of records across related tables

From

Pierre-Frédéric Caillaud

Date:

13 September 2004, 14:59:46

    Thanks for the thanks !

    Generally, when grouping stuff together, it is a good idea to have two
sorted lists, and to scan them simultaneously. I have already used this
solution several times outside of Postgres, and it worked very well (it
was with Berkeley DB and there were 3 lists to scan in order). The fact
that Python can very easily virtualize these lists using generators makes
it possible to do it without consuming too much memory.

> Pierre-Frederic, Paul,
>
> Thanks for your fast response (especially for the python code and
> performance figure) - I'll chase this up as a solution - looks most
> promising!
>
> Cheers,
>
> Damien
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 6: Have you searched our list archives?
>
>                http://archives.postgresql.org
>

Re: Help with extracting large volumes of records across related

From

Mischa Sandberg

Date:

14 September 2004, 05:25:08

Damien Dougan wrote:
> Basically we have a number of tables, which are exposed as 2 public
> views (say PvA and PvB). For each row in PvA, there are a number of
> related rows in PvB (this number is arbitrary, which is one of the
> reasons why it cant be expressed as additional columns in PvA - so we
> really need 2 sets of tables - which leads to two sets of extract calls
> - interwoven to associate PvA with PvB).
>
> Are there any tools/tricks/tips with regards to extracting large volumes
> of data across related tables from Postgres? It doesnt have to export
> into XML, we can do post-processing on the extracted data as needed -
> the important thing is to keep the relationship between PvA and PvB on a
> row-by-row basis.

Just recently had to come up with an alternative to MSSQL's "SQL..FOR
XML", for some five-level nested docs, that turned out to be faster (!)
and easier to understand:

Use SQL to organize each of the row types into a single text field, plus
a single key field, as well as any filter fields you . Sort the union,
and have the reading process break them into documents.

For example, if PvA has key (account_id, order_id) and
fields(order_date, ship_date) and PvB has key (order_id, product_id) and
fields (order_qty, back_order)

CREATE VIEW PvABxml AS
SELECT    account_id::text + order_id::text AS quay
    ,'order_date="' + order_date::text
    + '" ship_date="' + ship_date::text + '"' AS info
    ,ship_date
FROM    PvA
    UNION ALL
SELECT    account_id::text + order_id::text + product_id::text
    ,'order_qty="' + order_qty::text +'"'
    ,ship_date
FROM    PvA JOIN PvB USING (order_id)

Then:

SELECT quay, info
FROM pvABxml
WHERE ship_date = '...'
ORDER BY quay

gives you a stream of info in the (parent,child,child...
parent,child,child...) order you want, that assemble very easily into
XML documents. If you need to pick out, say, orders where there are
backordered items, you probably need to work with a temp table with
which to prefilter.