Re: Get master-detail relationship metadata - Mailing list pgsql-performance

From Andy Colson
Subject Re: Get master-detail relationship metadata
Date
Msg-id 4D4AC922.8050904@squeakycode.net
Whole thread Raw
In response to Get master-detail relationship metadata  (Laszlo Nagy <gandalf@shopzeus.com>)
List pgsql-performance
On 2/3/2011 5:40 AM, Laszlo Nagy wrote:
>
> Hi All,
>
> I'm working on a client program that iterates over master-detail
> relationships in a loop chain.
>
> Pseudo code:
>
> for row_1 in table_1:
> table_2 = get_details(row_1,"table2")
> for row_2 in table_2:
> row_3 = get_details(row_2,"table3")
> .... etc.
> process_data(row1,row_2,row_3,....)
>
> My task is to write the "get_details" iterator effectively. The obvious
> way to do it is to query details in every get_details() call, but that
> is not efficient. We have relationships where one master only has a few
> details. For 1 million master rows, that would result in execution of
> millions of SQL SELECT commands, degrading the performance by
> magnitudes. My idea was that the iterator should pre-fetch and cache
> data for many master records at once. The get_details() would use the
> cached rows, thus reducing the number of SQL SELECT statements needed.
> Actually I wrote the iterator, and it works fine in some cases. For
> example:
>
> producers = get_rows("producer")
> for producer in producers:
> products = get_getails(producer,"product")
> for product in products:
> prices = get_details(product,"prices")
> for price in prices:
> process_product_price(producer,product,price)
>
> This works fine if one producer has not more than 1000 products and one
> product has not more than 10 prices. I can easly keep 10 000 records in
> memory. The actual code executes about 15 SQL queries while iterating
> over 1 million rows. Compared to the original "obvious" method,
> performance is increased to 1500%
>
> But sometimes it just doesn't work. If a producer has 1 million
> products, and one product has 100 prices, then it won't work, because I
> cannot keep 100 million prices in memory. My program should somehow
> figure out, how much rows it will get for one master, and select between
> the cached and not cached methods.
>
> So here is the question: is there a way to get this information from
> PostgreSQL itself? I know that the query plan contains information about
> this, but I'm not sure how to extract. Should I run an ANALYZE command
> of some kind, and parse the result as a string? For example:
>
> EXPLAIN select * from product where producer_id=1008;
> QUERY PLAN
> ----------------------------------------------------------------------
> Seq Scan on product (cost=0.00..1018914.74 rows=4727498 width=1400)
> Filter: (producer_id = 1008)
> (2 rows)
>
>
> Then I could extract "rows=4727498" to get an idea about how much detail
> rows I'll get for the master.
>
> Is there any better way to do it? And how reliable is this?
>
>
> Thanks,
>
> Laszlo
>
>

One way would be to join the master to the detail, and write your code
expecting duplicates.

q = get_rows("select * from product inner join price ... order by
productid, priceid");

lastprodid = ''
for x in q:
    prodid = q.prodid
    if prodid <> lastprodid:
        # we saw the last product, prepare to move to the next product
        lastprodid = prodid

... etc

 > Is there any better way to do it? And how reliable is this?

It makes the sql really easy, but the code complex... so pick your poison.

-Andy

pgsql-performance by date:

Previous
From: Mark Stosberg
Date:
Subject: getting the most of out multi-core systems for repeated complex SELECT statements
Next
From: Robert Haas
Date:
Subject: Re: [HACKERS] Slow count(*) again...