Thread: Get master-detail relationship metadata

Get master-detail relationship metadata

From

Laszlo Nagy

Date:

03 February 2011, 07:56:31

   Hi All,

I'm working on a client program that iterates over master-detail
relationships in a loop chain.

Pseudo code:

for row_1 in table_1:
     table_2 = get_details(row_1,"table2")
     for row_2 in table_2:
         row_3 = get_details(row_2,"table3")
         .... etc.
                 process_data(row1,row_2,row_3,....)

My task is to write the "get_details" iterator effectively. The obvious
way to do it is to query details in every get_details() call, but that
is not efficient. We have relationships where one master only has a few
details. For 1 million master rows, that would result in execution of
millions of SQL SELECT commands, degrading the performance by
magnitudes. My idea was that the iterator should pre-fetch and cache
data for many master records at once. The get_details() would use the
cached rows, thus reducing the number of SQL SELECT statements needed.
Actually I wrote the iterator, and it works fine in some cases. For example:

producers = get_rows("producer")
for producer in producers:
     products = get_getails(producer,"product")
     for product in products:
         prices = get_details(product,"prices")
         for price in prices:
             process_product_price(producer,product,price)

This works fine if one producer has not more than 1000 products and one
product has not more than 10 prices. I can easly keep 10 000 records in
memory. The actual code executes about 15 SQL queries while iterating
over 1 million rows. Compared to the original "obvious" method,
performance is increased to 1500%

But sometimes it just doesn't work. If a producer has 1 million
products, and one product has 100 prices, then it won't work, because I
cannot keep 100 million prices in memory. My program should somehow
figure out, how much rows it will get for one master, and select between
the cached and not cached methods.

So here is the question: is there a way to get this information from
PostgreSQL itself? I know that the query plan contains information about
this, but I'm not sure how to extract. Should I run an ANALYZE command
of some kind, and parse the result as a string? For example:

EXPLAIN select * from product where producer_id=1008;
                               QUERY PLAN
----------------------------------------------------------------------
  Seq Scan on product  (cost=0.00..1018914.74 rows=4727498 width=1400)
    Filter: (producer_id = 1008)
(2 rows)


Then I could extract "rows=4727498" to get an idea about how much detail
rows I'll get for the master.

Is there any better way to do it? And how reliable is this?


Thanks,

    Laszlo

Re: Get master-detail relationship metadata

From

Andy Colson

Date:

03 February 2011, 11:26:19

On 2/3/2011 5:40 AM, Laszlo Nagy wrote:
>
> Hi All,
>
> I'm working on a client program that iterates over master-detail
> relationships in a loop chain.
>
> Pseudo code:
>
> for row_1 in table_1:
> table_2 = get_details(row_1,"table2")
> for row_2 in table_2:
> row_3 = get_details(row_2,"table3")
> .... etc.
> process_data(row1,row_2,row_3,....)
>
> My task is to write the "get_details" iterator effectively. The obvious
> way to do it is to query details in every get_details() call, but that
> is not efficient. We have relationships where one master only has a few
> details. For 1 million master rows, that would result in execution of
> millions of SQL SELECT commands, degrading the performance by
> magnitudes. My idea was that the iterator should pre-fetch and cache
> data for many master records at once. The get_details() would use the
> cached rows, thus reducing the number of SQL SELECT statements needed.
> Actually I wrote the iterator, and it works fine in some cases. For
> example:
>
> producers = get_rows("producer")
> for producer in producers:
> products = get_getails(producer,"product")
> for product in products:
> prices = get_details(product,"prices")
> for price in prices:
> process_product_price(producer,product,price)
>
> This works fine if one producer has not more than 1000 products and one
> product has not more than 10 prices. I can easly keep 10 000 records in
> memory. The actual code executes about 15 SQL queries while iterating
> over 1 million rows. Compared to the original "obvious" method,
> performance is increased to 1500%
>
> But sometimes it just doesn't work. If a producer has 1 million
> products, and one product has 100 prices, then it won't work, because I
> cannot keep 100 million prices in memory. My program should somehow
> figure out, how much rows it will get for one master, and select between
> the cached and not cached methods.
>
> So here is the question: is there a way to get this information from
> PostgreSQL itself? I know that the query plan contains information about
> this, but I'm not sure how to extract. Should I run an ANALYZE command
> of some kind, and parse the result as a string? For example:
>
> EXPLAIN select * from product where producer_id=1008;
> QUERY PLAN
> ----------------------------------------------------------------------
> Seq Scan on product (cost=0.00..1018914.74 rows=4727498 width=1400)
> Filter: (producer_id = 1008)
> (2 rows)
>
>
> Then I could extract "rows=4727498" to get an idea about how much detail
> rows I'll get for the master.
>
> Is there any better way to do it? And how reliable is this?
>
>
> Thanks,
>
> Laszlo
>
>

One way would be to join the master to the detail, and write your code
expecting duplicates.

q = get_rows("select * from product inner join price ... order by
productid, priceid");

lastprodid = ''
for x in q:
    prodid = q.prodid
    if prodid <> lastprodid:
        # we saw the last product, prepare to move to the next product
        lastprodid = prodid

... etc

 > Is there any better way to do it? And how reliable is this?

It makes the sql really easy, but the code complex... so pick your poison.

-Andy