Thread: Efficient Way to Merge Two Large Tables

Efficient Way to Merge Two Large Tables

From

Joshua Rubin

Date:

13 July 2010, 18:04:12

Hi,

I have two tables each with nearly 300M rows. There is a 1:1
relationship between the two tables and they are almost always joined
together in queries. The first table has many columns, the second has
a foreign key to the primary key of the first table and one more
column. It is expected that for every row in table1, there is a
corresponding row in table2. We would like to just add the one column
to the first table and drop the second table to allow us to index this
extra column.

This query would work after adding the column to the first table:
UPDATE table1 SET new_column = table2.new_column FROM table2 WHERE
table1.row_id = table2.row_id;

However, this will take much too long, I have not successfully
completed this on our staging server after running it for 3+ days.

Any extended down time is not really an option. Further, there are
many other tables with foreign keys to table1 so dropping it is fairly
complicated and time consuming as the indexes and foreign keys would
all have to be regenerated.

Does anyone have any other ideas on how this can be done in the most
efficient way possible?

Thanks,
--
Joshua Rubin

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: Efficient Way to Merge Two Large Tables

From

Ben Chobot

Date:

13 July 2010, 19:22:26

On Jul 13, 2010, at 1:46 PM, Joshua Rubin wrote:

> Hi,
>
> I have two tables each with nearly 300M rows. There is a 1:1
> relationship between the two tables and they are almost always joined
> together in queries. The first table has many columns, the second has
> a foreign key to the primary key of the first table and one more
> column. It is expected that for every row in table1, there is a
> corresponding row in table2. We would like to just add the one column
> to the first table and drop the second table to allow us to index this
> extra column.

Stupid question before you do this: is there a reason the design was split like this? For instance, if the table with
theid and the single field get updated a lot, while the other table almost never changes, maybe this design isn't so
bad.

Re: Efficient Way to Merge Two Large Tables

From

Joshua Rubin

Date:

13 July 2010, 19:55:06

HI Ben,

> Stupid question before you do this: is there a reason the design was split like this? For instance, if the table with
theid and the single field get updated a lot, while the other table almost never changes, maybe this design isn't so
bad.

We just wanted to minimize changes to table1, thats why table2 was
added as a separate table and not a new column. Now we need to index
items in table1 with the column in table2 because some queries are
much too slow and would be sped up greatly with this change. In
retrospect, we should have just added a column to table1. Now, we have
all the data, we just need to move it from table2 to table1.

Thanks,
--
Joshua Rubin

Re: Efficient Way to Merge Two Large Tables

From

Julian Mehnle

Date:

13 July 2010, 20:13:44

Joshua Rubin wrote:

> I have two tables each with nearly 300M rows. There is a 1:1
> relationship between the two tables and they are almost always joined
> together in queries. The first table has many columns, the second has
> a foreign key to the primary key of the first table and one more
> column. It is expected that for every row in table1, there is a
> corresponding row in table2. We would like to just add the one column
> to the first table and drop the second table to allow us to index this
> extra column.
>
> This query would work after adding the column to the first table:
> UPDATE table1 SET new_column = table2.new_column FROM table2 WHERE
> table1.row_id = table2.row_id;
>
> However, this will take much too long, I have not successfully
> completed this on our staging server after running it for 3+ days.

Can you get the query plan (EXPLAIN) of the update query?  My guess is the
join cost scales superlinearly.

You might be able to chop this up into smaller UPDATEs by limiting the
rows to be updated in each round by the primary key.

E.g.:

  UPDATE table1 SET new_column = table2.new_column FROM table2
    WHERE
      table1.row_id = table2.row_id and
      table1.row_id >= 0e6 and table1.row_id < 1e6 and
      table2.row_id >= 0e6 and table2.row_id < 1e6;

for a moving row_id window.

This has helped me in the past with a similar scenario (where both tables
were partitioned by the PK, but it would presumably still work in the
unpartitioned case).

-Julian

Attachment

signature.asc

Re: Efficient Way to Merge Two Large Tables

From

Joshua Rubin

Date:

16 July 2010, 16:06:14

Hi Julian,

Sorry for the slow response. I think I will need to chop up the query some how, but have not yet found an efficient way to do that. row_id is the primary key in both tables, so that might work.

Here is the explain:

urls_jrubin_merged=# EXPLAIN UPDATE table1 SET row_id = table2.row_id FROM table2 WHERE table1.row_id = table2.row_id;

QUERY PLAN

--------------------------------------------------------------------------------------

Merge Join (cost=57257969.62..12983795937.97 rows=4308749788074 width=121)

Merge Cond: (table2.row_id = table1.row_id)

-> Sort (cost=15885110.79..16029412.85 rows=288604128 width=8)

Sort Key: table2.row_id

-> Seq Scan on table2 (cost=0.00..2137231.26 rows=288604128 width=8)

-> Materialize (cost=41372858.83..42105903.14 rows=293217725 width=121)

-> Sort (cost=41372858.83..41519467.69 rows=293217725 width=121)

Sort Key: table1.row_id

-> Seq Scan on todo (cost=0.00..5922587.45 rows=293217725 width=121)

(9 rows)

Thanks,
--
Joshua Rubin

On Tue, Jul 13, 2010 at 5:08 PM, Julian Mehnle <julian@mehnle.net> wrote:

Joshua Rubin wrote:

> I have two tables each with nearly 300M rows. There is a 1:1
> relationship between the two tables and they are almost always joined
> together in queries. The first table has many columns, the second has
> a foreign key to the primary key of the first table and one more
> column. It is expected that for every row in table1, there is a
> corresponding row in table2. We would like to just add the one column
> to the first table and drop the second table to allow us to index this
> extra column.
>
> This query would work after adding the column to the first table:
> UPDATE table1 SET new_column = table2.new_column FROM table2 WHERE
> table1.row_id = table2.row_id;
>
> However, this will take much too long, I have not successfully
> completed this on our staging server after running it for 3+ days.

Can you get the query plan (EXPLAIN) of the update query? My guess is the
join cost scales superlinearly.

You might be able to chop this up into smaller UPDATEs by limiting the
rows to be updated in each round by the primary key.

E.g.:

UPDATE table1 SET new_column = table2.new_column FROM table2
WHERE
table1.row_id = table2.row_id and
table1.row_id >= 0e6 and table1.row_id < 1e6 and
table2.row_id >= 0e6 and table2.row_id < 1e6;

for a moving row_id window.

This has helped me in the past with a similar scenario (where both tables
were partitioned by the PK, but it would presumably still work in the
unpartitioned case).

-Julian

Re: Efficient Way to Merge Two Large Tables

From

Joshua Rubin

Date:

16 July 2010, 16:42:12

Hi Julian,

Using this way to break up the queries, I am able to update about 1500 rows per minute which will take over 100 days to complete, so I need to figure out why this is slow, and if there is any faster way.

UPDATE table1 SET new_column = table1.new_column FROM table2 WHERE table1.row_id = table2.row_id AND table2.row_id >= $1 AND table2.row_id < $2.

Here is the explain from that:

QUERY PLAN

------------------------------------------------------------------------------------------

Nested Loop (cost=0.00..39.78 rows=1 width=121)

-> Index Scan using table2_pkey on table2 (cost=0.00..19.88 rows=1 width=12)

Index Cond: ((row_id >= $1) AND (row_id < $2))

-> Index Scan using table1_pkey on table1 (cost=0.00..19.90 rows=1 width=113)

Index Cond: (table1.row_id = table2.row_id)

(5 rows)

Thanks,
--
Joshua Rubin

On Fri, Jul 16, 2010 at 1:05 PM, Joshua Rubin <jrubin@esoft.com> wrote:

Hi Julian,

Sorry for the slow response. I think I will need to chop up the query some how, but have not yet found an efficient way to do that. row_id is the primary key in both tables, so that might work.

Here is the explain:
urls_jrubin_merged=# EXPLAIN UPDATE table1 SET row_id = table2.row_id FROM table2 WHERE table1.row_id = table2.row_id;
   QUERY PLAN
--------------------------------------------------------------------------------------
Merge Join (cost=57257969.62..12983795937.97 rows=4308749788074 width=121)
   Merge Cond: (table2.row_id = table1.row_id)
   -> Sort (cost=15885110.79..16029412.85 rows=288604128 width=8)
   Sort Key: table2.row_id
   -> Seq Scan on table2 (cost=0.00..2137231.26 rows=288604128 width=8)
   -> Materialize (cost=41372858.83..42105903.14 rows=293217725 width=121)
   -> Sort (cost=41372858.83..41519467.69 rows=293217725 width=121)
   Sort Key: table1.row_id
   -> Seq Scan on todo (cost=0.00..5922587.45 rows=293217725 width=121)
(9 rows)

Thanks,
--
Joshua Rubin

On Tue, Jul 13, 2010 at 5:08 PM, Julian Mehnle <julian@mehnle.net> wrote:
Joshua Rubin wrote:

> I have two tables each with nearly 300M rows. There is a 1:1
> relationship between the two tables and they are almost always joined
> together in queries. The first table has many columns, the second has
> a foreign key to the primary key of the first table and one more
> column. It is expected that for every row in table1, there is a
> corresponding row in table2. We would like to just add the one column
> to the first table and drop the second table to allow us to index this
> extra column.
>
> This query would work after adding the column to the first table:
> UPDATE table1 SET new_column = table2.new_column FROM table2 WHERE
> table1.row_id = table2.row_id;
>
> However, this will take much too long, I have not successfully
> completed this on our staging server after running it for 3+ days.

Can you get the query plan (EXPLAIN) of the update query? My guess is the
join cost scales superlinearly.

You might be able to chop this up into smaller UPDATEs by limiting the
rows to be updated in each round by the primary key.

E.g.:

UPDATE table1 SET new_column = table2.new_column FROM table2
WHERE
table1.row_id = table2.row_id and
table1.row_id >= 0e6 and table1.row_id < 1e6 and
table2.row_id >= 0e6 and table2.row_id < 1e6;

for a moving row_id window.

This has helped me in the past with a similar scenario (where both tables
were partitioned by the PK, but it would presumably still work in the
unpartitioned case).

-Julian

Re: Efficient Way to Merge Two Large Tables

From

"Daniel Verite"

Date:

17 July 2010, 04:18:44

    Joshua Rubin wrote:

> I need to figure out why this is slow, and if there is any faster way.

Have you considered INSERTing into a third table that would replace both
source tables when it's over? The target table would initially have no index.

Best regards,
--
Daniel
PostgreSQL-powered mail user agent and storage: http://www.manitou-mail.org