Thread: JOIN order, 15K, 15K, 7MM rows

JOIN order, 15K, 15K, 7MM rows

From
"Andrew Lazarus"
Date:
All three tables have the same integer key, and it's indexed.
 
I parenthesized the joins to do the two small tables first.
 
I'm running and INSERT INTO ... SELECT query with this join (one record added per record in join), 4 hours down and all I have to show for it is 100 recycled transaction logs. ?
 
If it ever returns, I can post the Explain output.

Re: JOIN order, 15K, 15K, 7MM rows

From
Josh Berkus
Date:
Andrew,

> I'm running and INSERT INTO ... SELECT query with this join (one record
> added per record in join), 4 hours down and all I have to show for it is
> 100 recycled transaction logs. ?
>
> If it ever returns, I can post the Explain output.

How about giving us the query and the regular EXPLAIN output right now?  I've
a feeling that you have an unconstrained join in the query somewhere.

--
Josh Berkus
Aglio Database Solutions
San Francisco

Re: JOIN order, 15K, 15K, 7MM rows

From
andrew@pillette.com
Date:
EXPLAIN
INSERT INTO public.historical_price ( security_serial_id, [7 fields of proprietary data])
SELECT public.security_series.security_serial_id, [7 fields of data],
FROM obsolete.datadb_fix INNER JOIN (obsolete.calcdb INNER JOIN public.security_series ON
obsolete.calcdb.serial=public.security_series.legacy_calcdb_id)ON
obsolete.datadb_fix.id=public.security_series.legacy_calcdb_id;

datadb_fix is about 5.5MM records. The other two tables are about 15K records.

 Hash Join  (cost=1151.63..225863.54 rows=5535794 width=53)
   Hash Cond: ("outer".id = "inner".serial)
   ->  Seq Scan on datadb_fix  (cost=0.00..121867.99 rows=6729299 width=28)
   ->  Hash  (cost=1115.54..1115.54 rows=14438 width=25)
         ->  Hash Join  (cost=609.96..1115.54 rows=14438 width=25)
               Hash Cond: ("outer".legacy_calcdb_id = "inner".serial)
               ->  Seq Scan on security_series  (cost=0.00..247.40 rows=15540 width=13)
               ->  Hash  (cost=572.37..572.37 rows=15037 width=12)
                     ->  Seq Scan on calcdb  (cost=0.00..572.37 rows=15037 width=12)

pim_new-#         Table "obsolete.datadb_fix"
pim_new-#    Column    |       Type       | Modifiers
pim_new-# -------------+------------------+-----------
pim_new-#  serial      | integer          |
pim_new-#  id          | integer          |
pim_new-#  date        | date             |
[4 fields deleted]
pim_new-# Indexes: sb_data_pkey unique btree (id, date),
pim_new-#          datadb1_id btree (id),


pim_new=# \d obsolete.calcdb
                                      Table "obsolete.calcdb"
       Column       |         Type         |                       Modifiers

--------------------+----------------------+------------------------------------
-------------------
 serial             | integer              | not null default nextval('"calcdb_s
erial_seq"'::text)
[...30 proprietary fields]

Indexes: calcdb_serial_key unique btree (serial),
[...5 other indexes]

pim_new=# \d security_series
        Table "public.security_series"
       Column       |     Type     | Modifiers
--------------------+--------------+-----------
 security_serial_id | integer      | not null
 period             | character(1) | not null
 legacy_calcdb_id   | integer      |
Indexes: security_series_pkey primary key btree (security_serial_id, period),
         secseries_legacy_id_idx1 btree (legacy_calcdb_id)

The target table has three indexes on it, so I suppose that accounts for SOME extra time. I ended up cancelling the
query,running the select on a faster machine into an unindexed temp table, then using COPY out and in. That process
tookabout 2.5 hours total. Machine: Linux, PG 7.3.4, 1.1GHz, 768MB RAM, unfortunately running other stuff. The first
trywhich didn't finish in 24 hours was on Mac OS X Jaguar, PG 7.3.3, 1GHz, 256MB (please don't laugh). Yes, hardware
upgradesare coming, but I need to estimate how much more I have to squeeze out of the DB and client applications.