Home > mailing lists

Re: bulk loading table via join of 2 large staging tables - Mailing list pgsql-general

From	Andrew Sullivan
Subject	Re: bulk loading table via join of 2 large staging tables
Date	December 31, 2013 02:20:49
Msg-id	20131231022035.GA94061@crankycanuck.ca Whole thread Raw
In response to	bulk loading table via join of 2 large staging tables (Seb <spluque@gmail.com>)
List	pgsql-general

Tree view

On Mon, Dec 30, 2013 at 07:53:06PM -0600, Seb wrote:
> Given that the process involves a full join, I'm not sure I can do this
> in chunks (say breaking down the files into smaller pieces). Any
> suggestions would be greatly appreciated.

First, what I would probably do is merge the two files outside
Postgres, run sort|uniq > outfile over the whole thing, and then just
bulk import that.  It's probably going to be faster.  But if the files
are too big, you'll run out of memory.

Are there possibly duplicates _within_ each file, or just between them?

If not within, then load the first file into the target table (well,
probably with the staging table just so you can get the timestamp
sorted out), then create the staging table as you suggest for the
second file, but create some indexes and do a WHERE NOT EXISTS to get
just the subset from that second table.  (This might be faster if you
update the staging table with the timestamp first, then create the
relevant multicolumn index.)

If there are dupes within the file, you can do the same thing except
that in the first step, you do SELECT DISTINCT instead.  Again, I
suspect a multicolumn index is going to be your friend.

Anyway, those are two ways I've done this sort of thing in the past.

Best regards,

A

--
Andrew Sullivan
ajs@crankycanuck.ca

pgsql-general by date:

From: David Johnston
Date: 31 December 2013, 02:04:50
Subject: Re: bulk loading table via join of 2 large staging tables

From: Joe Van Dyk
Date: 31 December 2013, 04:57:13
Subject: Re: Replication failed after stalling

Re: bulk loading table via join of 2 large staging tables - Mailing list pgsql-general

Previous

Next