Home > mailing lists

bulk loading table via join of 2 large staging tables - Mailing list pgsql-general

From	Seb
Subject	bulk loading table via join of 2 large staging tables
Date	December 31, 2013 04:53:18
Msg-id	87mwjhokfx.fsf@kolob.subpolar.dyndns.org Whole thread Raw
Responses	Re: bulk loading table via join of 2 large staging tables (David Johnston <polobo@yahoo.com>) Re: bulk loading table via join of 2 large staging tables (Andrew Sullivan <ajs@crankycanuck.ca>) Re: bulk loading table via join of 2 large staging tables (Brent Wood <Brent.Wood@niwa.co.nz>)
List	pgsql-general

Tree view

Hi,

I have two large CSV files that need to be merged and loaded into a
single table of a database in Postgresql 9.3.  I thought I'd do this by
first staging the data in these files in two temporary tables:

---<--------------------cut here---------------start------------------->---
CREATE TEMPORARY TABLE rmc (
utc_year character varying(6),
utc_month character varying(4),
utc_day character varying(4),
utc_hour character varying(4),
utc_minute character varying(4),
utc_second character varying(8),
latitude numeric,
longitude numeric,
sog numeric,
cog numeric);

CREATE TEMPORARY TABLE gyro (
utc_year character varying(6),
utc_month character varying(4),
utc_day character varying(4),
utc_hour character varying(4),
utc_minute character varying(4),
utc_second character varying(8),
heading numeric);
---<--------------------cut here---------------end--------------------->---

And the target table in the database looks like this:

---<--------------------cut here---------------start------------------->---
                                                   Table "public.navigation_series"
        Column        |            Type             |                                    Modifiers
               

----------------------+-----------------------------+----------------------------------------------------------------------------------
 navigation_record_id | integer                     | not null default
nextval('navigation_series_navigation_record_id_seq'::regclass)
 project_id           | integer                     |
 platform_id          | integer                     |
 supplier_id          | integer                     |
 time                 | timestamp without time zone | not null
 longitude            | numeric                     |
 latitude             | numeric                     |
 speed_over_ground    | numeric                     |
 course_over_ground   | numeric                     |
 heading              | numeric                     |
Indexes:
    "navigation_series_pkey" PRIMARY KEY, btree (navigation_record_id)
    "navigation_series_project_id_platform_id_supplier_id_time_key" UNIQUE CONSTRAINT, btree (project_id, platform_id,
supplier_id,"time") 
Foreign-key constraints:
    "navigation_project_id_fkey" FOREIGN KEY (project_id) REFERENCES projects(project_id) ON UPDATE CASCADE ON DELETE
RESTRICT
    "navigation_series_platform_id_fkey" FOREIGN KEY (platform_id) REFERENCES platforms(platform_id) ON UPDATE CASCADE
ONDELETE RESTRICT 
    "navigation_series_supplier_id_fkey" FOREIGN KEY (supplier_id) REFERENCES suppliers(supplier_id) ON UPDATE CASCADE
ONDELETE RESTRICT 
---<--------------------cut here---------------end--------------------->---

Loading the temporary tables was very quick (about 3 min; input files
were 580 Mb and 3.5 Gb) in psql, using:

\copy gyro FROM 'gyro.csv' CSV
\copy rmc FROM 'rmc.csv' CSV

I then created a temporary view with:

CREATE TEMPORARY VIEW rmc_gyro AS
SELECT DISTINCT ON (project_id, platform_id, supplier_id, "time")
    2 AS project_id,
    1 AS platform_id,
    6 AS supplier_id,
    (utc_year || '-' || utc_month || '-' || utc_day || ' ' ||
    utc_hour || ':' || utc_minute || ':' || utc_second)::timestamp AS "time",
    longitude, latitude,
    sog AS speed_over_ground,
    cog AS course_over_ground,
    heading
FROM rmc
FULL JOIN gyro USING (utc_year, utc_month, utc_day, utc_hour, utc_minute, utc_second)
ORDER BY project_id, platform_id, supplier_id, "time";

But at this point even just selecting a few rows of data from the view
is too slow (I haven't seen the output after many hours).

Given that the process involves a full join, I'm not sure I can do this
in chunks (say breaking down the files into smaller pieces). Any
suggestions would be greatly appreciated.

Cheers,

--
Seb

pgsql-general by date:

From: Raphael Araújo e Silva
Date: 31 December 2013, 02:26:59
Subject: pgModeler the new Open source data modeling for PostgreSQL

From: David Johnston
Date: 31 December 2013, 05:04:50
Subject: Re: bulk loading table via join of 2 large staging tables

bulk loading table via join of 2 large staging tables - Mailing list pgsql-general

Previous

Next