BUG #15115: temporary AND unlogged tables for parallel db import - Mailing list pgsql-bugs

From PG Bug reporting form
Subject BUG #15115: temporary AND unlogged tables for parallel db import
Date
Msg-id 152120772359.1215.16174625790404871093@wrigleys.postgresql.org
Whole thread Raw
Responses Re: BUG #15115: temporary AND unlogged tables for parallel db import
List pgsql-bugs
The following bug has been logged on the website:

Bug reference:      15115
Logged by:           ew
Email address:      ennowein@googlemail.com
PostgreSQL version: 10.3
Operating system:   Centos 7
Description:

For fresh (large-) database import, we would like to be able to use
temporary tables which are also unlogged, here is why:
- Source data is organized in files per month (several million entries each)
of which we have multiple years
- Each source (-month) is independent of all others
- Each source month is imported as its own session with multiple preparation
operations (4 tables + 2 indexes) 
- Each session is completely independent of all others including the main db
status until the final insertion/commit step
- All session data is deleted immediately after insertion into the main db

A nice way to implement this is to use gnu parallel to start worker sessions
per month-input-file on a parallel machine (or several, actually). For this,
we implemented an sql script that uses the fact that temp tables are visible
only locally (otherwise, we'd have to name them all specifically).
We can tune the system quite well by the number of workers/cpu/RAM/disk
performance. 

Utilizing this parallel capability, we can reduce the import time by an
order of magnitude - from several days to a few hours. 
The "safety" aspect here is irrelevant in this stage as in case of a crash,
we'd simply restart the process from scratch.

Specifically WAL logging and checkpointing of the temporary data is very
cumbersome, slows the process, makes it actually less "reliable" (as it can
lead to a quasi-deadlock by causing traffic jams and swapping which in turn
practically stops the process).

Ironically, we can gain a little bit of optimization by unlogging the main
db - which is the only data we are actually interested in (we turn it back
on right after import of course).

There are some forum notes that we found which claim that "temporary" and
"unlogged" share the functionality of disabling WAL but according to what we
see that doesn't seem to be the case, the parallel import sessions generate
a huge (and process limiting) amount of logging data despite being
"temporary".

We would gladly share all data with the developers.



pgsql-bugs by date:

Previous
From: akshunj
Date:
Subject: Re: BUG #15018: yum install postgis24_96 failure
Next
From: Andrew Gierth
Date:
Subject: Re: BUG #15115: temporary AND unlogged tables for parallel db import