RE: Disable WAL logging to speed up data loading - Mailing list pgsql-hackers

From tsunakawa.takay@fujitsu.com
Subject RE: Disable WAL logging to speed up data loading
Date
Msg-id OSAPR01MB29771A38CEFAD90E6F3CA2D0FE639@OSAPR01MB2977.jpnprd01.prod.outlook.com
Whole thread Raw
In response to Re: Disable WAL logging to speed up data loading  (Stephen Frost <sfrost@snowman.net>)
Responses Re: Disable WAL logging to speed up data loading  (Stephen Frost <sfrost@snowman.net>)
List pgsql-hackers
From: Stephen Frost <sfrost@snowman.net>
> * tsunakawa.takay@fujitsu.com (tsunakawa.takay@fujitsu.com) wrote:
> > As Laurenz-san kindly replied, the database server refuses to start with a
> clear message.  So, it's similarly very clear what happens.  The user will never
> unknowingly resume operation with possibly corrupt data.
>
> No, instead they'll just have a completely broken system that has to be rebuilt or
> restored from a backup.  That doesn't strike me as a good result.

Your understanding is correct.  What I wanted to answer to confirm is that the behavior is clear: the server refuses to
start,and the user knows he/she should restore the database from backup. 


> > So, I understood the point boils down to elegance.  Could I ask what makes
> you feel ALTER TABLE UNLOGGED/LOGGED is (more) elegant?  I'm purely
> asking as a user.
>
> The impact is localized to those specific tables.  The rest of the system should
> come up cleanly and there won't be corruption, instead merely the lack of data
> in UNLOGGED tables.

So, I took your point as the ease and fast time of restore after a crash: the user just has to restore the lost table
datausing COPY FROM from files that was saved before the data loading job using COPY TO. 

In that sense, the backup and restoration of the whole database is an option for users when they have some instant
backupand restore feature available. 


> > (I don't want to digress, but if we consider the number of options for
> > wal_level as an issue, I feel it's not elegant to have separate
> > "replica" and "logical".)
>
> Do you know of a way to avoid having those two distinct levels while still writing
> only the WAL needed depending on if a system is doing logical replication or
> not..?  If you've got suggestions on how to eliminate one of those levels, I'm
> sure there would be interest in doing so.  I don't see the fact that we have
> those two levels as justification for adding another spelling of 'minimal'.

Sorry, I have almost no knowledge of logical replication implementation.  So, being ignorant of its intricacies, I have
feltlike as a user "Why do I have to set wal_level = logical, because streaming replication and logical replication are
bothreplication features?  If the implementation needs some additional WAL for logical replication, why doesn't the
serverautomatically emit the WAL when the target table of DML statements is in a publication?" 


> > The elegance of wal_level = none is that the user doesn't have to
> > remember to add ALTER TABLE to the data loading job when they add load
> > target tables/partitions.  If they build and use their own (shell)
> > scripts to load data, that won't be burdon or forgotten.  But what
> > would they have to do when they use ETL tools like Talend, Pentaho,
> > and Informatica Power Center?  Do those tools allow users to add
> > custom processing like ALTER TABLE to the data loading job steps for
> > each table?  (AFAIK, not.)
>
> I don't buy the argument that having to 'remember' to do an ALTER TABLE is
> such a burden when it means that the database will still be consistent and
> operational after a crash.

That depends on whether an instant backup and restore feature is at hand.  If the user is comfortable with it,
wal_level= none is easier and more attractive.  At least, I don't want the feature to be denied. 


> As for data loading tools, surely they support loading data into UNLOGGED
> tables and it's certainly not hard to have a script run around and flip those
> tables to LOGGED after they're loaded, and I do actually believe some of those
> tools support building processes of which one step could be such a command
> (I'm fairly confident Pentaho, in particular, does as I remember building such
> pipelines myself...).

Oh, Pentaho has such a feature, doesn't it?  But isn't it a separate step from the data output step?  Here, I assume
ETLtools allow users to compose a data loading job from multiple steps: data input, transformation, data output, etc.
Iguess the user can't directly incorporate ALTER TABLE into the data output step, and has to add separate custom steps
forALTER TABLE.  That's burdonsome and forgettable, I think. 


    Regards
Takayuki     Tsunakawa





pgsql-hackers by date:

Previous
From: Justin Pryzby
Date:
Subject: Re: PoC/WIP: Extended statistics on expressions
Next
From: Alvaro Herrera
Date:
Subject: Re: shared memory stats: high level design decisions: consistency, dropping