Home > mailing lists

Re: [HACKERS] GSOC'17 project introduction: Parallel COPY executionwith errors handling - Mailing list pgsql-hackers

From	Robert Haas
Subject	Re: [HACKERS] GSOC'17 project introduction: Parallel COPY executionwith errors handling
Date	April 12, 2017 23:23:06
Msg-id	CA+TgmoZtiWK4zD76hXD8Pw0CuwShYEW2jtGA6G9iT3d8rfSoiw@mail.gmail.com Whole thread Raw
In response to	Re: [HACKERS] GSOC'17 project introduction: Parallel COPY executionwith errors handling (Nicolas Barbier <nicolas.barbier@gmail.com>)
Responses	Re: [HACKERS] GSOC'17 project introduction: Parallel COPY executionwith errors handling
List	pgsql-hackers

Tree view

On Wed, Apr 12, 2017 at 1:18 PM, Nicolas Barbier
<nicolas.barbier@gmail.com> wrote:
> 2017-04-11 Robert Haas <robertmhaas@gmail.com>:
>> There's a nasty trade-off here between XID consumption (and the
>> aggressive vacuums it eventually causes) and preserving performance in
>> the face of errors - e.g. if you make k = 100,000 you consume 100x
>> fewer XIDs than if you make k = 1000, but you also have 100x the work
>> to redo (on average) every time you hit an error.
>
> You could make it dynamic: Commit the subtransaction even when not
> encountering any error after N lines (N starts out at 1), then double
> N and continue. When encountering an error, roll back the current
> subtransaction back and re-insert all the known good rows that have
> been rolled back (plus maybe the erroneous row into a separate table
> or whatever) in one new subtransaction and commit; then reset N to 1
> and continue processing the rest of the file.
>
> That would work reasonable well whenever the ratio of erroneous rows
> is not extremely high: whether the erroneous rows are all clumped
> together, entirely randomly spread out over the file, or a combination
> of both.

Right.  I wouldn't suggest the exact algorithm you proposed; I think
you ought to vary between some lower limit >1, maybe 10, and some
upper limit, maybe 1,000,000, ratcheting up and down based on how
often you hit errors in some way that might not be as simple as
doubling.  But something along those lines.

>> If the data quality is poor (say, 50% of lines have errors) it's
>> almost impossible to avoid runaway XID consumption.
>
> Yup, that seems difficult to work around with anything similar to the
> proposed. So the docs might need to suggest not to insert a 300 GB
> file with 50% erroneous lines :-).

Yep.  But it does seem reasonably likely that someone might shoot
themselves in the foot anyway.  Maybe we just live with that.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

pgsql-hackers by date:

From: Andres Freund
Date: 12 April 2017, 23:21:51
Subject: Re: [HACKERS] snapbuild woes

From: Alexander Kuzmenkov
Date: 12 April 2017, 23:23:22
Subject: Re: [HACKERS] index-only count(*) for indexes supporting bitmap scans

Re: [HACKERS] GSOC'17 project introduction: Parallel COPY executionwith errors handling - Mailing list pgsql-hackers

Previous

Next