Home > mailing lists

Re: Parallel copy - Mailing list pgsql-hackers

From	Ants Aasma
Subject	Re: Parallel copy
Date	February 26, 2020 18:17:33
Msg-id	CANwKhkM+PvRTYjyUgc2YETbpB10cFkDazGhTqf65MVfZ+tGOyw@mail.gmail.com Whole thread Raw
In response to	Re: Parallel copy (Tomas Vondra <tomas.vondra@2ndquadrant.com>)
Responses	Re: Parallel copy Re: Parallel copy
List	pgsql-hackers

Tree view

On Tue, 25 Feb 2020 at 18:00, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
> Perhaps. I guess it'll depend on the CSV file (number of fields, ...),
> so I still think we need to do some measurements first. I'm willing to
> do that, but (a) I doubt I'll have time for that until after 2020-03,
> and (b) it'd be good to agree on some set of typical CSV files.

I agree that getting a nice varied dataset would be nice. Including
things like narrow integer only tables, strings with newlines and
escapes in them, extremely wide rows.

I tried to capture a quick profile just to see what it looks like.
Grabbed a random open data set from the web, about 800MB of narrow
rows CSV [1].

Script:
CREATE TABLE census (year int,age int,ethnic int,sex int,area text,count text);
COPY census FROM '.../Data8277.csv' WITH (FORMAT 'csv', HEADER true);

Profile:
# Samples: 59K of event 'cycles:u'
# Event count (approx.): 57644269486
#
# Overhead  Command   Shared Object       Symbol
# ........  ........  ..................
.......................................
#
    18.24%  postgres  postgres            [.] CopyReadLine
     9.23%  postgres  postgres            [.] NextCopyFrom
     8.87%  postgres  postgres            [.] NextCopyFromRawFields
     5.82%  postgres  postgres            [.] pg_verify_mbstr_len
     5.45%  postgres  postgres            [.] pg_strtoint32
     4.16%  postgres  postgres            [.] heap_fill_tuple
     4.03%  postgres  postgres            [.] heap_compute_data_size
     3.83%  postgres  postgres            [.] CopyFrom
     3.78%  postgres  postgres            [.] AllocSetAlloc
     3.53%  postgres  postgres            [.] heap_form_tuple
     2.96%  postgres  postgres            [.] InputFunctionCall
     2.89%  postgres  libc-2.30.so        [.] __memmove_avx_unaligned_erms
     1.82%  postgres  libc-2.30.so        [.] __strlen_avx2
     1.72%  postgres  postgres            [.] AllocSetReset
     1.72%  postgres  postgres            [.] RelationPutHeapTuple
     1.47%  postgres  postgres            [.] heap_prepare_insert
     1.31%  postgres  postgres            [.] heap_multi_insert
     1.25%  postgres  postgres            [.] textin
     1.24%  postgres  postgres            [.] int4in
     1.05%  postgres  postgres            [.] tts_buffer_heap_clear
     0.85%  postgres  postgres            [.] pg_any_to_server
     0.80%  postgres  postgres            [.] pg_comp_crc32c_sse42
     0.77%  postgres  postgres            [.] cstring_to_text_with_len
     0.69%  postgres  postgres            [.] AllocSetFree
     0.60%  postgres  postgres            [.] appendBinaryStringInfo
     0.55%  postgres  postgres            [.] tts_buffer_heap_materialize.part.0
     0.54%  postgres  postgres            [.] palloc
     0.54%  postgres  libc-2.30.so        [.] __memmove_avx_unaligned
     0.51%  postgres  postgres            [.] palloc0
     0.51%  postgres  postgres            [.] pg_encoding_max_length
     0.48%  postgres  postgres            [.] enlargeStringInfo
     0.47%  postgres  postgres            [.] ExecStoreVirtualTuple
     0.45%  postgres  postgres            [.] PageAddItemExtended

So that confirms that the parsing is a huge chunk of overhead with
current splitting into lines being the largest portion. Amdahl's law
says that splitting into tuples needs to be made fast before
parallelizing makes any sense.

Regards,
Ants Aasma

[1]
https://www3.stats.govt.nz/2018census/Age-sex-by-ethnic-group-grouped-total-responses-census-usually-resident-population-counts-2006-2013-2018-Censuses-RC-TA-SA2-DHB.zip

pgsql-hackers by date:

From: Alastair Turner
Date: 26 February 2020, 18:16:11
Subject: Re: Parallel copy

From: Alex Malek
Date: 26 February 2020, 18:18:30
Subject: Re: bad wal on replica / incorrect resource manager data checksum inrecord / zfs

Re: Parallel copy - Mailing list pgsql-hackers

Previous

Next