Re: Make COPY format extendable: Extract COPY TO format implementations - Mailing list pgsql-hackers
From | Masahiko Sawada |
---|---|
Subject | Re: Make COPY format extendable: Extract COPY TO format implementations |
Date | |
Msg-id | CAD21AoALxEZz33NpcSk99ad_DT3A2oFNMa2KNjGBCMVFeCiUaA@mail.gmail.com Whole thread Raw |
In response to | Re: Make COPY format extendable: Extract COPY TO format implementations (Sutou Kouhei <kou@clear-code.com>) |
Responses |
Re: Make COPY format extendable: Extract COPY TO format implementations
Re: Make COPY format extendable: Extract COPY TO format implementations |
List | pgsql-hackers |
On Wed, Jan 24, 2024 at 11:17 PM Sutou Kouhei <kou@clear-code.com> wrote: > > Hi, > > In <10025bac-158c-ffe7-fbec-32b42629121f@dunslane.net> > "Re: Make COPY format extendable: Extract COPY TO format implementations" on Wed, 24 Jan 2024 07:15:55 -0500, > Andrew Dunstan <andrew@dunslane.net> wrote: > > > > > On 2024-01-24 We 03:11, Michael Paquier wrote: > >> On Wed, Jan 24, 2024 at 02:49:36PM +0900, Sutou Kouhei wrote: > >>> For COPY TO: > >>> > >>> 0001: This adds CopyToRoutine and use it for text/csv/binary > >>> formats. No implementation change. This just move codes. > >> 10M without this change: > >> > >> format,elapsed time (ms) > >> text,1090.763 > >> csv,1136.103 > >> binary,1137.141 > >> > >> 10M with this change: > >> > >> format,elapsed time (ms) > >> text,1082.654 > >> csv,1196.991 > >> binary,1069.697 > >> > >> These numbers point out that binary is faster by 6%, csv is slower by > >> 5%, while text stays around what looks like noise range. That's not > >> negligible. Are these numbers reproducible? If they are, that could > >> be a problem for anybody doing bulk-loading of large data sets. I am > >> not sure to understand where the improvement for binary comes from by > >> reading the patch, but perhaps perf would tell more for each format? > >> The loss with csv could be blamed on the extra manipulations of the > >> function pointers, likely. > > > > > > I don't think that's at all acceptable. > > > > We've spent quite a lot of blood sweat and tears over the years to make COPY > > fast, and we should not sacrifice any of that lightly. > > These numbers aren't reproducible. Because these benchmarks > executed on my normal machine not a machine only for > benchmarking. The machine runs another processes such as > editor and Web browser. > > For example, here are some results with master > (94edfe250c6a200d2067b0debfe00b4122e9b11e): > > Format,N records,Elapsed time (ms) > csv,10000000,1073.715 > csv,10000000,1022.830 > csv,10000000,1073.584 > csv,10000000,1090.651 > csv,10000000,1052.259 > > Here are some results with master + the 0001 patch: > > Format,N records,Elapsed time (ms) > csv,10000000,1025.356 > csv,10000000,1067.202 > csv,10000000,1014.563 > csv,10000000,1032.088 > csv,10000000,1058.110 > > > I uploaded my benchmark script so that you can run the same > benchmark on your machine: > > https://gist.github.com/kou/be02e02e5072c91969469dbf137b5de5 > > Could anyone try the benchmark with master and master+0001? > I've run a similar scenario: create unlogged table test (a int); insert into test select c from generate_series(1, 25000000) c; copy test to '/tmp/result.csv' with (format csv); -- generates 230MB file I've run it on HEAD and HEAD+0001 patch and here are the medians of 10 executions for each format: HEAD: binary 2930.353 ms text 2754.852 ms csv 2890.012 ms HEAD w/ 0001 patch: binary 2814.838 ms text 2900.845 ms csv 3015.210 ms Hmm I can see a similar trend that Suto-san had; the binary format got slightly faster whereas both text and csv format has small regression (4%~5%). I think that the improvement for binary came from the fact that we removed "if (cstate->opts.binary)" branches from the original CopyOneRowTo(). I've experimented with a similar optimization for csv and text format; have different callbacks for text and csv format and remove "if (cstate->opts.csv_mode)" branches. I've attached a patch for that. Here are results: HEAD w/ 0001 patch + remove branches: binary 2824.502 ms text 2715.264 ms csv 2803.381 ms The numbers look better now. I'm not sure these are within a noise range but it might be worth considering having different callbacks for text and csv formats. Regards, -- Masahiko Sawada Amazon Web Services: https://aws.amazon.com
Attachment
pgsql-hackers by date: