Re: Make COPY format extendable: Extract COPY TO format implementations - Mailing list pgsql-hackers
From | Michael Paquier |
---|---|
Subject | Re: Make COPY format extendable: Extract COPY TO format implementations |
Date | |
Msg-id | ZbHS439y-Bs6HIAR@paquier.xyz Whole thread Raw |
In response to | Re: Make COPY format extendable: Extract COPY TO format implementations (Sutou Kouhei <kou@clear-code.com>) |
Responses |
Re: Make COPY format extendable: Extract COPY TO format implementations
|
List | pgsql-hackers |
On Wed, Jan 24, 2024 at 11:17:26PM +0900, Sutou Kouhei wrote: > In <10025bac-158c-ffe7-fbec-32b42629121f@dunslane.net> > "Re: Make COPY format extendable: Extract COPY TO format implementations" on Wed, 24 Jan 2024 07:15:55 -0500, > Andrew Dunstan <andrew@dunslane.net> wrote: >> We've spent quite a lot of blood sweat and tears over the years to make COPY >> fast, and we should not sacrifice any of that lightly. Clearly. > I uploaded my benchmark script so that you can run the same > benchmark on your machine: > > https://gist.github.com/kou/be02e02e5072c91969469dbf137b5de5 Thanks, that saves time. I am attaching it to this email as well, for the sake of the archives if this link is removed in the future. > Could anyone try the benchmark with master and master+0001? Yep. It is one point we need to settle before deciding what to do with this patch set, and I've done so to reach my own conclusion. I have a rather good machine at my disposal in the cloud, so I did a few runs with HEAD and HEAD+0001, with PGDATA mounted on a tmpfs. Here are some results for the 10M row case, as these should be the least prone to noise, 5 runs each: master text 10M 1732.570 1684.542 1693.430 1687.696 1714.845 csv 10M 1729.113 1724.926 1727.414 1726.237 1728.865 bin 10M 1679.097 1677.887 1676.764 1677.554 1678.120 master+0001 text 10M 1702.207 1654.818 1647.069 1690.568 1654.446 csv 10M 1764.939 1714.313 1712.444 1712.323 1716.952 bin 10M 1703.061 1702.719 1702.234 1703.346 1704.137 Hmm. The point of contention in the patch is the change to use the CopyToOneRow callback in CopyOneRowTo(), as we go through it for each row and we should habe a worst-case scenario with a relation that has a small attribute size. The more rows, the more effect it would have. The memory context switches and the StringInfo manipulations are equally important, and there are a bunch of the latter, actually, with optimizations around fe_msgbuf. I've repeated a few runs across these two builds, and there is some variance and noise, but I am going to agree with your point that the effect 0001 cannot be seen. Even HEAD is showing some noise. So I am discarding the concerns I had after seeing the numbers you posted upthread. +typedef bool (*CopyToProcessOption_function) (CopyToState cstate, DefElem *defel); +typedef int16 (*CopyToGetFormat_function) (CopyToState cstate); +typedef void (*CopyToStart_function) (CopyToState cstate, TupleDesc tupDesc); +typedef void (*CopyToOneRow_function) (CopyToState cstate, TupleTableSlot *slot); +typedef void (*CopyToEnd_function) (CopyToState cstate); We don't really need a set of typedefs here, let's put the definitions in the CopyToRoutine struct instead. +extern CopyToRoutine CopyToRoutineText; +extern CopyToRoutine CopyToRoutineCSV; +extern CopyToRoutine CopyToRoutineBinary; All that should IMO remain in copyto.c and copyfrom.c in the initial patch doing the refactoring. Why not using a fetch function instead that uses a string in input? Then you can call that once after parsing the List of options in ProcessCopyOptions(). Introducing copyapi.h in the initial patch makes sense here for the TO and FROM routines. +/* All "text" and "csv" options are parsed in ProcessCopyOptions(). We may + * move the code to here later. */ Some areas, like this comment, are written in an incorrect format. + if (cstate->opts.csv_mode) + CopyAttributeOutCSV(cstate, colname, false, + list_length(cstate->attnumlist) == 1); + else + CopyAttributeOutText(cstate, colname); You are right that this is not worth the trouble of creating a different set of callbacks for CSV. This makes the result cleaner. + getTypeBinaryOutputInfo(attr->atttypid, &out_func_oid, &isvarlena); + fmgr_info(out_func_oid, &cstate->out_functions[attnum - 1]); Actually, this split is interesting. It is possible for a custom format to plug in a custom set of out functions. Did you make use of something custom for your own stuff? Actually, could it make sense to split the assignment of cstate->out_functions into its own callback? Sure, that's part of the start phase, but at least it would make clear that a custom method *has* to assign these OIDs to work. The patch implies that as a rule, without a comment that CopyToStart *must* set up these OIDs. I think that 0001 and 0005 should be handled first, as pieces independent of the rest. Then we could move on with 0002~0004 and 0006~0008. -- Michael
Attachment
pgsql-hackers by date: