Thread: copy vs. C function
I was experimenting with a few different methods of taking a line of text, parsing it, into a set of fields, and then getting that info into a table. The first method involved writing a C program to parse a file, parse the lines and output newly-formatted lines in a format that postgresql's COPY function can use. End-to-end, this takes 15 seconds for about 250MB (read 250MB, parse, output new data to new file -- 4 seconds, COPY new file -- 10 seconds). The next approach I took was to write a C function in postgresql to parse a single TEXT datum into an array of C strings, and then use BuildTupleFromCStrings. There are 8 columns involved. Eliding the time it takes to COPY the (raw) file into a temporary table, this method took 120 seconds, give or take. The difference was /quite/ a surprise to me. What is the probability that I am doing something very, very wrong? NOTE: the code that does the parsing is actually the same, line-for-line, the only difference is whether the routine is called by a postgresql function or by a C program via main, so obviously the overhead is elsewhere. NOTE #2: We are talking about approximately 2.6 million lines. I was testing: \copy some_table from 'some_file.csv' with csv vs. insert into some_table select (some_func(line)).* from some_temp_table; where some_func had been defined with (one) IN TEXT and (8) OUT params of varying types. PostgreSQL 9.1.1 on Linux, x86_64 -- Jon
On 12/11/2011 09:27 AM, Jon Nelson wrote: > The first method involved writing a C program to parse a file, parse > the lines and output newly-formatted lines in a format that > postgresql's COPY function can use. > End-to-end, this takes 15 seconds for about 250MB (read 250MB, parse, > output new data to new file -- 4 seconds, COPY new file -- 10 > seconds). Why not `COPY tablename FROM /path/to/myfifo' ? Just connect your import program up to a named pipe (fifo) created with `mknod myfifo p` either by redirecting stdout or by open()ing the fifo for write. Then have Pg read from the fifo. You'll save a round of disk writes and reads. > The next approach I took was to write a C function in postgresql to > parse a single TEXT datum into an array of C strings, and then use > BuildTupleFromCStrings. There are 8 columns involved. > Eliding the time it takes to COPY the (raw) file into a temporary > table, this method took 120 seconds, give or take. > > The difference was /quite/ a surprise to me. What is the probability > that I am doing something very, very wrong? Have a look at how COPY does it within the Pg sources, see if that's any help. I don't know enough about Pg's innards to answer this one beyond that suggestion, sorry. -- Craig Ringer
Start a transaction before the first insert and commit it after the last one and it will be much better, but I believe thatthe copy code path is optimized to perform better than any set of queries can, even in a single transaction Sent from my iPhone On Dec 10, 2011, at 5:27 PM, Jon Nelson <jnelson+pgsql@jamponi.net> wrote: > I was experimenting with a few different methods of taking a line of > text, parsing it, into a set of fields, and then getting that info > into a table. > > The first method involved writing a C program to parse a file, parse > the lines and output newly-formatted lines in a format that > postgresql's COPY function can use. > End-to-end, this takes 15 seconds for about 250MB (read 250MB, parse, > output new data to new file -- 4 seconds, COPY new file -- 10 > seconds). > > The next approach I took was to write a C function in postgresql to > parse a single TEXT datum into an array of C strings, and then use > BuildTupleFromCStrings. There are 8 columns involved. > Eliding the time it takes to COPY the (raw) file into a temporary > table, this method took 120 seconds, give or take. > > The difference was /quite/ a surprise to me. What is the probability > that I am doing something very, very wrong? > > NOTE: the code that does the parsing is actually the same, > line-for-line, the only difference is whether the routine is called by > a postgresql function or by a C program via main, so obviously the > overhead is elsewhere. > NOTE #2: We are talking about approximately 2.6 million lines. > > I was testing: > > \copy some_table from 'some_file.csv' with csv > vs. > insert into some_table select (some_func(line)).* from some_temp_table; > > where some_func had been defined with (one) IN TEXT and (8) OUT params > of varying types. > > PostgreSQL 9.1.1 on Linux, x86_64 > > -- > Jon > > -- > Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-performance
On Sat, Dec 10, 2011 at 8:32 PM, Craig Ringer <ringerc@ringerc.id.au> wrote: > On 12/11/2011 09:27 AM, Jon Nelson wrote: >> >> The first method involved writing a C program to parse a file, parse >> the lines and output newly-formatted lines in a format that >> postgresql's COPY function can use. >> End-to-end, this takes 15 seconds for about 250MB (read 250MB, parse, >> output new data to new file -- 4 seconds, COPY new file -- 10 >> seconds). > > Why not `COPY tablename FROM /path/to/myfifo' ? If I were to do this in any sort of production environment, that's exactly what I would do. I was much more concerned about the /huge/ difference -- 10 seconds for COPY and 120 seconds for (INSERT INTO / CREATE TABLE AS / whatever). >> The next approach I took was to write a C function in postgresql to >> parse a single TEXT datum into an array of C strings, and then use >> BuildTupleFromCStrings. There are 8 columns involved. >> Eliding the time it takes to COPY the (raw) file into a temporary >> table, this method took 120 seconds, give or take. >> >> The difference was /quite/ a surprise to me. What is the probability >> that I am doing something very, very wrong? > > Have a look at how COPY does it within the Pg sources, see if that's any > help. I don't know enough about Pg's innards to answer this one beyond that > suggestion, sorry. Ack. Regarding a subsequent email, I was using full transactions. -- Jon
On Sat, Dec 10, 2011 at 7:27 PM, Jon Nelson <jnelson+pgsql@jamponi.net> wrote: > I was experimenting with a few different methods of taking a line of > text, parsing it, into a set of fields, and then getting that info > into a table. > > The first method involved writing a C program to parse a file, parse > the lines and output newly-formatted lines in a format that > postgresql's COPY function can use. > End-to-end, this takes 15 seconds for about 250MB (read 250MB, parse, > output new data to new file -- 4 seconds, COPY new file -- 10 > seconds). > > The next approach I took was to write a C function in postgresql to > parse a single TEXT datum into an array of C strings, and then use > BuildTupleFromCStrings. There are 8 columns involved. > Eliding the time it takes to COPY the (raw) file into a temporary > table, this method took 120 seconds, give or take. > > The difference was /quite/ a surprise to me. What is the probability > that I am doing something very, very wrong? > > NOTE: the code that does the parsing is actually the same, > line-for-line, the only difference is whether the routine is called by > a postgresql function or by a C program via main, so obviously the > overhead is elsewhere. > NOTE #2: We are talking about approximately 2.6 million lines. Let me throw out an interesting third method I've been using to parse delimited text files that might be useful in your case. This is useful when parsing text that is bad csv where values are not escaped or there are lines, incomplete and/or missing records, or a huge amount of columns that you want to rotate into a more normalized structure based on columns position. 1. Import the data into a single column (containing the entire line) staging table, feeding the COPY parser a bogus delimiter 2. 'Parse' the record with regexp_split_to_array (maybe in plpgsql function). 3. Either loop the array (in 9.1 use FOR-IN-ARRAY construct), or, if you can work it into your problem, INSERT/SELECT, expanding the array with a trick like used in information_schema._pg_expandarray so you can hook logic on the array (column position). merlin
On Mon, Dec 12, 2011 at 10:38 AM, Merlin Moncure <mmoncure@gmail.com> wrote: > On Sat, Dec 10, 2011 at 7:27 PM, Jon Nelson <jnelson+pgsql@jamponi.net> wrote: >> I was experimenting with a few different methods of taking a line of >> text, parsing it, into a set of fields, and then getting that info >> into a table. >> >> The first method involved writing a C program to parse a file, parse >> the lines and output newly-formatted lines in a format that >> postgresql's COPY function can use. >> End-to-end, this takes 15 seconds for about 250MB (read 250MB, parse, >> output new data to new file -- 4 seconds, COPY new file -- 10 >> seconds). >> >> The next approach I took was to write a C function in postgresql to >> parse a single TEXT datum into an array of C strings, and then use >> BuildTupleFromCStrings. There are 8 columns involved. >> Eliding the time it takes to COPY the (raw) file into a temporary >> table, this method took 120 seconds, give or take. >> >> The difference was /quite/ a surprise to me. What is the probability >> that I am doing something very, very wrong? >> >> NOTE: the code that does the parsing is actually the same, >> line-for-line, the only difference is whether the routine is called by >> a postgresql function or by a C program via main, so obviously the >> overhead is elsewhere. >> NOTE #2: We are talking about approximately 2.6 million lines. > > > Let me throw out an interesting third method I've been using to parse > delimited text files that might be useful in your case. This is > useful when parsing text that is bad csv where values are not escaped > or there are lines, incomplete and/or missing records, or a huge > amount of columns that you want to rotate into a more normalized > structure based on columns position. > > 1. Import the data into a single column (containing the entire line) > staging table, feeding the COPY parser a bogus delimiter > 2. 'Parse' the record with regexp_split_to_array (maybe in plpgsql function). > 3. Either loop the array (in 9.1 use FOR-IN-ARRAY construct), or, if > you can work it into your problem, INSERT/SELECT, expanding the array > with a trick like used in information_schema._pg_expandarray so you > can hook logic on the array (column position). If you replace [2] with my C function (which can process all of the data, *postgresql overhead not included*, in about 1 second) then that's what I did. It returns a composite type making [3] unnecessary. I know it's not parsing, so I started a time honored debugging approach: I returned early. Is the function-call overhead that high? That's outrageously high. What else could it be? Is returning a composite type outragously expensive? So here is what I did: I modified the code so that it immediately returns NULL. Result: 2 seconds. Extract arguments, allocate temporary work buffer: another 0.5 seconds. Add parsing: another 1.5 seconds [total: 4.1 seconds] and so on... Two of the items require base conversion, so: Calling strtol (twice) and snprintf (twice) -- adds *6 seconds. and to format one of the items as an array (a strcpy and a strcat) -- add 1.5 seconds for a total of 11.5. The only thing I have left are these statements: get_call_result_type TupleDescGetAttInMetadata BuildTupleFromCStrings HeapTupleGetDatum and finally PG_RETURN_DATUM It turns out that: get_call_result_type adds 43 seconds [total: 54], TupleDescGetAttInMetadata adds 19 seconds [total: 73], BuildTypleFromCStrings accounts for 43 seconds [total: 116]. So those three functions account for 90% of the total time spent. What alternatives exist? Do I have to call get_call_result_type /every time/ through the function? -- Jon
Jon Nelson <jnelson+pgsql@jamponi.net> writes: > The only thing I have left are these statements: > get_call_result_type > TupleDescGetAttInMetadata > BuildTupleFromCStrings > HeapTupleGetDatum > and finally PG_RETURN_DATUM > It turns out that: > get_call_result_type adds 43 seconds [total: 54], > TupleDescGetAttInMetadata adds 19 seconds [total: 73], > BuildTypleFromCStrings accounts for 43 seconds [total: 116]. > So those three functions account for 90% of the total time spent. > What alternatives exist? Do I have to call get_call_result_type /every > time/ through the function? Well, if you're concerned about performance then I think you're going about this in entirely the wrong way, because as far as I can tell from this you're converting all the field values to text and back again. You should be trying to keep the values in Datum format and then invoking heap_form_tuple. And yeah, you probably could cache the type information across calls. regards, tom lane
Hi guys,
A nub question here since I could not figure it out on my own:
I'm using Hamachi to connect different sites into a VPN and their address always starts with 5.*.*.* - the problem I'm facing is that I cannot make the access restricted to that particular range only.
Currently I got :
host all all 0.0.0.0/32 md5
which allows all the IP's, and the try:
host all all 5.0.0.0/32 md5
does not work.
So what I am suppose to add in "pg_hba.conf" in order to achieve my restriction? Please help me,
Thank you,
Danny
From: Tom Lane <tgl@sss.pgh.pa.us>
To: Jon Nelson <jnelson+pgsql@jamponi.net>
Cc: pgsql-performance@postgresql.org
Sent: Wednesday, December 14, 2011 8:18 AM
Subject: Re: [PERFORM] copy vs. C function
Jon Nelson <jnelson+pgsql@jamponi.net> writes:
> The only thing I have left are these statements:
> get_call_result_type
> TupleDescGetAttInMetadata
> BuildTupleFromCStrings
> HeapTupleGetDatum
> and finally PG_RETURN_DATUM
> It turns out that:
> get_call_result_type adds 43 seconds [total: 54],
> TupleDescGetAttInMetadata adds 19 seconds [total: 73],
> BuildTypleFromCStrings accounts for 43 seconds [total: 116].
> So those three functions account for 90% of the total time spent.
> What alternatives exist? Do I have to call get_call_result_type /every
> time/ through the function?
Well, if you're concerned about performance then I think you're going
about this in entirely the wrong way, because as far as I can tell from
this you're converting all the field values to text and back again.
You should be trying to keep the values in Datum format and then
invoking heap_form_tuple. And yeah, you probably could cache the
type information across calls.
regards, tom lane
--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance
try
host all all 5.0.0.0/8 md5
--
Kevin P Martyn, CISSP
Principal Sales Engineer
(914) 819 8795 mobile
Kevin.Martyn@EnterpriseDB.com
Skype: kevin.martyn4
Website: www.enterprisedb.com
EnterpriseDB Blog: http://blogs.enterprisedb.com/
Follow us on Twitter: http://www.twitter.com/enterprisedb
This e-mail message (and any attachment) is intended for the use of the individual or entity to whom it is addressed. This message contains information from EnterpriseDB Corporation that may be privileged, confidential, or exempt from disclosure under applicable law. If you are not the intended recipient or authorized to receive this for the intended recipient, any use, dissemination, distribution, retention, archiving, or copying of this communication is strictly prohibited. If you have received this e-mail in error, please notify the sender immediately by reply e-mail and delete this message.
host all all 5.0.0.0/8 md5
On Wed, Dec 14, 2011 at 2:02 AM, idc danny <idcdanny@yahoo.com> wrote:
Hi guys,A nub question here since I could not figure it out on my own:I'm using Hamachi to connect different sites into a VPN and their address always starts with 5.*.*.* - the problem I'm facing is that I cannot make the access restricted to that particular range only.Currently I got :host all all 0.0.0.0/32 md5which allows all the IP's, and the try:host all all 5.0.0.0/32 md5does not work.So what I am suppose to add in "pg_hba.conf" in order to achieve my restriction? Please help me,Thank you,Danny
From: Tom Lane <tgl@sss.pgh.pa.us>
To: Jon Nelson <jnelson+pgsql@jamponi.net>
Cc: pgsql-performance@postgresql.org
Sent: Wednesday, December 14, 2011 8:18 AM
Subject: Re: [PERFORM] copy vs. C function
Jon Nelson <jnelson+pgsql@jamponi.net> writes:
> The only thing I have left are these statements:
> get_call_result_type
> TupleDescGetAttInMetadata
> BuildTupleFromCStrings
> HeapTupleGetDatum
> and finally PG_RETURN_DATUM
> It turns out that:
> get_call_result_type adds 43 seconds [total: 54],
> TupleDescGetAttInMetadata adds 19 seconds [total: 73],
> BuildTypleFromCStrings accounts for 43 seconds [total: 116].
> So those three functions account for 90% of the total time spent.
> What alternatives exist? Do I have to call get_call_result_type /every
> time/ through the function?
Well, if you're concerned about performance then I think you're going
about this in entirely the wrong way, because as far as I can tell from
this you're converting all the field values to text and back again.
You should be trying to keep the values in Datum format and then
invoking heap_form_tuple. And yeah, you probably could cache the
type information across calls.
regards, tom lane
--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance
--
Kevin P Martyn, CISSP
Principal Sales Engineer
(914) 819 8795 mobile
Kevin.Martyn@EnterpriseDB.com
Skype: kevin.martyn4
Website: www.enterprisedb.com
EnterpriseDB Blog: http://blogs.enterprisedb.com/
Follow us on Twitter: http://www.twitter.com/enterprisedb
This e-mail message (and any attachment) is intended for the use of the individual or entity to whom it is addressed. This message contains information from EnterpriseDB Corporation that may be privileged, confidential, or exempt from disclosure under applicable law. If you are not the intended recipient or authorized to receive this for the intended recipient, any use, dissemination, distribution, retention, archiving, or copying of this communication is strictly prohibited. If you have received this e-mail in error, please notify the sender immediately by reply e-mail and delete this message.
On Wed, Dec 14, 2011 at 12:18 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Jon Nelson <jnelson+pgsql@jamponi.net> writes: >> The only thing I have left are these statements: > >> get_call_result_type >> TupleDescGetAttInMetadata >> BuildTupleFromCStrings >> HeapTupleGetDatum >> and finally PG_RETURN_DATUM > >> It turns out that: >> get_call_result_type adds 43 seconds [total: 54], >> TupleDescGetAttInMetadata adds 19 seconds [total: 73], >> BuildTypleFromCStrings accounts for 43 seconds [total: 116]. > >> So those three functions account for 90% of the total time spent. >> What alternatives exist? Do I have to call get_call_result_type /every >> time/ through the function? > > Well, if you're concerned about performance then I think you're going > about this in entirely the wrong way, because as far as I can tell from > this you're converting all the field values to text and back again. > You should be trying to keep the values in Datum format and then > invoking heap_form_tuple. And yeah, you probably could cache the > type information across calls. The parsing/conversion (except BuildTupleFromCStrings) is only a small fraction of the overall time spent in the function and could probably be made slightly faster. It's the overhead that's killing me. Remember: I'm not converting multiple field values to text and back again, I'm turning a *single* TEXT into 8 columns of varying types (INET, INTEGER, and one INTEGER array, among others). I'll re-write the code to use Tuples but given that 53% of the time is spent in just two functions (the two I'd like to cache) I'm not sure how much of a gain it's likely to be. Regarding caching, I tried caching it across calls by making the TupleDesc static and only initializing it once. When I tried that, I got: ERROR: number of columns (6769856) exceeds limit (1664) I tried to find some documentation or examples that cache the information, but couldn't find any. -- Jon
Ah, that did the trick, thank you Kevin,
Danny
From: Kevin Martyn <kevin.martyn@enterprisedb.com>
To: idc danny <idcdanny@yahoo.com>
Cc: "pgsql-performance@postgresql.org" <pgsql-performance@postgresql.org>
Sent: Wednesday, December 14, 2011 3:14 PM
Subject: Re: [PERFORM] copy vs. C function
try
host all all 5.0.0.0/8 md5
--
Kevin P Martyn, CISSP
Principal Sales Engineer
(914) 819 8795 mobile
Kevin.Martyn@EnterpriseDB.com
Skype: kevin.martyn4
Website: www.enterprisedb.com
EnterpriseDB Blog: http://blogs.enterprisedb.com/
Follow us on Twitter: http://www.twitter.com/enterprisedb
This e-mail message (and any attachment) is intended for the use of the individual or entity to whom it is addressed. This message contains information from EnterpriseDB Corporation that may be privileged, confidential, or exempt from disclosure under applicable law. If you are not the intended recipient or authorized to receive this for the intended recipient, any use, dissemination, distribution, retention, archiving, or copying of this communication is strictly prohibited. If you have received this e-mail in error, please notify the sender immediately by reply e-mail and delete this message.
host all all 5.0.0.0/8 md5
On Wed, Dec 14, 2011 at 2:02 AM, idc danny <idcdanny@yahoo.com> wrote:
Hi guys,A nub question here since I could not figure it out on my own:I'm using Hamachi to connect different sites into a VPN and their address always starts with 5.*.*.* - the problem I'm facing is that I cannot make the access restricted to that particular range only.Currently I got :host all all 0.0.0.0/32 md5which allows all the IP's, and the try:host all all 5.0.0.0/32 md5does not work.So what I am suppose to add in "pg_hba.conf" in order to achieve my restriction? Please help me,Thank you,Danny
From: Tom Lane <tgl@sss.pgh.pa.us>
To: Jon Nelson <jnelson+pgsql@jamponi.net>
Cc: pgsql-performance@postgresql.org
Sent: Wednesday, December 14, 2011 8:18 AM
Subject: Re: [PERFORM] copy vs. C function
Jon Nelson <jnelson+pgsql@jamponi.net> writes:
> The only thing I have left are these statements:
> get_call_result_type
> TupleDescGetAttInMetadata
> BuildTupleFromCStrings
> HeapTupleGetDatum
> and finally PG_RETURN_DATUM
> It turns out that:
> get_call_result_type adds 43 seconds [total: 54],
> TupleDescGetAttInMetadata adds 19 seconds [total: 73],
> BuildTypleFromCStrings accounts for 43 seconds [total: 116].
> So those three functions account for 90% of the total time spent.
> What alternatives exist? Do I have to call get_call_result_type /every
> time/ through the function?
Well, if you're concerned about performance then I think you're going
about this in entirely the wrong way, because as far as I can tell from
this you're converting all the field values to text and back again.
You should be trying to keep the values in Datum format and then
invoking heap_form_tuple. And yeah, you probably could cache the
type information across calls.
regards, tom lane
--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance
--
Kevin P Martyn, CISSP
Principal Sales Engineer
(914) 819 8795 mobile
Kevin.Martyn@EnterpriseDB.com
Skype: kevin.martyn4
Website: www.enterprisedb.com
EnterpriseDB Blog: http://blogs.enterprisedb.com/
Follow us on Twitter: http://www.twitter.com/enterprisedb
This e-mail message (and any attachment) is intended for the use of the individual or entity to whom it is addressed. This message contains information from EnterpriseDB Corporation that may be privileged, confidential, or exempt from disclosure under applicable law. If you are not the intended recipient or authorized to receive this for the intended recipient, any use, dissemination, distribution, retention, archiving, or copying of this communication is strictly prohibited. If you have received this e-mail in error, please notify the sender immediately by reply e-mail and delete this message.
Jon Nelson <jnelson+pgsql@jamponi.net> writes: > Regarding caching, I tried caching it across calls by making the > TupleDesc static and only initializing it once. > When I tried that, I got: > ERROR: number of columns (6769856) exceeds limit (1664) > I tried to find some documentation or examples that cache the > information, but couldn't find any. You might find reading record_in to be helpful. What it caches is not exactly what you need to, I think, but it shows the general principles. There are lots of other functions that use fn_extra to cache info, too. regards, tom lane
On Wed, Dec 14, 2011 at 9:25 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Jon Nelson <jnelson+pgsql@jamponi.net> writes: >> Regarding caching, I tried caching it across calls by making the >> TupleDesc static and only initializing it once. >> When I tried that, I got: > >> ERROR: number of columns (6769856) exceeds limit (1664) > >> I tried to find some documentation or examples that cache the >> information, but couldn't find any. > > You might find reading record_in to be helpful. What it caches is not > exactly what you need to, I think, but it shows the general principles. > There are lots of other functions that use fn_extra to cache info, too. I will definitely look into those. I'm probably doing it wrong, but in the meantime, I allocated enough space (by way of MemoryContextAlloc) in TopMemoryContext for an AttInMetadata pointer, switched to that memory context (just the first time through), used CreateTupleDescCopy + TupleDescGetAttInMetadata to duplicate (in the new memory context) the TupleDesc, and then switched back. This approach seems to have dropped the total run time to about 54 seconds, the bulk of which is BuildTupleFromCStrings, a rather significant improvement. .... Looking at record_in, I think I see what I could be doing better. Again, thanks for the pointers. -- Jon
On Wed, Dec 14, 2011 at 9:40 AM, Jon Nelson <jnelson+pgsql@jamponi.net> wrote: > On Wed, Dec 14, 2011 at 9:25 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> Jon Nelson <jnelson+pgsql@jamponi.net> writes: >>> Regarding caching, I tried caching it across calls by making the >>> TupleDesc static and only initializing it once. >>> When I tried that, I got: >> >>> ERROR: number of columns (6769856) exceeds limit (1664) >> >>> I tried to find some documentation or examples that cache the >>> information, but couldn't find any. >> >> You might find reading record_in to be helpful. What it caches is not >> exactly what you need to, I think, but it shows the general principles. >> There are lots of other functions that use fn_extra to cache info, too. > > I will definitely look into those. I'm probably doing it wrong, but in > the meantime, I allocated enough space (by way of MemoryContextAlloc) > in TopMemoryContext for an AttInMetadata pointer, switched to that > memory context (just the first time through), used CreateTupleDescCopy > + TupleDescGetAttInMetadata to duplicate (in the new memory context) > the TupleDesc, and then switched back. This approach seems to have > dropped the total run time to about 54 seconds, the bulk of which is > BuildTupleFromCStrings, a rather significant improvement. > > .... > > Looking at record_in, I think I see what I could be doing better. Indeed. I revised the code to make use of fcinfo->flinfo->fn_extra for storage and fcinfo->flinfo->fn_mcxt for the MemoryContext and everything seemed to work just fine. Assuming one *starts* with a char *some_var[8], would building Datum myself be faster than using BuildTupleFromCStrings? -- Jon
On Wed, Dec 14, 2011 at 9:51 AM, Jon Nelson <jnelson+pgsql@jamponi.net> wrote: > On Wed, Dec 14, 2011 at 9:40 AM, Jon Nelson <jnelson+pgsql@jamponi.net> wrote: >> On Wed, Dec 14, 2011 at 9:25 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >>> Jon Nelson <jnelson+pgsql@jamponi.net> writes: >>>> Regarding caching, I tried caching it across calls by making the >>>> TupleDesc static and only initializing it once. >>>> When I tried that, I got: >>> >>>> ERROR: number of columns (6769856) exceeds limit (1664) >>> >>>> I tried to find some documentation or examples that cache the >>>> information, but couldn't find any. >>> >>> You might find reading record_in to be helpful. What it caches is not >>> exactly what you need to, I think, but it shows the general principles. >>> There are lots of other functions that use fn_extra to cache info, too. >> >> I will definitely look into those. I'm probably doing it wrong, but in >> the meantime, I allocated enough space (by way of MemoryContextAlloc) >> in TopMemoryContext for an AttInMetadata pointer, switched to that >> memory context (just the first time through), used CreateTupleDescCopy >> + TupleDescGetAttInMetadata to duplicate (in the new memory context) >> the TupleDesc, and then switched back. This approach seems to have >> dropped the total run time to about 54 seconds, the bulk of which is >> BuildTupleFromCStrings, a rather significant improvement. >> >> .... >> >> Looking at record_in, I think I see what I could be doing better. > > Indeed. I revised the code to make use of fcinfo->flinfo->fn_extra for > storage and fcinfo->flinfo->fn_mcxt for the MemoryContext and > everything seemed to work just fine. > > Assuming one *starts* with a char *some_var[8], would building Datum > myself be faster than using BuildTupleFromCStrings? The answer is: yes. At least, in my case it is. The total run time is now down to about 32 seconds. Versus the BuildTupleFromCStrings which takes about 54 seconds. 32 seconds is more than 10-15 seconds, but it's livable. This experiment has been very worthwhile - thank you all for the help. -- Jon
Hi Jon This is exactly, what I was looking for. Need to read the data from delimited file with no header, and do few transformation as described below using Postgres C function and load it using pg_bulkload utility. Transformation below, can be handled with query after loading all the data as varchar and nullable. But we need to handle this before loading as like we do in Oracle. I'm converting the code from Oracle to Postgres. Both version of code(Oracle & Postgres) will be available for different users. In Oracle, doing these kind of transformation in SQL loader. Need to follow the same kind of approach in Postgres. SQL filter approach was very easy in terms of coding. From documentation found, C filter was very much faster than SQL. I'm very much new to C. Looking for your options as you mentioned here in post.Some standard syntax for writing these functionalities would be greatly helpful. Kindly help me. Sample Data: ABC|20170101|DEF ||GHIJ|KLM Target Table Definition: COLA numeric(5,0) COLB date COLC text COLD text COLE text First column should be mapped to COLA Second column should be mapped to COLB Third column should be mapped to COLD Fourth column should be mapped to COLC Fifth column should be mapped to Some default value(column is not present in source) Transformation: a)First column should be mapped to COLA. It is numeric in target table. If any alpha-characters were present, default this column with '0'. Otherwise, source value should be moved to table. b)Second column should be mapped to COLB. TO_DATE function from text format. File will have date format as YYYYMMDD. It should be converted to date. c)Third column should be mapped to COLD.Need to Trim both leading and trailing spaces. d)Fourth column should be mapped to COLC. If it NULL, some value should be defaulted. e)Only few columns from source file should be loaded. In this case, only first four columns should be loaded. f)Different ordering in source files & target columns.In this case, Third column should be mapped to COLD Fourth column should be mapped to COLC g)COLE should be loaded with default value. This column is not present in source file. Thanks -- View this message in context: http://postgresql.nabble.com/copy-vs-C-function-tp5065298p5936796.html Sent from the PostgreSQL - performance mailing list archive at Nabble.com.