Thread: copy vs. C function

copy vs. C function

From

Jon Nelson

Date:

10 December 2011, 21:27:36

I was experimenting with a few different methods of taking a line of
text, parsing it, into a set of fields, and then getting that info
into a table.

The first method involved writing a C program to parse a file, parse
the lines and output newly-formatted lines in a format that
postgresql's COPY function can use.
End-to-end, this takes 15 seconds for about 250MB (read 250MB, parse,
output new data to new file -- 4 seconds, COPY new file -- 10
seconds).

The next approach I took was to write a C function in postgresql to
parse a single TEXT datum into an array of C strings, and then use
BuildTupleFromCStrings. There are 8 columns involved.
Eliding the time it takes to COPY the (raw) file into a temporary
table, this method took 120 seconds, give or take.

The difference was /quite/ a surprise to me. What is the probability
that I am doing something very, very wrong?

NOTE: the code that does the parsing is actually the same,
line-for-line, the only difference is whether the routine is called by
a postgresql function or by a C program via main, so obviously the
overhead is elsewhere.
NOTE #2: We are talking about approximately 2.6 million lines.

I was testing:

\copy some_table from 'some_file.csv' with csv
vs.
insert into some_table select (some_func(line)).* from some_temp_table;

where some_func had been defined with (one) IN TEXT and (8) OUT params
of varying types.

PostgreSQL 9.1.1 on Linux, x86_64

--
Jon

Re: copy vs. C function

From

Craig Ringer

Date:

10 December 2011, 22:33:11

On 12/11/2011 09:27 AM, Jon Nelson wrote:
> The first method involved writing a C program to parse a file, parse
> the lines and output newly-formatted lines in a format that
> postgresql's COPY function can use.
> End-to-end, this takes 15 seconds for about 250MB (read 250MB, parse,
> output new data to new file -- 4 seconds, COPY new file -- 10
> seconds).
Why not `COPY tablename FROM /path/to/myfifo' ?

Just connect your import program up to a named pipe (fifo) created with
`mknod myfifo p` either by redirecting stdout or by open()ing the fifo
for write. Then have Pg read from the fifo. You'll save a round of disk
writes and reads.
> The next approach I took was to write a C function in postgresql to
> parse a single TEXT datum into an array of C strings, and then use
> BuildTupleFromCStrings. There are 8 columns involved.
> Eliding the time it takes to COPY the (raw) file into a temporary
> table, this method took 120 seconds, give or take.
>
> The difference was /quite/ a surprise to me. What is the probability
> that I am doing something very, very wrong?
Have a look at how COPY does it within the Pg sources, see if that's any
help. I don't know enough about Pg's innards to answer this one beyond
that suggestion, sorry.

--
Craig Ringer

Re: copy vs. C function

From

Sam Gendler

Date:

10 December 2011, 22:35:38

Start a transaction before the first insert and commit it after the last one and it will be much better, but I believe
thatthe copy code path is optimized to perform better than any set of queries can, even in a single transaction 

Sent from my iPhone

On Dec 10, 2011, at 5:27 PM, Jon Nelson <jnelson+pgsql@jamponi.net> wrote:

> I was experimenting with a few different methods of taking a line of
> text, parsing it, into a set of fields, and then getting that info
> into a table.
>
> The first method involved writing a C program to parse a file, parse
> the lines and output newly-formatted lines in a format that
> postgresql's COPY function can use.
> End-to-end, this takes 15 seconds for about 250MB (read 250MB, parse,
> output new data to new file -- 4 seconds, COPY new file -- 10
> seconds).
>
> The next approach I took was to write a C function in postgresql to
> parse a single TEXT datum into an array of C strings, and then use
> BuildTupleFromCStrings. There are 8 columns involved.
> Eliding the time it takes to COPY the (raw) file into a temporary
> table, this method took 120 seconds, give or take.
>
> The difference was /quite/ a surprise to me. What is the probability
> that I am doing something very, very wrong?
>
> NOTE: the code that does the parsing is actually the same,
> line-for-line, the only difference is whether the routine is called by
> a postgresql function or by a C program via main, so obviously the
> overhead is elsewhere.
> NOTE #2: We are talking about approximately 2.6 million lines.
>
> I was testing:
>
> \copy some_table from 'some_file.csv' with csv
> vs.
> insert into some_table select (some_func(line)).* from some_temp_table;
>
> where some_func had been defined with (one) IN TEXT and (8) OUT params
> of varying types.
>
> PostgreSQL 9.1.1 on Linux, x86_64
>
> --
> Jon
>
> --
> Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-performance

Re: copy vs. C function

From

Jon Nelson

Date:

10 December 2011, 23:09:15

On Sat, Dec 10, 2011 at 8:32 PM, Craig Ringer <ringerc@ringerc.id.au> wrote:
> On 12/11/2011 09:27 AM, Jon Nelson wrote:
>>
>> The first method involved writing a C program to parse a file, parse
>> the lines and output newly-formatted lines in a format that
>> postgresql's COPY function can use.
>> End-to-end, this takes 15 seconds for about 250MB (read 250MB, parse,
>> output new data to new file -- 4 seconds, COPY new file -- 10
>> seconds).
>
> Why not `COPY tablename FROM /path/to/myfifo' ?

If I were to do this in any sort of production environment, that's
exactly what I would do. I was much more concerned about the /huge/
difference -- 10 seconds for COPY and 120 seconds for (INSERT INTO /
CREATE TABLE AS / whatever).

>> The next approach I took was to write a C function in postgresql to
>> parse a single TEXT datum into an array of C strings, and then use
>> BuildTupleFromCStrings. There are 8 columns involved.
>> Eliding the time it takes to COPY the (raw) file into a temporary
>> table, this method took 120 seconds, give or take.
>>
>> The difference was /quite/ a surprise to me. What is the probability
>> that I am doing something very, very wrong?
>
> Have a look at how COPY does it within the Pg sources, see if that's any
> help. I don't know enough about Pg's innards to answer this one beyond that
> suggestion, sorry.

Ack.

Regarding a subsequent email, I was using full transactions.


--
Jon

Re: copy vs. C function

From

Merlin Moncure

Date:

12 December 2011, 12:38:19

On Sat, Dec 10, 2011 at 7:27 PM, Jon Nelson <jnelson+pgsql@jamponi.net> wrote:
> I was experimenting with a few different methods of taking a line of
> text, parsing it, into a set of fields, and then getting that info
> into a table.
>
> The first method involved writing a C program to parse a file, parse
> the lines and output newly-formatted lines in a format that
> postgresql's COPY function can use.
> End-to-end, this takes 15 seconds for about 250MB (read 250MB, parse,
> output new data to new file -- 4 seconds, COPY new file -- 10
> seconds).
>
> The next approach I took was to write a C function in postgresql to
> parse a single TEXT datum into an array of C strings, and then use
> BuildTupleFromCStrings. There are 8 columns involved.
> Eliding the time it takes to COPY the (raw) file into a temporary
> table, this method took 120 seconds, give or take.
>
> The difference was /quite/ a surprise to me. What is the probability
> that I am doing something very, very wrong?
>
> NOTE: the code that does the parsing is actually the same,
> line-for-line, the only difference is whether the routine is called by
> a postgresql function or by a C program via main, so obviously the
> overhead is elsewhere.
> NOTE #2: We are talking about approximately 2.6 million lines.

Let me throw out an interesting third method I've been using to parse
delimited text files that might be useful in your case.  This is
useful when parsing text that is bad csv where values are not escaped
or there are lines, incomplete and/or missing records, or a huge
amount of columns that you want to rotate into a more normalized
structure based on columns position.

1. Import the data into a single column (containing the entire line)
staging table, feeding the COPY parser a bogus delimiter
2. 'Parse' the record with regexp_split_to_array (maybe in plpgsql function).
3. Either loop the array (in 9.1 use FOR-IN-ARRAY construct), or, if
you can work it into your problem, INSERT/SELECT, expanding the array
with a trick like used in information_schema._pg_expandarray so you
can hook logic on the array (column position).

merlin

Re: copy vs. C function

From

Jon Nelson

Date:

13 December 2011, 10:30:29

On Mon, Dec 12, 2011 at 10:38 AM, Merlin Moncure <mmoncure@gmail.com> wrote:
> On Sat, Dec 10, 2011 at 7:27 PM, Jon Nelson <jnelson+pgsql@jamponi.net> wrote:
>> I was experimenting with a few different methods of taking a line of
>> text, parsing it, into a set of fields, and then getting that info
>> into a table.
>>
>> The first method involved writing a C program to parse a file, parse
>> the lines and output newly-formatted lines in a format that
>> postgresql's COPY function can use.
>> End-to-end, this takes 15 seconds for about 250MB (read 250MB, parse,
>> output new data to new file -- 4 seconds, COPY new file -- 10
>> seconds).
>>
>> The next approach I took was to write a C function in postgresql to
>> parse a single TEXT datum into an array of C strings, and then use
>> BuildTupleFromCStrings. There are 8 columns involved.
>> Eliding the time it takes to COPY the (raw) file into a temporary
>> table, this method took 120 seconds, give or take.
>>
>> The difference was /quite/ a surprise to me. What is the probability
>> that I am doing something very, very wrong?
>>
>> NOTE: the code that does the parsing is actually the same,
>> line-for-line, the only difference is whether the routine is called by
>> a postgresql function or by a C program via main, so obviously the
>> overhead is elsewhere.
>> NOTE #2: We are talking about approximately 2.6 million lines.
>
>
> Let me throw out an interesting third method I've been using to parse
> delimited text files that might be useful in your case.  This is
> useful when parsing text that is bad csv where values are not escaped
> or there are lines, incomplete and/or missing records, or a huge
> amount of columns that you want to rotate into a more normalized
> structure based on columns position.
>
> 1. Import the data into a single column (containing the entire line)
> staging table, feeding the COPY parser a bogus delimiter
> 2. 'Parse' the record with regexp_split_to_array (maybe in plpgsql function).
> 3. Either loop the array (in 9.1 use FOR-IN-ARRAY construct), or, if
> you can work it into your problem, INSERT/SELECT, expanding the array
> with a trick like used in information_schema._pg_expandarray so you
> can hook logic on the array (column position).

If you replace [2] with my C function (which can process all of the
data, *postgresql overhead not included*, in about 1 second) then
that's what I did. It returns a composite type making [3] unnecessary.

I know it's not parsing, so I started a time honored debugging
approach: I returned early.

Is the function-call overhead that high? That's outrageously high.
What else could it be? Is returning a composite type outragously
expensive?
So here is what I did: I modified the code so that it immediately returns NULL.
Result: 2 seconds.
Extract arguments, allocate temporary work buffer: another 0.5 seconds.
Add parsing: another 1.5 seconds [total: 4.1 seconds]

and so on...

Two of the items require base conversion, so:
Calling strtol (twice) and snprintf (twice) -- adds *6 seconds.

and to format one of the items as an array (a strcpy and a strcat) --
add 1.5 seconds for a total of 11.5.

The only thing I have left are these statements:

get_call_result_type
TupleDescGetAttInMetadata
BuildTupleFromCStrings
HeapTupleGetDatum
and finally PG_RETURN_DATUM

It turns out that:
get_call_result_type adds 43 seconds [total: 54],
TupleDescGetAttInMetadata adds 19 seconds [total: 73],
BuildTypleFromCStrings accounts for 43 seconds [total: 116].

So those three functions account for 90% of the total time spent.
What alternatives exist? Do I have to call get_call_result_type /every
time/ through the function?

--
Jon

Re: copy vs. C function

From

Tom Lane

Date:

14 December 2011, 02:18:42

Jon Nelson <jnelson+pgsql@jamponi.net> writes:
> The only thing I have left are these statements:

> get_call_result_type
> TupleDescGetAttInMetadata
> BuildTupleFromCStrings
> HeapTupleGetDatum
> and finally PG_RETURN_DATUM

> It turns out that:
> get_call_result_type adds 43 seconds [total: 54],
> TupleDescGetAttInMetadata adds 19 seconds [total: 73],
> BuildTypleFromCStrings accounts for 43 seconds [total: 116].

> So those three functions account for 90% of the total time spent.
> What alternatives exist? Do I have to call get_call_result_type /every
> time/ through the function?

Well, if you're concerned about performance then I think you're going
about this in entirely the wrong way, because as far as I can tell from
this you're converting all the field values to text and back again.
You should be trying to keep the values in Datum format and then
invoking heap_form_tuple.  And yeah, you probably could cache the
type information across calls.

            regards, tom lane

Re: copy vs. C function

From

idc danny

Date:

14 December 2011, 03:02:31

Hi guys,

A nub question here since I could not figure it out on my own:

I'm using Hamachi to connect different sites into a VPN and their address always starts with 5.*.*.* - the problem I'm facing is that I cannot make the access restricted to that particular range only.

Currently I got :

host all all 0.0.0.0/32 md5

which allows all the IP's, and the try:

host all all 5.0.0.0/32 md5

does not work.

So what I am suppose to add in "pg_hba.conf" in order to achieve my restriction? Please help me,

Thank you,

Danny

From: Tom Lane <tgl@sss.pgh.pa.us>
To: Jon Nelson <jnelson+pgsql@jamponi.net>
Cc: pgsql-performance@postgresql.org
Sent: Wednesday, December 14, 2011 8:18 AM
Subject: Re: [PERFORM] copy vs. C function

Jon Nelson <jnelson+pgsql@jamponi.net> writes:
> The only thing I have left are these statements:

> get_call_result_type
> TupleDescGetAttInMetadata
> BuildTupleFromCStrings
> HeapTupleGetDatum
> and finally PG_RETURN_DATUM

> It turns out that:
> get_call_result_type adds 43 seconds [total: 54],
> TupleDescGetAttInMetadata adds 19 seconds [total: 73],
> BuildTypleFromCStrings accounts for 43 seconds [total: 116].

> So those three functions account for 90% of the total time spent.
> What alternatives exist? Do I have to call get_call_result_type /every
> time/ through the function?

Well, if you're concerned about performance then I think you're going
about this in entirely the wrong way, because as far as I can tell from
this you're converting all the field values to text and back again.
You should be trying to keep the values in Datum format and then
invoking heap_form_tuple. And yeah, you probably could cache the
type information across calls.

regards, tom lane

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: copy vs. C function

From

Kevin Martyn

Date:

14 December 2011, 09:14:39

try
host all all 5.0.0.0/8 md5

On Wed, Dec 14, 2011 at 2:02 AM, idc danny <idcdanny@yahoo.com> wrote:

Hi guys,
A nub question here since I could not figure it out on my own:
I'm using Hamachi to connect different sites into a VPN and their address always starts with 5.*.*.* - the problem I'm facing is that I cannot make the access restricted to that particular range only.
Currently I got :
host all all 0.0.0.0/32 md5
which allows all the IP's, and the try:
host all all 5.0.0.0/32 md5
does not work.
So what I am suppose to add in "pg_hba.conf" in order to achieve my restriction? Please help me,
Thank you,
Danny

From: Tom Lane <tgl@sss.pgh.pa.us>
To: Jon Nelson <jnelson+pgsql@jamponi.net>
Cc: pgsql-performance@postgresql.org
Sent: Wednesday, December 14, 2011 8:18 AM
Subject: Re: [PERFORM] copy vs. C function

Jon Nelson <jnelson+pgsql@jamponi.net> writes:
> The only thing I have left are these statements:

> get_call_result_type
> TupleDescGetAttInMetadata
> BuildTupleFromCStrings
> HeapTupleGetDatum
> and finally PG_RETURN_DATUM

> It turns out that:
> get_call_result_type adds 43 seconds [total: 54],
> TupleDescGetAttInMetadata adds 19 seconds [total: 73],
> BuildTypleFromCStrings accounts for 43 seconds [total: 116].

> So those three functions account for 90% of the total time spent.
> What alternatives exist? Do I have to call get_call_result_type /every
> time/ through the function?

Well, if you're concerned about performance then I think you're going
about this in entirely the wrong way, because as far as I can tell from
this you're converting all the field values to text and back again.
You should be trying to keep the values in Datum format and then
invoking heap_form_tuple. And yeah, you probably could cache the
type information across calls.

regards, tom lane

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

--
Kevin P Martyn, CISSP
Principal Sales Engineer
(914) 819 8795 mobile
Kevin.Martyn@EnterpriseDB.com
Skype: kevin.martyn4

Website: www.enterprisedb.com
EnterpriseDB Blog: http://blogs.enterprisedb.com/
Follow us on Twitter: http://www.twitter.com/enterprisedb

This e-mail message (and any attachment) is intended for the use of the individual or entity to whom it is addressed. This message contains information from EnterpriseDB Corporation that may be privileged, confidential, or exempt from disclosure under applicable law. If you are not the intended recipient or authorized to receive this for the intended recipient, any use, dissemination, distribution, retention, archiving, or copying of this communication is strictly prohibited. If you have received this e-mail in error, please notify the sender immediately by reply e-mail and delete this message.

Re: copy vs. C function

From

Jon Nelson

Date:

14 December 2011, 10:06:48

On Wed, Dec 14, 2011 at 12:18 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Jon Nelson <jnelson+pgsql@jamponi.net> writes:
>> The only thing I have left are these statements:
>
>> get_call_result_type
>> TupleDescGetAttInMetadata
>> BuildTupleFromCStrings
>> HeapTupleGetDatum
>> and finally PG_RETURN_DATUM
>
>> It turns out that:
>> get_call_result_type adds 43 seconds [total: 54],
>> TupleDescGetAttInMetadata adds 19 seconds [total: 73],
>> BuildTypleFromCStrings accounts for 43 seconds [total: 116].
>
>> So those three functions account for 90% of the total time spent.
>> What alternatives exist? Do I have to call get_call_result_type /every
>> time/ through the function?
>
> Well, if you're concerned about performance then I think you're going
> about this in entirely the wrong way, because as far as I can tell from
> this you're converting all the field values to text and back again.
> You should be trying to keep the values in Datum format and then
> invoking heap_form_tuple.  And yeah, you probably could cache the
> type information across calls.

The parsing/conversion (except BuildTupleFromCStrings) is only a small
fraction of the overall time spent in the function and could probably
be made slightly faster. It's the overhead that's killing me.

Remember: I'm not converting multiple field values to text and back
again, I'm turning a *single* TEXT into 8 columns of varying types
(INET, INTEGER, and one INTEGER array, among others).  I'll re-write
the code to use Tuples but given that 53% of the time is spent in just
two functions (the two I'd like to cache) I'm not sure how much of a
gain it's likely to be.

Regarding caching, I tried caching it across calls by making the
TupleDesc static and only initializing it once.
When I tried that, I got:

ERROR:  number of columns (6769856) exceeds limit (1664)

I tried to find some documentation or examples that cache the
information, but couldn't find any.

--
Jon

Re: copy vs. C function

From

idc danny

Date:

14 December 2011, 10:32:24

Ah, that did the trick, thank you Kevin,

Danny

From: Kevin Martyn <kevin.martyn@enterprisedb.com>
To: idc danny <idcdanny@yahoo.com>
Cc: "pgsql-performance@postgresql.org" <pgsql-performance@postgresql.org>
Sent: Wednesday, December 14, 2011 3:14 PM
Subject: Re: [PERFORM] copy vs. C function

try
host all all 5.0.0.0/8 md5

On Wed, Dec 14, 2011 at 2:02 AM, idc danny <idcdanny@yahoo.com> wrote:

Hi guys,
A nub question here since I could not figure it out on my own:
I'm using Hamachi to connect different sites into a VPN and their address always starts with 5.*.*.* - the problem I'm facing is that I cannot make the access restricted to that particular range only.
Currently I got :
host all all 0.0.0.0/32 md5
which allows all the IP's, and the try:
host all all 5.0.0.0/32 md5
does not work.
So what I am suppose to add in "pg_hba.conf" in order to achieve my restriction? Please help me,
Thank you,
Danny

From: Tom Lane <tgl@sss.pgh.pa.us>
To: Jon Nelson <jnelson+pgsql@jamponi.net>
Cc: pgsql-performance@postgresql.org
Sent: Wednesday, December 14, 2011 8:18 AM
Subject: Re: [PERFORM] copy vs. C function

Jon Nelson <jnelson+pgsql@jamponi.net> writes:
> The only thing I have left are these statements:

> get_call_result_type
> TupleDescGetAttInMetadata
> BuildTupleFromCStrings
> HeapTupleGetDatum
> and finally PG_RETURN_DATUM

> It turns out that:
> get_call_result_type adds 43 seconds [total: 54],
> TupleDescGetAttInMetadata adds 19 seconds [total: 73],
> BuildTypleFromCStrings accounts for 43 seconds [total: 116].

> So those three functions account for 90% of the total time spent.
> What alternatives exist? Do I have to call get_call_result_type /every
> time/ through the function?

Well, if you're concerned about performance then I think you're going
about this in entirely the wrong way, because as far as I can tell from
this you're converting all the field values to text and back again.
You should be trying to keep the values in Datum format and then
invoking heap_form_tuple. And yeah, you probably could cache the
type information across calls.

regards, tom lane

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: copy vs. C function

From

Tom Lane

Date:

14 December 2011, 11:25:48

Jon Nelson <jnelson+pgsql@jamponi.net> writes:
> Regarding caching, I tried caching it across calls by making the
> TupleDesc static and only initializing it once.
> When I tried that, I got:

> ERROR:  number of columns (6769856) exceeds limit (1664)

> I tried to find some documentation or examples that cache the
> information, but couldn't find any.

You might find reading record_in to be helpful.  What it caches is not
exactly what you need to, I think, but it shows the general principles.
There are lots of other functions that use fn_extra to cache info, too.

            regards, tom lane

Re: copy vs. C function

From

Jon Nelson

Date:

14 December 2011, 11:40:40

On Wed, Dec 14, 2011 at 9:25 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Jon Nelson <jnelson+pgsql@jamponi.net> writes:
>> Regarding caching, I tried caching it across calls by making the
>> TupleDesc static and only initializing it once.
>> When I tried that, I got:
>
>> ERROR:  number of columns (6769856) exceeds limit (1664)
>
>> I tried to find some documentation or examples that cache the
>> information, but couldn't find any.
>
> You might find reading record_in to be helpful.  What it caches is not
> exactly what you need to, I think, but it shows the general principles.
> There are lots of other functions that use fn_extra to cache info, too.

I will definitely look into those. I'm probably doing it wrong, but in
the meantime, I allocated enough space (by way of MemoryContextAlloc)
in TopMemoryContext for an AttInMetadata pointer, switched to that
memory context (just the first time through), used CreateTupleDescCopy
+ TupleDescGetAttInMetadata to duplicate (in the new memory context)
the TupleDesc, and then switched back. This approach seems to have
dropped the total run time to about 54 seconds, the bulk of which is
BuildTupleFromCStrings, a rather significant improvement.

....

Looking at record_in, I think I see what I could be doing better.

Again, thanks for the pointers.


--
Jon

Re: copy vs. C function

From

Jon Nelson

Date:

14 December 2011, 11:51:47

On Wed, Dec 14, 2011 at 9:40 AM, Jon Nelson <jnelson+pgsql@jamponi.net> wrote:
> On Wed, Dec 14, 2011 at 9:25 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Jon Nelson <jnelson+pgsql@jamponi.net> writes:
>>> Regarding caching, I tried caching it across calls by making the
>>> TupleDesc static and only initializing it once.
>>> When I tried that, I got:
>>
>>> ERROR:  number of columns (6769856) exceeds limit (1664)
>>
>>> I tried to find some documentation or examples that cache the
>>> information, but couldn't find any.
>>
>> You might find reading record_in to be helpful.  What it caches is not
>> exactly what you need to, I think, but it shows the general principles.
>> There are lots of other functions that use fn_extra to cache info, too.
>
> I will definitely look into those. I'm probably doing it wrong, but in
> the meantime, I allocated enough space (by way of MemoryContextAlloc)
> in TopMemoryContext for an AttInMetadata pointer, switched to that
> memory context (just the first time through), used CreateTupleDescCopy
> + TupleDescGetAttInMetadata to duplicate (in the new memory context)
> the TupleDesc, and then switched back. This approach seems to have
> dropped the total run time to about 54 seconds, the bulk of which is
> BuildTupleFromCStrings, a rather significant improvement.
>
> ....
>
> Looking at record_in, I think I see what I could be doing better.

Indeed. I revised the code to make use of fcinfo->flinfo->fn_extra for
storage and fcinfo->flinfo->fn_mcxt for the MemoryContext and
everything seemed to work just fine.

Assuming one *starts* with a char *some_var[8], would building Datum
myself be faster than using BuildTupleFromCStrings?

--
Jon

Re: copy vs. C function

From

Jon Nelson

Date:

14 December 2011, 23:19:19

On Wed, Dec 14, 2011 at 9:51 AM, Jon Nelson <jnelson+pgsql@jamponi.net> wrote:
> On Wed, Dec 14, 2011 at 9:40 AM, Jon Nelson <jnelson+pgsql@jamponi.net> wrote:
>> On Wed, Dec 14, 2011 at 9:25 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>> Jon Nelson <jnelson+pgsql@jamponi.net> writes:
>>>> Regarding caching, I tried caching it across calls by making the
>>>> TupleDesc static and only initializing it once.
>>>> When I tried that, I got:
>>>
>>>> ERROR:  number of columns (6769856) exceeds limit (1664)
>>>
>>>> I tried to find some documentation or examples that cache the
>>>> information, but couldn't find any.
>>>
>>> You might find reading record_in to be helpful.  What it caches is not
>>> exactly what you need to, I think, but it shows the general principles.
>>> There are lots of other functions that use fn_extra to cache info, too.
>>
>> I will definitely look into those. I'm probably doing it wrong, but in
>> the meantime, I allocated enough space (by way of MemoryContextAlloc)
>> in TopMemoryContext for an AttInMetadata pointer, switched to that
>> memory context (just the first time through), used CreateTupleDescCopy
>> + TupleDescGetAttInMetadata to duplicate (in the new memory context)
>> the TupleDesc, and then switched back. This approach seems to have
>> dropped the total run time to about 54 seconds, the bulk of which is
>> BuildTupleFromCStrings, a rather significant improvement.
>>
>> ....
>>
>> Looking at record_in, I think I see what I could be doing better.
>
> Indeed. I revised the code to make use of fcinfo->flinfo->fn_extra for
> storage and fcinfo->flinfo->fn_mcxt for the MemoryContext and
> everything seemed to work just fine.
>
> Assuming one *starts* with a char *some_var[8], would building Datum
> myself be faster than using BuildTupleFromCStrings?

The answer is: yes. At least, in my case it is.
The total run time is now down to about 32 seconds.
Versus the BuildTupleFromCStrings which takes about 54 seconds.
32 seconds is more than 10-15 seconds, but it's livable.

This experiment has been very worthwhile - thank you all for the help.

--
Jon

Re: [PERFORM] copy vs. C function

From

rajmhn

Date:

30 December 2016, 14:35:52

Hi Jon

This is exactly, what I was looking for. Need to read the data from
delimited file with no header, and do few transformation as described below
using Postgres C function and load it using pg_bulkload utility.

Transformation below, can be handled with query after loading all the data
as varchar and nullable. But we need to handle this before loading as like
we do in Oracle. I'm converting the code from Oracle to Postgres. Both
version of code(Oracle & Postgres) will be available for different users.

In Oracle, doing these kind of transformation in SQL loader. Need to follow
the same kind of approach in Postgres. SQL filter approach was very easy in
terms of coding. From documentation found, C filter was very much faster
than SQL.

I'm very much new to C. Looking for your options as you mentioned here in
post.Some standard syntax for writing these functionalities would be greatly
helpful. Kindly help me.

Sample Data:

ABC|20170101|DEF ||GHIJ|KLM

Target Table Definition:

COLA numeric(5,0)

COLB date

COLC text

COLD text

COLE text

First column should be mapped to COLA

Second column should be mapped to COLB

Third column should be mapped to COLD

Fourth column should be mapped to COLC

Fifth column should be mapped to Some default value(column is not
present in source)

Transformation:

a)First column should be mapped to COLA. It is numeric in target table.
If any alpha-characters were present, default this column with '0'.
Otherwise, source value should be moved to table.

b)Second column should be mapped to COLB. TO_DATE function from text
format. File will have date format as YYYYMMDD. It should be converted to
date.

c)Third column should be mapped to COLD.Need to Trim both leading and
trailing spaces.

d)Fourth column should be mapped to COLC. If it NULL, some value should
be defaulted.

e)Only few columns from source file should be loaded. In this case, only
first four columns should be loaded.

f)Different ordering in source files & target columns.In this case,

Third column should be mapped to COLD

Fourth column should be mapped to COLC

g)COLE should be loaded with default value. This column is not present
in source file.

Thanks

--
View this message in context: http://postgresql.nabble.com/copy-vs-C-function-tp5065298p5936796.html
Sent from the PostgreSQL - performance mailing list archive at Nabble.com.