Re: copy vs. C function - Mailing list pgsql-performance
From | Jon Nelson |
---|---|
Subject | Re: copy vs. C function |
Date | |
Msg-id | CAKuK5J0EzKnvAfNk=JKd2Mw3q25fguHKYwTQCkYvxPYA+TDKPw@mail.gmail.com Whole thread Raw |
In response to | Re: copy vs. C function (Merlin Moncure <mmoncure@gmail.com>) |
Responses |
Re: copy vs. C function
|
List | pgsql-performance |
On Mon, Dec 12, 2011 at 10:38 AM, Merlin Moncure <mmoncure@gmail.com> wrote: > On Sat, Dec 10, 2011 at 7:27 PM, Jon Nelson <jnelson+pgsql@jamponi.net> wrote: >> I was experimenting with a few different methods of taking a line of >> text, parsing it, into a set of fields, and then getting that info >> into a table. >> >> The first method involved writing a C program to parse a file, parse >> the lines and output newly-formatted lines in a format that >> postgresql's COPY function can use. >> End-to-end, this takes 15 seconds for about 250MB (read 250MB, parse, >> output new data to new file -- 4 seconds, COPY new file -- 10 >> seconds). >> >> The next approach I took was to write a C function in postgresql to >> parse a single TEXT datum into an array of C strings, and then use >> BuildTupleFromCStrings. There are 8 columns involved. >> Eliding the time it takes to COPY the (raw) file into a temporary >> table, this method took 120 seconds, give or take. >> >> The difference was /quite/ a surprise to me. What is the probability >> that I am doing something very, very wrong? >> >> NOTE: the code that does the parsing is actually the same, >> line-for-line, the only difference is whether the routine is called by >> a postgresql function or by a C program via main, so obviously the >> overhead is elsewhere. >> NOTE #2: We are talking about approximately 2.6 million lines. > > > Let me throw out an interesting third method I've been using to parse > delimited text files that might be useful in your case. This is > useful when parsing text that is bad csv where values are not escaped > or there are lines, incomplete and/or missing records, or a huge > amount of columns that you want to rotate into a more normalized > structure based on columns position. > > 1. Import the data into a single column (containing the entire line) > staging table, feeding the COPY parser a bogus delimiter > 2. 'Parse' the record with regexp_split_to_array (maybe in plpgsql function). > 3. Either loop the array (in 9.1 use FOR-IN-ARRAY construct), or, if > you can work it into your problem, INSERT/SELECT, expanding the array > with a trick like used in information_schema._pg_expandarray so you > can hook logic on the array (column position). If you replace [2] with my C function (which can process all of the data, *postgresql overhead not included*, in about 1 second) then that's what I did. It returns a composite type making [3] unnecessary. I know it's not parsing, so I started a time honored debugging approach: I returned early. Is the function-call overhead that high? That's outrageously high. What else could it be? Is returning a composite type outragously expensive? So here is what I did: I modified the code so that it immediately returns NULL. Result: 2 seconds. Extract arguments, allocate temporary work buffer: another 0.5 seconds. Add parsing: another 1.5 seconds [total: 4.1 seconds] and so on... Two of the items require base conversion, so: Calling strtol (twice) and snprintf (twice) -- adds *6 seconds. and to format one of the items as an array (a strcpy and a strcat) -- add 1.5 seconds for a total of 11.5. The only thing I have left are these statements: get_call_result_type TupleDescGetAttInMetadata BuildTupleFromCStrings HeapTupleGetDatum and finally PG_RETURN_DATUM It turns out that: get_call_result_type adds 43 seconds [total: 54], TupleDescGetAttInMetadata adds 19 seconds [total: 73], BuildTypleFromCStrings accounts for 43 seconds [total: 116]. So those three functions account for 90% of the total time spent. What alternatives exist? Do I have to call get_call_result_type /every time/ through the function? -- Jon
pgsql-performance by date: