Thread: [GENERAL] postgres pg_bulkload c filter function in c programming

[GENERAL] postgres pg_bulkload c filter function in c programming

From
rajmhn
Date:
Gurus,

Reading the data from file and loading it using pg_bulkload- C filter. As
per documentation, C filter is much faster than SQL filter.

I'm new to C. Gone through this documentation. Not clear, how to start.
https://www.postgresql.org/docs/current/static/xfunc-c.html.

Can someone kindly guide me to create C code, so that it can be called in
postgres function?

Here is the sample data.File has no header.

    Sample Data:
    ABC|20170101|DEF ||GHIJ|KLM

    Target Table Definition:
    COLA numeric(5,0)
    COLB date
    COLC text
    COLD text
    COLE text

    First column should be mapped to COLA
    Second column should be mapped to COLB
    Third column should be mapped to COLD
    Fourth column should be mapped to COLC
    Fifth column should be mapped to Some default value(column is not
present in source)

    Transformation:
    a)First column should be mapped to COLA. It is numeric in target table.
If any alpha-characters were present, default this column with '0'.
Otherwise, source value should be moved to table.
    b)Second column should be mapped to COLB. TO_DATE function from text
format. File will have date format as YYYYMMDD. It should be converted to
date.
    c)Third column should be mapped to COLD.Need to Trim both leading and
trailing spaces.
    d)Fourth column should be mapped to COLC. If it NULL, some value should
be defaulted.
    e)Only few columns from source file should be loaded. In this case, only
first four columns should be loaded.
    f)Different ordering in source files & target columns.In this case,
            Third column should be mapped to COLD
            Fourth column should be mapped to COLC
    g)COLE should be loaded with default value. This column is not present
in source file.

These transformations, can be handled with query after loading all the data
as varchar and nullable. But we need to handle this before loading as like
we do in Oracle.

Do you think, these functionalities can be accomplished using C programming?
Some standard syntax for writing these functionalities would be greatly
helpful.

Thanks




--
View this message in context:
http://postgresql.nabble.com/postgres-pg-bulkload-c-filter-function-in-c-programming-tp5936651.html
Sent from the PostgreSQL - general mailing list archive at Nabble.com.


Re: [GENERAL] postgres pg_bulkload c filter function in c programming

From
Jan de Visser
Date:
On Thursday, December 29, 2016 5:10:08 AM EST rajmhn wrote:
> Gurus,
>
> Reading the data from file and loading it using pg_bulkload- C filter. As
> per documentation, C filter is much faster than SQL filter.
>
> I'm new to C. Gone through this documentation. Not clear, how to start.
> https://www.postgresql.org/docs/current/static/xfunc-c.html.
>
> Can someone kindly guide me to create C code, so that it can be called in
> postgres function?
>

It seems to me it would be much easier to load the data into a temporary
table, and from there transform into the form you need it to be. If you're not
experienced in C (and probably even if you are) you can import *a lot* of data
in the time it would take you to develop that custom filter.

Besides, your requirements don't appear to be that difficult.



Re: [GENERAL] postgres pg_bulkload c filter function in c programming

From
rajmhn
Date:
Thanks Jan

I'm converting the code from Oracle to Postgres. Both version of code will
be available for different users.

In Oracle, doing these kind of transformation in SQL loader. Need to follow
the same kind of approach in Postgres. SQL filter approach was very easy in
terms of coding. From documentation found, C filter was very much faster
than SQL.

I'm very new to C. Tried few samples, as mentioned in this link.

https://www.postgresql.org/docs/current/static/xfunc-c.html

It worked.

But, how this can be accomplished when it have 100's of columns from source.
Need to apply transformations only for few columns as mentioned above.

Totally struck, nobody is here to help. If you can guide me with some syntax
with the sample data provided above, will be grateful. I can manage it from
there.

Thanks



--
View this message in context:
http://postgresql.nabble.com/postgres-pg-bulkload-c-filter-function-in-c-programming-tp5936651p5936658.html
Sent from the PostgreSQL - general mailing list archive at Nabble.com.


Re: [GENERAL] postgres pg_bulkload c filter function in c programming

From
Francisco Olarte
Date:
Hi:

On Thu, Dec 29, 2016 at 3:01 PM, Jan de Visser <jan@de-visser.net> wrote:
> On Thursday, December 29, 2016 5:10:08 AM EST rajmhn wrote:
...
>> I'm new to C. Gone through this documentation. Not clear, how to start.
...
> It seems to me it would be much easier to load the data into a temporary
> table, and from there transform into the form you need it to be. If you're not
> experienced in C (and probably even if you are) you can import *a lot* of data
> in the time it would take you to develop that custom filter.
> Besides, your requirements don't appear to be that difficult.

for his type of requirements I would recommend going the perl ( any
similar language will do, but after using it for 25 years I find it is
the king for that ) + pipe route. I mean, write a script which spits
"copy xxx from stdin\n" plus the transformed rows, pipe it trough
psql.

Total time is difficult to beat, as debugging is very easy, write the
filtering function using <> - chomp - split - s/// - join - print and
debug it by feeding it some lines with head.

Untested code from which I remembter of the specs, could be something like:

print "Copy my_table from stdin;\n";
my $e = "The magic col-e default value";
while(<>) {
  chomp; # get rid of OPTIONAl line terminator
  # Get the columns and do the c-d column swap
  my ($a,$b,$d,$c)=split /\|/, $_;
  # Add dashed to the date in column b:
  $b =~ s/^(\d\d\d\d)(\d\d)(\d\d)$/$1-$2-$3/;
  # zap not numerics a to 0:
  ($a=~/^\d+$/) or ($a = 0);
  # And send it with the default separators ( scaping left as an
exercise to the reader, ask if needed, I've wrtten and tested it
several times ).
  print join("\t", $a,$b,$c,$d,$e),"\n";
}

Test it with "head the-datafile | perl the_script.pl |
my_favourite_pager" until correct, the beauty of this approache  is
you do not touch the db in debug, feed it to psql when done. In my
experience the perl script overhead is unnoticeable in any 2k+ machine
(and perl was specifically dessigned to be good at this kind of things
).

Francisco Olarte.


Re: [GENERAL] postgres pg_bulkload c filter function in c programming

From
rajmhn
Date:
Thanks Francis.That seems to be a good solution.

Thought to use pg_bulkload, a third party library instead of copy, where
reject handling can be done in efficient way. Transformation(FILTER)
functions can be implemented with any languages in pg_bulkload before it was
loaded to table. SQL, C, PLs are ok, but you should write functions as fast
as possible because they are called many times.

In this case, function should be written in Perl and called inside the
Postgressql function. Do you think that will work it out? But pg_bulkload is
preferring C function over SQL function for performance.

I will try this option as you suggested.

Thanks



--
View this message in context:
http://postgresql.nabble.com/postgres-pg-bulkload-c-filter-function-in-c-programming-tp5936651p5936695.html
Sent from the PostgreSQL - general mailing list archive at Nabble.com.


Re: [GENERAL] postgres pg_bulkload c filter function in c programming

From
Francisco Olarte
Date:
On Thu, Dec 29, 2016 at 8:41 PM, rajmhn <rajmhn.ram@gmail.com> wrote:
> Thanks Francis.That seems to be a good solution.

Yep, but not for your problem as ...

>
> Thought to use pg_bulkload, a third party library instead of copy, where
> reject handling can be done in efficient way.

Mine was just an idea to do the part of the load you described
assuming pg_bulkload usage was optional. Not being it, it will not
work. MAYBE you can use the technique to preprocess the files for
pg_bulkload ( if possible this is nice, as the goood thing of
preprocessing them is you repeat until you get them right, no
DB-touchy ).

> Transformation(FILTER)
> functions can be implemented with any languages in pg_bulkload before it was
> loaded to table. SQL, C, PLs are ok, but you should write functions as fast
> as possible because they are called many times.



> In this case, function should be written in Perl and called inside the
> Postgressql function. Do you think that will work it out? But pg_bulkload is
> preferring C function over SQL function for performance.

I'm not familiar with pg_bulkload usage. I've read about it but all my
loading problemas have been solved better by using copy ( especially
factoring total time, I already know to use copy and a couple dozen
languages in which to write filters to preclean data for copy. In the
time I learn enough of pg_bulkload I can load filter and load a lot of
data ).

Regarding C vs perl, it seems pg_bulkload does server side processing.
In the server the funcion calling overhead is HUGE, specially when
transitioning between different languages. IMO the time spent doing
the data processing in perl would be 0 when compared with the time to
pass the data around to perl. C will be faster because the calling
barrier is smaller inside the server.

Just for data processing of things like you I've normally found
filters like the one I described can easily saturate an SSD array, and
the difference in time for processing is dwarfed by the difference in
time for developing the filter. In fact in any modern OS with write
through and readahead disk management the normal difference between
filtering in perl or C is perl may use 10% of 1 core, C 1%, perl
filter is developed in 15 minutes, C in an hour, and perl filter takes
some extra milliseconds to start. AND, if you are not familiar with
processing data in C you can easily code a slower solution than in
perl ( as perl was dessigned for this ).


> I will try this option as you suggested.

Just remember my option is not using pg_bulkload with perl stored
procedures. I cannot recommend anything if you use pg_bulkload.

I suggested using copy and perl to preclean the data. It just seemed
to me from the description of your problem you were using a too
complex tool. Now that you are introducing new terms, like reject
handling, I'll step out until I can make a sugestion ( don't bother to
define it for me, it seems a bulkload related term and I'm not able to
study that tool ).


FrancisCO Olarte.


Re: [GENERAL] postgres pg_bulkload c filter function in c programming

From
Alvaro Herrera
Date:
rajmhn wrote:

> But, how this can be accomplished when it have 100's of columns from source.
> Need to apply transformations only for few columns as mentioned above.

Did you try pgloader?  See http://pgloader.io/

--
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: [GENERAL] postgres pg_bulkload c filter function in c programming

From
rajmhn
Date:
Yes.

I can able to apply those mentioned transformation in pgloader &
pg_bulkload-SQL filter. Yet to take performance stats. Meanwhile, I'm trying
to figure out the other best possible option. We are counting more on
performance, error & audit handling.

Thanks








--
View this message in context:
http://postgresql.nabble.com/postgres-pg-bulkload-c-filter-function-in-c-programming-tp5936651p5936705.html
Sent from the PostgreSQL - general mailing list archive at Nabble.com.


Re: [GENERAL] postgres pg_bulkload c filter function in c programming

From
rajmhn
Date:
Thanks Francis.

Meanwhile just gone through usage of perl in postgres function. In turn,
this Postgres function can be called in pg_bulkload. I think, this task can
be done. Let me give a try.



--
View this message in context:
http://postgresql.nabble.com/postgres-pg-bulkload-c-filter-function-in-c-programming-tp5936651p5936707.html
Sent from the PostgreSQL - general mailing list archive at Nabble.com.