Thread: exposing COPY API

exposing COPY API

From
Andrew Dunstan
Date:
Revisiting this, it occurred to me that I could achieve what I need of 
we extend the proposed API a bit. Currently, it has:
   extern CopyState BeginCopyFrom(Relation rel, const char *filename,                                   List
*attnamelist,List *options);
 


I'd like to be able to add a callback function to construct the values 
for the tuple. So it would become something like:
   typedef void (*copy_make_values) (CopyState cstate, NumFieldsRead int);
   extern CopyState BeginCopyFrom(Relation rel, const char *filename,                                   List
*attnamelist,List *options,                                    copy_make_values custom_values_func);
 


If custom_values_func were NULL (as it would be if using the builtin 
COPY), then the builtin code would be run to construct the values for 
making tuple. If not null, the function would be called.

Of course, I want this so I could construct a text array from the read 
in data, but I could also imagine a foreign data wrapper wanting to 
mangle the data before handing it to postgres, say by filling in a field 
or hashing it.

The intrusiveness of this would be very small, I think.

Thoughts?

cheers

andrew


Re: exposing COPY API

From
Itagaki Takahiro
Date:
On Fri, Feb 4, 2011 at 09:48, Andrew Dunstan <andrew@dunslane.net> wrote:
> I'd like to be able to add a callback function to construct the values for
> the tuple. So it would become something like:
>   typedef void (*copy_make_values) (CopyState cstate, NumFieldsRead int);

You can do nothing interesting in the callback probably
because the details of CopyState is not exported yet.
Also, we should pass through user context for such kind of callback.
The prototype of would have "void *userdata".

> Of course, I want this so I could construct a text array from the read in
> data, but I could also imagine a foreign data wrapper wanting to mangle the
> data before handing it to postgres, say by filling in a field or hashing it.

Could you explain the actual use-cases and examples?  I think we need to have
SQL-level extensibility if we provide such flexibility. I guess typical users
don't want to write functions with C for each kind of input files.

Note that pg_bulkload has a similar feature like as: CREATE FUNCTION my_function(...) RETURNS record AS ...; COPY tbl
FROM'file' WITH (make_record_from_line = my_function) 

--
Itagaki Takahiro


Re: exposing COPY API

From
Andrew Dunstan
Date:

>> Of course, I want this so I could construct a text array from the read in
>> data, but I could also imagine a foreign data wrapper wanting to mangle the
>> data before handing it to postgres, say by filling in a field or hashing it.
> Could you explain the actual use-cases and examples?  I think we need to have
> SQL-level extensibility if we provide such flexibility. I guess typical users
> don't want to write functions with C for each kind of input files.
>
> Note that pg_bulkload has a similar feature like as:
>    CREATE FUNCTION my_function(...) RETURNS record AS ...;
>    COPY tbl FROM 'file' WITH (make_record_from_line = my_function)


Umm, where? I can't find this in the documentation 
<http://pgbulkload.projects.postgresql.org/pg_bulkload.html> nor in the 
source code. And how would  module like that provide an extra copy option?

The object, as I have explained previously, is to have a FDW that 
returns a text array from a (possibly irregularly shaped) file.


So, given this file:

1,,2,3
4,5,6

select t[4] as a,t[2] as b from my_fdw_table;

would return

a | b
-----
3 |  | 5



cheers

andrew




Re: exposing COPY API

From
Itagaki Takahiro
Date:
On Fri, Feb 4, 2011 at 11:32, Andrew Dunstan <andrew@dunslane.net> wrote:
> Umm, where? I can't find this in the documentation
> <http://pgbulkload.projects.postgresql.org/pg_bulkload.html>

Here:
http://pgbulkload.projects.postgresql.org/pg_bulkload.html#filter

> The object, as I have explained previously, is to have a FDW that returns a
> text array from a (possibly irregularly shaped) file.

I remember the text array proposal, but if the extension is written in C,
it can only handle one kind of input files. If another file is broken
in a different way, you need to rewrite the C code, no?

-- 
Itagaki Takahiro


Re: exposing COPY API

From
Andrew Dunstan
Date:

On 02/03/2011 09:43 PM, Itagaki Takahiro wrote:
> On Fri, Feb 4, 2011 at 11:32, Andrew Dunstan<andrew@dunslane.net>  wrote:
>> Umm, where? I can't find this in the documentation
>> <http://pgbulkload.projects.postgresql.org/pg_bulkload.html>
> Here:
> http://pgbulkload.projects.postgresql.org/pg_bulkload.html#filter
>
>> The object, as I have explained previously, is to have a FDW that returns a
>> text array from a (possibly irregularly shaped) file.
> I remember the text array proposal, but if the extension is written in C,
> it can only handle one kind of input files. If another file is broken
> in a different way, you need to rewrite the C code, no?


AFAICT, this doesn't support ragged tables with too many columns, which 
is my prime use case. If it supported variadic arguments in filter 
functions it might come closer.

But where does the COPY syntax you showed come in?

Also, a FDW allows the COPY to be used as a FROM target, giving it great 
flexibility. AFAICT this does not.

cheers

andrew


Re: exposing COPY API

From
Itagaki Takahiro
Date:
On Fri, Feb 4, 2011 at 12:17, Andrew Dunstan <andrew@dunslane.net> wrote:
>> http://pgbulkload.projects.postgresql.org/pg_bulkload.html#filter
> AFAICT, this doesn't support ragged tables with too many columns, which is
> my prime use case. If it supported variadic arguments in filter functions it
> might come closer.

It will be good improvement for pg_bulkload, but it's off-topic ;-)

> Also, a FDW allows the COPY to be used as a FROM target, giving it great
> flexibility. AFAICT this does not.

BTW, which do you want to improve, FDW or COPY FROM?  If FDW, the better
API would be "raw" version of NextCopyFrom(). For example: bool NextRawFields(CopyState cstate, char **fields, int
*nfields)
The caller FDW has responsibility to form tuples from the raw fields.
If you need to customize how to form the tuples from various fields,
the FDW also need to have such extensibility.

If COPY FROM, we should implement all the features in copy.c
rather than exported APIs.

-- 
Itagaki Takahiro


Re: exposing COPY API

From
Andrew Dunstan
Date:

On 02/03/2011 10:43 PM, Itagaki Takahiro wrote:
>
>> Also, a FDW allows the COPY to be used as a FROM target, giving it great
>> flexibility. AFAICT this does not.
> BTW, which do you want to improve, FDW or COPY FROM?  If FDW, the better
> API would be "raw" version of NextCopyFrom(). For example:
>    bool NextRawFields(CopyState cstate, char **fields, int *nfields)
> The caller FDW has responsibility to form tuples from the raw fields.
> If you need to customize how to form the tuples from various fields,
> the FDW also need to have such extensibility.
>
> If COPY FROM, we should implement all the features in copy.c
> rather than exported APIs.



The problem with COPY FROM is that nobody's come up with a good syntax 
for allowing it as a FROM target. Doing what I want via FDW neatly gets 
us around that problem. But I'm quite OK with doing the hard work inside 
the COPY code - that's what my working prototype does in fact.

One thing I'd like is to to have file_fdw do something we can't do 
another way. currently it doesn't, so it's nice but uninteresting.

cheers

andrew


Re: exposing COPY API

From
Itagaki Takahiro
Date:
Here is a demonstration to support jagged input files. It's a patch
on the latest patch. The new added API is:

  bool NextLineCopyFrom(
        [IN] CopyState cstate,
        [OUT] char ***fields, [OUT] int *nfields, [OUT] Oid *tupleOid)

It just returns separated fields in the next line. Fortunately, I need
no extra code for it because it is just extracted from NextCopyFrom().

I'm willing to include the change into copy APIs,
but we still have a few issues. See below.

On Fri, Feb 4, 2011 at 16:53, Andrew Dunstan <andrew@dunslane.net> wrote:
> The problem with COPY FROM is that nobody's come up with a good syntax for
> allowing it as a FROM target. Doing what I want via FDW neatly gets us
> around that problem. But I'm quite OK with doing the hard work inside the
> COPY code - that's what my working prototype does in fact.

I think it is not only syntax issue. I found an issue that we hard to
support FORCE_NOT_NULL option for extra fields. See FIXME in the patch.
It is a fundamental problem to support jagged fields.

> One thing I'd like is to to have file_fdw do something we can't do another
> way. currently it doesn't, so it's nice but uninteresting.

BTW, how do you determine which field is shifted in your broken CSV file?
For example, the case you find "AB,CD,EF" for 2 columns tables.
I could provide a raw CSV reader for jagged files, but you still have to
cook the returned fields into a proper tuple...

--
Itagaki Takahiro

Attachment

Re: exposing COPY API

From
Andrew Dunstan
Date:

On 02/04/2011 05:49 AM, Itagaki Takahiro wrote:
> Here is a demonstration to support jagged input files. It's a patch
> on the latest patch. The new added API is:
>
>    bool NextLineCopyFrom(
>          [IN] CopyState cstate,
>          [OUT] char ***fields, [OUT] int *nfields, [OUT] Oid *tupleOid)
>
> It just returns separated fields in the next line. Fortunately, I need
> no extra code for it because it is just extracted from NextCopyFrom().

Thanks, I'll have a look at it, after an emergency job I need to attend 
to. But the API looks weird. Why are fields and nfields OUT params. The 
issue isn't decomposing the line into raw fields. The code for doing 
that works fine as is, including on jagged files. See commit 
af1a614ec6d074fdea46de2e1c462f23fc7ddc6f which was done for exactly this 
purpose. The issue is taking those and composing them into the expected 
tuple.

> I'm willing to include the change into copy APIs,
> but we still have a few issues. See below.
>
> On Fri, Feb 4, 2011 at 16:53, Andrew Dunstan<andrew@dunslane.net>  wrote:
>> The problem with COPY FROM is that nobody's come up with a good syntax for
>> allowing it as a FROM target. Doing what I want via FDW neatly gets us
>> around that problem. But I'm quite OK with doing the hard work inside the
>> COPY code - that's what my working prototype does in fact.
> I think it is not only syntax issue. I found an issue that we hard to
> support FORCE_NOT_NULL option for extra fields. See FIXME in the patch.
> It is a fundamental problem to support jagged fields.

It's not a problem at all if you turn the line into a text array. That's 
exactly why we've been proposing it for this. The array has however many 
elements are on the line.

>> One thing I'd like is to to have file_fdw do something we can't do another
>> way. currently it doesn't, so it's nice but uninteresting.
> BTW, how do you determine which field is shifted in your broken CSV file?
> For example, the case you find "AB,CD,EF" for 2 columns tables.
> I could provide a raw CSV reader for jagged files, but you still have to
> cook the returned fields into a proper tuple...
>

See above. My client who deals with this situation and has been doing so 
for years treats underflowing fields as null and ignores overflowing 
fields. They would do he same if the data were delivered with a text 
array. It works very well for them.


See <https://github.com/adunstan/postgresql-dev/tree/sqlmed2> for my dev 
branch on this.


cheers

andrew




Re: exposing COPY API

From
Andrew Dunstan
Date:

On 02/04/2011 05:49 AM, Itagaki Takahiro wrote:
> Here is a demonstration to support jagged input files. It's a patch
> on the latest patch. The new added API is:
>
>    bool NextLineCopyFrom(
>          [IN] CopyState cstate,
>          [OUT] char ***fields, [OUT] int *nfields, [OUT] Oid *tupleOid)
>
> It just returns separated fields in the next line. Fortunately, I need
> no extra code for it because it is just extracted from NextCopyFrom().
>
> I'm willing to include the change into copy APIs,
> but we still have a few issues. See below.


I've looked at this, and I think it will do what I want. I haven't had 
time to play with it, although I hope to soon.  AIUI, it basically hands 
back the raw parsed strings to the user, who then has the responsibility 
of constructing Datum and Nulls arrays to form the tuple.  That should 
be all I need. So +1 from me for including it. In fact, +10. And many 
thanks.


I think we need a better name though. NextCopyFromRawFields maybe.

> On Fri, Feb 4, 2011 at 16:53, Andrew Dunstan<andrew@dunslane.net>  wrote:
>> The problem with COPY FROM is that nobody's come up with a good syntax for
>> allowing it as a FROM target. Doing what I want via FDW neatly gets us
>> around that problem. But I'm quite OK with doing the hard work inside the
>> COPY code - that's what my working prototype does in fact.
> I think it is not only syntax issue. I found an issue that we hard to
> support FORCE_NOT_NULL option for extra fields. See FIXME in the patch.
> It is a fundamental problem to support jagged fields.

I don't think we need to worry about it. The caller will have access to 
the raw strings so they can handle it. In fact, I'd take out that bit of 
code in NextCopyLine_From, and replace it with a comment about how it's 
the caller's responsibility to handle.

>> One thing I'd like is to to have file_fdw do something we can't do another
>> way. currently it doesn't, so it's nice but uninteresting.
> BTW, how do you determine which field is shifted in your broken CSV file?
> For example, the case you find "AB,CD,EF" for 2 columns tables.
> I could provide a raw CSV reader for jagged files, but you still have to
> cook the returned fields into a proper tuple...
>


I answered this previously, but in the case of a text array it won't 
even arise - the array will have however many fields have been read.


cheers

andrew



Re: exposing COPY API

From
Andrew Dunstan
Date:

On 02/07/2011 11:34 AM, Andrew Dunstan wrote:
>
>
> On 02/04/2011 05:49 AM, Itagaki Takahiro wrote:
>> Here is a demonstration to support jagged input files. It's a patch
>> on the latest patch. The new added API is:
>>
>>    bool NextLineCopyFrom(
>>          [IN] CopyState cstate,
>>          [OUT] char ***fields, [OUT] int *nfields, [OUT] Oid *tupleOid)
>>
>> It just returns separated fields in the next line. Fortunately, I need
>> no extra code for it because it is just extracted from NextCopyFrom().
>>
>> I'm willing to include the change into copy APIs,
>> but we still have a few issues. See below.
>
>
> I've looked at this, and I think it will do what I want. I haven't had
> time to play with it, although I hope to soon.  AIUI, it basically
> hands back the raw parsed strings to the user, who then has the
> responsibility of constructing Datum and Nulls arrays to form the
> tuple.  That should be all I need. So +1 from me for including it. In
> fact, +10. And many thanks.
>
>
> I think we need a better name though. NextCopyFromRawFields maybe.


Here is a patch against the latest revision of file_fdw to exercise this
API. It includes some regression tests, and I think apart from one or
two small details plus a requirement for documentation, is complete.

This work is also published at
<https://github.com/adunstan/postgresql-dev/tree/sqlmed3>

Here's an excerpt from the regression tests:

    CREATE FOREIGN TABLE jagged_text (
         t   text[]
    ) SERVER file_server
    OPTIONS (format 'csv', filename
    '/home/andrew/pgl/pg_head/contrib/file_fdw/data/jagged.csv', header
    'true', textarray 'true');
    SELECT * FROM jagged_text;
                          t
    --------------------------------------------
      {"one field"}
      {two,fields}
      {three,NULL,"fields of which one is null"}
      {"next line has no fields"}
      {}
    (5 rows)

    SELECT t[3] AS a, t[1] AS b, t[99] AS c  FROM jagged_text;
                   a              |            b            | c
    -----------------------------+-------------------------+---
                                  | one field               |
                                  | two                     |
      fields of which one is null | three                   |
                                  | next line has no fields |
                                  |                         |
    (5 rows)

Overall, this API works quite nicely, and does exactly what I want, so
again many thanks.

cheers

andrew



Attachment

Re: exposing COPY API

From
Itagaki Takahiro
Date:
On Tue, Feb 8, 2011 at 09:38, Andrew Dunstan <andrew@dunslane.net> wrote:
> Here is a patch against the latest revision of file_fdw to exercise this
> API. It includes some regression tests, and I think apart from one or two
> small details plus a requirement for documentation, is complete.

The patch contains a few fixes for typo in the original patch.
Hanada-san, could you take them into the core file_fdw patch?

>   CREATE FOREIGN TABLE jagged_text (
>        t   text[]
>   ) SERVER file_server
>   OPTIONS (format 'csv', filename
>   '/home/andrew/pgl/pg_head/contrib/file_fdw/data/jagged.csv', header
>   'true', textarray 'true');

There might be another approach -- we could have jagged_file_fdw aside
from file_fdw, because we cannot support some features in textarray mode
like force_not_null option and multiple non-text[] columns.

I'll include NextCopyFromRawFields() in COPY API patch to complete
raw_fields support in CopyState. (Or, we should also revert changes
related to raw_fields.)  However, we'd better postpone jagged csv
support to 9.2. The design is still under discussion.

--
Itagaki Takahiro


Re: exposing COPY API

From
Andrew Dunstan
Date:

On 02/08/2011 03:49 AM, Itagaki Takahiro wrote:
> On Tue, Feb 8, 2011 at 09:38, Andrew Dunstan<andrew@dunslane.net>  wrote:
>> Here is a patch against the latest revision of file_fdw to exercise this
>> API. It includes some regression tests, and I think apart from one or two
>> small details plus a requirement for documentation, is complete.
> The patch contains a few fixes for typo in the original patch.
> Hanada-san, could you take them into the core file_fdw patch?
>
>>    CREATE FOREIGN TABLE jagged_text (
>>         t   text[]
>>    ) SERVER file_server
>>    OPTIONS (format 'csv', filename
>>    '/home/andrew/pgl/pg_head/contrib/file_fdw/data/jagged.csv', header
>>    'true', textarray 'true');
> There might be another approach -- we could have jagged_file_fdw aside
> from file_fdw, because we cannot support some features in textarray mode
> like force_not_null option and multiple non-text[] columns.
>
> I'll include NextCopyFromRawFields() in COPY API patch to complete
> raw_fields support in CopyState. (Or, we should also revert changes
> related to raw_fields.)  However, we'd better postpone jagged csv
> support to 9.2. The design is still under discussion.


Please do include NextCopyFromRawFields(). I think the API is very 
limiting without it, but very flexible with it.

I also think that it's better to have contrib examples of the use of an 
API than not.

FORCE NOT NULL is much more of an issue for the *non* raw fields case 
than the reverse. In the raw fields case the caller can manage it 
themselves.

Multiple non-text[] columns strikes me as a red herring. This isn't the 
only possible use of NextCopyFromRawFields(), but it is one significant 
(and very useful as it stands) use.

cheers

andrew






Re: exposing COPY API

From
Shigeru HANADA
Date:
On Tue, 8 Feb 2011 17:49:09 +0900
Itagaki Takahiro <itagaki.takahiro@gmail.com> wrote:

> On Tue, Feb 8, 2011 at 09:38, Andrew Dunstan <andrew@dunslane.net> wrote:
> > Here is a patch against the latest revision of file_fdw to exercise this
> > API. It includes some regression tests, and I think apart from one or two
> > small details plus a requirement for documentation, is complete.
> 
> The patch contains a few fixes for typo in the original patch.
> Hanada-san, could you take them into the core file_fdw patch?

Thanks, I've applied to my local repo.

s/Centinel/Sentinel/
s/Vaidate/Validate/
s/reaind/reading/

Recent file_fdw is available here:
http://git.postgresql.org/gitweb?p=users/hanada/postgres.git;a=summary

I'll submit revised file_fdw patch after removing IsForeignTable()
catalog lookup along Heikki's proposal.

Regards,
--
Shigeru Hanada




Re: exposing COPY API

From
Robert Haas
Date:
On Tue, Feb 8, 2011 at 4:42 AM, Shigeru HANADA
<hanada@metrosystems.co.jp> wrote:
> On Tue, 8 Feb 2011 17:49:09 +0900
> Itagaki Takahiro <itagaki.takahiro@gmail.com> wrote:
>
>> On Tue, Feb 8, 2011 at 09:38, Andrew Dunstan <andrew@dunslane.net> wrote:
>> > Here is a patch against the latest revision of file_fdw to exercise this
>> > API. It includes some regression tests, and I think apart from one or two
>> > small details plus a requirement for documentation, is complete.
>>
>> The patch contains a few fixes for typo in the original patch.
>> Hanada-san, could you take them into the core file_fdw patch?
>
> Thanks, I've applied to my local repo.
>
> s/Centinel/Sentinel/
> s/Vaidate/Validate/
> s/reaind/reading/
>
> Recent file_fdw is available here:
> http://git.postgresql.org/gitweb?p=users/hanada/postgres.git;a=summary
>
> I'll submit revised file_fdw patch after removing IsForeignTable()
> catalog lookup along Heikki's proposal.

So I'm a bit confused.  I don't see the actual copy API change patch
anywhere here.  Are we close to getting something committed there?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: exposing COPY API

From
Shigeru HANADA
Date:
On Tue, 8 Feb 2011 08:49:36 -0500
Robert Haas <robertmhaas@gmail.com> wrote:
> On Tue, Feb 8, 2011 at 4:42 AM, Shigeru HANADA
> <hanada@metrosystems.co.jp> wrote:
> > I'll submit revised file_fdw patch after removing IsForeignTable()
> > catalog lookup along Heikki's proposal.
> 
> So I'm a bit confused.  I don't see the actual copy API change patch
> anywhere here.  Are we close to getting something committed there?

I'm sorry but I might have missed your point...

I replied here to answer to Itagaki-san's mention about typos in
file_fdw patch.

Or, would you mean that file_fdw should not depend on "copy API change"
patch?

Regards,
--
Shigeru Hanada




Re: exposing COPY API

From
Robert Haas
Date:
On Wed, Feb 9, 2011 at 7:38 AM, Shigeru HANADA
<hanada@metrosystems.co.jp> wrote:
> On Tue, 8 Feb 2011 08:49:36 -0500
> Robert Haas <robertmhaas@gmail.com> wrote:
>> On Tue, Feb 8, 2011 at 4:42 AM, Shigeru HANADA
>> <hanada@metrosystems.co.jp> wrote:
>> > I'll submit revised file_fdw patch after removing IsForeignTable()
>> > catalog lookup along Heikki's proposal.
>>
>> So I'm a bit confused.  I don't see the actual copy API change patch
>> anywhere here.  Are we close to getting something committed there?
>
> I'm sorry but I might have missed your point...
>
> I replied here to answer to Itagaki-san's mention about typos in
> file_fdw patch.
>
> Or, would you mean that file_fdw should not depend on "copy API change"
> patch?

I mean that this thread is entitled "exposing copy API", and I'm
wondering when and if the patch to expose the COPY API is going to be
committed.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: exposing COPY API

From
Andrew Dunstan
Date:

On 02/09/2011 12:26 PM, Robert Haas wrote:
> On Wed, Feb 9, 2011 at 7:38 AM, Shigeru HANADA
> <hanada@metrosystems.co.jp>  wrote:
>> On Tue, 8 Feb 2011 08:49:36 -0500
>> Robert Haas<robertmhaas@gmail.com>  wrote:
>>> On Tue, Feb 8, 2011 at 4:42 AM, Shigeru HANADA
>>> <hanada@metrosystems.co.jp>  wrote:
>>>> I'll submit revised file_fdw patch after removing IsForeignTable()
>>>> catalog lookup along Heikki's proposal.
>>> So I'm a bit confused.  I don't see the actual copy API change patch
>>> anywhere here.  Are we close to getting something committed there?
>> I'm sorry but I might have missed your point...
>>
>> I replied here to answer to Itagaki-san's mention about typos in
>> file_fdw patch.
>>
>> Or, would you mean that file_fdw should not depend on "copy API change"
>> patch?
> I mean that this thread is entitled "exposing copy API", and I'm
> wondering when and if the patch to expose the COPY API is going to be
> committed.


Itagaki-san published a patch for this about about 12 hours ago in the 
file_fdw thread that looks pretty committable to me.

This whole API thing is a breakout from file_fdw, because the original 
file_fdw submission copied huge chunks of copy.c instead of trying to 
leverage it.

cheers

andrew


Re: exposing COPY API

From
Robert Haas
Date:
On Wed, Feb 9, 2011 at 12:45 PM, Andrew Dunstan <andrew@dunslane.net> wrote:
> Itagaki-san published a patch for this about about 12 hours ago in the
> file_fdw thread that looks pretty committable to me.

OK, excellent.

> This whole API thing is a breakout from file_fdw, because the original
> file_fdw submission copied huge chunks of copy.c instead of trying to
> leverage it.

Yeah, I remembered that, I just got mixed up because the two patches
were on the same thread, and the one that is the topic of this thread
was posted elsewhere.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company