Thread: exposing COPY API
Revisiting this, it occurred to me that I could achieve what I need of we extend the proposed API a bit. Currently, it has: extern CopyState BeginCopyFrom(Relation rel, const char *filename, List *attnamelist,List *options); I'd like to be able to add a callback function to construct the values for the tuple. So it would become something like: typedef void (*copy_make_values) (CopyState cstate, NumFieldsRead int); extern CopyState BeginCopyFrom(Relation rel, const char *filename, List *attnamelist,List *options, copy_make_values custom_values_func); If custom_values_func were NULL (as it would be if using the builtin COPY), then the builtin code would be run to construct the values for making tuple. If not null, the function would be called. Of course, I want this so I could construct a text array from the read in data, but I could also imagine a foreign data wrapper wanting to mangle the data before handing it to postgres, say by filling in a field or hashing it. The intrusiveness of this would be very small, I think. Thoughts? cheers andrew
On Fri, Feb 4, 2011 at 09:48, Andrew Dunstan <andrew@dunslane.net> wrote: > I'd like to be able to add a callback function to construct the values for > the tuple. So it would become something like: > typedef void (*copy_make_values) (CopyState cstate, NumFieldsRead int); You can do nothing interesting in the callback probably because the details of CopyState is not exported yet. Also, we should pass through user context for such kind of callback. The prototype of would have "void *userdata". > Of course, I want this so I could construct a text array from the read in > data, but I could also imagine a foreign data wrapper wanting to mangle the > data before handing it to postgres, say by filling in a field or hashing it. Could you explain the actual use-cases and examples? I think we need to have SQL-level extensibility if we provide such flexibility. I guess typical users don't want to write functions with C for each kind of input files. Note that pg_bulkload has a similar feature like as: CREATE FUNCTION my_function(...) RETURNS record AS ...; COPY tbl FROM'file' WITH (make_record_from_line = my_function) -- Itagaki Takahiro
>> Of course, I want this so I could construct a text array from the read in >> data, but I could also imagine a foreign data wrapper wanting to mangle the >> data before handing it to postgres, say by filling in a field or hashing it. > Could you explain the actual use-cases and examples? I think we need to have > SQL-level extensibility if we provide such flexibility. I guess typical users > don't want to write functions with C for each kind of input files. > > Note that pg_bulkload has a similar feature like as: > CREATE FUNCTION my_function(...) RETURNS record AS ...; > COPY tbl FROM 'file' WITH (make_record_from_line = my_function) Umm, where? I can't find this in the documentation <http://pgbulkload.projects.postgresql.org/pg_bulkload.html> nor in the source code. And how would module like that provide an extra copy option? The object, as I have explained previously, is to have a FDW that returns a text array from a (possibly irregularly shaped) file. So, given this file: 1,,2,3 4,5,6 select t[4] as a,t[2] as b from my_fdw_table; would return a | b ----- 3 | | 5 cheers andrew
On Fri, Feb 4, 2011 at 11:32, Andrew Dunstan <andrew@dunslane.net> wrote: > Umm, where? I can't find this in the documentation > <http://pgbulkload.projects.postgresql.org/pg_bulkload.html> Here: http://pgbulkload.projects.postgresql.org/pg_bulkload.html#filter > The object, as I have explained previously, is to have a FDW that returns a > text array from a (possibly irregularly shaped) file. I remember the text array proposal, but if the extension is written in C, it can only handle one kind of input files. If another file is broken in a different way, you need to rewrite the C code, no? -- Itagaki Takahiro
On 02/03/2011 09:43 PM, Itagaki Takahiro wrote: > On Fri, Feb 4, 2011 at 11:32, Andrew Dunstan<andrew@dunslane.net> wrote: >> Umm, where? I can't find this in the documentation >> <http://pgbulkload.projects.postgresql.org/pg_bulkload.html> > Here: > http://pgbulkload.projects.postgresql.org/pg_bulkload.html#filter > >> The object, as I have explained previously, is to have a FDW that returns a >> text array from a (possibly irregularly shaped) file. > I remember the text array proposal, but if the extension is written in C, > it can only handle one kind of input files. If another file is broken > in a different way, you need to rewrite the C code, no? AFAICT, this doesn't support ragged tables with too many columns, which is my prime use case. If it supported variadic arguments in filter functions it might come closer. But where does the COPY syntax you showed come in? Also, a FDW allows the COPY to be used as a FROM target, giving it great flexibility. AFAICT this does not. cheers andrew
On Fri, Feb 4, 2011 at 12:17, Andrew Dunstan <andrew@dunslane.net> wrote: >> http://pgbulkload.projects.postgresql.org/pg_bulkload.html#filter > AFAICT, this doesn't support ragged tables with too many columns, which is > my prime use case. If it supported variadic arguments in filter functions it > might come closer. It will be good improvement for pg_bulkload, but it's off-topic ;-) > Also, a FDW allows the COPY to be used as a FROM target, giving it great > flexibility. AFAICT this does not. BTW, which do you want to improve, FDW or COPY FROM? If FDW, the better API would be "raw" version of NextCopyFrom(). For example: bool NextRawFields(CopyState cstate, char **fields, int *nfields) The caller FDW has responsibility to form tuples from the raw fields. If you need to customize how to form the tuples from various fields, the FDW also need to have such extensibility. If COPY FROM, we should implement all the features in copy.c rather than exported APIs. -- Itagaki Takahiro
On 02/03/2011 10:43 PM, Itagaki Takahiro wrote: > >> Also, a FDW allows the COPY to be used as a FROM target, giving it great >> flexibility. AFAICT this does not. > BTW, which do you want to improve, FDW or COPY FROM? If FDW, the better > API would be "raw" version of NextCopyFrom(). For example: > bool NextRawFields(CopyState cstate, char **fields, int *nfields) > The caller FDW has responsibility to form tuples from the raw fields. > If you need to customize how to form the tuples from various fields, > the FDW also need to have such extensibility. > > If COPY FROM, we should implement all the features in copy.c > rather than exported APIs. The problem with COPY FROM is that nobody's come up with a good syntax for allowing it as a FROM target. Doing what I want via FDW neatly gets us around that problem. But I'm quite OK with doing the hard work inside the COPY code - that's what my working prototype does in fact. One thing I'd like is to to have file_fdw do something we can't do another way. currently it doesn't, so it's nice but uninteresting. cheers andrew
Here is a demonstration to support jagged input files. It's a patch on the latest patch. The new added API is: bool NextLineCopyFrom( [IN] CopyState cstate, [OUT] char ***fields, [OUT] int *nfields, [OUT] Oid *tupleOid) It just returns separated fields in the next line. Fortunately, I need no extra code for it because it is just extracted from NextCopyFrom(). I'm willing to include the change into copy APIs, but we still have a few issues. See below. On Fri, Feb 4, 2011 at 16:53, Andrew Dunstan <andrew@dunslane.net> wrote: > The problem with COPY FROM is that nobody's come up with a good syntax for > allowing it as a FROM target. Doing what I want via FDW neatly gets us > around that problem. But I'm quite OK with doing the hard work inside the > COPY code - that's what my working prototype does in fact. I think it is not only syntax issue. I found an issue that we hard to support FORCE_NOT_NULL option for extra fields. See FIXME in the patch. It is a fundamental problem to support jagged fields. > One thing I'd like is to to have file_fdw do something we can't do another > way. currently it doesn't, so it's nice but uninteresting. BTW, how do you determine which field is shifted in your broken CSV file? For example, the case you find "AB,CD,EF" for 2 columns tables. I could provide a raw CSV reader for jagged files, but you still have to cook the returned fields into a proper tuple... -- Itagaki Takahiro
Attachment
On 02/04/2011 05:49 AM, Itagaki Takahiro wrote: > Here is a demonstration to support jagged input files. It's a patch > on the latest patch. The new added API is: > > bool NextLineCopyFrom( > [IN] CopyState cstate, > [OUT] char ***fields, [OUT] int *nfields, [OUT] Oid *tupleOid) > > It just returns separated fields in the next line. Fortunately, I need > no extra code for it because it is just extracted from NextCopyFrom(). Thanks, I'll have a look at it, after an emergency job I need to attend to. But the API looks weird. Why are fields and nfields OUT params. The issue isn't decomposing the line into raw fields. The code for doing that works fine as is, including on jagged files. See commit af1a614ec6d074fdea46de2e1c462f23fc7ddc6f which was done for exactly this purpose. The issue is taking those and composing them into the expected tuple. > I'm willing to include the change into copy APIs, > but we still have a few issues. See below. > > On Fri, Feb 4, 2011 at 16:53, Andrew Dunstan<andrew@dunslane.net> wrote: >> The problem with COPY FROM is that nobody's come up with a good syntax for >> allowing it as a FROM target. Doing what I want via FDW neatly gets us >> around that problem. But I'm quite OK with doing the hard work inside the >> COPY code - that's what my working prototype does in fact. > I think it is not only syntax issue. I found an issue that we hard to > support FORCE_NOT_NULL option for extra fields. See FIXME in the patch. > It is a fundamental problem to support jagged fields. It's not a problem at all if you turn the line into a text array. That's exactly why we've been proposing it for this. The array has however many elements are on the line. >> One thing I'd like is to to have file_fdw do something we can't do another >> way. currently it doesn't, so it's nice but uninteresting. > BTW, how do you determine which field is shifted in your broken CSV file? > For example, the case you find "AB,CD,EF" for 2 columns tables. > I could provide a raw CSV reader for jagged files, but you still have to > cook the returned fields into a proper tuple... > See above. My client who deals with this situation and has been doing so for years treats underflowing fields as null and ignores overflowing fields. They would do he same if the data were delivered with a text array. It works very well for them. See <https://github.com/adunstan/postgresql-dev/tree/sqlmed2> for my dev branch on this. cheers andrew
On 02/04/2011 05:49 AM, Itagaki Takahiro wrote: > Here is a demonstration to support jagged input files. It's a patch > on the latest patch. The new added API is: > > bool NextLineCopyFrom( > [IN] CopyState cstate, > [OUT] char ***fields, [OUT] int *nfields, [OUT] Oid *tupleOid) > > It just returns separated fields in the next line. Fortunately, I need > no extra code for it because it is just extracted from NextCopyFrom(). > > I'm willing to include the change into copy APIs, > but we still have a few issues. See below. I've looked at this, and I think it will do what I want. I haven't had time to play with it, although I hope to soon. AIUI, it basically hands back the raw parsed strings to the user, who then has the responsibility of constructing Datum and Nulls arrays to form the tuple. That should be all I need. So +1 from me for including it. In fact, +10. And many thanks. I think we need a better name though. NextCopyFromRawFields maybe. > On Fri, Feb 4, 2011 at 16:53, Andrew Dunstan<andrew@dunslane.net> wrote: >> The problem with COPY FROM is that nobody's come up with a good syntax for >> allowing it as a FROM target. Doing what I want via FDW neatly gets us >> around that problem. But I'm quite OK with doing the hard work inside the >> COPY code - that's what my working prototype does in fact. > I think it is not only syntax issue. I found an issue that we hard to > support FORCE_NOT_NULL option for extra fields. See FIXME in the patch. > It is a fundamental problem to support jagged fields. I don't think we need to worry about it. The caller will have access to the raw strings so they can handle it. In fact, I'd take out that bit of code in NextCopyLine_From, and replace it with a comment about how it's the caller's responsibility to handle. >> One thing I'd like is to to have file_fdw do something we can't do another >> way. currently it doesn't, so it's nice but uninteresting. > BTW, how do you determine which field is shifted in your broken CSV file? > For example, the case you find "AB,CD,EF" for 2 columns tables. > I could provide a raw CSV reader for jagged files, but you still have to > cook the returned fields into a proper tuple... > I answered this previously, but in the case of a text array it won't even arise - the array will have however many fields have been read. cheers andrew
On 02/07/2011 11:34 AM, Andrew Dunstan wrote: > > > On 02/04/2011 05:49 AM, Itagaki Takahiro wrote: >> Here is a demonstration to support jagged input files. It's a patch >> on the latest patch. The new added API is: >> >> bool NextLineCopyFrom( >> [IN] CopyState cstate, >> [OUT] char ***fields, [OUT] int *nfields, [OUT] Oid *tupleOid) >> >> It just returns separated fields in the next line. Fortunately, I need >> no extra code for it because it is just extracted from NextCopyFrom(). >> >> I'm willing to include the change into copy APIs, >> but we still have a few issues. See below. > > > I've looked at this, and I think it will do what I want. I haven't had > time to play with it, although I hope to soon. AIUI, it basically > hands back the raw parsed strings to the user, who then has the > responsibility of constructing Datum and Nulls arrays to form the > tuple. That should be all I need. So +1 from me for including it. In > fact, +10. And many thanks. > > > I think we need a better name though. NextCopyFromRawFields maybe. Here is a patch against the latest revision of file_fdw to exercise this API. It includes some regression tests, and I think apart from one or two small details plus a requirement for documentation, is complete. This work is also published at <https://github.com/adunstan/postgresql-dev/tree/sqlmed3> Here's an excerpt from the regression tests: CREATE FOREIGN TABLE jagged_text ( t text[] ) SERVER file_server OPTIONS (format 'csv', filename '/home/andrew/pgl/pg_head/contrib/file_fdw/data/jagged.csv', header 'true', textarray 'true'); SELECT * FROM jagged_text; t -------------------------------------------- {"one field"} {two,fields} {three,NULL,"fields of which one is null"} {"next line has no fields"} {} (5 rows) SELECT t[3] AS a, t[1] AS b, t[99] AS c FROM jagged_text; a | b | c -----------------------------+-------------------------+--- | one field | | two | fields of which one is null | three | | next line has no fields | | | (5 rows) Overall, this API works quite nicely, and does exactly what I want, so again many thanks. cheers andrew
Attachment
On Tue, Feb 8, 2011 at 09:38, Andrew Dunstan <andrew@dunslane.net> wrote: > Here is a patch against the latest revision of file_fdw to exercise this > API. It includes some regression tests, and I think apart from one or two > small details plus a requirement for documentation, is complete. The patch contains a few fixes for typo in the original patch. Hanada-san, could you take them into the core file_fdw patch? > CREATE FOREIGN TABLE jagged_text ( > t text[] > ) SERVER file_server > OPTIONS (format 'csv', filename > '/home/andrew/pgl/pg_head/contrib/file_fdw/data/jagged.csv', header > 'true', textarray 'true'); There might be another approach -- we could have jagged_file_fdw aside from file_fdw, because we cannot support some features in textarray mode like force_not_null option and multiple non-text[] columns. I'll include NextCopyFromRawFields() in COPY API patch to complete raw_fields support in CopyState. (Or, we should also revert changes related to raw_fields.) However, we'd better postpone jagged csv support to 9.2. The design is still under discussion. -- Itagaki Takahiro
On 02/08/2011 03:49 AM, Itagaki Takahiro wrote: > On Tue, Feb 8, 2011 at 09:38, Andrew Dunstan<andrew@dunslane.net> wrote: >> Here is a patch against the latest revision of file_fdw to exercise this >> API. It includes some regression tests, and I think apart from one or two >> small details plus a requirement for documentation, is complete. > The patch contains a few fixes for typo in the original patch. > Hanada-san, could you take them into the core file_fdw patch? > >> CREATE FOREIGN TABLE jagged_text ( >> t text[] >> ) SERVER file_server >> OPTIONS (format 'csv', filename >> '/home/andrew/pgl/pg_head/contrib/file_fdw/data/jagged.csv', header >> 'true', textarray 'true'); > There might be another approach -- we could have jagged_file_fdw aside > from file_fdw, because we cannot support some features in textarray mode > like force_not_null option and multiple non-text[] columns. > > I'll include NextCopyFromRawFields() in COPY API patch to complete > raw_fields support in CopyState. (Or, we should also revert changes > related to raw_fields.) However, we'd better postpone jagged csv > support to 9.2. The design is still under discussion. Please do include NextCopyFromRawFields(). I think the API is very limiting without it, but very flexible with it. I also think that it's better to have contrib examples of the use of an API than not. FORCE NOT NULL is much more of an issue for the *non* raw fields case than the reverse. In the raw fields case the caller can manage it themselves. Multiple non-text[] columns strikes me as a red herring. This isn't the only possible use of NextCopyFromRawFields(), but it is one significant (and very useful as it stands) use. cheers andrew
On Tue, 8 Feb 2011 17:49:09 +0900 Itagaki Takahiro <itagaki.takahiro@gmail.com> wrote: > On Tue, Feb 8, 2011 at 09:38, Andrew Dunstan <andrew@dunslane.net> wrote: > > Here is a patch against the latest revision of file_fdw to exercise this > > API. It includes some regression tests, and I think apart from one or two > > small details plus a requirement for documentation, is complete. > > The patch contains a few fixes for typo in the original patch. > Hanada-san, could you take them into the core file_fdw patch? Thanks, I've applied to my local repo. s/Centinel/Sentinel/ s/Vaidate/Validate/ s/reaind/reading/ Recent file_fdw is available here: http://git.postgresql.org/gitweb?p=users/hanada/postgres.git;a=summary I'll submit revised file_fdw patch after removing IsForeignTable() catalog lookup along Heikki's proposal. Regards, -- Shigeru Hanada
On Tue, Feb 8, 2011 at 4:42 AM, Shigeru HANADA <hanada@metrosystems.co.jp> wrote: > On Tue, 8 Feb 2011 17:49:09 +0900 > Itagaki Takahiro <itagaki.takahiro@gmail.com> wrote: > >> On Tue, Feb 8, 2011 at 09:38, Andrew Dunstan <andrew@dunslane.net> wrote: >> > Here is a patch against the latest revision of file_fdw to exercise this >> > API. It includes some regression tests, and I think apart from one or two >> > small details plus a requirement for documentation, is complete. >> >> The patch contains a few fixes for typo in the original patch. >> Hanada-san, could you take them into the core file_fdw patch? > > Thanks, I've applied to my local repo. > > s/Centinel/Sentinel/ > s/Vaidate/Validate/ > s/reaind/reading/ > > Recent file_fdw is available here: > http://git.postgresql.org/gitweb?p=users/hanada/postgres.git;a=summary > > I'll submit revised file_fdw patch after removing IsForeignTable() > catalog lookup along Heikki's proposal. So I'm a bit confused. I don't see the actual copy API change patch anywhere here. Are we close to getting something committed there? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Tue, 8 Feb 2011 08:49:36 -0500 Robert Haas <robertmhaas@gmail.com> wrote: > On Tue, Feb 8, 2011 at 4:42 AM, Shigeru HANADA > <hanada@metrosystems.co.jp> wrote: > > I'll submit revised file_fdw patch after removing IsForeignTable() > > catalog lookup along Heikki's proposal. > > So I'm a bit confused. I don't see the actual copy API change patch > anywhere here. Are we close to getting something committed there? I'm sorry but I might have missed your point... I replied here to answer to Itagaki-san's mention about typos in file_fdw patch. Or, would you mean that file_fdw should not depend on "copy API change" patch? Regards, -- Shigeru Hanada
On Wed, Feb 9, 2011 at 7:38 AM, Shigeru HANADA <hanada@metrosystems.co.jp> wrote: > On Tue, 8 Feb 2011 08:49:36 -0500 > Robert Haas <robertmhaas@gmail.com> wrote: >> On Tue, Feb 8, 2011 at 4:42 AM, Shigeru HANADA >> <hanada@metrosystems.co.jp> wrote: >> > I'll submit revised file_fdw patch after removing IsForeignTable() >> > catalog lookup along Heikki's proposal. >> >> So I'm a bit confused. I don't see the actual copy API change patch >> anywhere here. Are we close to getting something committed there? > > I'm sorry but I might have missed your point... > > I replied here to answer to Itagaki-san's mention about typos in > file_fdw patch. > > Or, would you mean that file_fdw should not depend on "copy API change" > patch? I mean that this thread is entitled "exposing copy API", and I'm wondering when and if the patch to expose the COPY API is going to be committed. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 02/09/2011 12:26 PM, Robert Haas wrote: > On Wed, Feb 9, 2011 at 7:38 AM, Shigeru HANADA > <hanada@metrosystems.co.jp> wrote: >> On Tue, 8 Feb 2011 08:49:36 -0500 >> Robert Haas<robertmhaas@gmail.com> wrote: >>> On Tue, Feb 8, 2011 at 4:42 AM, Shigeru HANADA >>> <hanada@metrosystems.co.jp> wrote: >>>> I'll submit revised file_fdw patch after removing IsForeignTable() >>>> catalog lookup along Heikki's proposal. >>> So I'm a bit confused. I don't see the actual copy API change patch >>> anywhere here. Are we close to getting something committed there? >> I'm sorry but I might have missed your point... >> >> I replied here to answer to Itagaki-san's mention about typos in >> file_fdw patch. >> >> Or, would you mean that file_fdw should not depend on "copy API change" >> patch? > I mean that this thread is entitled "exposing copy API", and I'm > wondering when and if the patch to expose the COPY API is going to be > committed. Itagaki-san published a patch for this about about 12 hours ago in the file_fdw thread that looks pretty committable to me. This whole API thing is a breakout from file_fdw, because the original file_fdw submission copied huge chunks of copy.c instead of trying to leverage it. cheers andrew
On Wed, Feb 9, 2011 at 12:45 PM, Andrew Dunstan <andrew@dunslane.net> wrote: > Itagaki-san published a patch for this about about 12 hours ago in the > file_fdw thread that looks pretty committable to me. OK, excellent. > This whole API thing is a breakout from file_fdw, because the original > file_fdw submission copied huge chunks of copy.c instead of trying to > leverage it. Yeah, I remembered that, I just got mixed up because the two patches were on the same thread, and the one that is the topic of this thread was posted elsewhere. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company