Re: Let file_fdw access COPY FROM PROGRAM - Mailing list pgsql-hackers

From Amit Langote
Subject Re: Let file_fdw access COPY FROM PROGRAM
Date
Msg-id 57df6b39-4ade-5b24-c651-da2169512968@lab.ntt.co.jp
Whole thread Raw
In response to Let file_fdw access COPY FROM PROGRAM  (Corey Huinker <corey.huinker@gmail.com>)
Responses Re: Let file_fdw access COPY FROM PROGRAM
List pgsql-hackers
Hi Corey,

Here are some comments and a review of the patch.

On 2016/06/03 5:48, Corey Huinker wrote:
> A while back, there was a push to make COPY gzip-aware. That didn't happen,
> but COPY FROM PROGRAM did, and it scratches the same itch.
> 
> I have a similar need, but with file_fdw foreign tables. I have .csv.gz
> files downloaded to the server, but those CSVs have 100+ columns in them,
> and in this case I only really care about a half dozen of those columns.
> I'd like to avoid:
> - the overhead of writing the uncompressed file to disk and then
> immediately re-reading it
> - writing unwanted columns to a temp/work table via COPY, and then
> immediately re-reading them
> - multicorn fdw because it ends up making a python string out of all data
> cells
> - a csv parsing tool like csvtool or mlr, because they output another CSV
> which must be reparsed from scratch

This feature seems desirable to me too.

> Since file_fdw leverages COPY, it seemed like it would be easy to add the
> FROM PROGRAM feature to file_fdw. I began asking questions on #postgresql
> IRC, only to discover that Adam Gomaa ( akgomaa@gmail.com ) had already
> written such a thing, but hadn't submitted it. Attached is a small rework
> of his patch, along with documentation.

Yeah, the fact that file_fdw leverages COPY makes including this feature a
breeze.  Various concerns such as security, popen() vs execve() were
addressed when COPY FROM PROGRAM/STDIN/STDOUT feature was proposed [1], so
we will not have to go through that again (hopefully).

> NOTE: The regression test includes unix commands in the program option. I
> figured that wouldn't work for win32 systems, so I checked to see what the
> regression tests do to test COPY FROM PROGRAM...and I couldn't find any. So
> I guess the test exists as a proof of concept that will get excised before
> final commit.

I am not familiar with win32 stuff too, so I don't have much to say about
that.  Maybe someone else can chime in to help with that.

About the patch:

* applies cleanly, compiles fine
* basic functionality seems to work (have not done any extensive tests though)


-     Specifies the file to be read.  Required.  Must be an absolute path
name.
+     Specifies the file to be read.  Must be an absolute path name.
+     Either <literal>filename</literal> or <literal>program</literal> must be
+     specified.  They are mutually exclusive.
+    </para>
+   </listitem>
+  </varlistentry>

The note about filename and program being mutually exclusive could be
placed at the end of list of options.  Or perhaps mention this note as
part of the description of program option, because filename is already
defined by that point.

+   <listitem>
+    <para>
+     Specifies the command to executed.

s/to executed/to be executed/g

+     Either <literal>program</literal> or <literal>filename</literal> must be
+     specified.  They are mutually exclusive.    </para>

Oh I see this has already been mentioned in program option description.
Did you intend to specify the same in both filename and program descriptions?

@@ -97,8 +99,9 @@ typedef struct FileFdwPlanStatetypedef struct FileFdwExecutionState{    char       *filename;
/*file to read */
 
-    List       *options;        /* merged COPY options, excluding filename */
-    CopyState   cstate;         /* state of reading file */
+    char       *program;        /* program to read output from */
+    List       *options;        /* merged COPY options, excluding
filename and program */
+    CopyState   cstate;         /* state of reading file or program */

Have you considered a Boolean flag is_program instead of char **program
similar to how copy.c does it?  (See a related comment further below)

-     * This is because the filename is one of those options, and we don't
want
-     * non-superusers to be able to determine which file gets read.
+     * This is because the filename or program string are two of those
+     * options, and we don't want non-superusers to be able to determine
which
+     * file gets read or what command is run.

I'm no native English speaker, but I wonder if this should read: filename
or program string *is* one of those options ... OR filename *and* program
are two of those options ... ?  Also, "are two of those options" sounds a
bit odd to me because I don't see that used as often as "one of those
whatever".  I guess that's quite a bit of nitpicking about a C comment, ;)

@@ -257,9 +264,26 @@ file_fdw_validator(PG_FUNCTION_ARGS)                ereport(ERROR,
(errcode(ERRCODE_SYNTAX_ERROR),                        errmsg("conflicting or redundant options")));
 
+            if (program)
+                ereport(ERROR,
+                        (errcode(ERRCODE_SYNTAX_ERROR),
+                         errmsg("conflicting or redundant options")));            filename = defGetString(def);
}

+        else if (strcmp(def->defname, "program") == 0)
+        {
+            if (filename)
+                ereport(ERROR,
+                        (errcode(ERRCODE_SYNTAX_ERROR),
+                         errmsg("conflicting or redundant options")));
+            if (program)
+                ereport(ERROR,
+                        (errcode(ERRCODE_SYNTAX_ERROR),
+                         errmsg("conflicting or redundant options")));
+            program = defGetString(def);
+        }

Why does it check for filename here?  Also the other way around above.

@@ -632,11 +671,18 @@ fileBeginForeignScan(ForeignScanState *node, int eflags)     * Create CopyState from FDW options.
We always acquire all columns, so     * as to match the expected ScanTupleSlot signature.     */
 
-    cstate = BeginCopyFrom(node->ss.ss_currentRelation,
-                           filename,
-                           false,
-                           NIL,
-                           options);
+    if(filename)
+        cstate = BeginCopyFrom(node->ss.ss_currentRelation,
+                               filename,
+                               false,
+                               NIL,
+                               options);
+    else
+        cstate = BeginCopyFrom(node->ss.ss_currentRelation,
+                               program,
+                               true,
+                               NIL,
+                               options)

Like I mentioned above, if there was a is_program Boolean flag instead of
separate filename and program, this would be just:

+    cstate = BeginCopyFrom(node->ss.ss_currentRelation,
+                           filename,
+                           is_program,
+                           NIL,
+                           options);

That would get rid of if-else blocks here and a couple of other places.

diff --git a/contrib/file_fdw/input/file_fdw.source
b/contrib/file_fdw/input/file_fdw.source
index 685561f..eae34a4 100644
--- a/contrib/file_fdw/input/file_fdw.source
+++ b/contrib/file_fdw/input/file_fdw.source

You forgot to include expected output diffs.

Thanks,
Amit

[1]
https://www.postgresql.org/message-id/002101cd9190%24081c4140%241854c3c0%24%40lab.ntt.co.jp





pgsql-hackers by date:

Previous
From: Christoph Berg
Date:
Subject: Re: [PATCH] COPY vs \copy HINT
Next
From: Craig Ringer
Date:
Subject: Re: [PATCH] COPY vs \copy HINT