Re: COPY FROM WHEN condition - Mailing list pgsql-hackers

From Corey Huinker
Subject Re: COPY FROM WHEN condition
Date
Msg-id CADkLM=d94dynBhz0WXpg-8Wc2=aDCzO9J4AoikgXJ9d0uJyqAg@mail.gmail.com
Whole thread Raw
In response to Re: COPY FROM WHEN condition  (David Fetter <david@fetter.org>)
Responses Re: COPY FROM WHEN condition  (Pavel Stehule <pavel.stehule@gmail.com>)
Re: COPY FROM WHEN condition  (Tomas Vondra <tomas.vondra@2ndquadrant.com>)
Re: COPY FROM WHEN condition  (David Fetter <david@fetter.org>)
List pgsql-hackers
> Are you thinking something like having a COPY command that provides
> results in such a way that they could be referenced in a FROM clause
> (perhaps a COPY that defines a cursor…)?

That would also be nice, but what I was thinking of was that some
highly restricted subset of cases of SQL in general could lend
themselves to levels of optimization that would be impractical in
other contexts.

If COPY (or a syntactical equivalent) can return a result set, then the whole of SQL is available to filter and aggregate the results and we don't have to invent new syntax, or endure confusion whenCOPY-WHEN syntax behaves subtly different from a similar FROM-WHERE.

Also, what would we be saving computationally? The whole file (or program output) has to be consumed no matter what, the columns have to be parsed no matter what. At least some of the columns have to be converted to their assigned datatypes enough to know whether or not to filter the row, but we might be able push that logic inside a copy. I'm thinking of something like this:

SELECT x.a, sum(x.b)
FROM ( COPY INLINE '/path/to/foo.txt' FORMAT CSV ) as x( a integer, b numeric, c text, d date, e json) )
WHERE x.d >= '2018-11-01'

In this case, there is the opportunity to see the following optimizations:
- columns c and e are never referenced, and need never be turned into a datum (though we might do so just to confirm that they conform to the data type)
- if column d is converted first, we can filter on it and avoid converting columns a,b
- whatever optimizations we can infer from knowing that the two surviving columns will go directly into an aggregate

If we go this route, we can train the planner to notice other optimizations and add those mechanisms at that time, and then existing code gets faster.

If we go the COPY-WHEN route, then we have to make up new syntax for every possible future optimization.

pgsql-hackers by date:

Previous
From: Christian Ohler
Date:
Subject: Re: WIP Patch: Add a function that returns binary JSONB as a bytea
Next
From: Tom Lane
Date:
Subject: Re: WIP Patch: Add a function that returns binary JSONB as a bytea