Thread: WIP Patch: Use sortedness of CSV foreign tables for query planning

WIP Patch: Use sortedness of CSV foreign tables for query planning

From
"Etsuro Fujita"
Date:
The following is a comment at fileGetForeignPaths() in contrib/file_fdw.c:
   /*    * If data file was sorted, and we knew it somehow, we could insert    * appropriate pathkeys into the
ForeignPathnode to tell the planner    * that.    */
 

To do this, I would like to propose new generic options for a file_fdw foreign
table to specify the sortedness of a data file.  While it is best to allow to
specify the sortedness on multiple columns, the current interface for the
generic options dose not seems to be suitable for doing it.  As a compromise, I
would like to propose single-column sortedness options and insert appropriate
pathkeys into the ForeignPath node based on these information:
sorted: Boolean option to specify whether data file is sorted by a column.key: Specifies the name of a column by which
datafile is sorted.  Required
 
when the above option is set to true.direction: Specifies the sort order: asc or desc.  The default is asc.nulls:
Specifiesthat nulls sort before or after non-nulls: first or last.
 
first is the default when direction option is set desc.  When direction option
is not set desc, last is the default.

Attached is a WIP patch implementing this feature.  I would like to demonstrate
the usefulness of the patch.  Experimental results are shown below.  Here, data
in /home/pgsql/relation.csv is sorted by aid in ascending order.

postgres=# CREATE EXTENSION file_fdw;
CREATE EXTENSION
postgres=# CREATE SERVER fs FOREIGN DATA WRAPPER file_fdw;
CREATE SERVER
postgres=# CREATE FOREIGN TABLE ft (aid INTEGER, bid INTEGER) SERVER fs OPTIONS
(filename '/home/pgsql/relation.csv', format 'csv', delimiter ',');
CREATE FOREIGN TABLE
postgres=# ANALYZE ft;
ANALYZE
postgres=# SELECT count(*) FROM ft; count
----------50000000
(1 row)

postgres=# SELECT count(DISTINCT aid) FROM ft; count
---------1000000
(1 row)

postgres=# EXPLAIN ANALYZE SELECT DISTINCT aid FROM ft ORDER BY aid;
       QUERY PLAN
 

--------------------------------------------------------------------------------
----------------------------------------------------Unique  (cost=14187375.19..14437375.19 rows=1092929 width=4)
(actualtime=48952
 
.602..62788.934 rows=1000000 loops=1)  ->  Sort  (cost=14187375.19..14312375.19 rows=50000000 width=4) (actual time=
48952.601..56635.448 rows=50000000 loops=1)        Sort Key: aid        Sort Method: external sort  Disk: 684272kB
 ->  Foreign Scan on ft  (cost=0.00..5059137.00 rows=50000000 width=4) (
 
actual time=0.073..18324.062 rows=50000000 loops=1)              Foreign File: /home/pgsql/relation.csv
ForeignFile Size: 484444500Total runtime: 63019.868 ms
 
(8 rows)

postgres=# ALTER FOREIGN TABLE ft OPTIONS ( ADD  sorted 'true', key 'aid' );
ALTER FOREIGN TABLE
postgres=# EXPLAIN ANALYZE SELECT DISTINCT aid FROM ft ORDER BY aid;
    QUERY PLAN
 

--------------------------------------------------------------------------------
----------------------------------------------Unique  (cost=0.00..5184137.00 rows=1092929 width=4) (actual
time=0.074..23124.
195 rows=1000000 loops=1)  ->  Foreign Scan on ft  (cost=0.00..5059137.00 rows=50000000 width=4)
(actualtime=0.070..17633.821rows=50000000 loops=1)        Foreign File: /home/pgsql/relation.csv        Foreign File
Size:484444500Total runtime: 23213.909 ms
 
(5 rows)

Any comments and suggestions are welcomed.

Thanks,

Best regards,
Etsuro Fujita

Re: WIP Patch: Use sortedness of CSV foreign tables for query planning

From
"Kevin Grittner"
Date:
"Etsuro Fujita" <fujita.etsuro@lab.ntt.co.jp> wrote:
> I would like to propose single-column sortedness options and
> insert appropriate pathkeys into the ForeignPath node based on
> these information:
> 
>  sorted: Boolean option to specify whether data file is sorted by
>      a column.
>  key: Specifies the name of a column by which data file is sorted.
>      Required when the above option is set to true.
>  direction: Specifies the sort order: asc or desc.  The default is
>      asc.
>  nulls: Specifies that nulls sort before or after non-nulls: first
>      or last.  first is the default when direction option is set
>      desc.  When direction option is not set desc, last is the
>      default.
For character-based columns, don't we need to know the collation?
-Kevin


Re: WIP Patch: Use sortedness of CSV foreign tables for query planning

From
"Etsuro Fujita"
Date:
Hi Kevin,

> > I would like to propose single-column sortedness options and
> > insert appropriate pathkeys into the ForeignPath node based on
> > these information:

> For character-based columns, don't we need to know the collation?

I forgot to add the collate option.  I'll add it at the next version of the
patch.

Thanks,

Best regards,
Etsuro Fujita



Re: WIP Patch: Use sortedness of CSV foreign tables for query planning

From
Robert Haas
Date:
On Thu, Aug 2, 2012 at 7:01 AM, Etsuro Fujita
<fujita.etsuro@lab.ntt.co.jp> wrote:
> The following is a comment at fileGetForeignPaths() in contrib/file_fdw.c:
>
>     /*
>      * If data file was sorted, and we knew it somehow, we could insert
>      * appropriate pathkeys into the ForeignPath node to tell the planner
>      * that.
>      */
>
> To do this, I would like to propose new generic options for a file_fdw foreign
> table to specify the sortedness of a data file.  While it is best to allow to
> specify the sortedness on multiple columns, the current interface for the
> generic options dose not seems to be suitable for doing it.  As a compromise, I
> would like to propose single-column sortedness options and insert appropriate
> pathkeys into the ForeignPath node based on these information:

I am not sure it is a good idea to complicate file_fdw with frammishes
of marginal utility.  I guess I tend to view things like file_fdw as a
mechanism for getting the data into the database, not necessarily
something that you actually want to keep your data in permanently and
run complex queries against.  It seems like that's the direction we're
headed in here - statistics, indexing, etc.  I am all in favor of
having some kind of pluggable storage engine as an alternative to our
heap, but I'm not sure a flat-file is a good choice.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: WIP Patch: Use sortedness of CSV foreign tables for query planning

From
"Etsuro Fujita"
Date:
Hi Robert,

> From: Robert Haas [mailto:robertmhaas@gmail.com]

> On Thu, Aug 2, 2012 at 7:01 AM, Etsuro Fujita
> <fujita.etsuro@lab.ntt.co.jp> wrote:
> > The following is a comment at fileGetForeignPaths() in contrib/file_fdw.c:
> >
> >     /*
> >      * If data file was sorted, and we knew it somehow, we could insert
> >      * appropriate pathkeys into the ForeignPath node to tell the planner
> >      * that.
> >      */
> >
> > To do this, I would like to propose new generic options for a file_fdw
foreign
> > table to specify the sortedness of a data file.  While it is best to allow
> to
> > specify the sortedness on multiple columns, the current interface for the
> > generic options dose not seems to be suitable for doing it.  As a
compromise,
> I
> > would like to propose single-column sortedness options and insert
appropriate
> > pathkeys into the ForeignPath node based on these information:
> 
> I am not sure it is a good idea to complicate file_fdw with frammishes
> of marginal utility.  I guess I tend to view things like file_fdw as a
> mechanism for getting the data into the database, not necessarily
> something that you actually want to keep your data in permanently and
> run complex queries against.

I think file_fdw is useful for managing log files such as PG CSV logs.  Since
often, such files are sorted by timestamp, I think the patch can improve the
performance of log analysis, though I have to admit my demonstration was not
realistic.

> It seems like that's the direction we're
> headed in here - statistics, indexing, etc.  I am all in favor of
> having some kind of pluggable storage engine as an alternative to our
> heap, but I'm not sure a flat-file is a good choice.

As you pointed out, I would like to allow indexing to be done for CSV foreign
tables, but that is another problem.  The submitted patch or the above comment
is not something toward indexing, so to say, an optimization of the current
file_fdw module.

Thanks,

Best regards,
Etsuro Fujita



Re: WIP Patch: Use sortedness of CSV foreign tables for query planning

From
Robert Haas
Date:
On Sun, Aug 5, 2012 at 10:41 PM, Etsuro Fujita
<fujita.etsuro@lab.ntt.co.jp> wrote:
> I think file_fdw is useful for managing log files such as PG CSV logs.  Since
> often, such files are sorted by timestamp, I think the patch can improve the
> performance of log analysis, though I have to admit my demonstration was not
> realistic.

Hmm, I guess I could buy that as a plausible use case.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: WIP Patch: Use sortedness of CSV foreign tables for query planning

From
Tom Lane
Date:
Robert Haas <robertmhaas@gmail.com> writes:
> On Sun, Aug 5, 2012 at 10:41 PM, Etsuro Fujita
> <fujita.etsuro@lab.ntt.co.jp> wrote:
>> I think file_fdw is useful for managing log files such as PG CSV logs.  Since
>> often, such files are sorted by timestamp, I think the patch can improve the
>> performance of log analysis, though I have to admit my demonstration was not
>> realistic.

> Hmm, I guess I could buy that as a plausible use case.

In the particular case of PG log files, I'd bet good money against them
being *exactly* sorted by timestamp.  Clock skew between backends, or
varying amounts of time to construct and send messages, will result in
small inconsistencies.  This would generally not matter, until the
planner relied on the claim of sortedness for something like a mergejoin
... and then it would matter a lot.

In general I'm quite suspicious of the idea of believing that externally
supplied data is sorted in exactly the way that PG thinks it should
sort.  If we implement this you can bet that people will screw up, for
instance by using the wrong locale/collation to sort text data.
        regards, tom lane


Re: WIP Patch: Use sortedness of CSV foreign tables for query planning

From
Robert Haas
Date:
On Mon, Aug 6, 2012 at 10:33 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
>> On Sun, Aug 5, 2012 at 10:41 PM, Etsuro Fujita
>> <fujita.etsuro@lab.ntt.co.jp> wrote:
>>> I think file_fdw is useful for managing log files such as PG CSV logs.  Since
>>> often, such files are sorted by timestamp, I think the patch can improve the
>>> performance of log analysis, though I have to admit my demonstration was not
>>> realistic.
>
>> Hmm, I guess I could buy that as a plausible use case.
>
> In the particular case of PG log files, I'd bet good money against them
> being *exactly* sorted by timestamp.  Clock skew between backends, or
> varying amounts of time to construct and send messages, will result in
> small inconsistencies.  This would generally not matter, until the
> planner relied on the claim of sortedness for something like a mergejoin
> ... and then it would matter a lot.

Hmm, true.

> In general I'm quite suspicious of the idea of believing that externally
> supplied data is sorted in exactly the way that PG thinks it should
> sort.  If we implement this you can bet that people will screw up, for
> instance by using the wrong locale/collation to sort text data.

I think that optimizations like this are going to be essential for
things like pgsql_fdw (or other_rdms_fdw).  Despite the thorny
semantic issues, we're just not going to be able to get around it.
There will even be people who want SELECT * FROM ft ORDER BY 1 to
order by the remote side's notion of ordering rather than ours,
despite the fact that the remote side has some insane-by-PG-standards
definition of ordering.  People are going to find ways to do that kind
of thing whether we condone it or not, so we might as well start
thinking now about how we're going to live with it.  But that doesn't
answer the question of whether or not we ought to support it for
file_fdw in particular, which seems like a more arguable point.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: WIP Patch: Use sortedness of CSV foreign tables for query planning

From
"Etsuro Fujita"
Date:
> From: Robert Haas [mailto:robertmhaas@gmail.com]

> On Mon, Aug 6, 2012 at 10:33 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> > Robert Haas <robertmhaas@gmail.com> writes:
> >> On Sun, Aug 5, 2012 at 10:41 PM, Etsuro Fujita
> >> <fujita.etsuro@lab.ntt.co.jp> wrote:
> >>> I think file_fdw is useful for managing log files such as PG CSV logs.
Since
> >>> often, such files are sorted by timestamp, I think the patch can improve
> the
> >>> performance of log analysis, though I have to admit my demonstration was
> not
> >>> realistic.
> >
> >> Hmm, I guess I could buy that as a plausible use case.
> >
> > In the particular case of PG log files, I'd bet good money against them
> > being *exactly* sorted by timestamp.  Clock skew between backends, or
> > varying amounts of time to construct and send messages, will result in
> > small inconsistencies.  This would generally not matter, until the
> > planner relied on the claim of sortedness for something like a mergejoin
> > ... and then it would matter a lot.
> 
> Hmm, true.
> 
> > In general I'm quite suspicious of the idea of believing that externally
> > supplied data is sorted in exactly the way that PG thinks it should
> > sort.  If we implement this you can bet that people will screw up, for
> > instance by using the wrong locale/collation to sort text data.
> 
> I think that optimizations like this are going to be essential for
> things like pgsql_fdw (or other_rdms_fdw).  Despite the thorny
> semantic issues, we're just not going to be able to get around it.
> There will even be people who want SELECT * FROM ft ORDER BY 1 to
> order by the remote side's notion of ordering rather than ours,
> despite the fact that the remote side has some insane-by-PG-standards
> definition of ordering.  People are going to find ways to do that kind
> of thing whether we condone it or not, so we might as well start
> thinking now about how we're going to live with it.  But that doesn't
> answer the question of whether or not we ought to support it for
> file_fdw in particular, which seems like a more arguable point.

For file_fdw, I feel inclined to simply implement file_fdw (1) to verify the key
column is sorted in the specified way at the execution phase ie, at the (first)
scan of a data file, only when pathkeys are set, and (2) to abort the
transaction if it detects the data file is not sorted.

Thanks,

Best regards,
Etsuro Fujita



Re: WIP Patch: Use sortedness of CSV foreign tables for query planning

From
Robert Haas
Date:
On Tue, Aug 7, 2012 at 2:02 AM, Etsuro Fujita
<fujita.etsuro@lab.ntt.co.jp> wrote:
>> I think that optimizations like this are going to be essential for
>> things like pgsql_fdw (or other_rdms_fdw).  Despite the thorny
>> semantic issues, we're just not going to be able to get around it.
>> There will even be people who want SELECT * FROM ft ORDER BY 1 to
>> order by the remote side's notion of ordering rather than ours,
>> despite the fact that the remote side has some insane-by-PG-standards
>> definition of ordering.  People are going to find ways to do that kind
>> of thing whether we condone it or not, so we might as well start
>> thinking now about how we're going to live with it.  But that doesn't
>> answer the question of whether or not we ought to support it for
>> file_fdw in particular, which seems like a more arguable point.
>
> For file_fdw, I feel inclined to simply implement file_fdw (1) to verify the key
> column is sorted in the specified way at the execution phase ie, at the (first)
> scan of a data file, only when pathkeys are set, and (2) to abort the
> transaction if it detects the data file is not sorted.

That seems like an even worse idea.  People who want to access data
repeatedly should load it into tables.

Mind you, if you want to publish a version of file_fdw on PGXN that
does this, that's fine with me.  But I don't think it belongs in core,
at least not without a lot more evidence that there is a real demand
for this than we have so far.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company