Re: WIP Patch: Use sortedness of CSV foreign tables for query planning - Mailing list pgsql-hackers

From Etsuro Fujita
Subject Re: WIP Patch: Use sortedness of CSV foreign tables for query planning
Date
Msg-id 002501cd7462$35f41ca0$a1dc55e0$@lab.ntt.co.jp
Whole thread Raw
In response to Re: WIP Patch: Use sortedness of CSV foreign tables for query planning  (Robert Haas <robertmhaas@gmail.com>)
Responses Re: WIP Patch: Use sortedness of CSV foreign tables for query planning
List pgsql-hackers
> From: Robert Haas [mailto:robertmhaas@gmail.com]

> On Mon, Aug 6, 2012 at 10:33 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> > Robert Haas <robertmhaas@gmail.com> writes:
> >> On Sun, Aug 5, 2012 at 10:41 PM, Etsuro Fujita
> >> <fujita.etsuro@lab.ntt.co.jp> wrote:
> >>> I think file_fdw is useful for managing log files such as PG CSV logs.
Since
> >>> often, such files are sorted by timestamp, I think the patch can improve
> the
> >>> performance of log analysis, though I have to admit my demonstration was
> not
> >>> realistic.
> >
> >> Hmm, I guess I could buy that as a plausible use case.
> >
> > In the particular case of PG log files, I'd bet good money against them
> > being *exactly* sorted by timestamp.  Clock skew between backends, or
> > varying amounts of time to construct and send messages, will result in
> > small inconsistencies.  This would generally not matter, until the
> > planner relied on the claim of sortedness for something like a mergejoin
> > ... and then it would matter a lot.
> 
> Hmm, true.
> 
> > In general I'm quite suspicious of the idea of believing that externally
> > supplied data is sorted in exactly the way that PG thinks it should
> > sort.  If we implement this you can bet that people will screw up, for
> > instance by using the wrong locale/collation to sort text data.
> 
> I think that optimizations like this are going to be essential for
> things like pgsql_fdw (or other_rdms_fdw).  Despite the thorny
> semantic issues, we're just not going to be able to get around it.
> There will even be people who want SELECT * FROM ft ORDER BY 1 to
> order by the remote side's notion of ordering rather than ours,
> despite the fact that the remote side has some insane-by-PG-standards
> definition of ordering.  People are going to find ways to do that kind
> of thing whether we condone it or not, so we might as well start
> thinking now about how we're going to live with it.  But that doesn't
> answer the question of whether or not we ought to support it for
> file_fdw in particular, which seems like a more arguable point.

For file_fdw, I feel inclined to simply implement file_fdw (1) to verify the key
column is sorted in the specified way at the execution phase ie, at the (first)
scan of a data file, only when pathkeys are set, and (2) to abort the
transaction if it detects the data file is not sorted.

Thanks,

Best regards,
Etsuro Fujita



pgsql-hackers by date:

Previous
From: Alexander Korotkov
Date:
Subject: Re: Statistics and selectivity estimation for ranges
Next
From: Craig Ringer
Date:
Subject: Re: [PATCH] Docs: Make notes on sequences and rollback more obvious