> From: Robert Haas [mailto:robertmhaas@gmail.com]
> On Mon, Aug 6, 2012 at 10:33 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> > Robert Haas <robertmhaas@gmail.com> writes:
> >> On Sun, Aug 5, 2012 at 10:41 PM, Etsuro Fujita
> >> <fujita.etsuro@lab.ntt.co.jp> wrote:
> >>> I think file_fdw is useful for managing log files such as PG CSV logs.
Since
> >>> often, such files are sorted by timestamp, I think the patch can improve
> the
> >>> performance of log analysis, though I have to admit my demonstration was
> not
> >>> realistic.
> >
> >> Hmm, I guess I could buy that as a plausible use case.
> >
> > In the particular case of PG log files, I'd bet good money against them
> > being *exactly* sorted by timestamp. Clock skew between backends, or
> > varying amounts of time to construct and send messages, will result in
> > small inconsistencies. This would generally not matter, until the
> > planner relied on the claim of sortedness for something like a mergejoin
> > ... and then it would matter a lot.
>
> Hmm, true.
>
> > In general I'm quite suspicious of the idea of believing that externally
> > supplied data is sorted in exactly the way that PG thinks it should
> > sort. If we implement this you can bet that people will screw up, for
> > instance by using the wrong locale/collation to sort text data.
>
> I think that optimizations like this are going to be essential for
> things like pgsql_fdw (or other_rdms_fdw). Despite the thorny
> semantic issues, we're just not going to be able to get around it.
> There will even be people who want SELECT * FROM ft ORDER BY 1 to
> order by the remote side's notion of ordering rather than ours,
> despite the fact that the remote side has some insane-by-PG-standards
> definition of ordering. People are going to find ways to do that kind
> of thing whether we condone it or not, so we might as well start
> thinking now about how we're going to live with it. But that doesn't
> answer the question of whether or not we ought to support it for
> file_fdw in particular, which seems like a more arguable point.
For file_fdw, I feel inclined to simply implement file_fdw (1) to verify the key
column is sorted in the specified way at the execution phase ie, at the (first)
scan of a data file, only when pathkeys are set, and (2) to abort the
transaction if it detects the data file is not sorted.
Thanks,
Best regards,
Etsuro Fujita