Re: WIP: Collecting statistics on CSV file data - Mailing list pgsql-hackers

From Etsuro Fujita
Subject Re: WIP: Collecting statistics on CSV file data
Date
Msg-id 005a01cd13b4$5f890a30$1e9b1e90$@lab.ntt.co.jp
Whole thread Raw
In response to Re: WIP: Collecting statistics on CSV file data  (Shigeru HANADA <shigeru.hanada@gmail.com>)
List pgsql-hackers
Thanks, Hanada-san!

Best regards,
Etsuro Fujita

-----Original Message-----
From: Shigeru HANADA [mailto:shigeru.hanada@gmail.com] 
Sent: Friday, April 06, 2012 11:41 AM
To: Etsuro Fujita
Cc: pgsql-hackers@postgresql.org
Subject: Re: [HACKERS] WIP: Collecting statistics on CSV file data

(2012/04/05 21:10), Shigeru HANADA wrote:
> file_fdw
> ========
> This patch contains a use case of new handler function in 
> contrib/file_fdw.  Since file_fdw reads data from a flat file, 
> fileAnalyzeForeignTable uses similar algorithm to ordinary tables;  it 
> samples first N rows first, and replaces them randomly with subsequent 
> rows.  Also file_fdw updates pg_class.relpages by calculating number 
> of pages from size of the data file.
> 
> To allow FDWs to implement sampling argorighm like this, several 
> functions are exported from analyze.c, e.g. random_fract, 
> init_selection_state, and get_next_S.

Just after my post, Fujita-san posted another v7 patch[1], so I merged
v7 patches into v8 patch.

[1] http://archives.postgresql.org/pgsql-hackers/2012-04/msg00212.php

Changes taken from Fujita-san's patch
=====================================
* Remove reporting validrows and deadrows at the end of acquire_sample_rows
of file_fdw.  Thus, it doesn't validate NOT NULL constraints any more.
* Improve get_rel_size of file_fdw, which is used in GetForeignRelSize, to
estimate current # of tuples of the foreign table from these values. - # of pages/tuples which are updated by last
ANALYZE- current file size
 

Additional Changes
==================
* Fix memory leak in acquire_sample_rows which caused by calling
NextCopyFrom repeatedly in one long-span memory context.  I add per-record
temporary context and it's used during processing a record.
Main context is used to create heap tuples from sampled records, because
sample tuples must be valid after the function ends.
* Some cosmetic changes for document, e.g. remove line-break inside tagged
elements.
* Some cosmetic changes to make patch more readable by minimizing difference
from master branch.

Changes did *not* merged
========================
* Fujita-san moved document of AnalyzeForeignTable to the section "Foreign
Data Wrapper Helper Functions" from "Foreign Data Wrapper Callback
Routines".  But I think analyze handler is one of callback functions, though
it's optional.

Please find attached a patch.

Regards,
--
Shigeru HANADA



pgsql-hackers by date:

Previous
From: Robert Haas
Date:
Subject: Re: measuring lwlock-related latency spikes
Next
From: Peter Eisentraut
Date:
Subject: Re: Another review of URI for libpq, v7 submission