Thread: WIP: Collecting statistics on CSV file data
Hi there, To enable file_fdw to estimate costs of scanning a CSV file more accurately, I would like to propose a new FDW callback routine, AnalyzeForeignTable, which allows to ANALYZE command to collect statistics on a foreign table, and a corresponding file_fdw function, fileAnalyzeForeignTable. Attached is my WIP patch. Here's a summary of the implementation: void AnalyzeForeignTable (Relation relation, VacuumStmt *vacstmt, int elevel); This is a new FDW callback routine to collect statistics on a foreign table and store the results in the pg_class and pg_statistic system catalogs. This is called when ANALYZE command is executed. (ANALYZE command should be executed because autovacuum does not analyze foreign tables.) static void fileAnalyzeForeignTable(Relation relation, VacuumStmt *vacstmt, int elevel); This new file_fdw function collects and stores the same statistics on CSV file data as collected on a local table except for index related statistics by executing the sequential scan on the CSV file and acquiring sample rows using Vitter's algorithm. (It is time-consuming for a large file.) estimate_costs() (more precisely, clauselist_selectivity() in estimate_costs()) estimates baserel->rows using the statistics stored in the pg_statistic system catalog. If there are no statistics, estimate_costs() estimates it using the default statistics as in PostgreSQL 9.1. I am able to demonstrate the effectiveness of this patch. The following run is performed on a single core of a 3.00GHz Intel Xeon CPU with 8GB of RAM. Configuration settings are default except for work_mem = 256MB. We can see from this result that the optimiser selects a good plan when the foreign tables have been analyzed. I appreciate your comments and suggestions. [sample csv file data] postgres=# COPY (SELECT s.a, repeat('a', 100) FROM generate_series(1, 5000000) AS s(a)) TO '/home/pgsql/sample_csv_data1.csv' (FORMAT csv, DELIMITER ','); COPY 5000000 postgres=# COPY (SELECT (random()*10000)::int, repeat('b', 100) FROM generate_series(1, 5000000)) TO '/home/pgsql/sample_csv_data2.csv' (FORMAT csv, DELIMITER ','); COPY 5000000 [Unpatched] postgres=# CREATE FOREIGN TABLE tab1 (aid INTEGER, msg text) SERVER file_fs OPTIONS (filename '/home/pgsql/sample_csv_data1.csv', format 'csv', delimiter ','); CREATE FOREIGN TABLE postgres=# CREATE FOREIGN TABLE tab2 (aid INTEGER, msg text) SERVER file_fs OPTIONS (filename '/home/pgsql/sample_csv_data2.csv', format 'csv', delimiter ','); CREATE FOREIGN TABLE postgres=# SELECT count(*) FROM tab1; count --------- 5000000 (1 row) postgres=# SELECT count(*) FROM tab2; count --------- 5000000 (1 row) postgres=# EXPLAIN ANALYZE SELECT count(*) FROM tab1, tab2 WHERE tab1.aid >= 0 AND tab1.aid <= 10000 AND tab1.aid = tab2.aid; QUERY PLAN --------------------------------------------------------------------------------------------------------------------------------------------- --- Aggregate (cost=128859182.29..128859182.30 rows=1 width=0) (actual time=27321.304..27321.304 rows=1 loops=1) -> Merge Join (cost=5787102.68..111283426.33 rows=7030302383 width=0) (actual time=22181.428..26736.194 rows=4999745 loops=1) Merge Cond: (tab1.aid = tab2.aid) -> Sort (cost=1857986.37..1858198.83 rows=84983 width=4) (actual time=5964.282..5965.958 rows=10000 loops=1) Sort Key: tab1.aid Sort Method: quicksort Memory: 853kB -> Foreign Scan on tab1 (cost=0.00..1851028.44 rows=84983 width=4) (actual time=0.071..5962.382 rows=10000 loops=1) Filter: ((aid >= 0) AND (aid <= 10000)) Foreign File: /home/pgsql/sample_csv_data1.csv Foreign File Size: 543888896 -> Materialize (cost=3929116.30..4011842.29 rows=16545197 width=4) (actual time=16216.953..19550.846 rows=5000000 loops=1) -> Sort (cost=3929116.30..3970479.30 rows=16545197 width=4) (actual time=16216.947..18418.684 rows=5000000 loops=1) Sort Key: tab2.aid Sort Method: external merge Disk: 68424kB -> Foreign Scan on tab2 (cost=0.00..1719149.70 rows=16545197 width=4) (actual time=0.081..6059.630 rows=5000000 loops=1) Foreign File: /home/pgsql/sample_csv_data2.csv Foreign File Size: 529446313 Total runtime: 27350.673 ms (18 rows) [Patched] postgres=# CREATE FOREIGN TABLE tab1 (aid INTEGER, msg text) SERVER file_fs OPTIONS (filename '/home/pgsql/sample_csv_data1.csv', format 'csv', delimiter ','); CREATE FOREIGN TABLE postgres=# CREATE FOREIGN TABLE tab2 (aid INTEGER, msg text) SERVER file_fs OPTIONS (filename '/home/pgsql/sample_csv_data2.csv', format 'csv', delimiter ','); CREATE FOREIGN TABLE postgres=# ANALYZE VERBOSE tab1; INFO: analyzing "public.tab1" INFO: "tab1": scanned, containing 5000000 rows; 30000 rows in sample ANALYZE postgres=# ANALYZE VERBOSE tab2; INFO: analyzing "public.tab2" INFO: "tab2": scanned, containing 5000000 rows; 30000 rows in sample ANALYZE postgres=# EXPLAIN ANALYZE SELECT count(*) FROM tab1, tab2 WHERE tab1.aid >= 0 AND tab1.aid <= 10000 AND tab1.aid = tab2.aid; QUERY PLAN ----------------------------------------------------------------------------------------------------------------------------------- Aggregate (cost=1282725.25..1282725.26 rows=1 width=0) (actual time=15114.325..15114.325 rows=1 loops=1) -> Hash Join (cost=591508.50..1271157.90 rows=4626940 width=0) (actual time=5964.449..14526.822 rows=4999745 loops=1) Hash Cond: (tab2.aid = tab1.aid) -> Foreign Scan on tab2 (cost=0.00..564630.00 rows=5000000 width=4) (actual time=0.070..6253.257 rows=5000000 loops=1) Foreign File: /home/pgsql/sample_csv_data2.csv Foreign File Size: 529446313 -> Hash (cost=591393.00..591393.00 rows=9240 width=4) (actual time=5964.346..5964.346 rows=10000 loops=1) Buckets: 1024 Batches: 1 Memory Usage: 352kB -> Foreign Scan on tab1 (cost=0.00..591393.00 rows=9240 width=4) (actual time=0.066..5962.222 rows=10000 loops=1) Filter: ((aid >= 0) AND (aid <= 10000)) Foreign File: /home/pgsql/sample_csv_data1.csv Foreign File Size: 543888896 Total runtime: 15114.480 ms (13 rows) Best regards, Etsuro Fujita
Attachment
Hi Fujita-san, (2011/09/12 19:40), Etsuro Fujita wrote: > Hi there, > > To enable file_fdw to estimate costs of scanning a CSV file more > accurately, I would like to propose a new FDW callback routine, > AnalyzeForeignTable, which allows to ANALYZE command to collect > statistics on a foreign table, and a corresponding file_fdw function, > fileAnalyzeForeignTable. Attached is my WIP patch. <snip> I think this is a very nice feature so that planner would be able to create smarter plan for a query which uses foreign tables. I took a look at the patch, and found that it couldn't be applied cleanly against HEAD. Please rebase your patch against current HEAD of master branch, rather than 9.1beta1. The wiki pages below would be helpful for you. http://wiki.postgresql.org/wiki/Submitting_a_Patch http://wiki.postgresql.org/wiki/Creating_Clean_Patcheshttp://wiki.postgresql.org/wiki/Reviewing_a_Patch And it would be easy to use git to follow changes made by other developers in master branch. http://wiki.postgresql.org/wiki/Working_with_Git Regards, -- Shigeru Hanada
2011/9/12 Etsuro Fujita <fujita.etsuro@lab.ntt.co.jp>: > This is called when ANALYZE command is executed. (ANALYZE > command should be executed because autovacuum does not analyze foreign > tables.) This is a good idea. However, if adding these statistics requires an explicit ANALYZE command, then we should also have a command for resetting the collected statistics -- to get it back into the un-analyzed state. Currently it looks like the only way to reset statistics is to tamper with catalogs directly, or recreate the foreign table. Regards, Marti
Marti Raudsepp <marti@juffo.org> writes: > 2011/9/12 Etsuro Fujita <fujita.etsuro@lab.ntt.co.jp>: >> This is called when ANALYZE command is executed. (ANALYZE >> command should be executed because autovacuum does not analyze foreign >> tables.) > This is a good idea. > However, if adding these statistics requires an explicit ANALYZE > command, then we should also have a command for resetting the > collected statistics -- to get it back into the un-analyzed state. Uh, why? There is no UNANALYZE operation for ordinary tables, and I've never heard anyone ask for one. If you're desperate you could manually delete the relevant rows in pg_statistic, a solution that would presumably work for foreign tables too. Probably a more interesting question is why we wouldn't change autovacuum so that it calls this automatically for foreign tables. (Note: I'm unconvinced that there's a use-case for this in the case of "real" foreign tables on a remote server --- it seems likely that the wrapper ought to ask the remote server for its current stats, instead. But it's clearly useful for non-server-backed sources such as file_fdw.) regards, tom lane
On 20-09-2011 11:12, Marti Raudsepp wrote: > 2011/9/12 Etsuro Fujita<fujita.etsuro@lab.ntt.co.jp>: >> This is called when ANALYZE command is executed. (ANALYZE >> command should be executed because autovacuum does not analyze foreign >> tables.) > > This is a good idea. > > However, if adding these statistics requires an explicit ANALYZE > command, then we should also have a command for resetting the > collected statistics -- to get it back into the un-analyzed state. > Why would you want this? If the stats aren't up to date, run ANALYZE periodically. Remember that it is part of the DBA maintenance tasks [1]. [1] http://www.postgresql.org/docs/current/static/maintenance.html -- Euler Taveira de Oliveira - Timbira http://www.timbira.com.br/ PostgreSQL: Consultoria, Desenvolvimento, Suporte24x7 e Treinamento
On Tue, Sep 20, 2011 at 11:13:05AM -0400, Tom Lane wrote: > Marti Raudsepp <marti@juffo.org> writes: > > 2011/9/12 Etsuro Fujita <fujita.etsuro@lab.ntt.co.jp>: > >> This is called when ANALYZE command is executed. (ANALYZE > >> command should be executed because autovacuum does not analyze foreign > >> tables.) > > > This is a good idea. > > > However, if adding these statistics requires an explicit ANALYZE > > command, then we should also have a command for resetting the > > collected statistics -- to get it back into the un-analyzed state. > > Uh, why? There is no UNANALYZE operation for ordinary tables, and > I've never heard anyone ask for one. > > If you're desperate you could manually delete the relevant rows in > pg_statistic, a solution that would presumably work for foreign tables > too. > > Probably a more interesting question is why we wouldn't change > autovacuum so that it calls this automatically for foreign tables. How about a per-table setting that tells autovacuum whether to do this? Come to think of it, all of per-FDW, per-remote and per-table settings would be handy, so people could express things like, "all CSV files except these three, all PostgreSQL connections on the 10.1.0.0/16 network, and these two tables in Oracle." Cheers, David. -- David Fetter <david@fetter.org> http://fetter.org/ Phone: +1 415 235 3778 AIM: dfetter666 Yahoo!: dfetter Skype: davidfetter XMPP: david.fetter@gmail.com iCal: webcal://www.tripit.com/feed/ical/people/david74/tripit.ics Remember to vote! Consider donating to Postgres: http://www.postgresql.org/about/donate
Excerpts from David Fetter's message of mar sep 20 21:22:32 -0300 2011: > On Tue, Sep 20, 2011 at 11:13:05AM -0400, Tom Lane wrote: > > Probably a more interesting question is why we wouldn't change > > autovacuum so that it calls this automatically for foreign tables. > > How about a per-table setting that tells autovacuum whether to do > this? Seems reasonable. Have autovacuum assume that foreign tables are not to be analyzed, unless some reloption is set. -- Álvaro Herrera <alvherre@commandprompt.com> The PostgreSQL Company - Command Prompt, Inc. PostgreSQL Replication, Consulting, Custom Development, 24x7 support
Hi Hanada-san, I'm very sorry for late reply. (2011/09/20 18:49), Shigeru Hanada wrote: > I took a look at the patch, and found that it couldn't be applied > cleanly against HEAD. Please rebase your patch against current HEAD of > master branch, rather than 9.1beta1. > > The wiki pages below would be helpful for you. > http://wiki.postgresql.org/wiki/Submitting_a_Patch > http://wiki.postgresql.org/wiki/Creating_Clean_Patches > http://wiki.postgresql.org/wiki/Reviewing_a_Patch > > And it would be easy to use git to follow changes made by other > developers in master branch. > http://wiki.postgresql.org/wiki/Working_with_Git Thank you for the review and the helpful information. I rebased. Please find attached a patch. I'll add the patch to the next CF. Changes: * cleanups and fixes * addition of the following to ALTER FOREIGN TABLE ALTER [COLUMN] column SET STATISTICS integer ALTER [COLUMN] column SET ( n_distinct = val ) (n_distinct only) ALTER [COLUMN] column RESET ( n_distinct ) * reflection of the force_not_null info in acquiring sample rows * documentation Best regards, Etsuro Fujita
Attachment
Hi, I'm very sorry for the late reply. (2011/09/21 10:00), Alvaro Herrera wrote: > Excerpts from David Fetter's message of mar sep 20 21:22:32 -0300 2011: >> On Tue, Sep 20, 2011 at 11:13:05AM -0400, Tom Lane wrote: > >>> Probably a more interesting question is why we wouldn't change >>> autovacuum so that it calls this automatically for foreign tables. >> >> How about a per-table setting that tells autovacuum whether to do >> this? > > Seems reasonable. Have autovacuum assume that foreign tables are not to > be analyzed, unless some reloption is set. Thank you for the comments. I'd like to leave that feature for future work. (But this is BTW. I'm interested in developing CREATE FOREIGN INDEX. I've examined whether there are discussions about the design and implementation of it in the archive, but could not find information. If you know anything, please tell me.) Best regards, Etsuro Fujita
On Fri, Oct 07, 2011 at 08:09:44PM +0900, Etsuro Fujita wrote: > Hi, > > I'm very sorry for the late reply. > > (2011/09/21 10:00), Alvaro Herrera wrote: > >Excerpts from David Fetter's message of mar sep 20 21:22:32 -0300 2011: > >>On Tue, Sep 20, 2011 at 11:13:05AM -0400, Tom Lane wrote: > > > >>>Probably a more interesting question is why we wouldn't change > >>>autovacuum so that it calls this automatically for foreign tables. > >> > >>How about a per-table setting that tells autovacuum whether to do > >>this? > > > >Seems reasonable. Have autovacuum assume that foreign tables are not to > >be analyzed, unless some reloption is set. > > Thank you for the comments. I'd like to leave that feature for future work. OK > (But this is BTW. I'm interested in developing CREATE FOREIGN INDEX. > I've examined whether there are discussions about the design and > implementation of it in the archive, but could not find information. > If you know anything, please tell me.) Look into the "virtual index interface" from Informix. We might want to start a wiki page on this. Cheers, David. -- David Fetter <david@fetter.org> http://fetter.org/ Phone: +1 415 235 3778 AIM: dfetter666 Yahoo!: dfetter Skype: davidfetter XMPP: david.fetter@gmail.com iCal: webcal://www.tripit.com/feed/ical/people/david74/tripit.ics Remember to vote! Consider donating to Postgres: http://www.postgresql.org/about/donate
(2011/10/07 21:56), David Fetter wrote: >> (But this is BTW. I'm interested in developing CREATE FOREIGN INDEX. >> I've examined whether there are discussions about the design and >> implementation of it in the archive, but could not find information. >> If you know anything, please tell me.) > > Look into the "virtual index interface" from Informix. Thank you for the information. > We might want to start a wiki page on this. Yeah, I think it might be better to add information to the SQL/MED wiki page: http://wiki.postgresql.org/wiki/SQL/MED Best regards, Etsuro Fujita
(2011/10/07 18:09), Etsuro Fujita wrote: > Thank you for the review and the helpful information. > I rebased. Please find attached a patch. I'll add the patch to the next CF. > > Changes: > > * cleanups and fixes > * addition of the following to ALTER FOREIGN TABLE > ALTER [COLUMN] column SET STATISTICS integer > ALTER [COLUMN] column SET ( n_distinct = val ) (n_distinct only) > ALTER [COLUMN] column RESET ( n_distinct ) > * reflection of the force_not_null info in acquiring sample rows > * documentation The new patch could be applied with some shifts. Regression tests of core and file_fdw have passed cleanly. Though I've tested only simple tests, ANALYZE works for foreign tables for file_fdw, and estimation of costs and selectivity seem appropriate. New API AnalyzeForeignTable =========================== In your design, new handler function is called instead of do_analylze_rel. IMO this hook point would be good for FDWs which can provide statistics in optimized way. For instance, pgsql_fdw can simply copy statistics from remote PostgreSQL server if they are compatible. Possible another idea is to replace acquire_sample_rows so that other FDWs can reuse most part of fileAnalyzeForeignTable and file_fdw_do_analyze_rel. And I think that AnalyzeForeignTable should be optional, and it would be very useful if a default handler is provided. Probably a default handler can use basic FDW APIs to acquire sample rows from the result of "SELECT * FROM foreign_table" with skipping periodically. It won't be efficient but I think it's not so unreasonable. Other issues ============ There are some other comments about non-critical issues. - When there is no analyzable column, vac_update_relstats is not called.Is this behavior intentional? - psql can't complete foreign table name after ANALYZE. - A new parameter has been added to vac_update_relstats in a recent commit. Perhaps 0 is OK for that parameter. - ANALYZE without relation name ignores foreign tables because get_rel_oids doesn't list foreign tables. - IMO logging "analyzing foo.bar" should not be done in AnalyzeForeignTable handler of each FDW because some FDW might forget to do it. Maybe it should be pulled up to analyze_rel or somewhere in core. - It should be mentioned in a document that foreign tables are not analyzed automatically because they are read-only. Regards, -- Shigeru Hanada
(2011/10/18 2:27), Shigeru Hanada wrote: > The new patch could be applied with some shifts. Regression tests of > core and file_fdw have passed cleanly. Though I've tested only simple > tests, ANALYZE works for foreign tables for file_fdw, and estimation of > costs and selectivity seem appropriate. Thank you for your testing. > New API AnalyzeForeignTable > =========================== > And I think that AnalyzeForeignTable should be optional, and it would be > very useful if a default handler is provided. Probably a default > handler can use basic FDW APIs to acquire sample rows from the result of > "SELECT * FROM foreign_table" with skipping periodically. It won't be > efficient but I think it's not so unreasonable. I agree with you. However, I think that it is difficult to support such a default handler in a unified way because there exist external data sources for which we cannot execute "SELECT * FROM foreign_table", e.g., web-accessible DBs limiting full access to the contents. > Other issues > ============ > There are some other comments about non-critical issues. > - When there is no analyzable column, vac_update_relstats is not called. > Is this behavior intentional? > - psql can't complete foreign table name after ANALYZE. > - A new parameter has been added to vac_update_relstats in a recent > commit. Perhaps 0 is OK for that parameter. I'll check. > - ANALYZE without relation name ignores foreign tables because > get_rel_oids doesn't list foreign tables. I think that it might be better to ignore foreign tables by default because analyzing such tables may take long depending on FDW. > - IMO logging "analyzing foo.bar" should not be done in > AnalyzeForeignTable handler of each FDW because some FDW might forget to > do it. Maybe it should be pulled up to analyze_rel or somewhere in core. > - It should be mentioned in a document that foreign tables are not > analyzed automatically because they are read-only. OK. I'll revise. > Regards, Best regards, Etsuro Fujita
> New API AnalyzeForeignTable I didn't look at the patch, but I'm using CSV foreign tables with named pipes to get near-realtime KPI calculated by postgresql. Of course, pipes can be read just once, so I wouldn't want an "automatic analyze" of foreign tables...
Hi, (2011/10/18 16:32), Leonardo Francalanci wrote: >> New API AnalyzeForeignTable > > I didn't look at the patch, but I'm using CSV foreign tables with named pipes > to get near-realtime KPI calculated by postgresql. Of course, pipes can be > read just once, so I wouldn't want an "automatic analyze" of foreign tables... The patch does not analyze on foreign tables automatically. (The issue of auto-analyze on foreign tables has been discussed. Please refer to [1].) [1] http://archives.postgresql.org/pgsql-hackers/2011-09/msg00992.php Best regards, Etsuro Fujita
Hi, I revised the patch according to Hanada-san's comments. Attached is the updated version of the patch. Changes: * pull up of logging "analyzing foo.bar" * new vac_update_relstats always called * tab-completion in psql * add "foreign tables are not analyzed automatically..." to 23.1.3 Updating Planner Statistics * some other modifications Best regards, Etsuro Fujita
Attachment
(2011/10/20 18:56), Etsuro Fujita wrote: > I revised the patch according to Hanada-san's comments. Attached is the > updated version of the patch. > > Changes: > > * pull up of logging "analyzing foo.bar" > * new vac_update_relstats always called > * tab-completion in psql > * add "foreign tables are not analyzed automatically..." to 23.1.3 > Updating Planner Statistics > * some other modifications Submission review ================= - Patch can be applied, and all regression tests passed. :) Random comments =============== - Some headers are not necessary for file_fdw.c #include "access/htup.h" #include "commands/dbcommands.h" #include "optimizer/plancat.h" #include "utils/elog.h" #include "utils/guc.h" #include "utils/lsyscache.h" #include "utils/rel.h" - It might be better to mention in document that users need to explicitly specify foreign table name to ANALYZE command. - I think backend should be aware the case which a handler is NULL. For the case of ANALYZE of foreign table, it would be worth telling user that request was not accomplished. - file_fdw_do_analyze_rel is almost copy of do_analyze_rel. IIUC, difference against do_analyze_rel are: * don't logging analyze target * don't switch userid to the owner of target table * don't measure elapsed time for autoanalyze deamon * don't handle index * some comments are removed. * samplerows are acquired by file_fdw's routine I don't see any problem here, but would you confirm that all of them are intentional? Besides, keeping format (mainly indent and folding) of this function similar to do_analyze_rel would help to follow changes in do_analyze_rel. - IMHO exporting CopyState should be avoided. One possible idea is adding new COPY API which allows to extract records from the file with skipping specified number or rate. - In your design, each FDW have to copy most of do_analyze_rel to their own source. It means that FDW authors must know much details of ANALYZE to implement ANALYZE handler. Actually, your patch exports some static functions from analyze.c. Have you considered hooking acquire_sample_rows()? Such handler should be more simple, and FDW-specific. As you say, such design requires FDWs to skip some records, but it would be difficult for some FDW (e.g. twitter_fdw) which can't pick sample data up easily. IMHO such problem *must* be solved by FDW itself. -- Shigeru Hanada
(2011/11/07 20:26), Shigeru Hanada wrote: > (2011/10/20 18:56), Etsuro Fujita wrote: >> I revised the patch according to Hanada-san's comments. Attached is the >> updated version of the patch. >> >> Changes: >> >> * pull up of logging "analyzing foo.bar" >> * new vac_update_relstats always called >> * tab-completion in psql >> * add "foreign tables are not analyzed automatically..." to 23.1.3 >> Updating Planner Statistics >> * some other modifications > > Submission review > ================= > > - Patch can be applied, and all regression tests passed. :) Thank you for your testing. I updated the patch according to your comments. Attached is the updated version of the patch. > - file_fdw_do_analyze_rel is almost copy of do_analyze_rel. IIUC, > difference against do_analyze_rel are: > * don't logging analyze target > * don't switch userid to the owner of target table > * don't measure elapsed time for autoanalyze deamon > * don't handle index > * some comments are removed. > * sample rows are acquired by file_fdw's routine > > I don't see any problem here, but would you confirm that all of them are > intentional? Yes. But in the updated version, I've refactored analyze.c a little bit to allow FDW authors to simply call do_analyze_rel(). > - In your design, each FDW have to copy most of do_analyze_rel to their > own source. It means that FDW authors must know much details of ANALYZE > to implement ANALYZE handler. Actually, your patch exports some static > functions from analyze.c. Have you considered hooking > acquire_sample_rows()? Such handler should be more simple, and > FDW-specific. As you say, such design requires FDWs to skip some > records, but it would be difficult for some FDW (e.g. twitter_fdw) which > can't pick sample data up easily. IMHO such problem *must* be solved by > FDW itself. The updated version enables FDW authors to just write their own acquire_sample_rows(). On the other hand, by retaining to hook AnalyzeForeignTable routine at analyze_rel(), higher level than acquire_sample_rows() as before, it allows FDW authors to write AnalyzeForeignTable routine for foreign tables on a remote server to ask the server for its current stats instead, as pointed out earlier by Tom Lane. Best regards, Etsuro Fujita
Attachment
(2011/11/18 16:25), Etsuro Fujita wrote: > Thank you for your testing. I updated the patch according to your > comments. Attached is the updated version of the patch. I'd like to share result of my review even though it's not fully finished. So far I looked from viewpoint of API design, code formatting, and documentation. I'll examine effectiveness of the patch and details of implementation next week, and hopefully try writing ANALYZE handler for pgsql_fdw :) New patch has correct format, and it could be applied to HEAD of master branch cleanly. All regression tests including those of contrib modules have passed. It contains changes of codes and regression tests related to the issue, and they have enough comments. IMO the document in this patch is not enough to show how to write analyze handler to FDW authors, but it can be enhanced afterward. With this patch, FDW author can provide optional ANALYZE handler which updates statistics of foreign tables. Planner would be able to generate better plan by using statistics. > Yes. But in the updated version, I've refactored analyze.c a little bit > to allow FDW authors to simply call do_analyze_rel(). <snip> > The updated version enables FDW authors to just write their own > acquire_sample_rows(). On the other hand, by retaining to hook > AnalyzeForeignTable routine at analyze_rel(), higher level than > acquire_sample_rows() as before, it allows FDW authors to write > AnalyzeForeignTable routine for foreign tables on a remote server to ask > the server for its current stats instead, as pointed out earlier by Tom > Lane. IIUC, this patch offers three options to FDWs: a) set AnalyzeForeignTable to NULL to indicate lack of capability, b) provide AnalyzeForeignTable which calls do_analyze_rel with custom sample_row_acquirer, and c) create statistics data from scratch by FDW itself by doing similar things to do_analyze_rel with given argument or copying statistics from remote PostgreSQL server. ISTM that this design is well-balanced between simplicity and flexibility. Maybe these three options would suit web-based wrappers, file-based or RDBMS wrappers, and pgsql_fdw respectively. I think that adding more details of FdwRoutine, such as purpose of new callback function and difference from required ones, would help FDW authors, including me :) I have some random comments: - I think separated typedef of sample_acquire_rows would make codes more readable. In addition, parameters of the function should be described explicitly. - I couldn't see the reason why file_fdw sets ctid of sample tuples, though I guess it's for Vitter's random sampling algorithm. If every FDW must set valid ctid to sample tuples, it should be mentioned in document of AnalyzeForeignTable. Exporting some functions from analyze.c relates this issue? - Why file_fdw skips sample tuples which have NULL value? AFAIS original acquire_sample_rows doesn't do so. - Some comment lines go past 80 columns. - Some headers included in file_fdw.c seems unnecessary. Regards, -- Shigeru Hanada
2011/11/18 Shigeru Hanada <shigeru.hanada@gmail.com>: > - I couldn't see the reason why file_fdw sets ctid of sample tuples, > though I guess it's for Vitter's random sampling algorithm. If every > FDW must set valid ctid to sample tuples, it should be mentioned in > document of AnalyzeForeignTable. Exporting some functions from > analyze.c relates this issue? If every FDW must set valid ctid to sample tuples, it should be fixed so that they don't have to, I would think. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Hi Hanada-san, Thank you for your valuable comments. I will improve the items pointed out by you at the next version of the patch, including documentation on the purpose of AnalyzeForeignTable, how to write it, and so on. Here I comment only one point: > - Why file_fdw skips sample tuples which have NULL value? AFAIS > original acquire_sample_rows doesn't do so. To be precise, I've implemented to skip tuples that have null values in certain column(s) but that are not allowed to contain null values in that(those) column(s) by NOT NULL constrain. file_fdw's sample_row_acquirer considers those tuples as "dead" tuples. This is for the consistency with NOT NULL constrain. (But I don't know why fileIterateForeignScan routine allows such dead tuples. I may have missed something.) Best regards, Etsuro Fujita (2011/11/18 21:00), Shigeru Hanada wrote: > (2011/11/18 16:25), Etsuro Fujita wrote: >> Thank you for your testing. I updated the patch according to your >> comments. Attached is the updated version of the patch. > > I'd like to share result of my review even though it's not fully > finished. So far I looked from viewpoint of API design, code > formatting, and documentation. I'll examine effectiveness of the patch > and details of implementation next week, and hopefully try writing > ANALYZE handler for pgsql_fdw :) > > New patch has correct format, and it could be applied to HEAD of master > branch cleanly. All regression tests including those of contrib modules > have passed. It contains changes of codes and regression tests related > to the issue, and they have enough comments. IMO the document in this > patch is not enough to show how to write analyze handler to FDW authors, > but it can be enhanced afterward. With this patch, FDW author can > provide optional ANALYZE handler which updates statistics of foreign > tables. Planner would be able to generate better plan by using statistics. > >> Yes. But in the updated version, I've refactored analyze.c a little bit >> to allow FDW authors to simply call do_analyze_rel(). > <snip> >> The updated version enables FDW authors to just write their own >> acquire_sample_rows(). On the other hand, by retaining to hook >> AnalyzeForeignTable routine at analyze_rel(), higher level than >> acquire_sample_rows() as before, it allows FDW authors to write >> AnalyzeForeignTable routine for foreign tables on a remote server to ask >> the server for its current stats instead, as pointed out earlier by Tom >> Lane. > > IIUC, this patch offers three options to FDWs: a) set > AnalyzeForeignTable to NULL to indicate lack of capability, b) provide > AnalyzeForeignTable which calls do_analyze_rel with custom > sample_row_acquirer, and c) create statistics data from scratch by FDW > itself by doing similar things to do_analyze_rel with given argument or > copying statistics from remote PostgreSQL server. > > ISTM that this design is well-balanced between simplicity and > flexibility. Maybe these three options would suit web-based wrappers, > file-based or RDBMS wrappers, and pgsql_fdw respectively. I think that > adding more details of FdwRoutine, such as purpose of new callback > function and difference from required ones, would help FDW authors, > including me :) > > I have some random comments: > > - I think separated typedef of sample_acquire_rows would make codes more > readable. In addition, parameters of the function should be described > explicitly. > - I couldn't see the reason why file_fdw sets ctid of sample tuples, > though I guess it's for Vitter's random sampling algorithm. If every > FDW must set valid ctid to sample tuples, it should be mentioned in > document of AnalyzeForeignTable. Exporting some functions from > analyze.c relates this issue? > - Why file_fdw skips sample tuples which have NULL value? AFAIS > original acquire_sample_rows doesn't do so. > - Some comment lines go past 80 columns. > - Some headers included in file_fdw.c seems unnecessary. > > Regards,
(2011/11/19 0:54), Robert Haas wrote: > 2011/11/18 Shigeru Hanada<shigeru.hanada@gmail.com>: >> - I couldn't see the reason why file_fdw sets ctid of sample tuples, >> though I guess it's for Vitter's random sampling algorithm. If every >> FDW must set valid ctid to sample tuples, it should be mentioned in >> document of AnalyzeForeignTable. Exporting some functions from >> analyze.c relates this issue? > > If every FDW must set valid ctid to sample tuples, it should be fixed > so that they don't have to, I would think. It's for neither Vitter's algorithm nor exporting functions from analyze.c. It's for "foreign index scan" on CSV file data that I plan to propose in the next CF. So, it is meaningless for now. I'm sorry. I will fix it at the next version of the patch so that they don't have to. Best regards, Etsuro Fujita
Hi Hanada-san, I updated the patch. Please find attached a patch. Best regards, Etsuro Fujita > (2011/11/18 21:00), Shigeru Hanada wrote: >> (2011/11/18 16:25), Etsuro Fujita wrote: >>> Thank you for your testing. I updated the patch according to your >>> comments. Attached is the updated version of the patch. >> >> I'd like to share result of my review even though it's not fully >> finished. So far I looked from viewpoint of API design, code >> formatting, and documentation. I'll examine effectiveness of the patch >> and details of implementation next week, and hopefully try writing >> ANALYZE handler for pgsql_fdw :) >> >> New patch has correct format, and it could be applied to HEAD of master >> branch cleanly. All regression tests including those of contrib modules >> have passed. It contains changes of codes and regression tests related >> to the issue, and they have enough comments. IMO the document in this >> patch is not enough to show how to write analyze handler to FDW authors, >> but it can be enhanced afterward. With this patch, FDW author can >> provide optional ANALYZE handler which updates statistics of foreign >> tables. Planner would be able to generate better plan by using statistics. >> >>> Yes. But in the updated version, I've refactored analyze.c a little bit >>> to allow FDW authors to simply call do_analyze_rel(). >> <snip> >>> The updated version enables FDW authors to just write their own >>> acquire_sample_rows(). On the other hand, by retaining to hook >>> AnalyzeForeignTable routine at analyze_rel(), higher level than >>> acquire_sample_rows() as before, it allows FDW authors to write >>> AnalyzeForeignTable routine for foreign tables on a remote server to ask >>> the server for its current stats instead, as pointed out earlier by Tom >>> Lane. >> >> IIUC, this patch offers three options to FDWs: a) set >> AnalyzeForeignTable to NULL to indicate lack of capability, b) provide >> AnalyzeForeignTable which calls do_analyze_rel with custom >> sample_row_acquirer, and c) create statistics data from scratch by FDW >> itself by doing similar things to do_analyze_rel with given argument or >> copying statistics from remote PostgreSQL server. >> >> ISTM that this design is well-balanced between simplicity and >> flexibility. Maybe these three options would suit web-based wrappers, >> file-based or RDBMS wrappers, and pgsql_fdw respectively. I think that >> adding more details of FdwRoutine, such as purpose of new callback >> function and difference from required ones, would help FDW authors, >> including me :) >> >> I have some random comments: >> >> - I think separated typedef of sample_acquire_rows would make codes more >> readable. In addition, parameters of the function should be described >> explicitly. >> - I couldn't see the reason why file_fdw sets ctid of sample tuples, >> though I guess it's for Vitter's random sampling algorithm. If every >> FDW must set valid ctid to sample tuples, it should be mentioned in >> document of AnalyzeForeignTable. Exporting some functions from >> analyze.c relates this issue? >> - Why file_fdw skips sample tuples which have NULL value? AFAIS >> original acquire_sample_rows doesn't do so. >> - Some comment lines go past 80 columns. >> - Some headers included in file_fdw.c seems unnecessary. >> >> Regards, > >
Attachment
(2011/12/09 21:16), Etsuro Fujita wrote: > I updated the patch. Please find attached a patch. I've examined v5 patch, and got reasonable EXPLAIN results which reflect collected statistics! As increasing STATISTICS option, estimated rows become better. Please see attached stats_*.txt for what I tested. stats_none.txt : before ANALYZE stats_100.txt : SET STATISTICS = 100 for all columns, and ANALYZE stats_10000.txt : SET STATISTICS = 10000 for all columns, and ANALYZE I think that this patch become ready for committer after some minor corrections: Couldn't set n_distinct ======================= I couldn't set n_distinct to columns of foreign tables. With some research, I noticed that ATSimplePermissions should accept ATT_FOREIGN_TABLE too for that case. In addition, regression tests for ALTER FOREIGN TABLE should be added to detect this kind of problem. Showing stats target ==================== We can see stats target of ordinary tables with \d+, but it is not available for foreign tables. I think "Stats target" column should be added even though output of \d+ for foreign tables become wider. One possible idea is to remove useless "Storage" column instead, but views have that column even though their source could come from foreign tables. Please see attached patch for these two items. Comments of FdwRoutine ====================== Mention about optional handler is obsolete. We should clearly say AnalyzeForeignTable is optional (can be set to NULL) and rest are required. IMO separating them with comment would help FDW authors to understand requirements, e.g.: typedef struct FdwRoutine { NodeTag type; /* * These Handlers are required to execute simple scan on a foreign * table. If any of them was set to NULL, scan on a foreign table * managed by such FDW would fail. */ PlanForeignScan_function PlanForeignScan; ExplainForeignScan_function ExplainForeignScan; BeginForeignScan_function BeginForeignScan; IterateForeignScan_function IterateForeignScan; ReScanForeignScan_function ReScanForeignScan; EndForeignScan_function EndForeignScan; /* * Handlers below are optional. You can set any of them to * NULL to tell PostgreSQL that the FDW doesn't have the * capability. */ AnalyzeForeignTable_function AnalyzeForeignTable; } FdwRoutine; Code formatting =============== Some code lines go past 80 columns. Message style ============= The terms 'cannot support option "n_distinct"...' used in ATPrepSetOptions seems little unusual in PostgreSQL. Should we say 'cannot set "n_distinct_inherited" for foreign tables' for that case? Typo ==== Typo "spcify" is in document of analyze. Regards, -- Shigeru Hanada
Attachment
(2011/12/12 19:33), Shigeru Hanada wrote: > (2011/12/09 21:16), Etsuro Fujita wrote: >> I updated the patch. Please find attached a patch. > > I've examined v5 patch, and got reasonable EXPLAIN results which reflect > collected statistics! As increasing STATISTICS option, estimated rows > become better. Please see attached stats_*.txt for what I > tested. > > stats_none.txt : before ANALYZE > stats_100.txt : SET STATISTICS = 100 for all columns, and ANALYZE > stats_10000.txt : SET STATISTICS = 10000 for all columns, and ANALYZE > > I think that this patch become ready for committer after some > minor corrections: > > Couldn't set n_distinct > ======================= > I couldn't set n_distinct to columns of foreign tables. With some > research, I noticed that ATSimplePermissions should accept > ATT_FOREIGN_TABLE too for that case. In addition, regression tests for > ALTER FOREIGN TABLE should be added to detect this kind of problem. > > Showing stats target > ==================== > We can see stats target of ordinary tables with \d+, but it is not > available for foreign tables. I think "Stats target" column should be > added even though output of \d+ for foreign tables become wider. One > possible idea is to remove useless "Storage" column instead, but views > have that column even though their source could come from foreign tables. > > Please see attached patch for these two items. > > Comments of FdwRoutine > ====================== > Mention about optional handler is obsolete. We should clearly say > AnalyzeForeignTable is optional (can be set to NULL) and rest are > required. IMO separating them with comment would help FDW authors to > understand requirements, e.g.: > > typedef struct FdwRoutine > { > NodeTag type; > > /* > * These Handlers are required to execute simple scan on a foreign > * table. If any of them was set to NULL, scan on a foreign table > * managed by such FDW would fail. > */ > PlanForeignScan_function PlanForeignScan; > ExplainForeignScan_function ExplainForeignScan; > BeginForeignScan_function BeginForeignScan; > IterateForeignScan_function IterateForeignScan; > ReScanForeignScan_function ReScanForeignScan; > EndForeignScan_function EndForeignScan; > > /* > * Handlers below are optional. You can set any of them to > * NULL to tell PostgreSQL that the FDW doesn't have the > * capability. > */ > AnalyzeForeignTable_function AnalyzeForeignTable; > } FdwRoutine; > > Code formatting > =============== > Some code lines go past 80 columns. > > Message style > ============= > The terms 'cannot support option "n_distinct"...' used in > ATPrepSetOptions seems little unusual in PostgreSQL. Should we say > 'cannot set "n_distinct_inherited" for foreign tables' for that case? > > Typo > ==== > Typo "spcify" is in document of analyze. Thank you for your effectiveness experiments and proposals for improvements. I updated the patch according to your proposals. Attached is the updated version of the patch. Best regards, Etsuro Fujita > Regards, > > > >
Attachment
(2011/12/13 22:00), Etsuro Fujita wrote: > Thank you for your effectiveness experiments and proposals for > improvements. I updated the patch according to your proposals. > Attached is the updated version of the patch. I think this patch could be marked as "Ready for committer" with some minor fixes. Please find attached a revised patch (v6.1). Changes from Fujita-san's patch are: * Fix a typo in src/backend/commands/analyze.c. * Connect multi-lined message string literals, because PG code style allows a line including message string literals to go past 80 columns. * Fix ATPrepSetOptions so that it uses pg_strcasecmp instead of pg_strncasecmp, becuase it's guaranteed that a) given strings are null-terminated, and b) they have no trailing. This feature would enhance cost estimation of foreign scans substantially. Great! Regards, -- Shigeru Hanada * 不明 - 自動検出 * 英語 * 日本語 * 英語 * 日本語 <javascript:void(0);>
Attachment
(2011/12/14 15:34), Shigeru Hanada wrote: > (2011/12/13 22:00), Etsuro Fujita wrote: >> Thank you for your effectiveness experiments and proposals for >> improvements. I updated the patch according to your proposals. >> Attached is the updated version of the patch. > > I think this patch could be marked as "Ready for committer" with some > minor fixes. Please find attached a revised patch (v6.1). Many thanks. Best regards, Etsuro Fujita
(2011/12/15 11:30), Etsuro Fujita wrote: > (2011/12/14 15:34), Shigeru Hanada wrote: >> I think this patch could be marked as "Ready for committer" with some >> minor fixes. Please find attached a revised patch (v6.1). I've tried to make pgsql_fdw work with this feature, and found that few static functions to be needed to exported to implement ANALYZE handler in short-cut style. The "Short-cut style" means the way to generate statistics (pg_class and pg_statistic) for foreign tables without retrieving sample data from foreign server. Attached patch (export_funcs.patch) exports examine_attribute and update_attstats which are necessary to implement ANALYZE handler for pgsql_fdw. In addition to exporting, update_attstats is also renamed to vac_update_attstats to fit with already exported function vac_update_relstats. I also attached archive of WIP pgsql_fdw with ANALYZE support. This version has better estimation than original pgsql_fdw, because it can use selectivity of qualifiers evaluated on local side to estimate number of result rows. To show the effect of ANALYZE clearly, WHERE push-down feature is disabled. Please see pgsqlAnalyzeForeignTable and store_remote_stats in pgsql_fdw.c. I used pgbench_accounts tables with 30000 records, and got reasonable rows estimation for queries below. <on remote side> postgres=# UPDATE pgbench_accounts SET filler = NULL postgres-# WHERE aid % 3 = 0; postgres=# ANALYZE; <on local side> postgres=# ANALYZE pgbench_accounts; -- needs explicit table name postgres=# EXPLAIN SELECT * FROM pgbench_accounts WHERE filler IS NULL; QUERY PLAN ---------------------------------------------------------------------------------------------------------------------------- Foreign Scan on pgbench_accounts (cost=100.00..40610.00 rows=100030 width=97) Filter: (filler IS NULL) Remote SQL: DECLARE pgsql_fdw_cursor_13 SCROLL CURSOR FOR SELECT aid, bid, abalance, filler FROM public.pgbench_accounts (3 rows) postgres=# EXPLAIN SELECT * FROM pgbench_accounts WHERE aid < 100; QUERY PLAN ---------------------------------------------------------------------------------------------------------------------------- Foreign Scan on pgbench_accounts (cost=100.00..40610.00 rows=96 width=97) Filter: (aid < 100) Remote SQL: DECLARE pgsql_fdw_cursor_14 SCROLL CURSOR FOR SELECT aid, bid, abalance, filler FROM public.pgbench_accounts (3 rows) postgres=# EXPLAIN SELECT * FROM pgbench_accounts WHERE aid < 1000; QUERY PLAN ---------------------------------------------------------------------------------------------------------------------------- Foreign Scan on pgbench_accounts (cost=100.00..40610.00 rows=1004 width=97) Filter: (aid < 1000) Remote SQL: DECLARE pgsql_fdw_cursor_15 SCROLL CURSOR FOR SELECT aid, bid, abalance, filler FROM public.pgbench_accounts (3 rows) In implementing ANALYZE handler, hardest part was copying anyarray values from remote to local. If we can make it common in core, it would help FDW authors who want to implement ANALYZE handler without retrieving sample rows from remote server. Regards, -- Shigeru Hanada
Attachment
Hi Hanada-san, Sorry for the late response. (2012/02/10 22:05), Shigeru Hanada wrote: > (2011/12/15 11:30), Etsuro Fujita wrote: >> (2011/12/14 15:34), Shigeru Hanada wrote: >>> I think this patch could be marked as "Ready for committer" with some >>> minor fixes. Please find attached a revised patch (v6.1). > > I've tried to make pgsql_fdw work with this feature, and found that few > static functions to be needed to exported to implement ANALYZE handler > in short-cut style. The "Short-cut style" means the way to generate > statistics (pg_class and pg_statistic) for foreign tables without > retrieving sample data from foreign server. That's great! Here is my review. The patch applies with some modifications and compiles cleanly. But regression tests on subqueries failed in addition to role related tests as discussed earlier. While I've not looked at the patch in detail, I have some comments: 1. The patch might need codes to handle the irregular case where ANALYZE-related catalog data such as attstattarget are different between the local and the remote. (Although we don't have the options to set such a data on a foreign table in ALTER FOREIGN TABLE.) For example, while attstattarget = -1 for some column on the local, attstattarget = 0 for that column on the remote meaning that there can be no stats available for that column. In such a case it would be better to inform the user of it. 2. It might be better for the FDW to estimate the costs of a remote query for itself without doing EXPLAIN if stats were available using this feature. While this approach is less accurate compared to the EXPLAIN approach due to the lack of information such as seq_page_cost or randam_page_cost on the remote, it is cheaper! I think such a information may be added to generic options for a foreign table, which may have been previously discussed. 3. > In implementing ANALYZE handler, hardest part was copying anyarray > values from remote to local. If we can make it common in core, it would > help FDW authors who want to implement ANALYZE handler without > retrieving sample rows from remote server. +1 from me. Best regards, Etsuro Fujita
(2011/12/13 22:00), Etsuro Fujita wrote: > Thank you for your effectiveness experiments and proposals for > improvements. I updated the patch according to your proposals. > Attached is the updated version of the patch. Hi all, I've revised the v6.1 patch and created v7 patch, though dead line of this CF is coming closer. I think that this feature provides a way to improve plans for foreign tables significantly, so I hope that this feature is available in 9.2. I'd like to show overview of the patch again for ease of review. New FDW API function ==================== This patch adds new FDW API function AnalyzeForeignTable to FdwRoutine which can be used to support updating local statistics of foreign table. This function is invoked when ANALYZE command is executed against a foreign table explicitly. This handler function is optional, so if underlying FDW set NULL for the pointer, PostgreSQL doesn't touch statistics but emits a message about skipping. It's not required to FDWs to implement fully featured analyzer by itself. They can use core routines, do_analyze_rel and others, for most difficult part of analyzing. What FDWs should do is to provide a sampling function and call do_analyze_rel with passing the sampling function as argument in their concrete AnalyzeForeignTable. AnalyzeForeignTable (or sampling function) can report FDW-specific additional information by calling ereport() with given elevel. Once we've considered an idea that FDW stores statistics information without calling do_analyze_rel, but it seems very hard to implement, and not so efficient. I tried to implement such handler in pgsql_fdw (which seeems easiest to achieve) by getting remote statistics by "SELECT * FROM pg_statistics", but it has several issues such as: 1) Highly privileged user on remote side should be mapped to ANALYZE invoker 2) Structure and semantics might be difficult on remote side, if versions are not same. 3) We need to convert anyarray to anyarray through text representation. We know type of elements, but bothering works are needed. DDL changes =========== ALTER FOREIGN TABLE supports SET STATISTICS clause and n_distinct option. Former changes per-attribute statistics target, and latter overrides calculated statistics. n_distinct_inherited is not available because foreign tables can't be inherited. psql support ============ psql completes foreign table names after the keyword "ANALYZE" in addition to ordinary tables. Of course newly added statistics target is shown in \d+ command. file_fdw ======== This patch contains a use case of new handler function in contrib/file_fdw. Since file_fdw reads data from a flat file, fileAnalyzeForeignTable uses similar algorithm to ordinary tables; it samples first N rows first, and replaces them randomly with subsequent rows. Also file_fdw updates pg_class.relpages by calculating number of pages from size of the data file. To allow FDWs to implement sampling argorighm like this, several functions are exported from analyze.c, e.g. random_fract, init_selection_state, and get_next_S. pgsql_fdw ========= Though it's not fully finished, I've implemented ANALYZE handler for pgsql_fdw. Please extract pgsql_fdw.tar.gz into contrib (or use pgxs) to try it. Regards, -- Shigeru HANADA
Attachment
(2012/04/05 21:10), Shigeru HANADA wrote: > file_fdw > ======== > This patch contains a use case of new handler function in > contrib/file_fdw. Since file_fdw reads data from a flat file, > fileAnalyzeForeignTable uses similar algorithm to ordinary tables; it > samples first N rows first, and replaces them randomly with subsequent > rows. Also file_fdw updates pg_class.relpages by calculating number of > pages from size of the data file. > > To allow FDWs to implement sampling argorighm like this, several > functions are exported from analyze.c, e.g. random_fract, > init_selection_state, and get_next_S. Just after my post, Fujita-san posted another v7 patch[1], so I merged v7 patches into v8 patch. [1] http://archives.postgresql.org/pgsql-hackers/2012-04/msg00212.php Changes taken from Fujita-san's patch ===================================== * Remove reporting validrows and deadrows at the end of acquire_sample_rows of file_fdw. Thus, it doesn't validate NOT NULL constraints any more. * Improve get_rel_size of file_fdw, which is used in GetForeignRelSize, to estimate current # of tuples of the foreign table from these values. - # of pages/tuples which are updated by last ANALYZE - current file size Additional Changes ================== * Fix memory leak in acquire_sample_rows which caused by calling NextCopyFrom repeatedly in one long-span memory context. I add per-record temporary context and it's used during processing a record. Main context is used to create heap tuples from sampled records, because sample tuples must be valid after the function ends. * Some cosmetic changes for document, e.g. remove line-break inside tagged elements. * Some cosmetic changes to make patch more readable by minimizing difference from master branch. Changes did *not* merged ======================== * Fujita-san moved document of AnalyzeForeignTable to the section "Foreign Data Wrapper Helper Functions" from "Foreign Data Wrapper Callback Routines". But I think analyze handler is one of callback functions, though it's optional. Please find attached a patch. Regards, -- Shigeru HANADA
Attachment
Thanks, Hanada-san! Best regards, Etsuro Fujita -----Original Message----- From: Shigeru HANADA [mailto:shigeru.hanada@gmail.com] Sent: Friday, April 06, 2012 11:41 AM To: Etsuro Fujita Cc: pgsql-hackers@postgresql.org Subject: Re: [HACKERS] WIP: Collecting statistics on CSV file data (2012/04/05 21:10), Shigeru HANADA wrote: > file_fdw > ======== > This patch contains a use case of new handler function in > contrib/file_fdw. Since file_fdw reads data from a flat file, > fileAnalyzeForeignTable uses similar algorithm to ordinary tables; it > samples first N rows first, and replaces them randomly with subsequent > rows. Also file_fdw updates pg_class.relpages by calculating number > of pages from size of the data file. > > To allow FDWs to implement sampling argorighm like this, several > functions are exported from analyze.c, e.g. random_fract, > init_selection_state, and get_next_S. Just after my post, Fujita-san posted another v7 patch[1], so I merged v7 patches into v8 patch. [1] http://archives.postgresql.org/pgsql-hackers/2012-04/msg00212.php Changes taken from Fujita-san's patch ===================================== * Remove reporting validrows and deadrows at the end of acquire_sample_rows of file_fdw. Thus, it doesn't validate NOT NULL constraints any more. * Improve get_rel_size of file_fdw, which is used in GetForeignRelSize, to estimate current # of tuples of the foreign table from these values. - # of pages/tuples which are updated by last ANALYZE- current file size Additional Changes ================== * Fix memory leak in acquire_sample_rows which caused by calling NextCopyFrom repeatedly in one long-span memory context. I add per-record temporary context and it's used during processing a record. Main context is used to create heap tuples from sampled records, because sample tuples must be valid after the function ends. * Some cosmetic changes for document, e.g. remove line-break inside tagged elements. * Some cosmetic changes to make patch more readable by minimizing difference from master branch. Changes did *not* merged ======================== * Fujita-san moved document of AnalyzeForeignTable to the section "Foreign Data Wrapper Helper Functions" from "Foreign Data Wrapper Callback Routines". But I think analyze handler is one of callback functions, though it's optional. Please find attached a patch. Regards, -- Shigeru HANADA
Shigeru HANADA <shigeru.hanada@gmail.com> writes: > Just after my post, Fujita-san posted another v7 patch[1], so I merged > v7 patches into v8 patch. I've committed a modified version of this, but right after pushing it I had a better idea about what the AnalyzeForeignTable API should do. An issue that I'd not previously figured out is how analysis of an inheritance tree could deal with foreign-table members, because it wants to estimate the members' sizes before collecting the actual sample rows. However, given that we've got the work split into a precheck phase and a sample collection phase, that's not hard to solve: we could insist that the FDW give back a size estimate in the precheck phase not the sample collection phase. I'm off to fix that up ... regards, tom lane
On Sat, Apr 7, 2012 at 4:19 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Shigeru HANADA <shigeru.hanada@gmail.com> writes: >> Just after my post, Fujita-san posted another v7 patch[1], so I merged >> v7 patches into v8 patch. > > I've committed a modified version of this, but right after pushing it > I had a better idea about what the AnalyzeForeignTable API should do. > An issue that I'd not previously figured out is how analysis of an > inheritance tree could deal with foreign-table members, because it wants > to estimate the members' sizes before collecting the actual sample rows. > However, given that we've got the work split into a precheck phase and > a sample collection phase, that's not hard to solve: we could insist > that the FDW give back a size estimate in the precheck phase not the > sample collection phase. I'm off to fix that up ... Thanks. I'll improve pgsql_fdw so that it can collect statistics of foreign data with this new API. Regards, -- Shigeru Hanada
Thanks! Best regards, Etsuro Fujita -----Original Message----- From: pgsql-hackers-owner@postgresql.org [mailto:pgsql-hackers-owner@postgresql.org] On Behalf Of Tom Lane Sent: Saturday, April 07, 2012 4:20 AM To: Shigeru HANADA Cc: Etsuro Fujita; pgsql-hackers@postgresql.org Subject: Re: [HACKERS] WIP: Collecting statistics on CSV file data Shigeru HANADA <shigeru.hanada@gmail.com> writes: > Just after my post, Fujita-san posted another v7 patch[1], so I merged > v7 patches into v8 patch. I've committed a modified version of this, but right after pushing it I had a better idea about what the AnalyzeForeignTable API should do. An issue that I'd not previously figured out is how analysis of an inheritance tree could deal with foreign-table members, because it wants to estimate the members' sizes before collecting the actual sample rows. However, given that we've got the work split into a precheck phase and a sample collection phase, that's not hard to solve: we could insist that the FDW give back a size estimate in the precheck phase not the sample collection phase. I'm off to fix that up ... regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers