FDW for PostgreSQL - Mailing list pgsql-hackers
From | Shigeru HANADA |
---|---|
Subject | FDW for PostgreSQL |
Date | |
Msg-id | 50533E40.5090304@gmail.com Whole thread Raw |
Responses |
Re: FDW for PostgreSQL
Re: FDW for PostgreSQL |
List | pgsql-hackers |
Hi all, I'd like to propose FDW for PostgreSQL as a contrib module again. Attached patch is updated version of the patch proposed in 9.2 development cycle. For ease of review, I summarized what the patch tries to achieve. Abstract ======== This patch provides FDW for PostgreSQL which allows users to access external data stored in remote PostgreSQL via foreign tables. Of course external instance can be beyond network. And I think that this FDW could be an example of other RDBMS-based FDW, and it would be useful for proof-of-concept of FDW-related features. Note that the name has been changed from "pgsql_fdw" which was used in last proposal, since I got a comment which says that most of existing FDWs have name "${PRODUCT_NAME}_fdw" so "postgresql_fdw" or "postgres_fdw" would be better. For this issue, I posted another patch which moves existing postgresql_fdw_validator into contrib/dblink with renaming in order to reserve the name "postgresql_fdw" for this FDW. Please note that the attached patch requires dblink_fdw_validator.patch to be applied first. http://archives.postgresql.org/pgsql-hackers/2012-09/msg00454.php Query deparser ============== Now postgresql_fdw has its own SQL query deparser inside, so it's free from backend's ruleutils module. This deparser maps object names when generic options below were set. nspname of foreign table: used as namespace (schema) of relation relname of foreign table: used as relation name colname of foreign column: used as column name This mapping allows flexible schema design. SELECT optimization =================== postgresql_fdw always retrieves as much columns as foreign table from remote to avoid overhead of column mapping. However, often some of them (or sometimes all of them) are not used on local side, so postgresql_fdw uses NULL literal as such unused columns in SELECT clause of remote query. For example, let's assume one of pgbench workloads: SELECT abalance FROM pgbench_accounts WHERE aid = 1; This query generates a remote query below. In addition to bid and filler, aid is replaced with NULL because it's already evaluated on remote side. SELECT NULL, NULL, abalance, NULL FROM pgbench_accounts WHERE (aid OPERATOR(pg_catalog.=) 1); This trick would improve performance notably by reducing amount of data to be transferred. One more example. Let's assume counting rows. SELCT count(*) FROM pgbench_accounts; This query requires only existence of row, so no actual column reference is in SELECT clause. SELECT NULL, NULL, NULL, NULL FROM pgbench_accounts; WHERE push down =============== postgresql_fdw pushes down some of restrictions (IOW, top level elements in WHERE clause which are connected with AND) which can be evaluated on remote side safely. Currently the criteria "safe" is declared as whether an expression contains only: - column reference - constant of bult-in type (scalar and array) - external parameter of EXECUTE statement - built-in operator which uses built-in immutable function (operator cannot be collative unless it's "=" or "<>") - built-in immutable function Some other elements might be also safe to be pushed down, but criteria above seems enough for basic use cases. Although it might seem odd, but operators are deparsed into OPERATOR notation to avoid search_path problem. E.g. local query : WHERE col = 1 remote query: WHERE (col OPERATOR(pg_catalog.=) 1) Connection management ===================== postgresql_fdw has its own connection manager. Connection is established when first foreign scan on a server is planned, and it's pooled in the backend. If another foreign scan on same server is invoked, same connection will be used. Connection pool is per-backend. This means that different backends never share connection. postgresql_fdw_connections view shows active connections, and postgresql_fdw_disconnect() allows users to discard particular connection at arbitrary timing. Transaction management ====================== If multiple foreign tables on same foreign server is used in a local query, postgresql_fdw uses same connection to retrieve results in a transaction to make results consistent. Currently remote transaction is closed at the end of local query, so following local query might produce inconsistent result. Costs estimation ================ To estimate costs and result rows of a foreign scan, postgresql_fdw executes EXPLAIN statement on remote side, and retrieves costs and rows values from the result. For cost estimation, cost of connection establishment and data transfer are added to the base costs. Currently these two factors is hard-coded, but making them configurable is not so difficult. Executing EXPLAIN is not cheap, but remote query itself is usually very expensive, so such additional cost would be acceptable. ANALYZE support =============== postgresql_fdw supports ANALYZE to improve selectivity estimation of filtering done on local side (WHERE clauses which could not been pushed down. The sampler function retrieves all rows from remote table and skip some of them so that result fits requested size. As same as file_fdw, postgresql_fdw doesn't care order of result, because it's important for only correlation, and correlation is important for only index scans, which is not supported for this FDW. Fetching Data ============= postgresql_fdw uses single-row mode of libpq so that memory usage is kept in low level even if the result is huge. To cope with difference of encoding, postgresql_fdw automatically sets client_encoding to server encoding of local database. Future improvement ================== I have some ideas for improvement: - Provide sorted result path (requires index information?) - Provide parameterized path - Transaction mapping between local and remotes (2PC) - binary transfer (only against servers with same PG major version?) - JOIN push-down (requires support by core) Any comments and questions are welcome. -- Shigeru HANADA
Attachment
pgsql-hackers by date: