Re: Make COPY format extendable: Extract COPY TO format implementations - Mailing list pgsql-hackers

From Sutou Kouhei
Subject Re: Make COPY format extendable: Extract COPY TO format implementations
Date
Msg-id 20240725.135138.734175363436010682.kou@clear-code.com
Whole thread Raw
In response to Re: Make COPY format extendable: Extract COPY TO format implementations  (Sutou Kouhei <kou@clear-code.com>)
Responses Re: Make COPY format extendable: Extract COPY TO format implementations
Re: Make COPY format extendable: Extract COPY TO format implementations
List pgsql-hackers
Hi,

THREAD SUMMARY:

Proposal:

How about making COPY format extendable?


Background:

Currently, COPY TO/FROM supports only "text", "csv" and
"binary" formats. There are some requests to support more
COPY formats. For example:


* 2023-11: JSON and JSON lines [1]
* 2022-04: Apache Arrow [2]
* 2018-02: Apache Avro, Apache Parquet and Apache ORC [3]

There were discussions how to add support for more formats. [3][4]
In these discussions, we got a consensus about making COPY
format extendable.

[1]:
https://www.postgresql.org/message-id/flat/24e3ee88-ec1e-421b-89ae-8a47ee0d2df1%40joeconway.com#a5e6b8829f9a74dfc835f6f29f2e44c5
[2]:
https://www.postgresql.org/message-id/flat/CAGrfaBVyfm0wPzXVqm0%3Dh5uArYh9N_ij%2BsVpUtDHqkB%3DVyB3jw%40mail.gmail.com
[3]: https://www.postgresql.org/message-id/flat/20180210151304.fonjztsynewldfba%40gmail.com
[4]: https://www.postgresql.org/message-id/flat/3741749.1655952719%40sss.pgh.pa.us#2bb7af4a3d2c7669f9a49808d777a20d


Concerns:

* Performance: If we make COPY format extendable, it will
  introduce some overheads. We don't want to loss our
  optimization efforts for the current implementations by
  this. [5]
* Extendability: We don't know which API set is enough for
  custom COPY format implementations yet. We don't want to
  provide too much APIs to reduce maintenance cost.

[5]: https://www.postgresql.org/message-id/3741749.1655952719%40sss.pgh.pa.us


Implementation:

The v18 patch set is the latest patch set. [6]
It includes the following patches:

0001: This adds a basic feature (Copy{From,To}Routine)
      (This isn't enough for extending COPY format.
      This just extracts minimal procedure sets to be
      extendable as callback sets.)
0002: This uses Copy{From,To}Rountine for the existing
      formats (text, csv and binary)
      (This may not be committed because there is a
      profiling related concern. See the following section
      for details)
0003: This adds support for specifying custom format by
      "COPY ... WITH (format 'my-format')"
      (This also adds a test for this feature.)
0004: This exports Copy{From,To}StateData
      (But this isn't enough to implement custom COPY
      FROM/TO handlers as an extension.)
0005: This adds opaque member to Copy{From,To}StateData and
      export some functions to read the next data and flush
      the buffer
      (We can implement a PoC Apache Arrow COPY FROM/TO
      handler as an extension with this. [7])

[6]: https://www.postgresql.org/message-id/flat/20240724.173059.909782980111496972.kou%40clear-code.com
[7]: https://github.com/kou/pg-copy-arrow


Implementation notes:

* 0002: We use "static inline" and "constant argument" for
  optimization.
* 0002: This hides NextCopyFromRawFields() in a public
  header because it's not used in PostgreSQL and we want to
  use "static inline" for it. If it's a problem, we can keep
  it and create an internal function for "static inline".
* 0003: We use "CREATE FUNCTION" to register a custom COPY
  FROM/TO handler. It's the same approach as tablesample.
* 0004 and 0005: We can mix them but this patch set split
  them for easy to review. 0004 just moves the existing
  codes. It doesn't change the existing codes.
* PoC: I provide it as a separated repository instead of a
  patch because an extension exists as a separated project
  in general. If it's a problem, I can provide it as a patch
  for contrib/.
* This patch set still has minimal Copy{From,To}Routine. For
  example, custom COPY FROM/TO handlers can't process their
  own options with this patch set. We may add more callbacks
  to Copy{From,To}Routine later based on real world use-cases.


Performance concern:

We have a benchmark result and a profile for the change that
uses Copy{From,To}Routine for the existing formats. [8] They
are based on the v15 patch but there are no significant
difference between the v15 patch and v18 patch set.

These results show the followings:

* Runtime: The patched version is faster than HEAD.
  * The patched version: 6232ms in average
  * HEAD: 6550ms in average
* Profile: The patched version spends more percents than
  HEAD in a core function.
  * The patched version: 85.61% in CopyOneRowTo()
  * HEAD: 80.35% in CopyOneRowTo()

[8]: https://www.postgresql.org/message-id/flat/ZdbtQJ-p5H1_EDwE%40paquier.xyz


Here are related information for this benchmark/profile:

* Use -O2 for optimization build flag
  ("meson setup --buildtype=release" may be used)
* Use tmpfs for PGDATA
* Disable fsync
* Run on scissors (what is "scissors" in this context...?)   [9]
* Unlogged table may be used
* Use a table that has 30 integer columns (*1)
* Use 5M rows (*2)
* Use '/dev/null' for COPY TO (*3)
* Use blackhole_am for COPY FROM (*4)
  https://github.com/michaelpq/pg_plugins/tree/main/blackhole_am
* perf is used but used options are unknown (sorry)


(*1) This SQL may be used to create the table:

CREATE OR REPLACE FUNCTION create_table_cols(tabname text, num_cols int)
RETURNS VOID AS
$func$
DECLARE
  query text;
BEGIN
  query := 'CREATE UNLOGGED TABLE ' || tabname || ' (';
  FOR i IN 1..num_cols LOOP
    query := query || 'a_' || i::text || ' int default 1';
    IF i != num_cols THEN
      query := query || ', ';
    END IF;
  END LOOP;
  query := query || ')';
  EXECUTE format(query);
END
$func$ LANGUAGE plpgsql;
SELECT create_table_cols ('to_tab_30', 30);
SELECT create_table_cols ('from_tab_30', 30);


(*2) This SQL may be used to insert 5M rows:

INSERT INTO to_tab_30 SELECT FROM generate_series(1, 5000000);


(*3) This SQL may be used for COPY TO:

COPY to_tab_30 TO '/dev/null' WITH (FORMAT text);


(*4) This SQL may be used for COPY FROM:

CREATE EXTENSION blackhole_am;
ALTER TABLE from_tab_30 SET ACCESS METHOD blackhole_am;
COPY to_tab_30 TO '/tmp/to_tab_30.txt' WITH (FORMAT text);
COPY from_tab_30 FROM '/tmp/to_tab_30.txt' WITH (FORMAT text);


[9]: https://www.postgresql.org/message-id/flat/Zbr6piWuVHDtFFOl%40paquier.xyz#dbbec4d5c54ef2317be01a54abaf495c


Thanks,
-- 
kou



pgsql-hackers by date:

Previous
From: Alexander Lakhin
Date:
Subject: Re: Recent 027_streaming_regress.pl hangs
Next
From: Amit Kapila
Date:
Subject: Re: long-standing data loss bug in initial sync of logical replication