Thread: Re: speedup COPY TO for partitioned table.

Re: speedup COPY TO for partitioned table.

From
jian he
Date:
On Wed, Jan 22, 2025 at 6:54 AM Melih Mutlu <m.melihmutlu@gmail.com> wrote:
>
> Hi Jian,
>
> Thanks for the patch.
>
> jian he <jian.universality@gmail.com>, 19 Ara 2024 Per, 15:03 tarihinde şunu yazdı:
>>
>> attached copy_par_regress_test.sql is a simple benchmark sql file,
>> a partitioned table with 10 partitions, 2 levels of indirection.
>> The simple benchmark shows around 7.7% improvement in my local environment.
>
>
> I confirm that the patch introduces some improvement in simple cases like the one you shared. I looked around a bit
tounderstand whether there is an obvious reason why copying from a partitioned table is not allowed, but couldn't find
one.It seems ok to me. 

hi. melih mutlu
thanks for confirmation.

> I realized that while both "COPY <partitioned_table> TO..." and "COPY (SELECT..) TO..." can return the same set of
rows,their orders may not be the same. I guess that it's hard to guess in which order find_all_inheritors() would
returntables, and that might be something we should be worried about with the patch. What do you think? 
>

in the
find_all_inheritors->find_inheritance_children->find_inheritance_children_extended

find_inheritance_children_extended we have
"""
    if (numoids > 1)
        qsort(oidarr, numoids, sizeof(Oid), oid_cmp);
"""

so the find_all_inheritors output order is deterministic?



Re: speedup COPY TO for partitioned table.

From
Melih Mutlu
Date:
Hi,

jian he <jian.universality@gmail.com>, 27 Oca 2025 Pzt, 04:47 tarihinde şunu yazdı:
in the
find_all_inheritors->find_inheritance_children->find_inheritance_children_extended

find_inheritance_children_extended we have
"""
    if (numoids > 1)
        qsort(oidarr, numoids, sizeof(Oid), oid_cmp);
"""

so the find_all_inheritors output order is deterministic?

You're right that order in find_all_inheritors is deterministic. But it's not always the same with the order of SELECT output. You can quickly see what I mean by running a slightly modified version of the example that you shared in your first email:

CREATE TABLE t3 (a INT, b int ) PARTITION BY RANGE (a);
-- change the order. first create t3_2 then t3_1
create table t3_2 partition of t3 for values from (11) to (15);
create table t3_1 partition of t3 for values from (1) to (11);
insert into t3 select g from generate_series(1, 3) g;
insert into t3 select g from generate_series(11, 11) g;

And the results of the two different COPY approaches would be:
postgres=# COPY t3 TO STDOUT;
11      \N
1       \N
2       \N
3       \N
postgres=# COPY (SELECT * FROM t3) TO STDOUT;
1       \N
2       \N
3       \N
11      \N

Notice that "COPY t3 TO STDOUT" changes the order since the partition t3_2 has been created first, hence it has a smaller OID. On the other hand, SELECT sorts the partitions based on partition boundaries, not OIDs. That's why we should always see the same order regardless of the OIDs of partitions (you can see create_range_bounds() in partbounds.c if interested in more details). One thing that might be useful in the COPY case would be using a partition descriptor to access the correct order of partitions. I believe something like (PartitionDesc) partdesc->oid should give us the partition OIDs in order. 

Thanks,
--
Melih Mutlu
Microsoft

Re: speedup COPY TO for partitioned table.

From
David Rowley
Date:
On Tue, 11 Feb 2025 at 08:10, Melih Mutlu <m.melihmutlu@gmail.com> wrote:
> jian he <jian.universality@gmail.com>, 27 Oca 2025 Pzt, 04:47 tarihinde şunu yazdı:
>> so the find_all_inheritors output order is deterministic?
>
> You're right that order in find_all_inheritors is deterministic. But it's not always the same with the order of
SELECToutput. You can quickly see what I mean by running a slightly modified version of the example that you shared in
yourfirst email: 

I think it's fine to raise the question as to whether the order
changing matters, however, I don't personally think there should be
any concerns with this. The main reason I think this is that the
command isn't the same, so the user shouldn't expect the same
behaviour. They'll need to adjust their commands to get the new
behaviour and possibly a different order.

Another few reasons are:

1) In the subquery version, the Append children are sorted by cost, so
the order isn't that predictable in the first place. (See
create_append_path() -> list_sort(subpaths,
append_total_cost_compare))
2) The order tuples are copied with COPY TO on non-partitioned tables
isn't that well defined in the first place. Two reasons for this, a)
the heap is a heap and has no defined order; and b) sync scans might
be used and the scan might start at any point in the heap and circle
back around again to the page prior to the page where the scan
started. See (table_beginscan() adds SO_ALLOW_SYNC to the flags).

I think the main thing to be concerned about regarding order is to
ensure that all rows from the same partition are copied consecutively,
and that does not seem to be at risk of changing here. This is
important as 3592e0ff9 added caching of the last found partition when
partition lookups continually find the same partition.

David



Re: speedup COPY TO for partitioned table.

From
jian he
Date:
hi.

rebased and polished patch attached, test case added.
However there is a case (the following) where
``COPY(partitioned_table)`` is much slower
(around 25% in some cases) than ``COPY (select * from partitioned_table)``.

If the partition attribute order is not the same as the partitioned table,
then for each output row, we need to create a template TupleTableSlot
and call execute_attr_map_slot,
i didn't find a work around to reduce the inefficiency.

Since the master doesn't have ``COPY(partitioned_table)``,
I guess this slowness case is allowed?


----------- the follow case is far slower than ``COPY(select * From pp) TO ``
drop table if exists pp;
CREATE TABLE pp (id  INT, val int ) PARTITION BY RANGE (id);
create table pp_1 (val int, id int);
create table pp_2 (val int, id int);
ALTER TABLE pp ATTACH PARTITION pp_1 FOR VALUES FROM (1) TO (5);
ALTER TABLE pp ATTACH PARTITION pp_2 FOR VALUES FROM (5) TO (10);
insert into pp select g, 10 + g from generate_series(1,9) g;
copy pp to stdout(header);

Attachment