Thread: Re: Unexpected Seq Scan's when using MERGE WHEN NOT MATCHED BY SOURCE
On Thu, 9 Jan 2025, 12:26 Feike Steenbergen, <feikesteenbergen@gmail.com> wrote:
I'm trying to change a few applications to fully use this, as PostgreSQL 17 added this support.
The application does something like this:
- fetch information from a source system and store it in a temp table
- run a MERGE with a table (target) in this database,
updating, inserting and deleting in a single statement
- the target table holds information for multiple systems
The temp table (source) doesn't contain the identifier for the system, as
we can inject that at runtime.
This is the shape of the statement:
MERGE INTO
merge_target AS t
USING
merge_source AS s ON (t.time = s.time AND t.device_id = $1)
WHEN MATCHED THEN
UPDATE set
value = s.value
WHEN NOT MATCHED THEN
INSERT (device_id, time, value) VALUES ($1, time, value)
WHEN NOT MATCHED BY SOURCE
AND t.device_id = $1
THEN DELETE;
If we run this however, there is always a Seq Scan against merge_target,
whereas the filter of `AND t.device_id = $1` uses a (Bitmap) Index scan
in other types of queries.
Previously we would - in a CTE - do a delete *and then* the merge.
Which performed much better as the DELETE would not do a Seq Scan.
Attached a full reproducing test case. Some numbers:
Previously, DELETE in CTE, then merge:
Planning Time: 6.700 ms
Execution Time: 7.473 ms
Using the MERGE WHEN MATCHED ON SOURCE THEN DELETE:
Planning Time: 0.994 ms
Execution Time: 65.695 ms
My questions are:
- is the Seq Scan expected by others in this mailing list as well?
- Is it required to do the Seq Scan?
- is this something that could be optimized?
Kind regards,
Feike Steenbergen
An excellent post. I wish all posters provided nice contained example cases like this one.
I note ,in the documentation, that a Warning box got added which says this...
If both WHEN NOT MATCHED BY SOURCE and WHEN NOT MATCHED [BY TARGET] clauses are specified, the MERGE command will perform a FULL join between data_source and the target table. For this to work, at least one join_condition subexpression must use an operator that can support a hash join, or all of the subexpressions must use operators that can support a merge join.
This could be a hint as to the reason maybe ? The NOT MATCHED BY SOURCE is new feature to 17. I'm looking forward to others replies here.
On Thu, 9 Jan 2025 at 20:39, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Yeah. That prevents pushing down the join condition, since in a FULL join all
> rows of both sides will contribute to the result; none can be removed ahead of
> the join.
> Not sure if this pattern is common enough to be worth trying to implement such
> an optimization.
I'm not going to pursue this much further, however for those reading along/
future reference, what we're trying to do is:
- for a given target table
- merge a subset of the data using a source table (only update any significant
changes). The subset filter `WHERE device_id=$1` is applied to both the JOIN
as well as the NOT MATCHED BY SOURCE part of the merge.
- when using MERGE ... NOT MATCHED BY SOURCE THEN DELETE currently (pg17) reads
the whole of the target table before applying the subset filter. As we plan to
merge only very small subsets (1/10,000 or so), this means that for now, this
isn't usable for that use case.
For example, for a monitoring system, we have 10's of thousands of remote
systems, for which we want to merge a snapshot of their current state inside a
central target table. That use case is currently not well supported with MERGE.
(The workaround is a DELETE inside a CTE).
I would expect this use case to be quite common, however, as this feature is
only available in PG17, it may not be used yet with MERGE, so I don't expect
any others to voice the same concern.