Re: Document efficient self-joins / UPDATE LIMIT techniques. - Mailing list pgsql-hackers

From Laurenz Albe
Subject Re: Document efficient self-joins / UPDATE LIMIT techniques.
Date
Msg-id d892fd3078b321155b0154487069b2fa8187bc6f.camel@cybertec.at
Whole thread Raw
In response to Re: Document efficient self-joins / UPDATE LIMIT techniques.  (Corey Huinker <corey.huinker@gmail.com>)
List pgsql-hackers
On Tue, 2023-10-31 at 14:12 -0400, Corey Huinker wrote:
>
>
> > About the SELECT example:
> > -------------------------
> >
> > That example belongs to UPDATE, I'd say, because that is the main
> > operation.
>
> I'm iffy on that suggestion. A big part of putting it in SELECT was the fact
> that it shows usage of SKIP LOCKED and FOR UPDATE.

I can accept that.

>
> > About the UPDATE example:
> > -------------------------
> >
> > I think that could go, because it is pretty similar to the previous
> > one.  You even use ctid in both examples.
>
> It is similar, but the idea here is to aid in discovery. A user might miss the
> technique for update if it's only documented in delete, and even if they did see
> it there, they might not realize that it works for both UPDATE and DELETE.
> We could make reference links from one to the other, but that seems like extra
> work for the reader.

I am talking about the similarity between the SELECT and the UPDATE example.
I don't agree with bloating the documentation with redundant examples just
to save a user a click.

I like the idea of a link. Perhaps:

  If you need to perform a large UPDATE in batches to avoid excessive bloat,
  deadlocks or to reduce the load on the server, look at the example in <link>.

Other observations:

  @@ -234,6 +234,35 @@ DELETE FROM films
      In some cases the join style is easier to write or faster to
      execute than the sub-select style.
     </para>
  +  <para>
  +   In situations where a single operation would consume too many resources,
  +   either causing the operation to fail or negatively impacting other workloads,
  +   it may be desirable to break up a large <command>DELETE</command> into
  +   multiple separate commands. While doing this will actually increase the
  +   total amount of work performed, it can break the work into chunks that have
  +   a more acceptable impact on other workloads.  The
  +   <glossterm linkend="glossary-sql-standard">SQL standard</glossterm> does
  +   not define a <literal>LIMIT</literal> clause for <command>DELETE</command>
  +   operations, but it is possible get the equivalent functionality through the
  +   <literal>USING</literal> clause to a
  +   <link linkend="queries-with">Common Table Expression</link> which identifies
  +   a subset of rows to be deleted, locks those rows, and returns their system
  +   column <link linkend="ddl-system-columns-ctid">ctid</link> values:

I don't think that reducing the load on the server is such a great use case
that we should recommend it as "best practice" in the documentation (because,
as your patch now mentions, it doesn't reduce the overall load).

I also don't think we need a verbal description of what the following query does.

How about something like:

"If you have to delete lots of rows, it can make sense to perform the operation
 in several smaller batches to reduce the risk of deadlocks.  The
 <glossterm linkend="glossary-sql-standard">SQL standard</glossterm> does
 not define a <literal>LIMIT</literal> clause for <command>DELETE</command>,
 but it is possible to achieve a similar effect with a self-join on
 the system column <link linkend="ddl-system-columns-ctid">ctid</link>:"

  +<programlisting>
  +WITH delete_batch AS (
  +  SELECT l.ctid
  +  FROM user_logs AS l
  +  WHERE l.status = 'archived'
  +  ORDER BY l.creation_date
  +  LIMIT 10000
  +  FOR UPDATE
  +)
  +DELETE FROM user_logs AS ul
  +USING delete_branch AS del
  +WHERE ul.ctid = del.ctid;
  +</programlisting>
  +  This allows for flexible search criteria within the CTE and an efficient self-join.
  +  </para>

The last sentence is redundant, I'd say.

But you could add:

"An added benefit is that by using an <literal>ORDER BY</literal> clause in
 the subquery, you can determine the order in which the rows will be locked
 and deleted, which will prevent deadlocks with other statements that lock
 the rows in the same order."

But if you do that, you had better use "ORDER BY id" or something else that
looks more like a unique column.

--- a/doc/src/sgml/ref/select.sgml
+++ b/doc/src/sgml/ref/select.sgml
@@ -1679,6 +1679,30 @@ SELECT * FROM (SELECT * FROM mytable FOR UPDATE) ss WHERE col1 = 5;
     condition is not textually within the sub-query.
    </para>

+   <para>
+    In cases where a <acronym>DML</acronym> operation involving many rows

I think we should avoid using DML.  Beginner might not know it, and it is
not an index term.  My suggestion is "data modification statement/operation".

+    must be performed, and that table experiences numerous other simultaneous
+    <acronym>DML</acronym> operations, a <literal>FOR UPDATE</literal> clause
+    used in conjunction with <literal>SKIP LOCKED</literal> can be useful for
+    performing partial <acronym>DML</acronym> operations:
+
+<programlisting>
+WITH mods AS (
+    SELECT ctid FROM mytable
+    WHERE status = 'active' AND retries > 10
+    ORDER BY id FOR UPDATE SKIP LOCKED
+)
+UPDATE mytable SET status = 'failed'
+FROM mods WHERE mytable.ctid = mods.ctid;
+</programlisting>
+
+    This allows the <acronym>DML</acronym> operation to be performed in parts, avoiding locking,
+    until such time as the set of rows that remain to be modified is small enough

"until such time as" does not sound English to me.  "Until the number of rows that remain"
would be better, in my opinion.

+    that the locking will not affect overall performance, at which point the same

"that the locking" --> "that locking them"

+    statement can be issued without the <literal>SKIP LOCKED</literal> clause to ensure
+    that no rows were overlooked. This technique has the additional benefit that it can reduce
+    the overal bloat of the updated table if the table can be vacuumed in between batch updates.
+   </para>

"overal" --> "overall"

I don't think you should use "vacuum" as a verb.
Suggestion: "if you perform <command>VACUUM</command> on the table between individual
update batches".


Yours,
Laurenz Albe



pgsql-hackers by date:

Previous
From: Tomas Vondra
Date:
Subject: Re: Statistics Import and Export
Next
From: John Naylor
Date:
Subject: Re: Extract numeric filed in JSONB more effectively