Proposing WITH ITERATIVE - Mailing list pgsql-hackers

From Jonah H. Harris
Subject Proposing WITH ITERATIVE
Date
Msg-id CADUqk8UpPEoYnGXX3_DMKyBM=t2gv0jkWQWct1mKVDv72GNN5Q@mail.gmail.com
Whole thread Raw
Responses Re: Proposing WITH ITERATIVE  (Jeff Davis <pgsql@j-davis.com>)
Re: Proposing WITH ITERATIVE  (David Fetter <david@fetter.org>)
Re: Proposing WITH ITERATIVE  (Andreas Karlsson <andreas@proxel.se>)
List pgsql-hackers
Hey, everyone.

It's been quite a while since I last contributed a patch but, as this is a new feature, I wanted to gather feedback before doing so. I've found this functionality, already in use at Universität Tübingen, to be both highly beneficial in many of my queries as well as a small change to Postgres - a good trade-off IMO.

As you know, Postgres currently supports SQL:1999 recursive common table expressions, constructed using WITH RECURSIVE, which permit the computation of growing relations (e.g., transitive closures.) Although it is possible to use this recursion for general-purpose iterations, doing so is a deviation from its intended use case. Accordingly, an iterative-only use of WITH RECURSIVE often results in sizable overhead and poor optimizer decisions. In this email, I'd like to propose a similar but iterative-optimized form of CTE - WITH ITERATIVE.

In that it can reference a relation within its definition, this iterative variant has similar capabilities as recursive CTEs. In contrast to recursive CTEs, however, rather than appending new tuples, this variant performs a replacement of the intermediate relation. As such, the final result consists of a relation containing tuples computed during the last iteration only. Why would we want this?

This type of computation pattern is often found in data and graph mining algorithms. In PageRank, for example, the initial ranks are updated in each iteration. In clustering algorithms, assignments of data tuples to clusters are updated in every iteration. Something these types of algorithms have in common is that they operate on fixed-size datasets, where only specific values (ranks, assigned clusters, etc.) are updated.

Rather than stopping when no new tuples are generated, WITH ITERATIVE stops when a user-defined predicate evaluates to true. By providing a non-appending iteration concept with a while-loop-style stop criterion, we can massively speed up query execution due to better optimizer decisions. Although it is possible to implement the types of algorithms mentioned above using recursive CTEs, this feature has two significant advantages:

First, iterative CTEs consume significantly less memory. Consider a CTE of N tuples and I iterations. With recursive CTEs, such a relation would grow to (N * I) tuples. With iterative CTEs, however, all prior iterations are discarded early. As such, the size of the relation would remain N, requiring only a maximum of (2 * N) tuples for comparisons of the current and the prior iteration. Furthermore, in queries where the number of executed iterations represents the stop criterion, recursive CTEs require even more additional per-tuple overhead - to carry along the iteration counter.

Second, iterative CTEs have lower query response times. Because of its smaller size, the time to scan and process the intermediate relation, to re-compute ranks, clusters, etc., is significantly reduced.

In short, iterative CTEs retain the flexibility of recursive CTEs, but offer a significant advantage for algorithms which don't require results computed throughout all iterations. As this feature deviates only slightly from WITH RECURSIVE, there is very little code required to support it (~250 loc.)

If there's any interest, I'll clean-up their patch and submit. Thoughts?

--
Jonah H. Harris

pgsql-hackers by date:

Previous
From: Robert Haas
Date:
Subject: improving basebackup.c's file-reading code
Next
From: David Zhang
Date:
Subject: Re: WIP/PoC for parallel backup