•   PostgreSQL   •   By Egor Rogov

Queries in PostgreSQL: 5. Nested loop

So far we've discussed query execution stages, statistics, and the two basic data access methods: Sequential scan and Index scan.

The next item on the list is join methods. This article will remind you what logical join types are out there, and then discuss one of three physical join methods, the Nested loop join. Additionally, we will check out the row memoization feature introduced in PostgreSQL 14.

Joins

Joins are the primary feature of SQL, the foundation that enables its power and agility. Sets of rows (whether pulled directly from a table or formed as a result of an operation) are always joined together in pairs. There are several types of joins.

  • Inner join. An INNER JOIN (or usually just JOIN) is a subset of two sets that includes all the row pairs from the two original sets that match the join condition. A join condition is what ties together columns from the first row set with columns from the second one. All the columns involved comprise what is known as the join key.

    If the join condition is an equality operator, as is often the case, then the join is called an equijoin.

    A Cartesian product (or a CROSS JOIN) of two sets includes all possible combinations of pairs of rows from both sets. This is a specific case of an inner join with a TRUE condition.

  • Outer joins. The LEFT OUTER JOIN (or the LEFT JOIN) supplements the result of an inner join with the rows from the left set which didn't have a corresponding pair in the right set (the contents of the missing right set columns are set to NULLs).

    The RIGHT JOIN works identically, except the joining order is reversed.

    The FULL JOIN is the LEFT JOIN and the RIGHT JOIN combined. It includes rows with missing pairs from both sets, as well as the INNER JOIN result.

  • Semijoins and antijoins. The semijoin is similar to a regular inner join, but with two key differences. Firstly, it includes only the rows from the first set that have a matching pair in the second set. Secondly, it includes the rows from the first set only once, even if a row happens to have multiple matches in the second set.

    The antijoin includes only those rows from the first set that didn't have a matching pair in the second set.

    SQL does not offer explicit semijoin or antijoin operations, but some expressions (EXISTS and NOT EXISTS, for example) can be used to achieve equivalent results.

All of the above are logical operations. For example, you can represent an inner join as a Cartesian product that retains only the rows that satisfy the join condition. When it comes to hardware, however, there are ways to perform an inner join much more efficiently.

PostgreSQL offers several join methods:

  • Nested loop join
  • Hash join
  • Merge join

Join methods are algorithms that execute the join operations in SQL. They often come in various flavours tailored to specific join types. For example, an inner join that uses the nested loop mode will be represented in a plan with a Nested Loop node, but a left outer join using the same mode will look like a Nested Loop Left Join node in the plan.

Different methods shine in different conditions, and it's the planner's job to select the best one cost-wise.

Nested loop join

The nested loop algorithm is based on two loops: an inner loop within an outer loop. The outer loop searches through all the rows of the first (outer) set. For every such row, the inner loop searches for matching rows in the second (inner) set. Upon discovering a pair that satisfies the join condition, the node immediately returns it to the parent node, and then resumes scanning.

The inner loop cycles as many times as there are rows in the outer set. Therefore, the algorithm efficiency depends on several factors:

  • Outer set cardinality.
  • Existence of an efficient access method that fetches the rows from the inner set.
  • Number of repeat row fetches from the inner set.

Let's look at examples.

Cartesian product

A nested loop join is the most efficient way to calculate a Cartesian product, regardless of the number of rows in both sets.

EXPLAIN SELECT *
FROM aircrafts_data a1
  CROSS JOIN aircrafts_data a2
WHERE a2.range > 5000;
                     QUERY PLAN
−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−
 Nested Loop  (cost=0.00..2.78 rows=45 width=144)
   −> Seq Scan on aircrafts_data a1                      outer set
       (cost=0.00..1.09 rows=9 width=72)
   −> Materialize  (cost=0.00..1.14 rows=5 width=72)     inner set
       −> Seq Scan on aircrafts_data a2
           (cost=0.00..1.11 rows=5 width=72)
           Filter: (range > 5000)
(7 rows)

The Nested Loop node is where the algorithm is executed. It always has two children: the top one is the outer set, the bottom one is the inner set.

The inner set is represented with a Materialize node in this case. When called, the node stores the output of its child node in RAM (up to work_mem, then starts spilling to disk) and then returns it. Upon further calls the node returns the data from memory, avoiding additional table scans.

An inner join query might generate a similar plan:

EXPLAIN SELECT *
FROM tickets t
  JOIN ticket_flights tf ON tf.ticket_no = t.ticket_no
WHERE t.ticket_no = '0005432000284';
                           QUERY PLAN
−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−
 Nested Loop  (cost=0.99..25.05 rows=3 width=136)
   −> Index Scan using tickets_pkey on tickets t
       (cost=0.43..8.45 rows=1 width=104)
       Index Cond: (ticket_no = '0005432000284'::bpchar)
   −> Index Scan using ticket_flights_pkey on ticket_flights tf
       (cost=0.56..16.57 rows=3 width=32)
       Index Cond: (ticket_no = '0005432000284'::bpchar)
(7 rows)

The planner realizes the equivalence and replaces the tf.ticket_no = t.ticket_no condition with tf.ticket_no = constant, essentially transforming the inner join into a Cartesian product.

Cardinality estimation. The cardinality of a Cartesian product of two sets equals the product of cardinalities of the two sets: 3 = 1 × 3.

Cost estimation. The startup cost of a join equals the sum of startup costs of its child nodes.

The total cost of a join, in this case, equals the sum of:

  • Row fetch cost for the outer set, for each row.
  • One-time row fetch cost for the inner set, for each row (because the cardinality of the outer set equals one).
  • Processing cost for each output row.
SELECT 0.43 + 0.56 AS startup_cost,
  round((
    8.45 + 16.58 +
    3 * current_setting('cpu_tuple_cost')::real
  )::numeric, 2) AS total_cost;
 startup_cost | total_cost
−−−−−−−−−−−−−−+−−−−−−−−−−−−
         0.99 |      25.06
(1 row)

The imprecision is due to a rounding error. The planner's calculations are performed in floating-point values, which are rounded down to two decimal places for plan readability, while I use rounded-down values as an input.

Let's come back to the previous example.

EXPLAIN SELECT *
FROM aircrafts_data a1
  CROSS JOIN aircrafts_data a2
WHERE a2.range > 5000;
                     QUERY PLAN
−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−
 Nested Loop  (cost=0.00..2.78 rows=45 width=144)
   −> Seq Scan on aircrafts_data a1
       (cost=0.00..1.09 rows=9 width=72)
   −> Materialize  (cost=0.00..1.14 rows=5 width=72)
       −> Seq Scan on aircrafts_data a2
           (cost=0.00..1.11 rows=5 width=72)
           Filter: (range > 5000)
(7 rows)

This plan has a Materialize node, which stores rows in memory and returns them faster upon repeat requests.

In general, a total join cost comprises:

  • Row fetch cost for the outer set, for each row.
  • One-time row fetch cost for the inner set, for each row (during which materialization is done).
  • (N−1) times the cost of repeat inner set row fetch cost, for each row (where N is the number of rows in the outer set).
  • Processing cost for each output row.

In this example, after the inner rows are fetched for the first time, materialization helps us save on further fetch costs. The cost of the initial Materialize call is in the plan, the cost of further calls is not. Its calculation is beyond the scope of this article, so trust me when I say that in this case the cost of consecutive Materialize node calls is 0.0125.

Therefore, the total join cost for this example looks like this:

SELECT 0.00 + 0.00 AS startup_cost,
  round((
    1.09 +
    1.14 + 8 * 0.0125 +
    45 * current_setting('cpu_tuple_cost')::real
  )::numeric, 2) AS total_cost;
 startup_cost | total_cost
−−−−−−−−−−−−−−+−−−−−−−−−−−−
         0.00 |       2.78
(1 row)

Parameterized join

Let's take a look at a more common example, one that does not simply reduce to a Cartesian product.

CREATE INDEX ON tickets(book_ref);
EXPLAIN SELECT *
FROM tickets t
  JOIN ticket_flights tf ON tf.ticket_no = t.ticket_no
WHERE t.book_ref = '03A76D';
                           QUERY PLAN
−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−
 Nested Loop  (cost=0.99..45.67 rows=6 width=136)
   −> Index Scan using tickets_book_ref_idx on tickets t
       (cost=0.43..12.46 rows=2 width=104)
       Index Cond: (book_ref = '03A76D'::bpchar)
   −> Index Scan using ticket_flights_pkey on ticket_flights tf
       (cost=0.56..16.57 rows=3 width=32)
       Index Cond: (ticket_no = t.ticket_no)
(7 rows)

Here, Nested Loop searches through the outer set (tickets) and for each outer row searches through the inner set (flights), passing down the ticket number t.ticket_no as a parameter. When the Index Scan node is called on the inner loop, it is called with the condition ticket_no = constant.

Cardinality estimation. The planner estimates that the t.book_ref = '03A76D' filter will return two rows from the outer set (rows=2), and each of these two rows will have, on average, three matches in the inner set (rows=3).

The join selectivity is the fraction of rows of a Cartesian product that remains after a join. This must exclude any rows that have NULLs in the columns that are being joined because the filter condition for a NULL is always false.

The cardinality is estimated as the Cartesian product cardinality (that is, the product of cardinalities of two sets) multiplied by the selectivity.

In this case, the cardinality estimate of the outer set is two rows. As for the inner set, no filters (except for the join condition itself) apply to it, so its cardinality equals the cardinality of the ticket_flights table.

Because the two tables are joined using a foreign key, each row of the child table will have only one matching pair in the parent table. The selectivity calculation takes that into account. In this case, the selectivity equals the reciprocal of the size of the table that the foreign key references.

Therefore, the estimate (provided that ticket_no rows don't contain any NULLs) is:

SELECT round(2  tf.reltuples  (1.0 / t.reltuples)) AS rows
FROM pg_class t, pg_class tf
WHERE t.relname = 'tickets'
AND tf.relname = 'ticket_flights';
 rows
−−−−−−
    6
(1 row)

Naturally, tables can be joined without foreign keys. In this case, the join selectivity will equal the selectivity of the join condition.

Therefore, a "universal" selectivity calculation formula for an equijoin (assuming a uniform data distribution) will look like this:

where nd1 and nd2 are the numbers of distinct join key values in the first and the second set.

The distinct values statistics show that all the ticket numbers in the tickets table are unique (which is no surprise, as ticket_no is the primary key), but in ticket_flights each ticket has about four matching rows:

SELECT t.n_distinct, tf.n_distinct
FROM pg_stats t, pg_stats tf
WHERE t.tablename = 'tickets' AND t.attname = 'ticket_no'
AND tf.tablename = 'ticket_flights' AND tf.attname = 'ticket_no';
 n_distinct | n_distinct
−−−−−−−−−−−−+−−−−−−−−−−−−
         −1 | −0.3054527
(1 row)

The estimate matches the estimate with a foreign key:

SELECT round(2  tf.reltuples 
  least(1.0/t.reltuples, 1.0/tf.reltuples/0.3054527)
) AS rows
FROM pg_class t, pg_class tf
WHERE t.relname = 'tickets' AND tf.relname = 'ticket_flights';
 rows
−−−−−−
    6
(1 row)

The planner supplements the universal formula calculation with most common value lists if this statistic is available for the join key for both tables. This gives the planner a relatively precise selectivity assessment for the rows from the MCV lists. The selectivity of the rest of the rows is still calculated as if their values are distributed uniformly. Histograms aren't used to increase the selectivity assessment quality.

In general, the selectivity of a join with no foreign key may end up worse than that of a join with a defined foreign key. This is especially true for composite keys.

Let's use the EXPLAIN ANALYZE command to check the number of actual rows and the number of inner loop calls in our plan:

EXPLAIN (analyze, timing off, summary off) SELECT *
FROM tickets t
  JOIN ticket_flights tf ON tf.ticket_no = t.ticket_no
WHERE t.book_ref = '03A76D';
                            QUERY PLAN
−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−
 Nested Loop  (cost=0.99..45.67 rows=6 width=136)
   (actual rows=8 loops=1)
   −> Index Scan using tickets_book_ref_idx on tickets t
       (cost=0.43..12.46 rows=2 width=104) (actual rows=2 loops=1)
       Index Cond: (book_ref = '03A76D'::bpchar)
   −> Index Scan using ticket_flights_pkey on ticket_flights tf
       (cost=0.56..16.57 rows=3 width=32) (actual rows=4 loops=2)
       Index Cond: (ticket_no = t.ticket_no)
(8 rows)

The outer set contains two rows (actual rows=2), as expected. The inner Index Scan node was called twice (loops=2) and returned four rows on average each time (actual rows=4). The grand total is eight rows (actual rows=8).

I switched the per-step timing off mainly to keep the output readable, but it's worth noting that the timing feature may slow the execution down considerably on some platforms. If we were to switch timing back on, however, we would see that the timings are averaged, like the row counts. From there, you can multiply the timing by the loop count to get the full estimate.

Cost estimation. The cost calculation isn't much different from the previous examples. Here's our execution plan:

EXPLAIN SELECT *
FROM tickets t
  JOIN ticket_flights tf ON tf.ticket_no = t.ticket_no
WHERE t.book_ref = '03A76D';
                           QUERY PLAN
−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−
 Nested Loop  (cost=0.99..45.67 rows=6 width=136)
   −> Index Scan using tickets_book_ref_idx on tickets t
       (cost=0.43..12.46 rows=2 width=104)
       Index Cond: (book_ref = '03A76D'::bpchar)
   −> Index Scan using ticket_flights_pkey on ticket_flights tf
       (cost=0.56..16.57 rows=3 width=32)
       Index Cond: (ticket_no = t.ticket_no)
(7 rows)

The repeat scan cost for the inner set is the same as the initial scan cost. The result:

SELECT 0.43 + 0.56 AS startup_cost,
  round((
    12.46 + 2 * 16.58 +
    6 * current_setting('cpu_tuple_cost')::real
  )::numeric, 2) AS total_cost;
 startup_cost | total_cost
−−−−−−−−−−−−−−+−−−−−−−−−−−−
         0.99 |      45.68
(1 row)

Row caching (memoization)

If you repeatedly scan the inner set rows with the same parameter and (consequently) get the same result every time, it might be a good idea to cache the rows for faster access.

This became possible in PostgreSQL 14 with the introduction of the Memoize node. The Memoize node resembles the Materialize node in some ways, but it is tailored specifically for parameterized joins and is much more complicated under the hood:

  • While Materialize simply materializes every row of its child node, Memoize stores separate row instances for each parameter value.
  • When reaching its maximum storage capacity, Materialize offloads any additional data on disk, but Memoize does not (because that would void any benefit of caching).

Below is a plan with a Memorize node:

EXPLAIN SELECT *
FROM flights f
  JOIN aircrafts_data a ON f.aircraft_code = a.aircraft_code
WHERE f.flight_no = 'PG0003';
                             QUERY PLAN
−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−
 Nested Loop  (cost=5.44..387.10 rows=113 width=135)
   −> Bitmap Heap Scan on flights f
      (cost=5.30..382.22 rows=113 width=63)
      Recheck Cond: (flight_no = 'PG0003'::bpchar)
      −> Bitmap Index Scan on flights_flight_no_scheduled_depart...
          (cost=0.00..5.27 rows=113 width=0)
          Index Cond: (flight_no = 'PG0003'::bpchar)
   −> Memoize  (cost=0.15..0.27 rows=1 width=72)
       Cache Key: f.aircraft_code
       −> Index Scan using aircrafts_pkey on aircrafts_data a
           (cost=0.14..0.26 rows=1 width=72)
           Index Cond: (aircraft_code = f.aircraft_code)
(12 rows)

First, the planner allocates work_mem × hash_mem_multiplier process memory for caching purposes. The second parameter hash_mem_multiplier (1.0 by default) gives us a hint that the node searches the rows using a hash table (with open addressing in this case). The parameter (or a set of parameters) is used as the cache key.

In addition to that, all keys are put in a list. One end of the list stores "cold" keys (which haven't been used in a while), the other stores "hot" keys (used recently).

Whenever the Memoize node is called, it checks if the rows corresponding to the passed parameter value have already been cached. If they are, Memoize returns them to the parent node (Nested Loop) without calling the child node. It also puts the cache key into the hot end of the key list.

If the required rows haven't been cached yet, Memoize requests the rows from the child node, caches them and passes them upwards. The new cache key is also put into the hot end of the list.

As the cache fills up, the allocated memory might run out. When it happens, Memoize removes the coldest items from the list to free up space. The algorithm is different from the one used in buffer cache, but serves the same goal.

If a parameter matches so many rows that they can't fit into the cache even when all other entries are removed, the parameter's rows are simply not cached. There's no sense in caching a partial output because the next time the parameter comes up, Memoize will still have to call its child node to get the full output.

The cardinality and cost estimates here are similar to what we've seen before.

One notable thing here is that the Memoize node cost in the plan is just the cost of its child node plus cpu_tuple_cost and doesn't mean much as such.

The only reason we want the node in there is to reduce this cost. The Materialize node suffers the same unclarity: the "real" node cost is the repeat scan cost, which isn't listed in the plan.

The repeat scan cost for the Memoize node depends on the amount of available memory and the manner in which the cache is accessed. It also depends significantly on the number of expected distinct parameter values, which determines the number of inner set scans. With all these variables at hand, you can attempt to calculate the probability of finding a given row in the cache and the probability of evicting a given row from the cache. The first value decreases the cost estimate, the other one increases it.

The details of this calculation are irrelevant to the topic of this article.

For now, let's use our favorite EXPLAIN ANALYZE command to check out how a plan with a Memoize node is executed.

This example query selects all flights that match a specific flight path and a specific aircraft type, therefore the cache key will be the same for all the Memoize calls. The required row will not be cached upon the initial call (Misses: 1), but it will be there for all the repeat calls (Hits: 112). The cache itself takes up all of one kilobyte of memory in total.

EXPLAIN (analyze, costs off, timing off, summary off)
SELECT *
FROM flights f
  JOIN aircrafts_data a ON f.aircraft_code = a.aircraft_code
WHERE f.flight_no = 'PG0003';
                             QUERY PLAN
−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−
 Nested Loop (actual rows=113 loops=1)
   −> Bitmap Heap Scan on flights f
       (actual rows=113 loops=1)
       Recheck Cond: (flight_no = 'PG0003'::bpchar)
       Heap Blocks: exact=2
       −> Bitmap Index Scan on flights_flight_no_scheduled_depart...
           (actual rows=113 loops=1)
           Index Cond: (flight_no = 'PG0003'::bpchar)
   −> Memoize (actual rows=1 loops=113)
       Cache Key: f.aircraft_code
       Hits: 112 Misses: 1 Evictions: 0 Overflows: 0 Memory
       Usage: 1kB
       −> Index Scan using aircrafts_pkey on aircrafts_data a
           (actual rows=1 loops=1)
           Index Cond: (aircraft_code = f.aircraft_code)
(15 rows)

Note the two zero values: Evictions and Overflows. The former one is the number of evictions from the cache and the latter one is the number of memory overflows, where the full output for a given parameter value was larger than the allocated memory size and therefore could not be cached.

High Evictions and Overflows values would indicate that the allocated cache size was insufficient. This often happens when an estimate of the number of possible parameter values is incorrect. In this case the use of the Memoize node may turn out to be quite costly. As a last resort, you can disable the use of the cache by setting the enable_memoize parameter to off.

Outer joins

You can perform a left outer join with a nested loop:

EXPLAIN SELECT *
FROM ticket_flights tf
  LEFT JOIN boarding_passes bp ON bp.ticket_no = tf.ticket_no
                              AND bp.flight_id = tf.flight_id
WHERE tf.ticket_no = '0005434026720';
                             QUERY PLAN
−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−
 Nested Loop Left Join  (cost=1.12..33.35 rows=3 width=57)
   Join Filter: ((bp.ticket_no = tf.ticket_no) AND (bp.flight_id =
   tf.flight_id))
   −> Index Scan using ticket_flights_pkey on ticket_flights tf
       (cost=0.56..16.57 rows=3 width=32)
       Index Cond: (ticket_no = '0005434026720'::bpchar)
   −> Materialize  (cost=0.56..16.62 rows=3 width=25)
       −> Index Scan using boarding_passes_pkey on boarding_passe...
           (cost=0.56..16.61 rows=3 width=25)
           Index Cond: (ticket_no = '0005434026720'::bpchar)
(10 rows)

Note the Nested Loop Left Join node. For this particular query, the planner opted for a non-parameterized filtered join: this means that the inner row set is scanned identically every loop (which is why it's "stashed" behind the Materialize node), and the output is filtered at the Join Filter node.

The cardinality estimate of an outer join is calculated in the same way as for an inner join, but the result is the larger value between the calculation result and the cardinality of the outer set. In other words, an outer join can have more, but never fewer rows than the larger of the joined sets.

The cost estimate calculation for an outer join is entirely the same as for an inner join.

Keep in mind, however, that the planner may select a different plan for an inner join than for an outer join. Even in this example, if we force the planner to use the nested loop join, we will notice a difference in the Join Filter node because the outer join will have to check for ticket number matches to get the correct result whenever there isn't a pair in the outer set. This will slightly increase the overall cost.

SET enable_mergejoin = off;
EXPLAIN SELECT *
FROM ticket_flights tf
  JOIN boarding_passes bp ON bp.ticket_no = tf.ticket_no
                         AND bp.flight_id = tf.flight_id
WHERE tf.ticket_no = '0005434026720';
                             QUERY PLAN
−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−
 Nested Loop  (cost=1.12..33.33 rows=3 width=57)
   Join Filter: (tf.flight_id = bp.flight_id)
   −> Index Scan using ticket_flights_pkey on ticket_flights tf
       (cost=0.56..16.57 rows=3 width=32)
       Index Cond: (ticket_no = '0005434026720'::bpchar)
   −> Materialize  (cost=0.56..16.62 rows=3 width=25)
       −> Index Scan using boarding_passes_pkey on boarding_passe...
           (cost=0.56..16.61 rows=3 width=25)
           Index Cond: (ticket_no = '0005434026720'::bpchar)
(9 rows)
RESET enable_mergejoin;

Right joins are incompatible with nested loops because the nested loop algorithm distinguishes between the inner and the outer set. The outer set is scanned fully, but if the inner one is accessed using index scan, then only the rows that match the join filter are returned. Therefore, some rows may remain unscanned.

Full joins are also incompatible for the same reason.

Antijoins and semijoins

Antijoins and semijoins are similar in the sense that for each row of the first (outer) set both algorithms want to find only one match in the second (inner) set.

The antijoin returns all the rows of the first set that didn't get a match in the second set. In other words, the algorithm takes a row from the outer set and searches the inner set for a match, and as soon as a match is found, the algorithm stops the search and jumps to the next row of the outer set.

Antijoins are useful for calculating the NOT EXISTS predicate. Example: find all the aircraft models without a defined seating pattern:

EXPLAIN SELECT *
FROM aircrafts a
WHERE NOT EXISTS (
  SELECT * FROM seats s WHERE s.aircraft_code = a.aircraft_code
);
                             QUERY PLAN
−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−
 Nested Loop Anti Join  (cost=0.28..4.65 rows=1 width=40)
   −> Seq Scan on aircrafts_data ml  (cost=0.00..1.09 rows=9 widt...
       −> Index Only Scan using seats_pkey on seats s
           (cost=0.28..5.55 rows=149 width=4)
           Index Cond: (aircraft_code = ml.aircraft_code)
(5 rows)

The Nested Loop Anti Join node is where the antijoin is executed.

This isn't the only use of antijoins, of course. This equivalent query will also generate a plan with an antijoin node:

EXPLAIN SELECT a.*
FROM aircrafts a
  LEFT JOIN seats s ON a.aircraft_code = s.aircraft_code
WHERE s.aircraft_code IS NULL;

The semijoin returns all the rows of the first set that got a match in the second set (no searching for repeat matches here either, as they don't affect the result in any way).

Semijoins are used for calculating the EXISTS predicate. Let's find all the aircrafts with a defined seating pattern:

EXPLAIN SELECT *
FROM aircrafts a
WHERE EXISTS (
  SELECT * FROM seats s WHERE s.aircraft_code = a.aircraft_code
);
                             QUERY PLAN
−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−
 Nested Loop Semi Join  (cost=0.28..6.67 rows=9 width=40)
   −> Seq Scan on aircrafts_data ml  (cost=0.00..1.09 rows=9 widt...
   −> Index Only Scan using seats_pkey on seats s
       (cost=0.28..5.55 rows=149 width=4)
       Index Cond: (aircraft_code = ml.aircraft_code)
(5 rows)

Nested Loop Semi Join is the node.

In this plan (and in the anitjoin plans above) the seats table has a regular row count estimate (rows=149), while we know that we only need to get one row from it. When the query is executed, the loop will stop after it gets the row, of course.

EXPLAIN (analyze, costs off, timing off, summary off) SELECT *
FROM aircrafts a
WHERE EXISTS (
  SELECT * FROM seats s WHERE s.aircraft_code = a.aircraft_code
);
                         QUERY PLAN
−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−
 Nested Loop Semi Join (actual rows=9 loops=1)
   −> Seq Scan on aircrafts_data ml (actual rows=9 loops=1)
   −> Index Only Scan using seats_pkey on seats s
       (actual rows=1 loops=9)
       Index Cond: (aircraft_code = ml.aircraft_code)
       Heap Fetches: 0
(6 rows)

The semijoin cardinality estimate is calculated as usual, except that the inner set cardinality is set to one. As for the anitjoin selectivity, the estimate is also calculated as usual and then subtracted from 1.

The cost estimate for both antijoin and semijoin is calculated with regards to the fact that only a fraction of the inner set rows is scanned for most of outer set rows.

Non-equijoins

The nested loop join algorithm can join row sets using any join condition.

Naturally, if the inner set is a base table, the table has an index, the operator class of the index contains the join condition operator, then the inner set rows can be accessed very efficiently.

That aside, any two sets of rows can be joined as a Cartesian product with a filtering join condition, and the condition here can be arbitrary. Here's an example: find pairs of airports that are located next to each other.

CREATE EXTENSION earthdistance CASCADE;
EXPLAIN (costs off) SELECT *
FROM airports a1
  JOIN airports a2 ON a1.airport_code != a2.airport_code
                  AND a1.coordinates <@> a2.coordinates < 100;
                             QUERY PLAN
−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−
 Nested Loop
   Join Filter: ((ml.airport_code <> ml_1.airport_code) AND
   ((ml.coordinates <@> ml_1.coordinates) < '100'::double precisi...
   −> Seq Scan on airports_data ml
   −> Materialize
       −> Seq Scan on airports_data ml_1
(6 rows)

Parallel mode

The nested loop join can run in the parallel mode.

The parallelization is done at the outer set level and allows the outer set to be scanned by multiple worker processes simultaneously. Each worker gets a row from the outer set and then sequentially scans the inner set all by itself.

Here's an example multiple-join query that finds all the passengers that have tickets for a specific flight:

EXPLAIN (costs off) SELECT t.passenger_name
FROM tickets t
  JOIN ticket_flights tf ON tf.ticket_no = t.ticket_no
  JOIN flights f ON f.flight_id = tf.flight_id
WHERE f.flight_id = 12345;
                             QUERY PLAN
−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−−
 Nested Loop
   −> Index Only Scan using flights_flight_id_status_idx on fligh...
       Index Cond: (flight_id = 12345)
   −> Gather
       Workers Planned: 2
       −> Nested Loop
           −> Parallel Seq Scan on ticket_flights tf
               Filter: (flight_id = 12345)
           −> Index Scan using tickets_pkey on tickets t
               Index Cond: (ticket_no = tf.ticket_no)
(10 rows)

The top-level nested loop join runs sequentially, as usual. The outer set contains a single row from the flights table that was fetched using a unique key, so the nested loop is efficient even despite the large number of rows in the inner set.

The inner set is collected in the parallel mode. Each worker gets its own rows from ticket_flights and joins them with tickets in a nested loop.

To be continued.

← Back to all articles

Egor Rogov