Re: Joins on TID - Mailing list pgsql-hackers

From Darafei "Komяpa" Praliaskouski
Subject Re: Joins on TID
Date
Msg-id CAC8Q8tL_xL-ReBmdYXyd2-EQP2w3=XON=u2nvczKc1D-E0rsow@mail.gmail.com
Whole thread Raw
In response to Re: Joins on TID  (Tom Lane <tgl@sss.pgh.pa.us>)
List pgsql-hackers
Hi,

Writing as someone who used TID joins and group by's in the past.

One use case is having a chance to peek into what will DELETE do.
A lot of GIS tables don't have any notion of ID, and dirty datasets tend to have many duplicates you need to cross-reference with something else. So, you write your query in form of 

CREATE TABLE ttt as (SELECT distinct on (ctid) ctid as ct, field1, field2, b.field3, ... from table b join othertable b on ST_Whatever(a.geom, b.geom));

<connect to table with QGIS, poke around, maybe delete some rows you doubt you want to remove>

DELETE FROM table a USING ttt b where a.ctid = b.ct;
DROP TABLE ttt;

Here:
 - distinct on ctid is used (hash?)
 - a.ctid = b.ct (hash join candidate?)

I know it's all better with proper IDs, but sometimes it works like that, usually just once per dataset.


сб, 22 дек. 2018 г. в 19:31, Tom Lane <tgl@sss.pgh.pa.us>:
Simon Riggs <simon@2ndquadrant.com> writes:
> On Sat, 22 Dec 2018 at 04:31, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> BTW, if we're to start taking joins on TID seriously, we should also
>> add the missing hash opclass for TID, so that you can do hash joins
>> when dealing with a lot of rows.

> I don't think we are trying to do TID joins more seriously, just fix a
> special case.
> The case cited requires the batches of work to be small, so nested loops
> works fine.
> Looks to me that Edmund is trying to solve the same problem. If so, this is
> the best solution.

No, I think what Edmund is on about is unrelated, except that it touches
some of the same code.  He's interested in problems like "find the last
few tuples in this table".  You can solve that today, with e.g.
"SELECT ... WHERE ctid >= '(n,1)'", but you get a stupidly inefficient
plan.  If we think that's a use-case worth supporting then it'd be
reasonable to provide less inefficient implementation(s).

What I'm thinking about in this thread is joins on TID, which we have only
very weak support for today --- you'll basically always wind up with a
mergejoin, which requires full-table scan and sort of its inputs.  Still,
that's better than a naive nestloop, and for years we've been figuring
that that was good enough.  Several people in the other thread that
I cited felt that that isn't good enough.  But if we think it's worth
taking seriously, then IMO we need to add both parameterized scans (for
nestloop-with-inner-fetch-by-tid) and hash join, because each of those
can dominate depending on how many tuples you're joining.

                        regards, tom lane

--
Darafei Praliaskouski
Support me: http://patreon.com/komzpa

pgsql-hackers by date:

Previous
From: ilmari@ilmari.org (Dagfinn Ilmari Mannsåker)
Date:
Subject: Re: [PATCH] Improve tab completion for CREATE TABLE
Next
From: John Naylor
Date:
Subject: Re: reducing the footprint of ScanKeyword (was Re: Large writable variables)