Re: big joins not converging - Mailing list pgsql-performance

From fork
Subject Re: big joins not converging
Date
Msg-id loom.20110311T173103-318@post.gmane.org
Whole thread Raw
In response to big joins not converging  (Dan Ancona <da@vizbang.com>)
List pgsql-performance
Dan Ancona <da <at> vizbang.com> writes:

>  his is a choice between
> developing some in-house capacity for this and sending people to
> various vendors so we'll probably lean on the vendors for now, at
> least while we work on it.

I would try to do the record matching in house and see how far you get, even if
you are talking to vendors concurrently.  You might get lucky, and you will
learn a lot about your data and how much to expect and pay for vendor solutions.

I would: Try building multi column indices on both tables for what you think are
the same rows, and match deterministically (if you have a key like social
security, then do this again on full names).  Examine your data to see what
hits, what misses, what hits multiple.  If you know there is a "good" and an
"iffy" table, you can use a left outer, otherwise you need a full outer.   Then
put all your leftovers from each into new tables, and try again with something
fuzzy.

If you build the indices and use "=" and it is still slow, ask again here --
that shouldn't happen.

> And you're right fork, Record Linkage is in fact an entire academic
> discipline!

Indeed.  Look for "blocking" and "editing" with your data first, I think.

I find this problem pretty interesting, so I would love to hear your results.  I
am right now matching building permits to assessor parcels....  I wish I was
using PG ...

pgsql-performance by date:

Previous
From: "hans wulf"
Date:
Subject: ANTI-JOIN needs table, index scan not possible?
Next
From: Maciek Sakrejda
Date:
Subject: Re: ANTI-JOIN needs table, index scan not possible?