Thread: big joins not converging

big joins not converging

From

Dan Ancona

Date:

10 March 2011, 20:42:48

Hi postgressers -

As part of my work with voter file data, I pretty regularly have to join one large-ish (over 500k rows) table to another. Sometimes this is via a text field (countyname) + integer (voter id). I've noticed sometimes this converges and sometimes it doesn't, seemingly regardless of how I index things. So I'm looking for general thoughts on the joining of large tables, but also running into a specific issue with the following slightly different query:

This one is between two tables that are a 754k row list of voters and a 445k row list of property owners. (I'm trying to find records where the owner name matches the voter name at the same address.) I have btree single column indices built on all the relevant fields, and multicolumn indices built across all the columns I'm matching. The full schemas of both tables are below. The machine is an older-ish (3 years ago) dual-core pentium w/ 4GB RAM running FreeBSD, more details below.

This is the query I've come up with so far:

explain analyze

update vanalameda set ownerflag = 'exact'

from aralameda where

vanalameda.streetno ~~ aralameda.streetnum and

vanalameda.streetname ~~ aralameda.streetname and

vanalameda.lastname ~~ aralameda.ownername and

vanalameda.firstname ~~ aralameda.ownername;

If I include the analyze, this didn't complete after running overnight. If I drop the analyze and just explain, I get this:

"Nested Loop (cost=46690.74..15384448712.74 rows=204 width=204)"

" Join Filter: (((vanalameda.streetno)::text ~~ (aralameda.streetnum)::text) AND ((vanalameda.streetname)::text ~~ (aralameda.streetname)::text) AND ((vanalameda.lastname)::text ~~ (aralameda.ownername)::text) AND ((vanalameda.firstname)::text ~~ (aralameda.ownername)::text))"

" -> Seq Scan on vanalameda (cost=0.00..26597.80 rows=734780 width=204)"

" -> Materialize (cost=46690.74..58735.87 rows=444613 width=113)"

" -> Seq Scan on aralameda (cost=0.00..38647.13 rows=444613 width=113)"

One general question: does the width of the tables (i.e. the numbers of columns not being joined and the size of those fields) matter? The tables do have a lot of extra columns that I could slice out.

Thanks so much!

Dan

System:

client: pgadmin III, Mac OS

server:

select version();

PostgreSQL 8.3.7 on i386-portbld-freebsd7.2, compiled by GCC cc (GCC) 4.2.1 20070719 [FreeBSD]

(installed from freebsd package system, default configuration)

%sysctl -a | egrep -i 'hw.machine|hw.model|hw.ncpu'

hw.machine: i386

hw.model: Genuine Intel(R) CPU 2160 @ 1.80GHz

hw.ncpu: 2

hw.machine_arch: i386

w/ 4GB RAM, 1 1GB disk, no RAID.

Here's the tables...

Table "public.aralameda"

Column | Type | Modifiers

-----------------+-----------------------+-----------

dt000o039001010 | character varying(13) |

o3901010 | character varying(15) |

dt17 | character varying(2) |

dt046 | character varying(3) |

streetnum | character varying(10) |

streetname | character varying(50) |

unitnum | character varying(10) |

city | character varying(30) |

zip | character varying(5) |

unk3 | character varying(1) |

crap1 | character varying(12) |

crap2 | character varying(12) |

crap3 | character varying(12) |

crap4 | character varying(12) |

crap5 | character varying(12) |

crap6 | character varying(12) |

crap7 | character varying(12) |

crap8 | character varying(12) |

crap9 | character varying(12) |

crap10 | character varying(12) |

dt2009 | character varying(4) |

dt066114 | character varying(6) |

crap11 | character varying(8) |

crap12 | character varying(8) |

ownername | character varying(50) |

careofname | character varying(50) |

unk4 | character varying(1) |

maddr1 | character varying(60) |

munitnum | character varying(10) |

mcitystate | character varying(30) |

mzip | character varying(5) |

mplus4 | character varying(4) |

dt40 | character varying(2) |

dt4 | character varying(1) |

crap13 | character varying(8) |

d | character varying(1) |

dt0500 | character varying(4) |

unk6 | character varying(1) |

crap14 | character varying(8) |

unk7 | character varying(1) |

Indexes:

"arall" btree (streetnum, streetname, ownername)

"aroname" btree (ownername)

"arstreetname" btree (streetname)

"arstreetnum" btree (streetnum)

Table "public.vanalameda"

Column | Type | Modifiers

---------------+-----------------------+-----------

vanid | character varying(8) |

lastname | character varying(25) |

firstname | character varying(16) |

middlename | character varying(16) |

suffix | character varying(3) |

streetno | character varying(5) |

streetnohalf | character varying(3) |

streetprefix | character varying(2) |

streetname | character varying(24) |

streettype | character varying(4) |

streetsuffix | character varying(2) |

apttype | character varying(4) |

aptno | character varying(8) |

city | character varying(13) |

state | character varying(2) |

zip5 | character varying(5) |

zip4 | character varying(4) |

vaddress | character varying(33) |

maddress | character varying(41) |

mcity | character varying(25) |

mstate | character varying(2) |

mzip5 | character varying(5) |

mzip4 | character varying(4) |

mstreetno | character varying(6) |

mstreetnohalf | character varying(9) |

mstreetprefix | character varying(2) |

mstreetname | character varying(40) |

mstreettype | character varying(4) |

mstreetsuffix | character varying(2) |

mapttype | character varying(4) |

maptno | character varying(13) |

dob | character varying(10) |

countyfileid | character varying(7) |

countyid | character varying(3) |

affno | character varying(12) |

ownerflag | character varying(20) |

Indexes:

"vanall" btree (streetno, streetname, lastname, firstname)

"vanfname" btree (firstname)

"vanlname" btree (lastname)

"vanstreetname" btree (streetname)

"vanstreetno" btree (streetno)

Re: big joins not converging

From

Steve Atkins

Date:

10 March 2011, 21:13:16

On Mar 10, 2011, at 1:25 PM, Dan Ancona wrote:

> Hi postgressers -
>
> As part of my work with voter file data, I pretty regularly have to join one large-ish (over 500k rows) table to
another.Sometimes this is via a text field (countyname) + integer (voter id). I've noticed sometimes this converges and
sometimesit doesn't, seemingly regardless of how I index things. So I'm looking for general thoughts on the joining of
largetables, but also running into a specific issue with the following slightly different query: 
>
> This one is between two tables that are a 754k row list of voters and a 445k row list of property owners. (I'm trying
tofind records where the owner name matches the voter name at the same address.) I have btree single column indices
builton all the relevant fields, and multicolumn indices built across all the columns I'm matching. The full schemas of
bothtables are below. The machine is an older-ish (3 years ago) dual-core pentium w/ 4GB RAM running FreeBSD, more
detailsbelow. 
>
> This is the query I've come up with so far:
>
> explain analyze
> update vanalameda set ownerflag = 'exact'
>   from aralameda where
>   vanalameda.streetno ~~ aralameda.streetnum and
>   vanalameda.streetname ~~ aralameda.streetname and
>   vanalameda.lastname ~~ aralameda.ownername and
>   vanalameda.firstname ~~ aralameda.ownername;
>
> If I include the analyze, this didn't complete after running overnight. If I drop the analyze and just explain, I get
this:
>
> "Nested Loop  (cost=46690.74..15384448712.74 rows=204 width=204)"
> "  Join Filter: (((vanalameda.streetno)::text ~~ (aralameda.streetnum)::text) AND ((vanalameda.streetname)::text ~~
(aralameda.streetname)::text)AND ((vanalameda.lastname)::text ~~ (aralameda.ownername)::text) AND
((vanalameda.firstname)::text~~ (aralameda.ownername)::text))" 
> "  ->  Seq Scan on vanalameda  (cost=0.00..26597.80 rows=734780 width=204)"
> "  ->  Materialize  (cost=46690.74..58735.87 rows=444613 width=113)"
> "        ->  Seq Scan on aralameda  (cost=0.00..38647.13 rows=444613 width=113)"
>
> One general question: does the width of the tables (i.e. the numbers of columns not being joined and the size of
thosefields) matter? The tables do have a lot of extra columns that I could slice out. 
>

Is there any reason you're using '~~' to compare values, rather than '='?

If you're intentionally using LIKE-style comparisons then there are some other things you can do, but I don't think you
meanto do that, for streeno and streetname anyway. 

Switching to an equality comparison should let your query use an index, most usefully one on (streetname, streetnum)
probably.

I'm not sure what you're intending by comparing ownername to both firstname and lastname. I don't think that'll do
anythinguseful, and doubt it'll ever match. Are you expecting firstname and lastname to be substrings of ownername? If
so,you might need to use wildcards with the like. 

(Also, performance and smart use of indexes tends to get better in newer versions of postgresql. You might want to
upgradeto 9.0.3 too.) 

Cheers,
  Steve

Re: big joins not converging

From

fork

Date:

10 March 2011, 22:49:19

Steve Atkins <steve <at> blighty.com> writes:

>
>
> On Mar 10, 2011, at 1:25 PM, Dan Ancona wrote:
>
> > Hi postgressers -
> >
> > As part of my work with voter file data, I pretty regularly have to join one
large-ish (over 500k rows) table
> to another. Sometimes this is via a text field (countyname) + integer (voter
id). I've noticed sometimes
> this converges and sometimes it doesn't, seemingly regardless of how I index
things.

By "converge" you mean "finish running" -- "converge" has a lot of other
overtones for us amateur math types.

Note that I think you are doing "record linkage" which is a stepchild academic
of its own these days.  It might bear some research.  THere is also a CDC
matching program for text files freely downloadalbe to windows (ack), if you
hunt for it.

For now, my first thought is that you should try a few different matches, maybe
via PL/PGSQL functions, cascading the non-hits to the next step in the process
while shrinking your tables. upcase and delete all spaces, etc. First use
equality on all columns, which should be able to use indices, and separate those
records.  Then try equality on a few columns.  Then try some super fuzzy regexes
on a few columns.  Etc.

You will also have to give some thought to scoring a match, with perfection a
one, but, say, name and birthday the same with all else different a .75, etc.

Also, soundex(), levenshtein, and other fuzzy string tools are your friend.  I
want to write a version of SAS's COMPGED for Postgres, but I haven't got round
to it yet.

Re: big joins not converging

From

Dan Ancona

Date:

10 March 2011, 23:43:20

On Mar 10, 2011, at 3:48 PM, fork wrote:
[much thoughtfulness]

> Steve Atkins <steve <at> blighty.com> writes:
> [also much thoughtfulness]

Steve and fork -- thank you, this is super helpful. I meant to tweak
that exact search before sending this around, sorry if that was
confusing. That was meant to be a place holder for [some set of
matches that works]. And yes, "not converging" was incorrect, I did
mean "not finishing." But together from your answers it sounds pretty
clear that there's no particularly obvious easy solution that I'm
missing; this really is kind of tricky. This is a choice between
developing some in-house capacity for this and sending people to
various vendors so we'll probably lean on the vendors for now, at
least while we work on it. I've gotten my head partway around PL/PGSQL
functions, I may give that another try.

And you're right fork, Record Linkage is in fact an entire academic
discipline! I had no idea, this is fascinating and helpful:

http://en.wikipedia.org/wiki/Record_linkage

Thanks so much!

Dan

Re: big joins not converging

From

fork

Date:

11 March 2011, 15:47:09

Dan Ancona <da <at> vizbang.com> writes:

>  his is a choice between
> developing some in-house capacity for this and sending people to
> various vendors so we'll probably lean on the vendors for now, at
> least while we work on it.

I would try to do the record matching in house and see how far you get, even if
you are talking to vendors concurrently.  You might get lucky, and you will
learn a lot about your data and how much to expect and pay for vendor solutions.

I would: Try building multi column indices on both tables for what you think are
the same rows, and match deterministically (if you have a key like social
security, then do this again on full names).  Examine your data to see what
hits, what misses, what hits multiple.  If you know there is a "good" and an
"iffy" table, you can use a left outer, otherwise you need a full outer.   Then
put all your leftovers from each into new tables, and try again with something
fuzzy.

If you build the indices and use "=" and it is still slow, ask again here --
that shouldn't happen.

> And you're right fork, Record Linkage is in fact an entire academic
> discipline!

Indeed.  Look for "blocking" and "editing" with your data first, I think.

I find this problem pretty interesting, so I would love to hear your results.  I
am right now matching building permits to assessor parcels....  I wish I was
using PG ...