Thread: big joins not converging
Hi postgressers -
As part of my work with voter file data, I pretty regularly have to join one large-ish (over 500k rows) table to another. Sometimes this is via a text field (countyname) + integer (voter id). I've noticed sometimes this converges and sometimes it doesn't, seemingly regardless of how I index things. So I'm looking for general thoughts on the joining of large tables, but also running into a specific issue with the following slightly different query:
This one is between two tables that are a 754k row list of voters and a 445k row list of property owners. (I'm trying to find records where the owner name matches the voter name at the same address.) I have btree single column indices built on all the relevant fields, and multicolumn indices built across all the columns I'm matching. The full schemas of both tables are below. The machine is an older-ish (3 years ago) dual-core pentium w/ 4GB RAM running FreeBSD, more details below.
This is the query I've come up with so far:
explain analyze
update vanalameda set ownerflag = 'exact'
from aralameda where
vanalameda.streetno ~~ aralameda.streetnum and
vanalameda.streetname ~~ aralameda.streetname and
vanalameda.lastname ~~ aralameda.ownername and
vanalameda.firstname ~~ aralameda.ownername;
If I include the analyze, this didn't complete after running overnight. If I drop the analyze and just explain, I get this:
"Nested Loop (cost=46690.74..15384448712.74 rows=204 width=204)"
" Join Filter: (((vanalameda.streetno)::text ~~ (aralameda.streetnum)::text) AND ((vanalameda.streetname)::text ~~ (aralameda.streetname)::text) AND ((vanalameda.lastname)::text ~~ (aralameda.ownername)::text) AND ((vanalameda.firstname)::text ~~ (aralameda.ownername)::text))"
" -> Seq Scan on vanalameda (cost=0.00..26597.80 rows=734780 width=204)"
" -> Materialize (cost=46690.74..58735.87 rows=444613 width=113)"
" -> Seq Scan on aralameda (cost=0.00..38647.13 rows=444613 width=113)"
One general question: does the width of the tables (i.e. the numbers of columns not being joined and the size of those fields) matter? The tables do have a lot of extra columns that I could slice out.
Thanks so much!
Dan
System:
client: pgadmin III, Mac OS
server:
select version();
PostgreSQL 8.3.7 on i386-portbld-freebsd7.2, compiled by GCC cc (GCC) 4.2.1 20070719 [FreeBSD]
(installed from freebsd package system, default configuration)
%sysctl -a | egrep -i 'hw.machine|hw.model|hw.ncpu'
hw.machine: i386
hw.model: Genuine Intel(R) CPU 2160 @ 1.80GHz
hw.ncpu: 2
hw.machine_arch: i386
w/ 4GB RAM, 1 1GB disk, no RAID.
Here's the tables...
Table "public.aralameda"
Column | Type | Modifiers
-----------------+-----------------------+-----------
dt000o039001010 | character varying(13) |
o3901010 | character varying(15) |
dt17 | character varying(2) |
dt046 | character varying(3) |
streetnum | character varying(10) |
streetname | character varying(50) |
unitnum | character varying(10) |
city | character varying(30) |
zip | character varying(5) |
unk3 | character varying(1) |
crap1 | character varying(12) |
crap2 | character varying(12) |
crap3 | character varying(12) |
crap4 | character varying(12) |
crap5 | character varying(12) |
crap6 | character varying(12) |
crap7 | character varying(12) |
crap8 | character varying(12) |
crap9 | character varying(12) |
crap10 | character varying(12) |
dt2009 | character varying(4) |
dt066114 | character varying(6) |
crap11 | character varying(8) |
crap12 | character varying(8) |
ownername | character varying(50) |
careofname | character varying(50) |
unk4 | character varying(1) |
maddr1 | character varying(60) |
munitnum | character varying(10) |
mcitystate | character varying(30) |
mzip | character varying(5) |
mplus4 | character varying(4) |
dt40 | character varying(2) |
dt4 | character varying(1) |
crap13 | character varying(8) |
d | character varying(1) |
dt0500 | character varying(4) |
unk6 | character varying(1) |
crap14 | character varying(8) |
unk7 | character varying(1) |
Indexes:
"arall" btree (streetnum, streetname, ownername)
"aroname" btree (ownername)
"arstreetname" btree (streetname)
"arstreetnum" btree (streetnum)
Table "public.vanalameda"
Column | Type | Modifiers
---------------+-----------------------+-----------
vanid | character varying(8) |
lastname | character varying(25) |
firstname | character varying(16) |
middlename | character varying(16) |
suffix | character varying(3) |
streetno | character varying(5) |
streetnohalf | character varying(3) |
streetprefix | character varying(2) |
streetname | character varying(24) |
streettype | character varying(4) |
streetsuffix | character varying(2) |
apttype | character varying(4) |
aptno | character varying(8) |
city | character varying(13) |
state | character varying(2) |
zip5 | character varying(5) |
zip4 | character varying(4) |
vaddress | character varying(33) |
maddress | character varying(41) |
mcity | character varying(25) |
mstate | character varying(2) |
mzip5 | character varying(5) |
mzip4 | character varying(4) |
mstreetno | character varying(6) |
mstreetnohalf | character varying(9) |
mstreetprefix | character varying(2) |
mstreetname | character varying(40) |
mstreettype | character varying(4) |
mstreetsuffix | character varying(2) |
mapttype | character varying(4) |
maptno | character varying(13) |
dob | character varying(10) |
countyfileid | character varying(7) |
countyid | character varying(3) |
affno | character varying(12) |
ownerflag | character varying(20) |
Indexes:
"vanall" btree (streetno, streetname, lastname, firstname)
"vanfname" btree (firstname)
"vanlname" btree (lastname)
"vanstreetname" btree (streetname)
"vanstreetno" btree (streetno)
On Mar 10, 2011, at 1:25 PM, Dan Ancona wrote: > Hi postgressers - > > As part of my work with voter file data, I pretty regularly have to join one large-ish (over 500k rows) table to another.Sometimes this is via a text field (countyname) + integer (voter id). I've noticed sometimes this converges and sometimesit doesn't, seemingly regardless of how I index things. So I'm looking for general thoughts on the joining of largetables, but also running into a specific issue with the following slightly different query: > > This one is between two tables that are a 754k row list of voters and a 445k row list of property owners. (I'm trying tofind records where the owner name matches the voter name at the same address.) I have btree single column indices builton all the relevant fields, and multicolumn indices built across all the columns I'm matching. The full schemas of bothtables are below. The machine is an older-ish (3 years ago) dual-core pentium w/ 4GB RAM running FreeBSD, more detailsbelow. > > This is the query I've come up with so far: > > explain analyze > update vanalameda set ownerflag = 'exact' > from aralameda where > vanalameda.streetno ~~ aralameda.streetnum and > vanalameda.streetname ~~ aralameda.streetname and > vanalameda.lastname ~~ aralameda.ownername and > vanalameda.firstname ~~ aralameda.ownername; > > If I include the analyze, this didn't complete after running overnight. If I drop the analyze and just explain, I get this: > > "Nested Loop (cost=46690.74..15384448712.74 rows=204 width=204)" > " Join Filter: (((vanalameda.streetno)::text ~~ (aralameda.streetnum)::text) AND ((vanalameda.streetname)::text ~~ (aralameda.streetname)::text)AND ((vanalameda.lastname)::text ~~ (aralameda.ownername)::text) AND ((vanalameda.firstname)::text~~ (aralameda.ownername)::text))" > " -> Seq Scan on vanalameda (cost=0.00..26597.80 rows=734780 width=204)" > " -> Materialize (cost=46690.74..58735.87 rows=444613 width=113)" > " -> Seq Scan on aralameda (cost=0.00..38647.13 rows=444613 width=113)" > > One general question: does the width of the tables (i.e. the numbers of columns not being joined and the size of thosefields) matter? The tables do have a lot of extra columns that I could slice out. > Is there any reason you're using '~~' to compare values, rather than '='? If you're intentionally using LIKE-style comparisons then there are some other things you can do, but I don't think you meanto do that, for streeno and streetname anyway. Switching to an equality comparison should let your query use an index, most usefully one on (streetname, streetnum) probably. I'm not sure what you're intending by comparing ownername to both firstname and lastname. I don't think that'll do anythinguseful, and doubt it'll ever match. Are you expecting firstname and lastname to be substrings of ownername? If so,you might need to use wildcards with the like. (Also, performance and smart use of indexes tends to get better in newer versions of postgresql. You might want to upgradeto 9.0.3 too.) Cheers, Steve
Steve Atkins <steve <at> blighty.com> writes: > > > On Mar 10, 2011, at 1:25 PM, Dan Ancona wrote: > > > Hi postgressers - > > > > As part of my work with voter file data, I pretty regularly have to join one large-ish (over 500k rows) table > to another. Sometimes this is via a text field (countyname) + integer (voter id). I've noticed sometimes > this converges and sometimes it doesn't, seemingly regardless of how I index things. By "converge" you mean "finish running" -- "converge" has a lot of other overtones for us amateur math types. Note that I think you are doing "record linkage" which is a stepchild academic of its own these days. It might bear some research. THere is also a CDC matching program for text files freely downloadalbe to windows (ack), if you hunt for it. For now, my first thought is that you should try a few different matches, maybe via PL/PGSQL functions, cascading the non-hits to the next step in the process while shrinking your tables. upcase and delete all spaces, etc. First use equality on all columns, which should be able to use indices, and separate those records. Then try equality on a few columns. Then try some super fuzzy regexes on a few columns. Etc. You will also have to give some thought to scoring a match, with perfection a one, but, say, name and birthday the same with all else different a .75, etc. Also, soundex(), levenshtein, and other fuzzy string tools are your friend. I want to write a version of SAS's COMPGED for Postgres, but I haven't got round to it yet.
On Mar 10, 2011, at 3:48 PM, fork wrote: [much thoughtfulness] > Steve Atkins <steve <at> blighty.com> writes: > [also much thoughtfulness] Steve and fork -- thank you, this is super helpful. I meant to tweak that exact search before sending this around, sorry if that was confusing. That was meant to be a place holder for [some set of matches that works]. And yes, "not converging" was incorrect, I did mean "not finishing." But together from your answers it sounds pretty clear that there's no particularly obvious easy solution that I'm missing; this really is kind of tricky. This is a choice between developing some in-house capacity for this and sending people to various vendors so we'll probably lean on the vendors for now, at least while we work on it. I've gotten my head partway around PL/PGSQL functions, I may give that another try. And you're right fork, Record Linkage is in fact an entire academic discipline! I had no idea, this is fascinating and helpful: http://en.wikipedia.org/wiki/Record_linkage Thanks so much! Dan
Dan Ancona <da <at> vizbang.com> writes: > his is a choice between > developing some in-house capacity for this and sending people to > various vendors so we'll probably lean on the vendors for now, at > least while we work on it. I would try to do the record matching in house and see how far you get, even if you are talking to vendors concurrently. You might get lucky, and you will learn a lot about your data and how much to expect and pay for vendor solutions. I would: Try building multi column indices on both tables for what you think are the same rows, and match deterministically (if you have a key like social security, then do this again on full names). Examine your data to see what hits, what misses, what hits multiple. If you know there is a "good" and an "iffy" table, you can use a left outer, otherwise you need a full outer. Then put all your leftovers from each into new tables, and try again with something fuzzy. If you build the indices and use "=" and it is still slow, ask again here -- that shouldn't happen. > And you're right fork, Record Linkage is in fact an entire academic > discipline! Indeed. Look for "blocking" and "editing" with your data first, I think. I find this problem pretty interesting, so I would love to hear your results. I am right now matching building permits to assessor parcels.... I wish I was using PG ...