Thread: Q: Table scans on set difference
What's happening here? I have two tables, encounter_properties_table with about 100000 rows and xfiles with about 500000 rows. The structures of these tables is as follows: Table "public.encounter_properties_table" Column | Type | Modifiers ----------------+--------------------------+----------- timestamp | timestamp with time zone | not null practice_id | integer | not null patient_id | bigint | not null properties | text | modified_by | bigint | not null client_version | integer | file_name | character varying(255) | Indexes: "encounter_properties_table_pkey" primary key, btree (patient_id) "fn_ix" btree (file_name) and Table "public.xfiles" Column | Type | Modifiers ----------+------------------------+----------- filename | character varying(100) | not null Indexes: "xfiles_ix1" btree (filename) The following query shows that PostgreSQL 7.4 is doing table scans on both tables: explain select file_name from encounter_properties_table where file_name not in (select filename from xfiles); QUERY PLAN ---------------------------------------------------------------------------- ------------ Seq Scan on encounter_properties_table (cost=0.00..1030610198.10 rows=85828 width=58) Filter: (NOT (subplan)) SubPlan -> Seq Scan on xfiles (cost=0.00..10755.44 rows=500944 width=59) (4 rows) I ran vacumm analyze on both tables. We aborted this query when it had not finished after 4 hours. We ran the same query on SQLServer 2005 with the same data and it took under one second to finish. Any ideas?
Attachment
G. Ralph Kuntz, MD wrote: > What's happening here? > > I have two tables, encounter_properties_table with about 100000 rows and > xfiles with about 500000 rows. The structures of these tables is as follows: > file_name | character varying(255) | > Table "public.xfiles" > Column | Type | Modifiers > ----------+------------------------+----------- > filename | character varying(100) | not null These columns are of different types, you're forcing a typecast on every row comparison; I think the varchar(100)'s will be upscaled to varchar(255) on comparison. My advice: use the text type. It's more flexible (practically no size limit) and faster. > explain select file_name from encounter_properties_table where file_name not > in (select filename from xfiles); What about: explain select file_name from encounter_properties_table where not exists ( select file_name from xfiles where filename = file_name); I often even use "select 1" - a constant - because I'm not interested in the value, but apparently selecting a column is marginally faster than selecting a constant. Testing will prove it, I thought I'd mention the possibilit. > I ran vacumm analyze on both tables. > > We aborted this query when it had not finished after 4 hours. Probably due to the type cast. We used to run into this problem when using bigint index columns. We changed them into int (which was sufficient) and the speed went up a lot. Later we determined - with input from this list - that the cause wasn't the size of the column but the type casting required to match the constant integer values in our queries. In our case explicit casting of our constant values helped. > We ran the same query on SQLServer 2005 with the same data and it took under > one second to finish. -- Alban Hertroys alban@magproductions.nl magproductions b.v. T: ++31(0)534346874 F: ++31(0)534346876 M: I: www.magproductions.nl A: Postbus 416 7500 AK Enschede // Integrate Your World //
Alban Hertroys <alban@magproductions.nl> writes: > G. Ralph Kuntz, MD wrote: >> explain select file_name from encounter_properties_table where file_name not >> in (select filename from xfiles); > What about: > explain select file_name from encounter_properties_table > where not exists ( > select file_name from xfiles where filename = file_name); If you only need the file name, an EXCEPT would probably work much better: select file_name from encounter_properties_table except select filename from xfiles; Another possibility is to abuse the outer join machinery: select file_name, ... from encounter_properties_table l left join xfiles r on l.file_name = r.filename where r.filename is null; Generally speaking, NOT IN performance is going to suck unless the sub-select is small enough to fit in a hashtable. You could consider increasing work_mem enough that it would fit, but with 500K filenames needed, that's probably not going to win. regards, tom lane