Thread: Question on a select
Hi all, This is my first post here so please let me know if I miss any list guidelines. :) I was hoping to get some help, advice or pointers to an answer for a somewhat odd (to me at least) SELECT. What I am trying to do is select that values from one table where matching values do not exist in another table. For example: Let's say 'table_a' has the columns 'a_name, a_type, a_dir, a_<others>' and 'table_b' has the columns 'b_name, b_type, b_dir, b_<others>' where 'others' are columns unique to each table. What I need to do is select all the values in 'a_name, a_type, a_dir' from 'table_a' where there is no matching entries in "table_b's" 'b_name, b_type, b_dir'. I know I could do something like: SELECT a_name, a_type, a_dir FROM table_a; and then loop through all the returned values and for each do a matching select from 'table_b' and use my program to catch the ones not in 'table_b'. This is not very efficient though and I will be searching through tables that could have several hundred thousand entries so the inefficiency would be amplified. Is there some way to use a join or something similar to do this? Thank you all! Madison
There are several ways. I am making the simplifying assumption that name, type and dir cannot be NULL in either table. If they are the query is a little more complicated. The following are a couple of many techniques. SELECT a.a_name , a.a_type , a.a_dir FROM a_table a WHERE NOT EXISTS ( SELECT NULL FROM b_table b WHERE b.b_name = a.a_name AND b.b_type = a.a_type AND b.b_dir = a.a_dir ) SELECT a.a_name , a.a_type , a.a_dir FROM a_table a LEFT JOIN b_table b ON a.a_table = b.b_table AND a.a_type = b.b_type AND a.a_dir = b.b_type WHERE b.b_table IS NULL // assumes that b.b_table is a not null column. Let's say that dir could be null and dir is a string, then (assuming that dir can never be 'xyz') you could say something like COALESCE(a.a_dir,'xyz') = COALESCE(b.b_dir,'xyz') Since NULL never equal NULL, if you want NULL in one table to match a NULL in another table, you need to change it to something not NULL. However this depends on what you want in your application. Queries like this are used often to check the integrity of your data. Examples of this are 1) What orders don't have order items? 2) What books have no authors? etc. ----- Original Message ----- From: "Madison Kelly" <linux@alteeve.com> To: "PgSQL General List" <pgsql-general@postgresql.org> Sent: Saturday, January 01, 2005 7:32 PM Subject: [GENERAL] Question on a select > Hi all, > > This is my first post here so please let me know if I miss any list > guidelines. :) > > I was hoping to get some help, advice or pointers to an answer for a > somewhat odd (to me at least) SELECT. What I am trying to do is select > that values from one table where matching values do not exist in another > table. > > For example: > > Let's say 'table_a' has the columns 'a_name, a_type, a_dir, a_<others>' > and 'table_b' has the columns 'b_name, b_type, b_dir, b_<others>' where > 'others' are columns unique to each table. What I need to do is select all > the values in 'a_name, a_type, a_dir' from 'table_a' where there is no > matching entries in "table_b's" 'b_name, b_type, b_dir'. > > I know I could do something like: > > SELECT a_name, a_type, a_dir FROM table_a; > > and then loop through all the returned values and for each do a matching > select from 'table_b' and use my program to catch the ones not in > 'table_b'. This is not very efficient though and I will be searching > through tables that could have several hundred thousand entries so the > inefficiency would be amplified. Is there some way to use a join or > something similar to do this? > > Thank you all! > > Madison > > ---------------------------(end of broadcast)--------------------------- > TIP 6: Have you searched our list archives? > > http://archives.postgresql.org >
On Sat, Jan 01, 2005 at 22:32:17 -0500, Madison Kelly <linux@alteeve.com> wrote: > Hi all, > > This is my first post here so please let me know if I miss any list > guidelines. :) > > I was hoping to get some help, advice or pointers to an answer for a > somewhat odd (to me at least) SELECT. What I am trying to do is select > that values from one table where matching values do not exist in another > table. > > For example: > > Let's say 'table_a' has the columns 'a_name, a_type, a_dir, > a_<others>' and 'table_b' has the columns 'b_name, b_type, b_dir, > b_<others>' where 'others' are columns unique to each table. What I need > to do is select all the values in 'a_name, a_type, a_dir' from 'table_a' > where there is no matching entries in "table_b's" 'b_name, b_type, b_dir'. SELECT a_name, a_type, a_dir, a_<others> FROM table_a WHERE a_name, a_type, a_dir NOT IN ( SELECT b_name, b_type, b_dir FROM table_b) ; In pre 7.4 versions or if there are NULLs in the key columns for table_b then you probably want to use NOT EXISTS (with a moodified WHERE clause) instead on NOT IN.
Vincent Hikida wrote: > There are several ways. I am making the simplifying assumption that > name, type and dir cannot be NULL in either table. If they are the query > is a little more complicated. > > The following are a couple of many techniques. > > SELECT a.a_name > , a.a_type > , a.a_dir > FROM a_table a > WHERE NOT EXISTS > ( SELECT NULL > FROM b_table b > WHERE b.b_name = a.a_name > AND b.b_type = a.a_type > AND b.b_dir = a.a_dir > ) > > SELECT a.a_name > , a.a_type > , a.a_dir > FROM a_table a > LEFT JOIN b_table b > ON a.a_table = b.b_table > AND a.a_type = b.b_type > AND a.a_dir = b.b_type > WHERE b.b_table IS NULL // assumes that > b.b_table is a not null column. > > Let's say that dir could be null and dir is a string, then (assuming > that dir can never be 'xyz') you could say something like > > COALESCE(a.a_dir,'xyz') = COALESCE(b.b_dir,'xyz') > > Since NULL never equal NULL, if you want NULL in one table to match a > NULL in another table, you need to change it to something not NULL. > However this depends on what you want in your application. > > Queries like this are used often to check the integrity of your data. > Examples of this are 1) What orders don't have order items? 2) What > books have no authors? etc. This is wonderful! Thank you for responding so quickly! :) I should mention that I am still very much learning so I apologise in advance if I miss the obvious. ^.^; They are all 'not null' and I am trying to do exactly the kind of task you described. I tried the first example on my DB and got a syntax error: tle-bu=> SELECT a.file_name, a.file_parent_dir, a.file_type FROM file_info_1 a WHERE NOT EXIST (SELECT NULL FROM file_set_1 b WHERE b.fs_name=a.file_name, b.fs_parent_dir=a.file_parent_dir, b.fs_type=a.file_type); ERROR: syntax error at or near "SELECT" at character 88 The second example you gave seems to work perfectly (as I will show near the bottom of this email). What are the benefits and down sides of each method? Is there a simple reason why the first method failed (probably a typo I imagine...)? A little more specifics about my DB: 'file_info_1' and 'file_set_1' are two tables I use to store information of files and directories (this is a backup program). 'file_info_#' stores dynamic info like file size, owner and such. This table is dropped and recreated before a new scan of the partition creates a mass 'COPY' load (the '_1' indicates the first partition). 'file_set_#' stores static information such as "has the file been selected for backup" which is why I keep it in a separate table. I want to run this select first to write entries for newly added files and directories (the values will match the file's parent) and then again in reverse to remove from 'file_set_#' entries that no longer exist on the partition. If it helps, here is the structure of the tables: CREATE TABLE file_info_ID ( file_acc_time bigint not null, file_group_name varchar(255) not null, file_group_uid int not null, file_mod_time bigint not null, file_name varchar(255) not null, file_parent_dir varchar(255) not null, file_perm varchar(10) not null, file_size bigint not null, file_type varchar(2) not null default 'f', file_user_name varchar(255) not null, file_user_uid int not null ); CREATE TABLE file_set_# ( fs_backup boolean not null default 't', fs_display boolean not null default 'f', fs_name varchar(255) not null, fs_parent_dir varchar(255) not null, fs_restore boolean not null default 'f', fs_type varchar(2) not null default 'f' ); And here is some sample data that I have to work with (yes, it's a win2k partition... I use it to test other aspects of my program and, if I blow it away, I won't be upset. ^.^; All of this is being done on a Fedora Core 3 install in case it makes a difference): tle-bu=> SELECT file_type, file_parent_dir, file_name FROM file_info_1 WHERE file_parent_dir='/' LIMIT 30; file_type | file_parent_dir | file_name -----------+-----------------+------------------------ d | / | . d | / | downloads d | / | Documents and Settings d | / | Program Files f | / | io.sys f | / | msdos.sys f | / | _NavCClt.Log d | / | WUTemp d | / | Recycled f | / | pagefile.sys d | / | winnt f | / | ntldr f | / | ntdetect.com f | / | boot.ini f | / | config.sys f | / | autoexec.bat f | / | t5r4e3w2q1.exe f | / | 1q2w3e4r5t.exe f | / | logon.exe f | / | arcldr.exe f | / | arcsetup.exe (21 rows) tle-bu=> SELECT fs_type, fs_parent_dir, fs_name FROM file_set_1 WHERE fs_parent_dir='/' LIMIT 30; fs_type | fs_parent_dir | fs_name ---------+---------------+------------------------ d | / | . d | / | downloads d | / | Documents and Settings d | / | Program Files d | / | WUTemp d | / | Recycled d | / | winnt (7 rows) In this example I deleted manually all the 'f' entries so that when I do the select I should get: file_type | file_parent_dir | file_name -----------+-----------------+------------------------ f | / | io.sys f | / | msdos.sys f | / | _NavCClt.Log f | / | pagefile.sys f | / | ntldr f | / | ntdetect.com f | / | boot.ini f | / | config.sys f | / | autoexec.bat f | / | t5r4e3w2q1.exe f | / | 1q2w3e4r5t.exe f | / | logon.exe f | / | arcldr.exe f | / | arcsetup.exe Which is exactly what your second example provides: tle-bu=> SELECT a.file_name, a.file_parent_dir, a.file_type FROM file_info_1 a LEFT JOIN file_set_1 b ON a.file_name=b.fs_name AND a.file_parent_dir=b.fs_parent_dir AND a.file_type=b.fs_type WHERE b.fs_name IS NULL; file_name | file_parent_dir | file_type ----------------+-----------------+----------- 1q2w3e4r5t.exe | / | f arcldr.exe | / | f arcsetup.exe | / | f autoexec.bat | / | f boot.ini | / | f config.sys | / | f io.sys | / | f logon.exe | / | f msdos.sys | / | f _NavCClt.Log | / | f ntdetect.com | / | f ntldr | / | f pagefile.sys | / | f t5r4e3w2q1.exe | / | f (14 rows) Thank you very much for your help! Madison
Bruno Wolff III wrote: > SELECT a_name, a_type, a_dir, a_<others> FROM table_a > WHERE a_name, a_type, a_dir NOT IN ( > SELECT b_name, b_type, b_dir FROM table_b) > ; > > In pre 7.4 versions or if there are NULLs in the key columns for table_b > then you probably want to use NOT EXISTS (with a moodified WHERE clause) > instead on NOT IN. Hi Bruno, Thank you for replying! I tried your example but I am getting a syntax error: tle-bu=> SELECT file_name, file_parent_dir, file_type FROM file_info_1 WHERE file_name, file_parent_dir, file_type NOT IN (SELECT fs_name, fs_parent_dir, fs_type FROM file_set_1); ERROR: syntax error at or near "," at character 78 I just replied to Vincent's post with a lot of detail on what I am trying to do and how my DB is constructed. His second example worked but I also had a syntax error on his first example. This program will be working with very large data sets so I would love to get your method working so that I could try benchmarking them to see which, in my application, would be most effective. Thank you very kindly for helping! Madison
On Sun, Jan 02, 2005 at 01:58:20 -0500, Madison Kelly <linux@alteeve.com> wrote: > Bruno Wolff III wrote: > >SELECT a_name, a_type, a_dir, a_<others> FROM table_a > > WHERE a_name, a_type, a_dir NOT IN ( > > SELECT b_name, b_type, b_dir FROM table_b) > >; > > > >In pre 7.4 versions or if there are NULLs in the key columns for table_b > >then you probably want to use NOT EXISTS (with a moodified WHERE clause) > >instead on NOT IN. > > Hi Bruno, > > Thank you for replying! I tried your example but I am getting a > syntax error: > > tle-bu=> SELECT file_name, file_parent_dir, file_type FROM file_info_1 > WHERE file_name, file_parent_dir, file_type NOT IN (SELECT fs_name, > fs_parent_dir, fs_type FROM file_set_1); > ERROR: syntax error at or near "," at character 78 There should be parenthesis around the list to test. WHERE a_name, a_type, a_dir NOT IN ( should be WHERE (a_name, a_type, a_dir) NOT IN ( > > I just replied to Vincent's post with a lot of detail on what I am > trying to do and how my DB is constructed. His second example worked but > I also had a syntax error on his first example. This program will be > working with very large data sets so I would love to get your method > working so that I could try benchmarking them to see which, in my > application, would be most effective. I believe that the NOT IN query should run comparably to the LEFT JOIN example supplied by the other person (at least in recent versions of Postgres). I would expect this to run faster than using NOT EXISTS. You probably want to try all 3. The semantics of the three ways of doing this are not all equivalent if there are NULLs in the data being used to eliminate rows. As you indicated you don't have NULLs this shouldn't be a problem. Another way to write this is using set different (EXCEPT or EXCEPT ALL) using the key fields and then joining back to table a to pick up the other fields. However this will almost certianly be slower than the other methods.
Bruno Wolff III wrote: > There should be parenthesis around the list to test. > WHERE a_name, a_type, a_dir NOT IN ( > should be > WHERE (a_name, a_type, a_dir) NOT IN ( That did it (I think)! > > I believe that the NOT IN query should run comparably to the LEFT JOIN > example supplied by the other person (at least in recent versions of > Postgres). I would expect this to run faster than using NOT EXISTS. > You probably want to try all 3. The semantics of the three ways of doing > this are not all equivalent if there are NULLs in the data being used > to eliminate rows. As you indicated you don't have NULLs this shouldn't > be a problem. > > Another way to write this is using set different (EXCEPT or EXCEPT ALL) > using the key fields and then joining back to table a to pick up the > other fields. However this will almost certianly be slower than the > other methods. Something odd, now that I have the other method working (I think)... tle-bu=> SELECT a.file_name, a.file_parent_dir, a.file_type FROM file_info_1 a LEFT JOIN file_set_1 b ON a.file_name=b.fs_name AND a.file_parent_dir=b.fs_parent_dir AND a.file_type=b.fs_type WHERE b.fs_name IS NULL; returns the results in roughly 1 or 2 seconds on a test data set of 15,000 entries. I have an index on both 'file_info_1' covering 'fs_name', 'fs_parent_dir' and 'fs_type' and on 'file_set_1' covering 'file_name', 'file_parent_dir' and 'file_type'. When I try the seconds method though: tle-bu=> SELECT file_name, file_parent_dir, file_type FROM file_info_1 WHERE (file_name, file_parent_dir, file_type) NOT IN (SELECT fs_name, fs_parent_dir, fs_type FROM file_set_1); It took so long to process that after roughly three minutes I stopped the query for fear of overheating my laptop (which happend a while back forcing a thermal shut down). The indexes are: CREATE INDEX file_info_#_display_idx ON file_info_# (file_type, file_parent_dir, file_name); CREATE INDEX file_set_#_sync_idx ON file_set_# (fs_name, fs_parent_dir, fs_type) Are these not effective for the second query? If not, what should I change or add? If so, would you have any insight into why there is such an incredible difference in performance? Thanks very much again!! Madison
> > They are all 'not null' and I am trying to do exactly the kind of task > you described. I tried the first example on my DB and got a syntax error: > > tle-bu=> SELECT a.file_name, a.file_parent_dir, a.file_type FROM > file_info_1 a WHERE NOT EXIST (SELECT NULL FROM file_set_1 b WHERE > b.fs_name=a.file_name, b.fs_parent_dir=a.file_parent_dir, > b.fs_type=a.file_type); > ERROR: syntax error at or near "SELECT" at character 88 > I've quickly read the thread and I don't think you got an answer as to why you are getting a syntax error here. Your query shows something line WHERE b.fs_name=a.file_name, b.fs_parent_dir=a.file_parent_dir you need to put an AND instead of a comma: WHERE b.fs_name=a.file_name AND b.fs_parent_dir=a.file_parent_dir As for which of the queries is best I don't know. My background is as an Oracle developer. I think that Bruno already suggested testing the three queries. There is a trace utility which shows some of what happens under the covers of a query. I've used it extensively in Oracle but have never used it in Postgresql. If I understand what you said, the NOT IN was significantly slower. That has been my experience in Oracle long time ago so I've tended to shy away from that syntax. I'm sure optimizers are much better now then when I experimented with NOT IN but my coworker who tried it in Oracle was getting a slower response than with a subselect about a year ago. Theoretically if 3 queries are logically equivalent as the three queries you've been given, an optimizer should find the same best query plan to execute it. I don't think that optimizers are that smart yet. The outer join is probably doing either a sort merge or a hash join. In your application this should be the best option. (A sort merge sorts both tables first or at least the key columns and then merges the tables together.) Bruno said that the subselect would be slower. It may be that he thinks it will do a nested loop. That is that it will read each row in table A and try to find that concatenated key in table B's index. I don't think that a nested loop would be very good in your particular application. As for the indexes you set up, I think they are correct indexes. Vincent
> > The indexes are: > > CREATE INDEX file_info_#_display_idx ON file_info_# (file_type, > file_parent_dir, file_name); > CREATE INDEX file_set_#_sync_idx ON file_set_# (fs_name, fs_parent_dir, > fs_type) > > Are these not effective for the second query? If not, what should I > change or add? If so, would you have any insight into why there is such an > incredible difference in performance? > I didn't look at your indexes closely enough. When you have concatenated index, you want to have the most selective colum first. I guess that file_type is not very selective. file_name is probably the most selective. In the above, the index on file_set_# is optimal. The index on file_info_# is suboptimal. However, if the query is doing a hash join or sort merge, an index is not used so the index doesn't matter. However, you probably do other queries that do use the index so it should be fixed. Vincent
Vincent Hikida wrote: > >> >> The indexes are: >> >> CREATE INDEX file_info_#_display_idx ON file_info_# (file_type, >> file_parent_dir, file_name); >> CREATE INDEX file_set_#_sync_idx ON file_set_# (fs_name, >> fs_parent_dir, fs_type) >> >> Are these not effective for the second query? If not, what should I >> change or add? If so, would you have any insight into why there is >> such an incredible difference in performance? >> > > I didn't look at your indexes closely enough. When you have concatenated > index, you want to have the most selective colum first. I guess that > file_type is not very selective. file_name is probably the most > selective. In the above, the index on file_set_# is optimal. The index > on file_info_# is suboptimal. > > However, if the query is doing a hash join or sort merge, an index is > not used so the index doesn't matter. However, you probably do other > queries that do use the index so it should be fixed. > > Vincent Thank you, Vincent! I didn't realize that the order made a difference. A sign of how much learning I need to do. :p For reference, I think 'file_parent_dir' and 'fs_parent_dir' are the most important because I do an 'ORDER BY [fs|file]_parent_dir ASC' on most queries. I've made the changes, thank you again! Madison
> I didn't realize that the order made a difference. A sign of how much > learning I need to do. :p For reference, I think 'file_parent_dir' and > 'fs_parent_dir' are the most important because I do an 'ORDER BY > [fs|file]_parent_dir ASC' on most queries. I've made the changes, thank > you again! If you SELECT ... WHERE condition on A order by B : an index on A will be used, but an index on B won't If you SELECT ... WHERE condition on A order by A, B : an index on A,B will be used because it will give the rows in already sorted order