Thread: Question on a select

Question on a select

From
Madison Kelly
Date:
Hi all,

   This is my first post here so please let me know if I miss any list
guidelines. :)

   I was hoping to get some help, advice or pointers to an answer for a
somewhat odd (to me at least) SELECT. What I am trying to do is select
that values from one table where matching values do not exist in another
table.

For example:

   Let's say 'table_a' has the columns 'a_name, a_type, a_dir,
a_<others>' and 'table_b' has the columns 'b_name, b_type, b_dir,
b_<others>' where 'others' are columns unique to each table. What I need
to do is select all the values in 'a_name, a_type, a_dir' from 'table_a'
where there is no matching entries in "table_b's" 'b_name, b_type, b_dir'.

   I know I could do something like:

SELECT a_name, a_type, a_dir FROM table_a;

   and then loop through all the returned values and for each do a
matching select from 'table_b' and use my program to catch the ones not
in 'table_b'. This is not very efficient though and I will be searching
through tables that could have several hundred thousand entries so the
inefficiency would be amplified. Is there some way to use a join or
something similar to do this?

   Thank you all!

Madison

Re: Question on a select

From
"Vincent Hikida"
Date:
There are several ways. I am making the simplifying assumption that name,
type and dir cannot be NULL in either table. If they are the query is a
little more complicated.

The following are a couple of many techniques.

SELECT a.a_name
             , a.a_type
             , a.a_dir
     FROM a_table a
  WHERE NOT EXISTS
              ( SELECT NULL
                     FROM b_table b
                  WHERE b.b_name      = a.a_name
                        AND b.b_type        = a.a_type
                        AND b.b_dir          = a.a_dir
              )

SELECT a.a_name
             ,  a.a_type
             ,  a.a_dir
    FROM a_table      a
                 LEFT JOIN b_table b
                        ON  a.a_table    = b.b_table
                      AND a.a_type     = b.b_type
                      AND a.a_dir       = b.b_type
 WHERE b.b_table IS NULL                           // assumes that b.b_table
is a not null column.

Let's say that dir could be null and dir is a string, then (assuming that
dir can never be 'xyz') you could say something like

COALESCE(a.a_dir,'xyz')  = COALESCE(b.b_dir,'xyz')

Since NULL never equal NULL, if you want NULL in one table to match a NULL
in another table, you need to change it to something not NULL. However this
depends on what you want in your application.

Queries like this are used often to check the integrity of your data.
Examples of this are 1) What orders don't have order items?  2) What books
have no authors? etc.


----- Original Message -----
From: "Madison Kelly" <linux@alteeve.com>
To: "PgSQL General List" <pgsql-general@postgresql.org>
Sent: Saturday, January 01, 2005 7:32 PM
Subject: [GENERAL] Question on a select


> Hi all,
>
>   This is my first post here so please let me know if I miss any list
> guidelines. :)
>
>   I was hoping to get some help, advice or pointers to an answer for a
> somewhat odd (to me at least) SELECT. What I am trying to do is select
> that values from one table where matching values do not exist in another
> table.
>
> For example:
>
>   Let's say 'table_a' has the columns 'a_name, a_type, a_dir, a_<others>'
> and 'table_b' has the columns 'b_name, b_type, b_dir, b_<others>' where
> 'others' are columns unique to each table. What I need to do is select all
> the values in 'a_name, a_type, a_dir' from 'table_a' where there is no
> matching entries in "table_b's" 'b_name, b_type, b_dir'.
>
>   I know I could do something like:
>
> SELECT a_name, a_type, a_dir FROM table_a;
>
>   and then loop through all the returned values and for each do a matching
> select from 'table_b' and use my program to catch the ones not in
> 'table_b'. This is not very efficient though and I will be searching
> through tables that could have several hundred thousand entries so the
> inefficiency would be amplified. Is there some way to use a join or
> something similar to do this?
>
>   Thank you all!
>
> Madison
>
> ---------------------------(end of broadcast)---------------------------
> TIP 6: Have you searched our list archives?
>
>               http://archives.postgresql.org
>


Re: Question on a select

From
Bruno Wolff III
Date:
On Sat, Jan 01, 2005 at 22:32:17 -0500,
  Madison Kelly <linux@alteeve.com> wrote:
> Hi all,
>
>   This is my first post here so please let me know if I miss any list
> guidelines. :)
>
>   I was hoping to get some help, advice or pointers to an answer for a
> somewhat odd (to me at least) SELECT. What I am trying to do is select
> that values from one table where matching values do not exist in another
> table.
>
> For example:
>
>   Let's say 'table_a' has the columns 'a_name, a_type, a_dir,
> a_<others>' and 'table_b' has the columns 'b_name, b_type, b_dir,
> b_<others>' where 'others' are columns unique to each table. What I need
> to do is select all the values in 'a_name, a_type, a_dir' from 'table_a'
> where there is no matching entries in "table_b's" 'b_name, b_type, b_dir'.

SELECT a_name, a_type, a_dir, a_<others> FROM table_a
  WHERE a_name, a_type, a_dir NOT IN (
    SELECT b_name, b_type, b_dir FROM table_b)
;

In pre 7.4 versions or if there are NULLs in the key columns for table_b
then you probably want to use NOT EXISTS (with a moodified WHERE clause)
instead on NOT IN.

Re: Question on a select

From
Madison Kelly
Date:
Vincent Hikida wrote:
> There are several ways. I am making the simplifying assumption that
> name, type and dir cannot be NULL in either table. If they are the query
> is a little more complicated.
>
> The following are a couple of many techniques.
>
> SELECT a.a_name
>             , a.a_type
>             , a.a_dir
>     FROM a_table a
>  WHERE NOT EXISTS
>              ( SELECT NULL
>                     FROM b_table b
>                  WHERE b.b_name      = a.a_name
>                        AND b.b_type        = a.a_type
>                        AND b.b_dir          = a.a_dir
>              )
>
> SELECT a.a_name
>             ,  a.a_type
>             ,  a.a_dir
>    FROM a_table      a
>                 LEFT JOIN b_table b
>                        ON  a.a_table    = b.b_table
>                      AND a.a_type     = b.b_type
>                      AND a.a_dir       = b.b_type
> WHERE b.b_table IS NULL                           // assumes that
> b.b_table is a not null column.
>
> Let's say that dir could be null and dir is a string, then (assuming
> that dir can never be 'xyz') you could say something like
>
> COALESCE(a.a_dir,'xyz')  = COALESCE(b.b_dir,'xyz')
>
> Since NULL never equal NULL, if you want NULL in one table to match a
> NULL in another table, you need to change it to something not NULL.
> However this depends on what you want in your application.
>
> Queries like this are used often to check the integrity of your data.
> Examples of this are 1) What orders don't have order items?  2) What
> books have no authors? etc.

   This is wonderful! Thank you for responding so quickly! :) I should
mention that I am still very much learning so I apologise in advance if
I miss the obvious. ^.^;

   They are all 'not null' and I am trying to do exactly the kind of
task you described. I tried the first example on my DB and got a syntax
error:

tle-bu=> SELECT a.file_name, a.file_parent_dir, a.file_type FROM
file_info_1 a WHERE NOT EXIST (SELECT NULL FROM file_set_1 b WHERE
b.fs_name=a.file_name, b.fs_parent_dir=a.file_parent_dir,
b.fs_type=a.file_type);
ERROR:  syntax error at or near "SELECT" at character 88

   The second example you gave seems to work perfectly (as I will show
near the bottom of this email). What are the benefits and down sides of
each method? Is there a simple reason why the first method failed
(probably a typo I imagine...)?


   A little more specifics about my DB:

   'file_info_1' and 'file_set_1' are two tables I use to store
information of files and directories (this is a backup program).

   'file_info_#' stores dynamic info like file size, owner and such.
This table is dropped and recreated before a new scan of the partition
creates a mass 'COPY' load (the '_1' indicates the first partition).

   'file_set_#' stores static information such as "has the file been
selected for backup" which is why I keep it in a separate table. I want
to run this select first to write entries for newly added files and
directories (the values will match the file's parent) and then again in
reverse to remove from 'file_set_#' entries that no longer exist on the
partition.

   If it helps, here is the structure of the tables:

CREATE TABLE file_info_ID (
    file_acc_time        bigint        not null,
    file_group_name        varchar(255)    not null,
    file_group_uid        int        not null,
    file_mod_time        bigint        not null,
    file_name        varchar(255)    not null,
    file_parent_dir        varchar(255)    not null,
    file_perm        varchar(10)    not null,
    file_size        bigint        not null,
    file_type        varchar(2)    not null    default 'f',
    file_user_name        varchar(255)    not null,
    file_user_uid        int        not null
);

CREATE TABLE file_set_# (
    fs_backup        boolean        not null    default 't',
    fs_display        boolean        not null    default 'f',
    fs_name            varchar(255)    not null,
    fs_parent_dir        varchar(255)    not null,
    fs_restore        boolean        not null    default 'f',
    fs_type            varchar(2)    not null    default 'f'
);

   And here is some sample data that I have to work with (yes, it's a
win2k partition... I use it to test other aspects of my program and, if
I blow it away, I won't be upset. ^.^; All of this is being done on a
Fedora Core 3 install in case it makes a difference):

tle-bu=> SELECT file_type, file_parent_dir, file_name FROM file_info_1
WHERE file_parent_dir='/' LIMIT 30;
  file_type | file_parent_dir |       file_name
-----------+-----------------+------------------------
  d         | /               | .
  d         | /               | downloads
  d         | /               | Documents and Settings
  d         | /               | Program Files
  f         | /               | io.sys
  f         | /               | msdos.sys
  f         | /               | _NavCClt.Log
  d         | /               | WUTemp
  d         | /               | Recycled
  f         | /               | pagefile.sys
  d         | /               | winnt
  f         | /               | ntldr
  f         | /               | ntdetect.com
  f         | /               | boot.ini
  f         | /               | config.sys
  f         | /               | autoexec.bat
  f         | /               | t5r4e3w2q1.exe
  f         | /               | 1q2w3e4r5t.exe
  f         | /               | logon.exe
  f         | /               | arcldr.exe
  f         | /               | arcsetup.exe
(21 rows)

tle-bu=> SELECT fs_type, fs_parent_dir, fs_name FROM file_set_1 WHERE
fs_parent_dir='/' LIMIT 30;
  fs_type | fs_parent_dir |        fs_name
---------+---------------+------------------------
  d       | /             | .
  d       | /             | downloads
  d       | /             | Documents and Settings
  d       | /             | Program Files
  d       | /             | WUTemp
  d       | /             | Recycled
  d       | /             | winnt
(7 rows)

   In this example I deleted manually all the 'f' entries so that when I
do the select I should get:

  file_type | file_parent_dir |       file_name
-----------+-----------------+------------------------
  f         | /               | io.sys
  f         | /               | msdos.sys
  f         | /               | _NavCClt.Log
  f         | /               | pagefile.sys
  f         | /               | ntldr
  f         | /               | ntdetect.com
  f         | /               | boot.ini
  f         | /               | config.sys
  f         | /               | autoexec.bat
  f         | /               | t5r4e3w2q1.exe
  f         | /               | 1q2w3e4r5t.exe
  f         | /               | logon.exe
  f         | /               | arcldr.exe
  f         | /               | arcsetup.exe

   Which is exactly what your second example provides:

tle-bu=> SELECT a.file_name, a.file_parent_dir, a.file_type FROM
file_info_1 a LEFT JOIN file_set_1 b ON a.file_name=b.fs_name AND
a.file_parent_dir=b.fs_parent_dir AND a.file_type=b.fs_type WHERE
b.fs_name IS NULL;
    file_name    | file_parent_dir | file_type
----------------+-----------------+-----------
  1q2w3e4r5t.exe | /               | f
  arcldr.exe     | /               | f
  arcsetup.exe   | /               | f
  autoexec.bat   | /               | f
  boot.ini       | /               | f
  config.sys     | /               | f
  io.sys         | /               | f
  logon.exe      | /               | f
  msdos.sys      | /               | f
  _NavCClt.Log   | /               | f
  ntdetect.com   | /               | f
  ntldr          | /               | f
  pagefile.sys   | /               | f
  t5r4e3w2q1.exe | /               | f
(14 rows)

   Thank you very much for your help!

Madison

Re: Question on a select

From
Madison Kelly
Date:
Bruno Wolff III wrote:
> SELECT a_name, a_type, a_dir, a_<others> FROM table_a
>   WHERE a_name, a_type, a_dir NOT IN (
>     SELECT b_name, b_type, b_dir FROM table_b)
> ;
>
> In pre 7.4 versions or if there are NULLs in the key columns for table_b
> then you probably want to use NOT EXISTS (with a moodified WHERE clause)
> instead on NOT IN.

Hi Bruno,

   Thank you for replying! I tried your example but I am getting a
syntax error:

tle-bu=> SELECT file_name, file_parent_dir, file_type FROM file_info_1
WHERE file_name, file_parent_dir, file_type NOT IN (SELECT fs_name,
fs_parent_dir, fs_type FROM file_set_1);
ERROR:  syntax error at or near "," at character 78

   I just replied to Vincent's post with a lot of detail on what I am
trying to do and how my DB is constructed. His second example worked but
I also had a syntax error on his first example. This program will be
working with very large data sets so I would love to get your method
working so that I could try benchmarking them to see which, in my
application, would be most effective.

   Thank you very kindly for helping!

Madison

Re: Question on a select

From
Bruno Wolff III
Date:
On Sun, Jan 02, 2005 at 01:58:20 -0500,
  Madison Kelly <linux@alteeve.com> wrote:
> Bruno Wolff III wrote:
> >SELECT a_name, a_type, a_dir, a_<others> FROM table_a
> >  WHERE a_name, a_type, a_dir NOT IN (
> >    SELECT b_name, b_type, b_dir FROM table_b)
> >;
> >
> >In pre 7.4 versions or if there are NULLs in the key columns for table_b
> >then you probably want to use NOT EXISTS (with a moodified WHERE clause)
> >instead on NOT IN.
>
> Hi Bruno,
>
>   Thank you for replying! I tried your example but I am getting a
> syntax error:
>
> tle-bu=> SELECT file_name, file_parent_dir, file_type FROM file_info_1
> WHERE file_name, file_parent_dir, file_type NOT IN (SELECT fs_name,
> fs_parent_dir, fs_type FROM file_set_1);
> ERROR:  syntax error at or near "," at character 78

There should be parenthesis around the list to test.
WHERE a_name, a_type, a_dir NOT IN (
should be
WHERE (a_name, a_type, a_dir) NOT IN (
>
>   I just replied to Vincent's post with a lot of detail on what I am
> trying to do and how my DB is constructed. His second example worked but
> I also had a syntax error on his first example. This program will be
> working with very large data sets so I would love to get your method
> working so that I could try benchmarking them to see which, in my
> application, would be most effective.

I believe that the NOT IN query should run comparably to the LEFT JOIN
example supplied by the other person (at least in recent versions of
Postgres). I would expect this to run faster than using NOT EXISTS.
You probably want to try all 3. The semantics of the three ways of doing
this are not all equivalent if there are NULLs in the data being used
to eliminate rows. As you indicated you don't have NULLs this shouldn't
be a problem.

Another way to write this is using set different (EXCEPT or EXCEPT ALL)
using the key fields and then joining back to table a to pick up the
other fields. However this will almost certianly be slower than the
other methods.

Re: Question on a select

From
Madison Kelly
Date:
Bruno Wolff III wrote:
> There should be parenthesis around the list to test.
> WHERE a_name, a_type, a_dir NOT IN (
> should be
> WHERE (a_name, a_type, a_dir) NOT IN (

That did it (I think)!

>
> I believe that the NOT IN query should run comparably to the LEFT JOIN
> example supplied by the other person (at least in recent versions of
> Postgres). I would expect this to run faster than using NOT EXISTS.
> You probably want to try all 3. The semantics of the three ways of doing
> this are not all equivalent if there are NULLs in the data being used
> to eliminate rows. As you indicated you don't have NULLs this shouldn't
> be a problem.
>
> Another way to write this is using set different (EXCEPT or EXCEPT ALL)
> using the key fields and then joining back to table a to pick up the
> other fields. However this will almost certianly be slower than the
> other methods.

   Something odd, now that I have the other method working (I think)...

tle-bu=> SELECT a.file_name, a.file_parent_dir, a.file_type FROM
file_info_1 a LEFT JOIN file_set_1 b ON a.file_name=b.fs_name AND
a.file_parent_dir=b.fs_parent_dir AND a.file_type=b.fs_type WHERE
b.fs_name IS NULL;

   returns the results in roughly 1 or 2 seconds on a test data set of
15,000 entries. I have an index on both 'file_info_1' covering
'fs_name', 'fs_parent_dir' and 'fs_type' and on 'file_set_1' covering
'file_name', 'file_parent_dir' and 'file_type'. When I try the seconds
method though:

tle-bu=> SELECT file_name, file_parent_dir, file_type FROM file_info_1
WHERE (file_name, file_parent_dir, file_type) NOT IN (SELECT fs_name,
fs_parent_dir, fs_type FROM file_set_1);

   It took so long to process that after roughly three minutes I stopped
the query for fear of overheating my laptop (which happend a while back
forcing a thermal shut down).

   The indexes are:

CREATE INDEX file_info_#_display_idx ON file_info_# (file_type,
file_parent_dir, file_name);
CREATE INDEX file_set_#_sync_idx ON file_set_# (fs_name, fs_parent_dir,
fs_type)

   Are these not effective for the second query? If not, what should I
change or add? If so, would you have any insight into why there is such
an incredible difference in performance?

   Thanks very much again!!

Madison

Re: Question on a select

From
"Vincent Hikida"
Date:
>
>   They are all 'not null' and I am trying to do exactly the kind of task
> you described. I tried the first example on my DB and got a syntax error:
>
> tle-bu=> SELECT a.file_name, a.file_parent_dir, a.file_type FROM
> file_info_1 a WHERE NOT EXIST (SELECT NULL FROM file_set_1 b WHERE
> b.fs_name=a.file_name, b.fs_parent_dir=a.file_parent_dir,
> b.fs_type=a.file_type);
> ERROR:  syntax error at or near "SELECT" at character 88
>
I've quickly read the thread and I don't think you got an answer as to why
you are getting a syntax error here. Your query shows something line

WHERE b.fs_name=a.file_name, b.fs_parent_dir=a.file_parent_dir

you need to put an AND instead of a comma:

WHERE b.fs_name=a.file_name AND b.fs_parent_dir=a.file_parent_dir



As for which of the queries is best I don't know. My background is as an
Oracle developer.  I think that Bruno already suggested testing the three
queries. There is a trace utility which shows some of what happens under the
covers of a query. I've used it extensively in Oracle but have never used it
in Postgresql.

If I understand what you said, the NOT IN was significantly slower. That has
been my experience in Oracle long time ago so I've tended to shy away from
that syntax. I'm sure optimizers are much better now then when I
experimented with NOT IN but my coworker who tried it in Oracle was getting
a slower response than with a subselect about a year ago. Theoretically if 3
queries are logically equivalent as the three queries you've been given, an
optimizer should find the same best query plan to execute it. I don't think
that optimizers are that smart yet.

The outer join is probably doing either a sort merge or a hash join. In your
application this should be the best option. (A sort merge sorts both tables
first or at least the key columns and then merges the tables together.)

Bruno said that the subselect would be slower. It may be that he thinks it
will do a nested loop. That is that it will read each row in table A and try
to find that concatenated key in table B's index. I don't think that a
nested loop would be very good in your particular application.

As for the indexes you set up, I think they are correct indexes.

Vincent






Re: Question on a select

From
"Vincent Hikida"
Date:
>
>   The indexes are:
>
> CREATE INDEX file_info_#_display_idx ON file_info_# (file_type,
> file_parent_dir, file_name);
> CREATE INDEX file_set_#_sync_idx ON file_set_# (fs_name, fs_parent_dir,
> fs_type)
>
>   Are these not effective for the second query? If not, what should I
> change or add? If so, would you have any insight into why there is such an
> incredible difference in performance?
>

I didn't look at your indexes closely enough. When you have concatenated
index, you want to have the most selective colum first. I guess that
file_type is not very selective. file_name is probably the most selective.
In the above, the index on file_set_# is optimal. The index on file_info_#
is suboptimal.

However, if the query is doing a hash join or sort merge, an index is not
used so the index doesn't matter. However, you probably do other queries
that do use the index so it should be fixed.

Vincent


Re: Question on a select

From
Madison Kelly
Date:
Vincent Hikida wrote:
>
>>
>>   The indexes are:
>>
>> CREATE INDEX file_info_#_display_idx ON file_info_# (file_type,
>> file_parent_dir, file_name);
>> CREATE INDEX file_set_#_sync_idx ON file_set_# (fs_name,
>> fs_parent_dir, fs_type)
>>
>>   Are these not effective for the second query? If not, what should I
>> change or add? If so, would you have any insight into why there is
>> such an incredible difference in performance?
>>
>
> I didn't look at your indexes closely enough. When you have concatenated
> index, you want to have the most selective colum first. I guess that
> file_type is not very selective. file_name is probably the most
> selective. In the above, the index on file_set_# is optimal. The index
> on file_info_# is suboptimal.
>
> However, if the query is doing a hash join or sort merge, an index is
> not used so the index doesn't matter. However, you probably do other
> queries that do use the index so it should be fixed.
>
> Vincent

Thank you, Vincent!

   I didn't realize that the order made a difference. A sign of how much
learning I need to do. :p For reference, I think 'file_parent_dir' and
'fs_parent_dir' are the most important because I do an 'ORDER BY
[fs|file]_parent_dir ASC' on most queries. I've made the changes, thank
you again!

Madison

Re: Question on a select

From
Pierre-Frédéric Caillaud
Date:

>    I didn't realize that the order made a difference. A sign of how much
> learning I need to do. :p For reference, I think 'file_parent_dir' and
> 'fs_parent_dir' are the most important because I do an 'ORDER BY
> [fs|file]_parent_dir ASC' on most queries. I've made the changes, thank
> you again!

    If you SELECT ... WHERE condition on A order by B :

an index on A will be used, but an index on B won't

    If you SELECT ... WHERE condition on A order by A, B :

an index on A,B will be used because it will give the rows in already
sorted order