Thread: UPDATE on NOT JOIN

UPDATE on NOT JOIN

From
Gabriel Biberian
Date:
Hello,

I'm working on a system that detects changes in someone's filesystem.
We first scan the entire filesystem which is dumped into the table
'filesystem' containing the full_path of every file and it's
corresponding md5 hash.
We then do subsequent scans of the filesystem which are dumped into a
temporary table and we would like to find out which files are not
present anymore.  To do so, we update the field 'dead' in the
'filesystem' table for every row that does not exist in the 'temporary'
table.

\d filesystem;
               Table « public.filesystem »
   Colonne   |          Type          |    Modificateurs
------------+------------------------+----------------------
  hash       | character(32)          |
  full_name  | text                   |
  dead       | integer                | default 0
Index :
     « filesystem_unique » UNIQUE, btree (hash)

\d temporary;
               Table « public.filesystem »
   Colonne   |          Type          |    Modificateurs
------------+------------------------+----------------------
  hash       | character(32)          |
  full_name  | text                   |
Index :
     « temporary_unique » UNIQUE, btree (hash)

Currently, i use the following query to update the filesystem table with
the missing files :
UPDATE filesystem SET dead=some_value WHERE dead=0 AND (SELECT 1 FROM
temporary AS t WHERE t.hash=filesystem.hash LIMIT 1) IS NULL

This works correctly for regular filesystems.  But when the 'filesystem'
table contains more than a few million rows, the update query can take days.

Here's an explain of the query :
=# UPDATE filesystem SET dead=some_value WHERE dead=0 AND (SELECT 1 FROM
temporary AS t WHERE t.hash=filesystem.hash LIMIT 1) IS NULL
                                                       QUERY PLAN
----------------------------------------------------------------------------------------------------------------------
  Seq Scan on filesystem  (cost=0.00..97536479.02 rows=25747 width=241)
    Filter: ((dead = 0) AND ((subplan) IS NULL))
    SubPlan
      ->  Limit  (cost=0.00..9.53 rows=1 width=0)
            ->  Index Scan using temporary_hash on temporary t
(cost=0.00..9.53 rows=1 width=0)
                  Index Cond: (hash = $0)
(6 lignes)

Is there a better way to update a table if it doesn't join another table ?

Best Regards,

Gabriel Biberian

Re: UPDATE on NOT JOIN

From
Marti Raudsepp
Date:
On Wed, Feb 15, 2012 at 20:33, Gabriel Biberian
<admin@beemotechnologie.com> wrote:
> Currently, i use the following query to update the filesystem table with the
> missing files :
> UPDATE filesystem SET dead=some_value WHERE dead=0 AND (SELECT 1 FROM
> temporary AS t WHERE t.hash=filesystem.hash LIMIT 1) IS NULL

I don't know if this solves your problem entirely, but an obvious
improvement would be using the NOT EXISTS (SELECT ...) construct:

UPDATE filesystem SET dead=some_value WHERE dead=0 AND NOT EXISTS
(SELECT 1 FROM temporary AS t WHERE t.hash=filesystem.hash);

PostgreSQL 8.4+ can optimize this into an "anti join" query (you
didn't mention what version you are using).

Also, if your hardware isn't very limited, you should increase the
work_mem setting from the default (1MB).

If the above doesn't help significantly, please post the full EXPLAIN
ANALYZE output.

Regards,
Marti