Thread: Scrub one large table against another

Scrub one large table against another

From
Brendan Curran
Date:
I currently have a db supporting what is for the most part an OLAP data
warehousing application.

One table (good data) has roughly 120 million rows, divided into roughly
40 different relational groups (logically by foreign key). Every time I
add data to this table, I need to afterwards scrub that group against
known "bad data" which is housed in a second table that has roughly 21
million rows.

The 120 million row good data table is called "email_record"
The 21 million row bad data table is called "suppress"

There are separate btree indexes on 'email_record_id', 'email_list_id'
and 'email' on both tables.

Each time I scrub data I pull out anywhere from 1 to 5 million rows from
the good table (depending on the size of the group i'm scrubbing) and
compare them against the 21 million rows in the 'suppress' table.

So far I've done this using a temporary staging table that stores only
the email_record_id for each row from the relevant group of the good
table. I use a plsql function that does roughly the following (i've
included only sql syntax and inserted the constant '9' where i would
normally use a variable):

The characters: email_record_id int8, email varchar(255), email_list_id int8
-------------------------------------------------------------

CREATE TEMP TABLE temp_list_suppress(email_record_id int8);

INSERT INTO temp_list_suppress
    SELECT email_record_id from ONLY email_record er
    WHERE email_list_id = 9 AND email IN
    (select email from suppress);

CREATE INDEX unique_id_index on temp_list_suppress ( email_record_id );

INSERT INTO er_banned
SELECT * from ONLY email_record er WHERE EXISTS
(SELECT 1 from temp_list_suppress ts where er.email_record_id =
ts.email_record_id)';

DELETE FROM ONLY email_record WHERE email_list_id = 9 AND email_record_id IN
    (SELECT email_record_id from temp_list_suppress);

TRUNCATE TABLE temp_list_suppress;
DROP TABLE temp_list_suppress;
--------------------------------------------------------------

The performance is dreadful, is there a more efficient way to do this?
Would I be better off just grabbing * initially from the good table
instead of just the id to avoid more sequential searches later? Here are
my configs:

Debian
Postgres 8.1.4
dual zeon
ram: 4 gigs
raid 5

# - Memory -
shared_buffers = 3000
work_mem = 92768
maintenance_work_mem = 128384

autovacuum is turned off, and the db is annalyzed and vacuumed regularly.


Regards,
Brendan

Re: Scrub one large table against another

From
Tom Lane
Date:
Brendan Curran <brendan.curran@gmail.com> writes:
> CREATE TEMP TABLE temp_list_suppress(email_record_id int8);

> INSERT INTO temp_list_suppress
>     SELECT email_record_id from ONLY email_record er
>     WHERE email_list_id = 9 AND email IN
>     (select email from suppress);

> CREATE INDEX unique_id_index on temp_list_suppress ( email_record_id );

> INSERT INTO er_banned
> SELECT * from ONLY email_record er WHERE EXISTS
> (SELECT 1 from temp_list_suppress ts where er.email_record_id =
> ts.email_record_id)';

> DELETE FROM ONLY email_record WHERE email_list_id = 9 AND email_record_id IN
>     (SELECT email_record_id from temp_list_suppress);

> TRUNCATE TABLE temp_list_suppress;
> DROP TABLE temp_list_suppress;

> The performance is dreadful, is there a more efficient way to do this?

Have you tried doing EXPLAIN ANALYZE of each of the INSERT/DELETE steps?
If you don't even know which part is slow, it's hard to improve.

It would probably help to do an "ANALYZE temp_list_suppress" right after
populating the temp table.  As you have it, the second insert and delete
are being planned with nothing more than a row count (obtained during
CREATE INDEX) and no stats about distribution of the table contents.

Also, I'd be inclined to try replacing the EXISTS with an IN test;
in recent PG versions the planner is generally smarter about IN.
(Is there a reason why you are doing the INSERT one way and the
DELETE the other?)

BTW, that TRUNCATE right before the DROP seems quite useless,
although it's not the main source of your problem.

            regards, tom lane

Re: Scrub one large table against another

From
Brendan Curran
Date:

Tom Lane wrote:
> Brendan Curran <brendan.curran@gmail.com> writes:
>> CREATE TEMP TABLE temp_list_suppress(email_record_id int8);
>
>> INSERT INTO temp_list_suppress
>>     SELECT email_record_id from ONLY email_record er
>>     WHERE email_list_id = 9 AND email IN
>>     (select email from suppress);
>
>> CREATE INDEX unique_id_index on temp_list_suppress ( email_record_id );
>
>> INSERT INTO er_banned
>> SELECT * from ONLY email_record er WHERE EXISTS
>> (SELECT 1 from temp_list_suppress ts where er.email_record_id =
>> ts.email_record_id)';
>
>> DELETE FROM ONLY email_record WHERE email_list_id = 9 AND email_record_id IN
>>     (SELECT email_record_id from temp_list_suppress);
>
>> TRUNCATE TABLE temp_list_suppress;
>> DROP TABLE temp_list_suppress;
>
>> The performance is dreadful, is there a more efficient way to do this?
>
> Have you tried doing EXPLAIN ANALYZE of each of the INSERT/DELETE steps?
> If you don't even know which part is slow, it's hard to improve.

FIRST INSERT (Just the select is explained):
Hash Join  (cost=8359220.68..9129843.00 rows=800912 width=32)
   Hash Cond: (("outer".email)::text = ("inner".email)::text)
   ->  Unique  (cost=4414093.19..4522324.49 rows=21646260 width=25)
         ->  Sort  (cost=4414093.19..4468208.84 rows=21646260 width=25)
               Sort Key: suppress.email
               ->  Seq Scan on suppress  (cost=0.00..393024.60
rows=21646260 width=25)
   ->  Hash  (cost=3899868.47..3899868.47 rows=4606808 width=32)
         ->  Bitmap Heap Scan on email_record er
(cost=38464.83..3899868.47 rows=4606808 width=32)
               Recheck Cond: (email_list_id = 13)
               ->  Bitmap Index Scan on list  (cost=0.00..38464.83
rows=4606808 width=0)
                     Index Cond: (email_list_id = 13)

SECOND INSERT (Using EXISTS):
Seq Scan on email_record er  (cost=0.00..381554175.29 rows=62254164
width=1863)
   Filter: (subplan)
   SubPlan
     ->  Index Scan using er_primeq_pk on er_primeq eq  (cost=0.00..3.03
rows=1 width=0)
           Index Cond: ($0 = email_record_id)

SECOND INSERT (Using IN):
Nested Loop  (cost=26545.94..2627497.28 rows=27134 width=1863)
   ->  HashAggregate  (cost=26545.94..33879.49 rows=733355 width=8)
         ->  Seq Scan on er_primeq  (cost=0.00..24712.55 rows=733355
width=8)
   ->  Index Scan using email_record_pkey on email_record er
(cost=0.00..3.52 rows=1 width=1863)
         Index Cond: (er.email_record_id = "outer".email_record_id)
         Filter: (email_list_id = 13)

DELETE
Nested Loop  (cost=26545.94..2627497.28 rows=50846 width=6)
   ->  HashAggregate  (cost=26545.94..33879.49 rows=733355 width=8)
         ->  Seq Scan on er_primeq  (cost=0.00..24712.55 rows=733355
width=8)
   ->  Index Scan using email_record_pkey on email_record
(cost=0.00..3.52 rows=1 width=14)
         Index Cond: (email_record.email_record_id =
"outer".email_record_id)
         Filter: (email_list_id = 9)


To get this explain data I used a sample "temp_suppress" table that
contained about 700k rows and was indexed but not analyzed...


>
> It would probably help to do an "ANALYZE temp_list_suppress" right after
> populating the temp table.  As you have it, the second insert and delete
> are being planned with nothing more than a row count (obtained during
> CREATE INDEX) and no stats about distribution of the table contents.
>
> Also, I'd be inclined to try replacing the EXISTS with an IN test;
> in recent PG versions the planner is generally smarter about IN.
> (Is there a reason why you are doing the INSERT one way and the
> DELETE the other?)
>
> BTW, that TRUNCATE right before the DROP seems quite useless,
> although it's not the main source of your problem.
>
>             regards, tom lane
>

Re: Scrub one large table against another

From
Tom Lane
Date:
Brendan Curran <brendan.curran@gmail.com> writes:
> Tom Lane wrote:
>> Have you tried doing EXPLAIN ANALYZE of each of the INSERT/DELETE steps?

> FIRST INSERT (Just the select is explained):

EXPLAIN ANALYZE, please, not just EXPLAIN.

            regards, tom lane

Re: Scrub one large table against another

From
Brendan Curran
Date:

Tom Lane wrote:
> Brendan Curran <brendan.curran@gmail.com> writes:
>> Tom Lane wrote:
>>> Have you tried doing EXPLAIN ANALYZE of each of the INSERT/DELETE steps?
>
>> FIRST INSERT (Just the select is explained):
>
> EXPLAIN ANALYZE, please, not just EXPLAIN.
>
>             regards, tom lane
>

Sorry, here is the EXPLAIN ANALYZE output of that first SELECT

EXPLAIN ANALYZE SELECT email_record_id from ONLY email_record er
    WHERE email_list_id = 13 AND email IN
    (select email from suppress);

Hash Join  (cost=8359220.68..9129843.00 rows=800912 width=8) (actual
time=2121601.603..2121601.603 rows=0 loops=1)
   Hash Cond: (("outer".email)::text = ("inner".email)::text)
   ->  Unique  (cost=4414093.19..4522324.49 rows=21646260 width=25)
(actual time=1165955.907..1434439.731 rows=21646261 loops=1)
         ->  Sort  (cost=4414093.19..4468208.84 rows=21646260 width=25)
(actual time=1165955.903..1384667.715 rows=21646261 loops=1)
               Sort Key: suppress.email
               ->  Seq Scan on suppress  (cost=0.00..393024.60
rows=21646260 width=25) (actual time=37.784..609848.551 rows=21646261
loops=1)
   ->  Hash  (cost=3899868.47..3899868.47 rows=4606808 width=32) (actual
time=554522.983..554522.983 rows=3245336 loops=1)
         ->  Bitmap Heap Scan on email_record er
(cost=38464.83..3899868.47 rows=4606808 width=32) (actual
time=275640.435..541342.727 rows=3245336 loops=1)
               Recheck Cond: (email_list_id = 13)
               ->  Bitmap Index Scan on list  (cost=0.00..38464.83
rows=4606808 width=0) (actual time=275102.037..275102.037 rows=5172979
loops=1)
                     Index Cond: (email_list_id = 13)
Total runtime: 2122693.864 ms


So much time is being spent in the Unique and Sort leaves... I would
think that it wouldn't need to do the unique portion, since there is no
DISTINCT clause...

Re: Scrub one large table against another

From
"Jim C. Nasby"
Date:
On Tue, Oct 10, 2006 at 05:46:18PM -0600, Brendan Curran wrote:
>
>
> Tom Lane wrote:
> >Brendan Curran <brendan.curran@gmail.com> writes:
> >>Tom Lane wrote:
> >>>Have you tried doing EXPLAIN ANALYZE of each of the INSERT/DELETE steps?
> >
> >>FIRST INSERT (Just the select is explained):
> >
> >EXPLAIN ANALYZE, please, not just EXPLAIN.
> >
> >            regards, tom lane
> >
>
> Sorry, here is the EXPLAIN ANALYZE output of that first SELECT
>
> EXPLAIN ANALYZE SELECT email_record_id from ONLY email_record er
>     WHERE email_list_id = 13 AND email IN
>     (select email from suppress);
>
> Hash Join  (cost=8359220.68..9129843.00 rows=800912 width=8) (actual
> time=2121601.603..2121601.603 rows=0 loops=1)
>   Hash Cond: (("outer".email)::text = ("inner".email)::text)
>   ->  Unique  (cost=4414093.19..4522324.49 rows=21646260 width=25)
> (actual time=1165955.907..1434439.731 rows=21646261 loops=1)
>         ->  Sort  (cost=4414093.19..4468208.84 rows=21646260 width=25)
> (actual time=1165955.903..1384667.715 rows=21646261 loops=1)
>               Sort Key: suppress.email
>               ->  Seq Scan on suppress  (cost=0.00..393024.60
> rows=21646260 width=25) (actual time=37.784..609848.551 rows=21646261
> loops=1)
>   ->  Hash  (cost=3899868.47..3899868.47 rows=4606808 width=32) (actual
> time=554522.983..554522.983 rows=3245336 loops=1)
>         ->  Bitmap Heap Scan on email_record er
> (cost=38464.83..3899868.47 rows=4606808 width=32) (actual
> time=275640.435..541342.727 rows=3245336 loops=1)
>               Recheck Cond: (email_list_id = 13)
>               ->  Bitmap Index Scan on list  (cost=0.00..38464.83
> rows=4606808 width=0) (actual time=275102.037..275102.037 rows=5172979
> loops=1)
>                     Index Cond: (email_list_id = 13)
> Total runtime: 2122693.864 ms
>
>
> So much time is being spent in the Unique and Sort leaves... I would
> think that it wouldn't need to do the unique portion, since there is no
> DISTINCT clause...

I think that's coming about because of the IN. Try a simple join
instead...

SELECT email_record_id FROM ONLY email_record er JOIN suppress s USING
(email) WHERE er.email_list_id = 13;
--
Jim Nasby                                            jim@nasby.net
EnterpriseDB      http://enterprisedb.com      512.569.9461 (cell)

Re: Scrub one large table against another

From
Tom Lane
Date:
Brendan Curran <brendan.curran@gmail.com> writes:
> So much time is being spent in the Unique and Sort leaves... I would
> think that it wouldn't need to do the unique portion, since there is no
> DISTINCT clause...

There's nothing in that query suggesting that suppress.email is unique.
If you know that it is, try using a plain join instead of an IN.

            regards, tom lane

Re: Scrub one large table against another

From
Brendan Curran
Date:
Tom Lane wrote:
> Brendan Curran <brendan.curran@gmail.com> writes:
>> So much time is being spent in the Unique and Sort leaves... I would
>> think that it wouldn't need to do the unique portion, since there is no
>> DISTINCT clause...
>
> There's nothing in that query suggesting that suppress.email is unique.
> If you know that it is, try using a plain join instead of an IN.
>
>             regards, tom lane
>


Interestingly, and thank you to Tom and Jim, the explicit JOIN improved
performance tremendously (RESULTS BELOW). I converted the entire query
to use explicit joins instead of IN and EXISTS and discovered acceptable
performance. I think the next place to go from here is RAID1/RAID10 and
possibly partitioning my large table (Welcome to DDL insanity, right?).

I have to add that I'm a little surprised the documentation is so
generous to IN and EXISTS. Is there something amiss in my configuration
that prevents them from performing correctly? If not, I can't imagine a
time when IN or EXISTS would be more performant than an explicit JOIN...

Additionally, I manually scrub for duplicates at the group level in the
email_record table to keep my records unique. I would like to use a
unique constraint, but have found that batching in JDBC is impossible
due to irrecoverable errors even when using BEFORE INSERT triggers to
just return NULL if a record exists already. Has anyone got an elegant
solution for the 'add only if not exists already' problem similar to
MSSQL's MERGE command?

Just one more thing... I have found that maintaining a btree index on a
varchar(255) value is extremely expensive on insert/update/delete. It is
unfortunately necessary for me to maintain this index for queries and
reports so I am transitioning to using an unindexed staging table to
import data into before merging it with the larger table. All the docs
and posts recommend is to drop the index, import your data, and then
create the index again. This is untenable on a daily / bi-weekly basis.
Is there a more elegant solution to this indexing problem?

Thank you for all of your help!

EXPLAIN ANALYZE result comparison...

1. EXPLAIN ANALYZE SELECT email_record_id from ONLY email_record er
     WHERE email_list_id = 13 AND email IN
     (select email from suppress);

Hash Join  (cost=8359220.68..9129843.00 rows=800912 width=8) (actual
time=2121601.603..2121601.603 rows=0 loops=1)
   Hash Cond: (("outer".email)::text = ("inner".email)::text)
   ->  Unique  (cost=4414093.19..4522324.49 rows=21646260 width=25)
(actual time=1165955.907..1434439.731 rows=21646261 loops=1)
         ->  Sort  (cost=4414093.19..4468208.84 rows=21646260 width=25)
(actual time=1165955.903..1384667.715 rows=21646261 loops=1)
               Sort Key: suppress.email
               ->  Seq Scan on suppress  (cost=0.00..393024.60
rows=21646260 width=25) (actual time=37.784..609848.551 rows=21646261
loops=1)
   ->  Hash  (cost=3899868.47..3899868.47 rows=4606808 width=32) (actual
time=554522.983..554522.983 rows=3245336 loops=1)
         ->  Bitmap Heap Scan on email_record er
(cost=38464.83..3899868.47 rows=4606808 width=32) (actual
time=275640.435..541342.727 rows=3245336 loops=1)
               Recheck Cond: (email_list_id = 13)
               ->  Bitmap Index Scan on list  (cost=0.00..38464.83
rows=4606808 width=0) (actual time=275102.037..275102.037 rows=5172979
loops=1)
                     Index Cond: (email_list_id = 13)
Total runtime: 2,122,693.864 ms
--------------------------------------------------------

2. EXPLAIN ANALYZE SELECT email_record_id FROM ONLY email_record er JOIN
suppress s USING
(email) WHERE er.email_list_id = 13;

Hash Join  (cost=3945127.49..5000543.11 rows=800912 width=8) (actual
time=808874.088..808874.088 rows=0 loops=1)
   Hash Cond: (("outer".email)::text = ("inner".email)::text)
   ->  Seq Scan on suppress s  (cost=0.00..393024.60 rows=21646260
width=25) (actual time=661.518..216933.399 rows=21646261 loops=1)
   ->  Hash  (cost=3899868.47..3899868.47 rows=4606808 width=32) (actual
time=494294.932..494294.932 rows=3245336 loops=1)
         ->  Bitmap Heap Scan on email_record er
(cost=38464.83..3899868.47 rows=4606808 width=32) (actual
time=242198.226..485942.542 rows=3245336 loops=1)
               Recheck Cond: (email_list_id = 13)
               ->  Bitmap Index Scan on list  (cost=0.00..38464.83
rows=4606808 width=0) (actual time=241769.786..241769.786 rows=5172979
loops=1)
                     Index Cond: (email_list_id = 13)
Total runtime: 808,884.387 ms



Re: Scrub one large table against another (vmstat output)

From
Brendan Curran
Date:
>
> What prevents you from using an aggregate function?
>

I guess I could actually obtain the results in an aggregate function and use those to maintain a
summary table. There is a web view that requires 'as accurate as possible' numbers to be queried per
group (all 40 groups are displayed on the same page) and so constant aggregates over the entire
table would be a nightmare.

> Probably not 2x, but better performance than now. You probably don't
> want RAID 1, depending on your setup, many list member swear by RAID 10.
> Of course, your setup will depend on how much money you have to burn.
> That said, RAID 1 testing will allow you to determine the upper bounds
> of your hardware. Some folks say they get better performance with WAL
> off the main RAID, some keep it on. Only testing will allow you to
> determine what is optimal.

I will have to try moving WAL off those raid spindles, I have seen the posts regarding this.

> In the meantime, you need to identify the
> bottleneck of your operation. You should collect vmstat and iostat
> statistics for your present setup. Good luck!
>

I have to confess that I am a bit of a novice with vmstat. Below is a sample of my vmstat output
while running two scrubbing queries simultaneously:

machine:/dir# vmstat -S M 2
procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu----
  r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa
  0  1      4    117     15   2962    0    0   100    25   96  107  2  0 86 11
  0  3      4    117     15   2962    0    0  4884  1860  415  841 18  1 52 29
  1  1      4    115     15   2964    0    0  2246  1222  462  394  8  0 51 41
  0  2      4    114     14   2967    0    0  3932  2238  485  613 12  0 62 25
  1  1      4    115     13   2966    0    0  3004  1684  507  609  8  0 60 31
  0  3      4    116     13   2965    0    0  4688  4000  531  613 15  1 52 33
  1  1      4    117     13   2964    0    0  2890   268  433  441  9  1 58 32
  0  1      4    114     13   2968    0    0  2802  4708  650  501  8  1 64 28
  0  2      4    114     13   2968    0    0  4850  1696  490  574 15  1 57 27
  0  2      4    116     13   2966    0    0  4300  3062  540  520 13  1 61 26
  0  2      4    115     13   2966    0    0  3292  3608  549  455 10  1 65 24
  0  3      4    115     13   2966    0    0  4856  2098  505  564 15  1 59 26
  0  3      4    115     13   2966    0    0  1608  2314  447  413  4  0 63 33
  0  3      4    116     13   2966    0    0  6206  1664  442  649 18  1 52 29
  1  1      4    115     13   2966    0    0  1886  1262  464  412  5  0 60 35
  0  3      4    118     13   2964    0    0  2510  4138  571  493  7  1 64 28
  1  1      4    117     13   2964    0    0  1632    56  325  373  5  0 53 42
  0  3      4    116     13   2965    0    0  5358  3510  504  649 14  1 59 26
  1  1      4    118     13   2964    0    0  2814   920  447  403  8  0 63 29

I know that wa is the time spent waiting on IO, but I lack a benchmark to determine just what I
should expect from my hardware (three 146GB U320 SCSI 10k drives in raid 5 on a Dell PERC4ei PE2850
controller). Those drives are dedicated completely to a /data mount that contains only
/data/postgresql/8.1/main. I have another two drives in raid 1 for everything else (OS, apps, etc.).
Can you give me any pointers based on that vmstat output?

Regards and Thanks,
Brendan

Re: Scrub one large table against another

From
"Jim C. Nasby"
Date:
On Wed, Oct 11, 2006 at 10:53:41AM -0600, Brendan Curran wrote:
> Interestingly, and thank you to Tom and Jim, the explicit JOIN improved
> performance tremendously (RESULTS BELOW). I converted the entire query
> to use explicit joins instead of IN and EXISTS and discovered acceptable
> performance. I think the next place to go from here is RAID1/RAID10 and
> possibly partitioning my large table (Welcome to DDL insanity, right?).

Remember that partitioning is not a magic bullet: it only helps in cases
where you need to keep a lot of data, but normally only access a small
portion of it.

WAL on RAID5 without a really good controller will probably kill you.
Data being there isn't too much better. You'll probably be better with
either 1 raid 10 or 2 raid 1s.

> I have to add that I'm a little surprised the documentation is so
> generous to IN and EXISTS. Is there something amiss in my configuration
> that prevents them from performing correctly? If not, I can't imagine a
> time when IN or EXISTS would be more performant than an explicit JOIN...

Well, IN != EXISTS != JOIN. Exists just stops as soon as it finds a
record. For some cases, it's equivalent to IN, but not all. IN has to
de-duplicate it's list in some fashion. For small IN lists, you can do
this with an OR, but at some point you need to switch to an actual
unique (actually, I suspect the difference in PostgreSQL just depends on
if you passed values into IN or a subquery). A join on the other hand
doesn't worry about duplicates at all. There may be some brains in the
planner that realize if a subquery will return a unique set (ie: you're
querying on a primary key).

> Additionally, I manually scrub for duplicates at the group level in the
> email_record table to keep my records unique. I would like to use a
> unique constraint, but have found that batching in JDBC is impossible
> due to irrecoverable errors even when using BEFORE INSERT triggers to
> just return NULL if a record exists already. Has anyone got an elegant
> solution for the 'add only if not exists already' problem similar to
> MSSQL's MERGE command?

Your best bet (until we have something akin to MERGE, hopefully in 8.3)
is to load the data into a TEMP table and de-dupe it from there.
Depending on what you're doing you might want to delete it, or update an
ID column in the temp table. Note that assumes that only one process is
loading data at any time, if that's not the case you have to get
trickier.

> Just one more thing... I have found that maintaining a btree index on a
> varchar(255) value is extremely expensive on insert/update/delete. It is
> unfortunately necessary for me to maintain this index for queries and
> reports so I am transitioning to using an unindexed staging table to
> import data into before merging it with the larger table. All the docs
> and posts recommend is to drop the index, import your data, and then
> create the index again. This is untenable on a daily / bi-weekly basis.
> Is there a more elegant solution to this indexing problem?

You might be happier with tsearch than a regular index.
--
Jim Nasby                                            jim@nasby.net
EnterpriseDB      http://enterprisedb.com      512.569.9461 (cell)

Re: Scrub one large table against another (vmstat output)

From
Markus Schaber
Date:
Hi, Brendan,

Brendan Curran wrote:
>> What prevents you from using an aggregate function?
>
> I guess I could actually obtain the results in an aggregate function and
> use those to maintain a summary table. There is a web view that requires
> 'as accurate as possible' numbers to be queried per group (all 40 groups
> are displayed on the same page) and so constant aggregates over the
> entire table would be a nightmare.

That sounds just like a case for GROUP BY and a materialized view.

Markus
--
Markus Schaber | Logical Tracking&Tracing International AG
Dipl. Inf.     | Software Development GIS

Fight against software patents in Europe! www.ffii.org
www.nosoftwarepatents.org

Re: Scrub one large table against another

From
Brendan Curran
Date:
> Well, IN != EXISTS != JOIN. Exists just stops as soon as it finds a
> record. For some cases, it's equivalent to IN, but not all. IN has to
> de-duplicate it's list in some fashion. For small IN lists, you can do
> this with an OR, but at some point you need to switch to an actual
> unique (actually, I suspect the difference in PostgreSQL just depends on
> if you passed values into IN or a subquery). A join on the other hand
> doesn't worry about duplicates at all. There may be some brains in the
> planner that realize if a subquery will return a unique set (ie: you're
> querying on a primary key).
>

I agree, and it makes sense now that I consider it that IN would force the planner to implement some
form of unique check - possibly leveraging a PK or unique index if one is already available. Maybe
I'll tack up a note to the online documentation letting people know so that it's a little more
explicitly clear that when you choose IN on data that isn't explicitly unique (to the planner i.e.
post-analyze) you get the baggage of a forced unique whether you need it or not. Or perhaps someone
that knows the internals of the planner a little better than me should put some info up regarding that?

>
>> Just one more thing... I have found that maintaining a btree index on a
>> varchar(255) value is extremely expensive on insert/update/delete. It is
>> unfortunately necessary for me to maintain this index for queries and
>> reports so I am transitioning to using an unindexed staging table to
>> import data into before merging it with the larger table. All the docs
>> and posts recommend is to drop the index, import your data, and then
>> create the index again. This is untenable on a daily / bi-weekly basis.
>> Is there a more elegant solution to this indexing problem?
>
> You might be happier with tsearch than a regular index.

Thanks, I'll look into using tsearch2 as a possibility. From what I've seen so far it would add
quite a bit of complexity (necessary updates after inserts, proprietary query syntax that might
require a large amount of specialization from client apps) but in the end the overhead may be less
than that of maintaining the btree.

Thanks and Regards,
B

Re: Scrub one large table against another

From
Tom Lane
Date:
Brendan Curran <brendan.curran@gmail.com> writes:
> I'll tack up a note to the online documentation letting people know so
> that it's a little more explicitly clear that when you choose IN on
> data that isn't explicitly unique (to the planner i.e.  post-analyze)
> you get the baggage of a forced unique whether you need it or not. Or
> perhaps someone that knows the internals of the planner a little
> better than me should put some info up regarding that?

You get a forced unique step, period --- the planner doesn't try to
shortcut on the basis of noticing a relevant unique constraint.
We have some plan techniques that might look like they are not checking
uniqueness (eg, an "IN Join") but they really are.

This is an example of what I was talking about just a minute ago, about
not wanting to rely on constraints that could go away while the plan is
still potentially usable.  It's certainly something that we should look
at adding as soon as the plan-invalidation infrastructure is there to
make it safe to do.

            regards, tom lane