Re: NOT IN doesn't use index? (fwd) - Mailing list pgsql-performance

From Joe Conway
Subject Re: NOT IN doesn't use index? (fwd)
Date
Msg-id 3EB42586.6080301@joeconway.com
Whole thread Raw
In response to Re: NOT IN doesn't use index? (fwd)  (Becky Neville <rebecca.neville@yale.edu>)
List pgsql-performance
Becky Neville wrote:
> I think that list is actually (gulp) hard coded.  It's not my query.
> I am trying to speed it up for someone else - to hopefully learn something
> in the process that isn't dependent on what version of postgres i'm
> running :)
>
> I assume it's from another table but can't find it on their data
> model at the moment.  Those are all valid billing codes.  The query
> is checking to see if anyone was billed under an invalid code.  So if
> everything is ok, the query returns nothing.

Yeah -- that sounds like there has to be a table of valid codes
somewhere. In that case you can substitute the "valid_codes" table in
the left join where I had the subselect with all the UNIONs.
Alternatively you might find a NOT EXISTS method would work best. If
there isn't a "valid_codes" table, but that hard coded list is static,
perhaps you could build one and use that.

> But there must be more to it than that...otherwise, they could just
> add a Valid flag to the lookup table.

Well I certainly wouldn't query a whole table of historical information
over and over. Can you use and date column (suitably indexed) to just
check recent transactions (like since the last time you checked)?

> If you have any ideas for speeding it up other than using another
> table please let me know.  It only takes me 9 min to run with 2 mil
> rows but it takes them 7 hours (51 mil rows in Oracle with many other
> jobs running and poor system maintenance.)

As above, are all 51 million rows recent transactions, or is that all of
eternity? If its the latter, I'd scan the whole thing once and produce a
report, or maybe a "transactions_with_invalid_codes" table.

 From that point on, I'd only check the transactions since the last time
I'd checked, either based on a timestamp or even a sequence generated id
field. All you need to do is save off the max value each time you run,
and then use that as the starting point next time.

HTH,

Joe


pgsql-performance by date:

Previous
From: Becky Neville
Date:
Subject: Re: NOT IN doesn't use index? (fwd)
Next
From: brew@theMode.com
Date:
Subject: Re: why is the db so slow?