Thread: tricky query

tricky query

From
"Merlin Moncure"
Date:
I need a fast way (sql only preferred) to solve the following problem:

I need the smallest integer that is greater than zero that is not in the
column of a table.  In other words, if an 'id' column has values
1,2,3,4,6 and 7, I need a query that returns the value of 5.

I've already worked out a query using generate_series (not scalable) and
pl/pgsql.  An SQL only solution would be preferred, am I missing
something obvious?

Merlin

Re: tricky query

From
Bruno Wolff III
Date:
On Tue, Jun 28, 2005 at 10:21:16 -0400,
  Merlin Moncure <merlin.moncure@rcsonline.com> wrote:
> I need a fast way (sql only preferred) to solve the following problem:
>
> I need the smallest integer that is greater than zero that is not in the
> column of a table.  In other words, if an 'id' column has values
> 1,2,3,4,6 and 7, I need a query that returns the value of 5.
>
> I've already worked out a query using generate_series (not scalable) and
> pl/pgsql.  An SQL only solution would be preferred, am I missing
> something obvious?

I would expect that using generate series from the 1 to the max (using
order by and limit 1 to avoid extra sequential scans) and subtracting
out the current list using except and then taking the minium value
would be the best way to do this if the list is pretty dense and
you don't want to change the structure.

If it is sparse than you can do a special check for 1 and if that
is present find the first row whose successor is not in the table.
That shouldn't be too slow.

If you are willing to change the structure you might keep one row for
each number and use a flag to mark which ones are empty. If there are
relatively few empty rows at any time, then you can create a partial
index on the row number for only empty rows.

Re: tricky query

From
John A Meinel
Date:
Merlin Moncure wrote:

>I need a fast way (sql only preferred) to solve the following problem:
>
>I need the smallest integer that is greater than zero that is not in the
>column of a table.  In other words, if an 'id' column has values
>1,2,3,4,6 and 7, I need a query that returns the value of 5.
>
>I've already worked out a query using generate_series (not scalable) and
>pl/pgsql.  An SQL only solution would be preferred, am I missing
>something obvious?
>
>Merlin
>
>

Not so bad. Try something like this:

SELECT min(id+1) as id_new FROM table
    WHERE (id+1) NOT IN (SELECT id FROM table);

Now, this requires probably a sequential scan, but I'm not sure how you
can get around that.
Maybe if you got trickier and did some ordering and limits. The above
seems to give the right answer, though.

I don't know how big you want to scale to.

You might try something like:
SELECT id+1 as id_new FROM t
    WHERE (id+1) NOT IN (SELECT id FROM t)
    ORDER BY id LIMIT 1;

John
=:->


Attachment

Re: tricky query

From
John A Meinel
Date:
Merlin Moncure wrote:

>>Not so bad. Try something like this:
>>
>>SELECT min(id+1) as id_new FROM table
>>    WHERE (id+1) NOT IN (SELECT id FROM table);
>>
>>Now, this requires probably a sequential scan, but I'm not sure how
>>
>>
>you
>
>
>>can get around that.
>>Maybe if you got trickier and did some ordering and limits. The above
>>seems to give the right answer, though.
>>
>>
>
>it does, but it is still faster than generate_series(), which requires
>both a seqscan and a materialization of the function.
>
>
>
>>I don't know how big you want to scale to.
>>
>>
>
>big. :)
>
>merlin
>
>

See my follow up post, which enables an index scan. On my system with
90k rows, it takes no apparent time.
(0.000ms)
John
=:->


Attachment

Re: tricky query

From
Sam Mason
Date:
Merlin Moncure wrote:
>I've already worked out a query using generate_series (not scalable) and
>pl/pgsql.  An SQL only solution would be preferred, am I missing
>something obvious?

I would be tempted to join the table to itself like:

  SELECT id+1
  FROM foo
  WHERE id > 0
    AND i NOT IN (SELECT id-1 FROM foo)
  LIMIT 1;

Seems to work for me.  Not sure if that's good enough for you, but
it may help.

  Sam

Re: tricky query

From
John A Meinel
Date:
John A Meinel wrote:

> Merlin Moncure wrote:
>
>> I need a fast way (sql only preferred) to solve the following problem:
>>
>> I need the smallest integer that is greater than zero that is not in the
>> column of a table.  In other words, if an 'id' column has values
>> 1,2,3,4,6 and 7, I need a query that returns the value of 5.
>>
>> I've already worked out a query using generate_series (not scalable) and
>> pl/pgsql.  An SQL only solution would be preferred, am I missing
>> something obvious?
>>
>> Merlin
>>
>>
>
> Not so bad. Try something like this:
>
> SELECT min(id+1) as id_new FROM table
>    WHERE (id+1) NOT IN (SELECT id FROM table);
>
> Now, this requires probably a sequential scan, but I'm not sure how you
> can get around that.
> Maybe if you got trickier and did some ordering and limits. The above
> seems to give the right answer, though.
>
> I don't know how big you want to scale to.
>
> You might try something like:
> SELECT id+1 as id_new FROM t
>    WHERE (id+1) NOT IN (SELECT id FROM t)
>    ORDER BY id LIMIT 1;
>
> John
> =:->

Well, I was able to improve it to using appropriate index scans.
Here is the query:

SELECT t1.id+1 as id_new FROM id_test t1
    WHERE NOT EXISTS
        (SELECT t2.id FROM id_test t2 WHERE t2.id = t1.id+1)
    ORDER BY t1.id LIMIT 1;

I created a test table which has 90k randomly inserted rows. And this is
what EXPLAIN ANALYZE says:

                                                               QUERY PLAN


----------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=0.00..12.10 rows=1 width=4) (actual time=0.000..0.000 rows=1 loops=1)
   ->  Index Scan using id_test_pkey on id_test t1  (cost=0.00..544423.27 rows=45000 width=4) (actual time=0.000..0.000
rows=1loops=1) 
         Filter: (NOT (subplan))
         SubPlan
           ->  Index Scan using id_test_pkey on id_test t2  (cost=0.00..6.01 rows=1 width=4) (actual time=0.000..0.000
rows=1loops=15) 
                 Index Cond: (id = ($0 + 1))
 Total runtime: 0.000 ms
(7 rows)

The only thing I have is a primary key index on id_test(id);

John
=:->


Attachment

Re: tricky query

From
"Merlin Moncure"
Date:
> Not so bad. Try something like this:
>
> SELECT min(id+1) as id_new FROM table
>     WHERE (id+1) NOT IN (SELECT id FROM table);
>
> Now, this requires probably a sequential scan, but I'm not sure how
you
> can get around that.
> Maybe if you got trickier and did some ordering and limits. The above
> seems to give the right answer, though.

it does, but it is still faster than generate_series(), which requires
both a seqscan and a materialization of the function.

> I don't know how big you want to scale to.

big. :)

merlin

Re: tricky query

From
"Merlin Moncure"
Date:
John Meinel wrote:
> See my follow up post, which enables an index scan. On my system with
> 90k rows, it takes no apparent time.
> (0.000ms)
> John
> =:->

Confirmed.  Hats off to you, the above some really wicked querying.
IIRC I posted the same question several months ago with no response and
had given up on it.  I think your solution (smallest X1 not in X) is a
good candidate for general bits, so I'm passing this to varlena for
review :)

SELECT t1.id+1 as id_new FROM id_test t1
    WHERE NOT EXISTS
        (SELECT t2.id FROM id_test t2 WHERE t2.id = t1.id+1)
    ORDER BY t1.id LIMIT 1;

Merlin

Re: tricky query

From
"Merlin Moncure"
Date:
> Merlin Moncure wrote:
>
> > I need a fast way (sql only preferred) to solve the following
problem:
> > I need the smallest integer that is greater than zero that is not in
the
> > column of a table.
> >
> > I've already worked out a query using generate_series (not scalable)
and
> > pl/pgsql.  An SQL only solution would be preferred, am I missing
> > something obvious?

> Probably not, but I thought about this "brute-force" approach... :-)
> This should work well provided that:
>
> - you have a finite number of integers. Your column should have a
biggest
>    integer value with a reasonable maximum like 100,000 or 1,000,000.
>    #define YOUR_MAX 99999
[...]
:-) generate_series function does the same thing only a little bit
faster (although less portable).

generate_series(m,n) returns set of integers from m to n with time
complexity n - m.  I use it for cases where I need to increment for
something, for example:

select now()::date + d from generate_series(0,355) as d;

returns days from today until 355 days from now.

Merlin

Re: tricky query

From
Cosimo Streppone
Date:
Merlin Moncure wrote:

> I need a fast way (sql only preferred) to solve the following problem:
> I need the smallest integer that is greater than zero that is not in the
> column of a table.
>
> I've already worked out a query using generate_series (not scalable) and
> pl/pgsql.  An SQL only solution would be preferred, am I missing
> something obvious?

Probably not, but I thought about this "brute-force" approach... :-)
This should work well provided that:

- you have a finite number of integers. Your column should have a biggest
   integer value with a reasonable maximum like 100,000 or 1,000,000.
   #define YOUR_MAX 99999

- you can accept that query execution time depends on smallest integer found.
   The bigger the found integer, the slower execution you get.

Ok, so:

Create a relation "integers" (or whatever) with every single integer from 1 to
YOUR_MAX:

   CREATE TABLE integers (id integer primary key);
   INSERT INTO integers (id) VALUES (1);
   INSERT INTO integers (id) VALUES (2);
   ...
   INSERT INTO integers (id) VALUES (YOUR_MAX);

Create your relation:

   CREATE TABLE merlin (id integer primary key);
   <and fill it with values>

Query is simple now:

   SELECT a.id FROM integers a
     LEFT JOIN merlin b ON a.id=b.id
         WHERE b.id IS NULL
      ORDER BY a.id LIMIT 1;

Execution times with 100k tuples in "integers" and
99,999 tuples in "merlin":

   >\timing
   Timing is on.
   >select i.id from integers i left join merlin s on i.id=s.id where s.id is
null order by i.id limit 1;
    99999

   Time: 233.618 ms
   >insert into merlin (id) values (99999);
   INSERT 86266614 1
   Time: 0.579 ms
   >delete from merlin where id=241;
   DELETE 1
   Time: 0.726 ms
   >select i.id from integers i left join merlin s on i.id=s.id where s.id is
null order by i.id limit 1;
    241

   Time: 1.336 ms
   >

--
Cosimo


Re: tricky query

From
John A Meinel
Date:
Merlin Moncure wrote:

>John Meinel wrote:
>
>
>>See my follow up post, which enables an index scan. On my system with
>>90k rows, it takes no apparent time.
>>(0.000ms)
>>John
>>=:->
>>
>>
>
>Confirmed.  Hats off to you, the above some really wicked querying.
>IIRC I posted the same question several months ago with no response and
>had given up on it.  I think your solution (smallest X1 not in X) is a
>good candidate for general bits, so I'm passing this to varlena for
>review :)
>
>SELECT t1.id+1 as id_new FROM id_test t1
>    WHERE NOT EXISTS
>        (SELECT t2.id FROM id_test t2 WHERE t2.id = t1.id+1)
>    ORDER BY t1.id LIMIT 1;
>
>Merlin
>
>
Just be aware that as your table fills it's holes, this query gets
slower and slower.
I've been doing some testing. And it starts at 0.00 when the first entry
is something like 3, but when you start getting to 16k it starts taking
more like 200 ms.

So it kind of depends how your table fills (and empties I suppose).

The earlier query was slower overall (since it took 460ms to read in the
whole table).
I filled up the table such that 63713 is the first empty space, and it
takes 969ms to run.
So actually if your table is mostly full, the first form is better.

But if you are going to have 100k rows, with basically random
distribution of empties, then the NOT EXISTS works quite well.

Just be aware of the tradeoff. I'm pretty sure the WHERE NOT EXISTS will
always use a looping structure, and go through the index in order.

John
=:->


Attachment

Re: tricky query

From
Sam Mason
Date:
John A Meinel wrote:
>SELECT t1.id+1 as id_new FROM id_test t1
>   WHERE NOT EXISTS
>       (SELECT t2.id FROM id_test t2 WHERE t2.id = t1.id+1)
>   ORDER BY t1.id LIMIT 1;

This works well on sparse data, as it only requires as many index
access as it takes to find the first gap.   The simpler "NOT IN"
version that everybody seems to have posted the first time round
has a reasonably constant (based on the number of rows, not gap
position) startup time but the actual time spent searching for the
gap is much lower.

I guess the version you use depends on how sparse you expect the
data to be.  If you expect your query to have to search through
more than half the table before finding the gap then you're better
off using the "NOT IN" version, otherwise the "NOT EXISTS" version
is faster -- on my system anyway.

Hope that's interesting!


  Sam

Re: tricky query

From
Cosimo Streppone
Date:
John A Meinel wrote:
> John A Meinel wrote:
>> Merlin Moncure wrote:
>>
>>> I need the smallest integer that is greater than zero that is not in the
>>> column of a table.  In other words, if an 'id' column has values
>>> 1,2,3,4,6 and 7, I need a query that returns the value of 5.
>>
 >> [...]
 >
> Well, I was able to improve it to using appropriate index scans.
> Here is the query:
>
> SELECT t1.id+1 as id_new FROM id_test t1
>    WHERE NOT EXISTS
>        (SELECT t2.id FROM id_test t2 WHERE t2.id = t1.id+1)
>    ORDER BY t1.id LIMIT 1;

I'm very interested in this "tricky query".
Sorry John, but if I populate the `id_test' relation
with only 4 tuples with id values (10, 11, 12, 13),
the result of this query is:

   cosimo=> create table id_test (id integer primary key);
   NOTICE:  CREATE TABLE / PRIMARY KEY will create implicit index 'id_test_pkey'
for table 'id_test'
   CREATE TABLE
   cosimo=> insert into id_test values (10); -- and 11, 12, 13, 14
   INSERT 7457570 1
   INSERT 7457571 1
   INSERT 7457572 1
   INSERT 7457573 1
   INSERT 7457574 1
   cosimo=> SELECT t1.id+1 as id_new FROM id_test t1 WHERE NOT EXISTS (SELECT
t2.id FROM id_test t2 WHERE t2.id = t1.id+1) ORDER BY t1.id LIMIT 1;
    id_new
   --------
        15
   (1 row)

which if I understand correctly, is the wrong answer to the problem.
At this point, I'm starting to think I need some sleep... :-)

--
Cosimo


Re: tricky query

From
John A Meinel
Date:
Merlin Moncure wrote:

>>On Tue, Jun 28, 2005 at 12:02:09 -0400,
>>  Merlin Moncure <merlin.moncure@rcsonline.com> wrote:
>>
>>
>>>Confirmed.  Hats off to you, the above some really wicked querying.
>>>IIRC I posted the same question several months ago with no response
>>>
>>>
>and
>
>
>>>had given up on it.  I think your solution (smallest X1 not in X) is
>>>
>>>
>a
>
>
>>>good candidate for general bits, so I'm passing this to varlena for
>>>review :)
>>>
>>>SELECT t1.id+1 as id_new FROM id_test t1
>>>    WHERE NOT EXISTS
>>>        (SELECT t2.id FROM id_test t2 WHERE t2.id = t1.id+1)
>>>    ORDER BY t1.id LIMIT 1;
>>>
>>>
>>You need to rework this to check to see if row '1' is missing. The
>>above returns the start of the first gap after the first row that
>>isn't missing.
>>
>>
>
>Correct.
>
>In fact, I left out a detail in my original request in that I had a
>starting value (easily supplied with where clause)...so what I was
>really looking for was a query which started at a supplied value and
>looped forwards looking for an empty slot.  John's supplied query is a
>drop in replacement for a plpgsql routine which does exactly this.
>
>The main problem with the generate_series approach is that there is no
>convenient way to determine a supplied upper bound.  Also, in some
>corner cases of my problem domain the performance was not good.
>
>Merlin
>
>
Actually, if you already have a lower bound, then you can change it to:

SELECT t1.id+1 as id_new FROM id_test t1
    WHERE t1.id > id_min
    AND NOT EXISTS
        (SELECT t2.id FROM id_test t2 WHERE t2.id = t1.id+1)
    ORDER BY t1.id LIMIT 1;

This would actually really help performance if you have a large table
and then empty entries start late.

On my system, where the first entry is 64k, doing where id > 60000
speeds it up back to 80ms instead of 1000ms.
John
=:->


Attachment

Re: tricky query

From
"Merlin Moncure"
Date:
> On Tue, Jun 28, 2005 at 12:02:09 -0400,
>   Merlin Moncure <merlin.moncure@rcsonline.com> wrote:
> >
> > Confirmed.  Hats off to you, the above some really wicked querying.
> > IIRC I posted the same question several months ago with no response
and
> > had given up on it.  I think your solution (smallest X1 not in X) is
a
> > good candidate for general bits, so I'm passing this to varlena for
> > review :)
> >
> > SELECT t1.id+1 as id_new FROM id_test t1
> >     WHERE NOT EXISTS
> >         (SELECT t2.id FROM id_test t2 WHERE t2.id = t1.id+1)
> >     ORDER BY t1.id LIMIT 1;
>
> You need to rework this to check to see if row '1' is missing. The
> above returns the start of the first gap after the first row that
> isn't missing.

Correct.

In fact, I left out a detail in my original request in that I had a
starting value (easily supplied with where clause)...so what I was
really looking for was a query which started at a supplied value and
looped forwards looking for an empty slot.  John's supplied query is a
drop in replacement for a plpgsql routine which does exactly this.

The main problem with the generate_series approach is that there is no
convenient way to determine a supplied upper bound.  Also, in some
corner cases of my problem domain the performance was not good.

Merlin



Re: tricky query

From
Bruno Wolff III
Date:
On Tue, Jun 28, 2005 at 12:02:09 -0400,
  Merlin Moncure <merlin.moncure@rcsonline.com> wrote:
>
> Confirmed.  Hats off to you, the above some really wicked querying.
> IIRC I posted the same question several months ago with no response and
> had given up on it.  I think your solution (smallest X1 not in X) is a
> good candidate for general bits, so I'm passing this to varlena for
> review :)
>
> SELECT t1.id+1 as id_new FROM id_test t1
>     WHERE NOT EXISTS
>         (SELECT t2.id FROM id_test t2 WHERE t2.id = t1.id+1)
>     ORDER BY t1.id LIMIT 1;

You need to rework this to check to see if row '1' is missing. The
above returns the start of the first gap after the first row that
isn't missing.

Re: tricky query

From
"Merlin Moncure"
Date:
Cosimo wrote:
> I'm very interested in this "tricky query".
> Sorry John, but if I populate the `id_test' relation
> with only 4 tuples with id values (10, 11, 12, 13),
> the result of this query is:
>
>    cosimo=> create table id_test (id integer primary key);
>    NOTICE:  CREATE TABLE / PRIMARY KEY will create implicit index
> 'id_test_pkey'
> for table 'id_test'
>    CREATE TABLE
>    cosimo=> insert into id_test values (10); -- and 11, 12, 13, 14
>    INSERT 7457570 1
>    INSERT 7457571 1
>    INSERT 7457572 1
>    INSERT 7457573 1
>    INSERT 7457574 1
>    cosimo=> SELECT t1.id+1 as id_new FROM id_test t1 WHERE NOT EXISTS
> (SELECT
> t2.id FROM id_test t2 WHERE t2.id = t1.id+1) ORDER BY t1.id LIMIT 1;
>     id_new
>    --------
>         15
>    (1 row)
>
> which if I understand correctly, is the wrong answer to the problem.
> At this point, I'm starting to think I need some sleep... :-)

Correct, in that John's query returns the first empty slot above an
existing  filled slot (correct behavior in my case).  You could flip
things around a bit to get around thist tho.

Merlin

Re: tricky query

From
Sebastian Hennebrueder
Date:
John A Meinel schrieb:

> John A Meinel wrote:
>
>>
>
> Well, I was able to improve it to using appropriate index scans.
> Here is the query:
>
> SELECT t1.id+1 as id_new FROM id_test t1
>    WHERE NOT EXISTS
>        (SELECT t2.id FROM id_test t2 WHERE t2.id = t1.id+1)
>    ORDER BY t1.id LIMIT 1;
>
> I created a test table which has 90k randomly inserted rows. And this is
> what EXPLAIN ANALYZE says:
>
>


As Cosimo stated the result can be wrong. The result is always wrong
when the id with value 1 does not exist.

--
Best Regards / Viele Grüße

Sebastian Hennebrueder

----

http://www.laliluna.de

Tutorials for JSP, JavaServer Faces, Struts, Hibernate and EJB

Get support, education and consulting for these technologies - uncomplicated and cheap.


optimized counting of web statistics

From
Billy extyeightysix
Date:
Hola folks,

I have a web statistics Pg database (user agent, urls, referrer, etc)
that is part of an online web survey system. All of the data derived
from analyzing web server logs is stored in one large table with each
record representing an analyzed webserver log entry.

Currently all reports are generated when the logs are being analyzed
and before the data ever goes into the large table I mention above.
Well, the time has come to build an interface that will allow a user
to make ad-hoc queries against the stats and that is why I am emailing
the performance list.

I need to allow the user to specify any fields and values in a query.
For example,

"I want to see a report about all users from Germany that have flash
installed"  or
"I want to see a report about all users from Africa that are using FireFox 1"

I do not believe that storing all of the data in one big table is the
correct way to go about this. So, I am asking for suggestions,
pointers and any kind of info that anyone can share on how to store
this data set in an optimized manner.

Also, I have created a prototype and done some testing using the
colossal table. Actually finding all of the rows that satisfy the
query is pretty fast and is not a problem.  The bottleneck in the
whole process is actually counting each data point (how many times a
url was visited, or how many times a url referred the user to the
website). So more specifically I am wondering if there is way to store
and retrieve the data such that it speeds up the counting of the
statistics.

Lastly, this will become an open source tool that is akin to urchin,
awstats, etc. The difference is that this software is part of a suite
of tools for doing online web surveys and it maps web stats to the
survey respondent data.  This can give web site managers a very clear
view of what type of people come to the site and how those types use
the site.

Thanks in advance,

exty

Re: optimized counting of web statistics

From
Billy extyeightysix
Date:
> The bottleneck in the
> whole process is actually counting each data point (how many times a
> url was visited, or how many times a url referred the user to the
> website). So more specifically I am wondering if there is way to store
> and retrieve the data such that it speeds up the counting of the
> statistics.

This is misleading, the counting is being done by perl.  so what is
happening is that I am locating all of the rows via a cursor and then
calculating the stats using perl hashes.  no counting is being in the
DB.  maybe it would be much faster to count in the db somehow?


exty

Re: optimized counting of web statistics

From
Matthew Nuzum
Date:
On 6/28/05, Billy extyeightysix <exty86@gmail.com> wrote:
> Hola folks,
>
> I have a web statistics Pg database (user agent, urls, referrer, etc)
> that is part of an online web survey system. All of the data derived
> from analyzing web server logs is stored in one large table with each
> record representing an analyzed webserver log entry.
>
> Currently all reports are generated when the logs are being analyzed
> and before the data ever goes into the large table I mention above.
> Well, the time has come to build an interface that will allow a user
> to make ad-hoc queries against the stats and that is why I am emailing
> the performance list.

Load your data into a big table, then pre-process into additional
tables that have data better organized for running your reports.

For example, you may want a table that shows a sum of all hits for
each site, for each hour of the day. You may want an additional table
that shows the sum of all page views, or maybe sessions for each site
for each hour of the day.

So, if you manage a single site, each day you will add 24 new records
to the sum table.

You may want the following fields:
site (string)
atime (timestamptz)
hour_of_day (int)
day_of_week (int)
total_hits (int8)

A record may look like this:
site | atime | hour_of_day | day_of_week | total_hits
'www.yoursite.com'  '2005-06-28 16:00:00 -0400'  18  2  350

Index all of the fields except total_hits (unless you want a report
that shows all hours where hits were greater than x or less than x).

Doing:
select sum(total_hits) as total_hits from summary_table where atime
between now() and (now() - '7 days'::interval);
should be pretty fast.

You can also normalize your data such as referrers, user agents, etc
and create similar tables to the above.

In case you haven't guessed, I've already done this very thing.

I do my batch processing daily using a python script I've written. I
found that trying to do it with pl/pgsql took more than 24 hours to
process 24 hours worth of logs. I then used C# and in memory hash
tables to drop the time to 2 hours, but I couldn't get mono installed
on some of my older servers. Python proved the fastest and I can
process 24 hours worth of logs in about 15 minutes. Common reports run
in < 1 sec and custom reports run in < 15 seconds (usually).
--
Matthew Nuzum
www.bearfruit.org

Re: optimized counting of web statistics

From
Rudi Starcevic
Date:
Hi,

>I do my batch processing daily using a python script I've written. I
>found that trying to do it with pl/pgsql took more than 24 hours to
>process 24 hours worth of logs. I then used C# and in memory hash
>tables to drop the time to 2 hours, but I couldn't get mono installed
>on some of my older servers. Python proved the fastest and I can
>process 24 hours worth of logs in about 15 minutes. Common reports run
>in < 1 sec and custom reports run in < 15 seconds (usually).
>
>

When you say you do your batch processing in a Python script do you mean
a you are using 'plpython' inside
PostgreSQL or using Python to execut select statements and crunch the
data 'outside' PostgreSQL?

Your reply is very interesting.

Thanks.
Regards,
Rudi.


Re: optimized counting of web statistics

From
Matthew Nuzum
Date:
On 6/29/05, Rudi Starcevic <tech@wildcash.com> wrote:
> Hi,
>
> >I do my batch processing daily using a python script I've written. I
> >found that trying to do it with pl/pgsql took more than 24 hours to
> >process 24 hours worth of logs. I then used C# and in memory hash
> >tables to drop the time to 2 hours, but I couldn't get mono installed
> >on some of my older servers. Python proved the fastest and I can
> >process 24 hours worth of logs in about 15 minutes. Common reports run
> >in < 1 sec and custom reports run in < 15 seconds (usually).
> >
> >
>
> When you say you do your batch processing in a Python script do you mean
> a you are using 'plpython' inside
> PostgreSQL or using Python to execut select statements and crunch the
> data 'outside' PostgreSQL?
>
> Your reply is very interesting.

Sorry for not making that clear... I don't use plpython, I'm using an
external python program that makes database connections, creates
dictionaries and does the normalization/batch processing in memory. It
then saves the changes to a textfile which is copied using psql.

I've tried many things and while this is RAM intensive, it is by far
the fastest aproach I've found. I've also modified the python program
to optionally use disk based dictionaries based on (I think) gdb. This
signfincantly increases the time to closer to 25 min. ;-) but drops
the memory usage by an order of magnitude.

To be fair to C# and .Net, I think that python and C# can do it
equally fast, but between the time of creating the C# version and the
python version I learned some new optimization techniques. I feel that
both are powerful languages. (To be fair to python, I can write the
dictionary lookup code in 25% (aprox) fewer lines than similar hash
table code in C#. I could go on but I think I'm starting to get off
topic.)
--
Matthew Nuzum
www.bearfruit.org

Re: tricky query

From
Dawid Kuroczko
Date:
On 6/28/05, John A Meinel <john@arbash-meinel.com> wrote:

> Actually, if you already have a lower bound, then you can change it to:
>
> SELECT t1.id+1 as id_new FROM id_test t1
>     WHERE t1.id > id_min
>         AND NOT EXISTS
>         (SELECT t2.id FROM id_test t2 WHERE t2.id = t1.id+1)
>     ORDER BY t1.id LIMIT 1;
>
> This would actually really help performance if you have a large table
> and then empty entries start late.

You can also boost performance by creating a functional index!

CREATE UNIQUE INDEX id_test_id1_index ON id_test ((id+1));

...and then joining two tables and filtering results.  PostgreSQL (8.x)
will do Merge Full Join which will use both the indexes:

SELECT t2.id+1 FROM id_test t1 FULL OUTER JOIN id_test t2 ON (t1.id =
t2.id+1) WHERE t1.id IS NULL LIMIT 1;

 Limit  (cost=0.00..1.52 rows=1 width=4)
   ->  Merge Full Join  (cost=0.00..1523122.73 rows=999974 width=4)
         Merge Cond: ("outer".id = ("inner".id + 1))
         Filter: ("outer".id IS NULL)
         ->  Index Scan using id_test_pkey on id_test t1
(cost=0.00..18455.71 rows=999974 width=4)
         ->  Index Scan using id_test_id1_index on id_test t2
(cost=0.00..1482167.60 rows=999974 width=4)
(6 rows)

...the only drawback is having to keep two indexes instead of just one.
But for large tables I think it is really worth it

For my test case, the times are (1-1000000 range with 26 missing
rows):
NOT EXISTS -- 670ms
NOT IN -- 1800ms
indexed FULL OUTER -- 267ms

   Regards,
       Dawid

PS: Does it qualify for General Bits? ;-)))