Re: Receiving many more rows than expected - Mailing list pgsql-general

From Vincent de Phily
Subject Re: Receiving many more rows than expected
Date
Msg-id 2153573.NEZ8etDJMq@moltowork
Whole thread Raw
In response to Re: Receiving many more rows than expected  (Adrian Klaver <adrian.klaver@aklaver.com>)
Responses Re: Receiving many more rows than expected  (Adrian Klaver <adrian.klaver@aklaver.com>)
List pgsql-general
On Thursday 08 May 2014 06:30:39 Adrian Klaver wrote:
> On 05/08/2014 04:09 AM, Vincent de Phily wrote:
> > Hello,
> >
> > I'm processing records in batches using PG 9.1.12, python 2.7, and psycopg
>
> > 2.5.2 :
> Comments in the code below:
> > def enqueue_loop(q):
> >      curs = DB_HANDLER.cursor()
> >      query = """UPDATE foo SET processing = 't' WHERE id IN
> >
> >             (SELECT id FROM foo WHERE processing = 'f' ORDER BY id ASC
> >             LIMIT
> >
> >              5000 FOR UPDATE)  RETURNING *"""
>
>         # Where is this query actually run?

Sorry, I edited my code too much, it's actually executed at the first line of
the loop, I added it below.

> >      while DO_LOOP: #the whole program eventually stops if this is false
>
>         # What cause DO_LOOP to go false?

Either when receiving a signal from the OS (registered with
"signal.signal(signal.SIGINT, stop_main)") or when the topmost try-catch-
reinitialize-retry loop has caught a quick sucession of exceptions.

DO_LOOP is tested in a few places where we can make a clean exit. A cronjob
will restart the process if it is not or badly running.

> >          curs.execute(query)
> >          results = curs.fetchall()
> >          rlen = len(results)
> >
> >          if rlen > 0:
> >              LOG.debug("Fetched %d rows", rlen)
>
>             # What do you see in LOG for rlen values?

The histogram shows a large amount of small values, progressively becoming
rarer for bigger values, up to value 5000 which is very frequent again
(depending on the day, between 0.5 and 5% of queries return the maximum number
of rows), then a handfull freak values of ~9k, 11k, 15k, 33k, 46k, 107k, etc
which cause my problems.

> >          if rlen == 0:
> >              # [...] wait for notification...
> >              continue
> >
> >          # [...] Enqueue batch and let other threads process it.
> >          # [...] Those threads will eventually delete the processed rows
> >          from
> >          #       the foo table.
> >
> > The problem is that sometimes (once every few days at about 2-300K queries
> > per day) I get many more rows than the max 5000 I asked for (I've seen up
> > to 25k). And I'm getting timeouts and other problems as a result.
> >
> > The id column is your typical primary key integer with a unique index.
> > I've
> > checked the problematic cases and there are no id gaps or duplicate rows.
> > There are multiple threads in the program, but only the main thread is
> > running enqueue_loop(). I'm not sure if this is a server or a driver
> > issue.
> >
> >
> > Any idea ? Thanks.

--
Vincent de Phily


pgsql-general by date:

Previous
From: "Anand Kumar, Karthik"
Date:
Subject: Re: Oracle to PostgreSQL replication
Next
From: Adrian Klaver
Date:
Subject: Re: Receiving many more rows than expected