Thread: Receiving many more rows than expected
Hello, I'm processing records in batches using PG 9.1.12, python 2.7, and psycopg 2.5.2 : def enqueue_loop(q): curs = DB_HANDLER.cursor() query = """UPDATE foo SET processing = 't' WHERE id IN (SELECT id FROM foo WHERE processing = 'f' ORDER BY id ASC LIMIT 5000 FOR UPDATE) RETURNING *""" while DO_LOOP: #the whole program eventually stops if this is false results = curs.fetchall() rlen = len(results) if rlen > 0: LOG.debug("Fetched %d rows", rlen) if rlen == 0: # [...] wait for notification... continue # [...] Enqueue batch and let other threads process it. # [...] Those threads will eventually delete the processed rows from # the foo table. The problem is that sometimes (once every few days at about 2-300K queries per day) I get many more rows than the max 5000 I asked for (I've seen up to 25k). And I'm getting timeouts and other problems as a result. The id column is your typical primary key integer with a unique index. I've checked the problematic cases and there are no id gaps or duplicate rows. There are multiple threads in the program, but only the main thread is running enqueue_loop(). I'm not sure if this is a server or a driver issue. Any idea ? Thanks. -- Vincent de Phily
<div class="moz-cite-prefix">On 05/08/2014 02:09 PM, Vincent de Phily wrote:<br /></div><blockquote cite="mid:2347827.4qtYvNo0j6@moltowork"type="cite"><pre wrap="">The problem is that sometimes (once every few days at about2-300K queries per day) I get many more rows than the max 5000 I asked for (I've seen up to 25k). And I'm getting timeouts and other problems as a result. </pre></blockquote> I would bet you have another copy of the code running without the limit. Maybe on a cron job.<br /> Toprove this, modify the log statement slightly and see if this statement is really returning more then 5000 rows.<br /><prewrap="">LOG.debug("1 - Fetched %d rows", rlen) or similar. </pre><br />
On 05/08/2014 04:09 AM, Vincent de Phily wrote: > Hello, > > I'm processing records in batches using PG 9.1.12, python 2.7, and psycopg > 2.5.2 : Comments in the code below: > > def enqueue_loop(q): > curs = DB_HANDLER.cursor() > query = """UPDATE foo SET processing = 't' WHERE id IN > (SELECT id FROM foo WHERE processing = 'f' ORDER BY id ASC LIMIT > 5000 FOR UPDATE) RETURNING *""" # Where is this query actually run? > while DO_LOOP: #the whole program eventually stops if this is false # What cause DO_LOOP to go false? > results = curs.fetchall() > rlen = len(results) > if rlen > 0: > LOG.debug("Fetched %d rows", rlen) # What do you see in LOG for rlen values? > if rlen == 0: > # [...] wait for notification... > continue > # [...] Enqueue batch and let other threads process it. > # [...] Those threads will eventually delete the processed rows from > # the foo table. > > > The problem is that sometimes (once every few days at about 2-300K queries per > day) I get many more rows than the max 5000 I asked for (I've seen up to 25k). > And I'm getting timeouts and other problems as a result. > > The id column is your typical primary key integer with a unique index. I've > checked the problematic cases and there are no id gaps or duplicate rows. > There are multiple threads in the program, but only the main thread is running > enqueue_loop(). I'm not sure if this is a server or a driver issue. > > > Any idea ? Thanks. > > -- Adrian Klaver adrian.klaver@aklaver.com
On Thursday 08 May 2014 06:30:39 Adrian Klaver wrote: > On 05/08/2014 04:09 AM, Vincent de Phily wrote: > > Hello, > > > > I'm processing records in batches using PG 9.1.12, python 2.7, and psycopg > > > 2.5.2 : > Comments in the code below: > > def enqueue_loop(q): > > curs = DB_HANDLER.cursor() > > query = """UPDATE foo SET processing = 't' WHERE id IN > > > > (SELECT id FROM foo WHERE processing = 'f' ORDER BY id ASC > > LIMIT > > > > 5000 FOR UPDATE) RETURNING *""" > > # Where is this query actually run? Sorry, I edited my code too much, it's actually executed at the first line of the loop, I added it below. > > while DO_LOOP: #the whole program eventually stops if this is false > > # What cause DO_LOOP to go false? Either when receiving a signal from the OS (registered with "signal.signal(signal.SIGINT, stop_main)") or when the topmost try-catch- reinitialize-retry loop has caught a quick sucession of exceptions. DO_LOOP is tested in a few places where we can make a clean exit. A cronjob will restart the process if it is not or badly running. > > curs.execute(query) > > results = curs.fetchall() > > rlen = len(results) > > > > if rlen > 0: > > LOG.debug("Fetched %d rows", rlen) > > # What do you see in LOG for rlen values? The histogram shows a large amount of small values, progressively becoming rarer for bigger values, up to value 5000 which is very frequent again (depending on the day, between 0.5 and 5% of queries return the maximum number of rows), then a handfull freak values of ~9k, 11k, 15k, 33k, 46k, 107k, etc which cause my problems. > > if rlen == 0: > > # [...] wait for notification... > > continue > > > > # [...] Enqueue batch and let other threads process it. > > # [...] Those threads will eventually delete the processed rows > > from > > # the foo table. > > > > The problem is that sometimes (once every few days at about 2-300K queries > > per day) I get many more rows than the max 5000 I asked for (I've seen up > > to 25k). And I'm getting timeouts and other problems as a result. > > > > The id column is your typical primary key integer with a unique index. > > I've > > checked the problematic cases and there are no id gaps or duplicate rows. > > There are multiple threads in the program, but only the main thread is > > running enqueue_loop(). I'm not sure if this is a server or a driver > > issue. > > > > > > Any idea ? Thanks. -- Vincent de Phily
On 05/08/2014 03:11 PM, Vincent de Phily wrote: > On Thursday 08 May 2014 06:30:39 Adrian Klaver wrote: >> On 05/08/2014 04:09 AM, Vincent de Phily wrote: >>> Hello, >>> > > DO_LOOP is tested in a few places where we can make a clean exit. A cronjob > will restart the process if it is not or badly running. > >>> curs.execute(query) >>> results = curs.fetchall() >>> rlen = len(results) >>> >>> if rlen > 0: >>> LOG.debug("Fetched %d rows", rlen) >> >> # What do you see in LOG for rlen values? > > The histogram shows a large amount of small values, progressively becoming > rarer for bigger values, up to value 5000 which is very frequent again > (depending on the day, between 0.5 and 5% of queries return the maximum number > of rows), then a handfull freak values of ~9k, 11k, 15k, 33k, 46k, 107k, etc > which cause my problems. So just to be clear this the value for rlen and that value and log message only appear in the loop above? As Sim suggested, it might be worth it to add a unique id to the above message to prove or not that the values are coming from where you think they are. > >>> if rlen == 0: >>> # [...] wait for notification... >>> continue >>> >>> # [...] Enqueue batch and let other threads process it. >>> # [...] Those threads will eventually delete the processed rows >>> from >>> # the foo table. >>> >>> The problem is that sometimes (once every few days at about 2-300K queries >>> per day) I get many more rows than the max 5000 I asked for (I've seen up >>> to 25k). And I'm getting timeouts and other problems as a result. And there is no instance of the UPDATE query that is unconstrained and a code path to that query? Or as Sim suggested another copy of this code without the LIMIT? Also what happens in the rlen == 0 case? >>> Any idea ? Thanks. > -- Adrian Klaver adrian.klaver@aklaver.com
On Thursday 08 May 2014 16:05:59 Sim Zacks wrote: > On 05/08/2014 02:09 PM, Vincent de Phily wrote: > The problem is that sometimes (once every few days at about 2-300K queries > per day) I get many more rows than the max 5000 I asked for (I've seen up > to 25k). And I'm getting timeouts and other problems as a result. > I would bet you have another copy of the code running without the limit. > Maybe on a cron job. To prove this, modify the log statement slightly and > see if this statement is really returning more then 5000 rows. LOG.debug("1 > - Fetched %d rows", rlen) or similar. I assure you I don't :/ The program is only 350 lines including plenty of good comments, it'd be hard to miss a rogue query. Exhaustive list of queries: - pg_try_advisory_lock to make sure that only one instance is running at a time - set processing='f' during startup - delete from foo where id in (fully_processed_records) - listen for the notification from an "on insert" trigger on table foo - select highest and lowest record in foo for monitoring purpose -- Vincent de Phily
On Thursday 08 May 2014 16:56:25 Adrian Klaver wrote: > On 05/08/2014 03:11 PM, Vincent de Phily wrote: > > On Thursday 08 May 2014 06:30:39 Adrian Klaver wrote: > >> On 05/08/2014 04:09 AM, Vincent de Phily wrote: > > The histogram shows a large amount of small values, progressively becoming > > rarer for bigger values, up to value 5000 which is very frequent again > > (depending on the day, between 0.5 and 5% of queries return the maximum > > number of rows), then a handfull freak values of ~9k, 11k, 15k, 33k, 46k, > > 107k, etc which cause my problems. > > So just to be clear this the value for rlen and that value and log > message only appear in the loop above? Yes, and the problematic query is only executed in this loop. For clarity's sake I had left out bits of the code concering the notification. In case it changes anything, this is the uncut (but still anonimized) function: def enqueue_loop(q): global DO_LOOP poll = select.poll() poll.register(DB_HANDLER.fileno(), select.POLLIN) curs = DB_HANDLER.cursor() query = """UPDATE foo SET processing = 't' WHERE id IN (SELECT id FROM foo WHERE processing = 'f' ORDER BY id ASC LIMIT %d FOR UPDATE) RETURNING *""" % (conf_getint('DEFAULT', 'push_count', 5000),) while DO_LOOP: # Fetch data and pop notifications. curs.execute(query) results = curs.fetchall() rlen = len(results) nlen = len(DB_HANDLER.notifies) while DB_HANDLER.notifies: DB_HANDLER.notifies.pop() if rlen > 0 or nlen > 0: LOG.debug("Fetched %d rows, poped %d notifs, %d batches in queue", rlen, nlen, q.qsize()) # If we get nothing, wait for notifications. Try again anyway once a # minute to make sure that connection is ok. if rlen == 0: LOG.debug("Waiting for a notification") poll.poll(60000) continue # Got a batch of rows, enqueue them. results.sort() # Sort is needed here (compares the 1st element of the # tuple) because we do not have the guarantee that # "UPDATE ... (SELECT ORDER BY ..) RETURNING *" returns # ordered rows. q.put(results) Before you grow suspicious of that conf_getint, the config is loaded once at program startup, and the overlarge results hapen together with normal results without having left the loop. Just in case I'm now logging the query string anyway; who knows... > >>> if rlen == 0: > >>> # [...] wait for notification... > >>> continue > >>> > >>> # [...] Enqueue batch and let other threads process it. > >>> # [...] Those threads will eventually delete the processed > >>> rows > >>> from > >>> # the foo table. > >>> > >>> The problem is that sometimes (once every few days at about 2-300K > >>> queries > >>> per day) I get many more rows than the max 5000 I asked for (I've seen > >>> up > >>> to 25k). And I'm getting timeouts and other problems as a result. > > And there is no instance of the UPDATE query that is unconstrained and a > code path to that query? > > Or as Sim suggested another copy of this code without the LIMIT? Again, no. I'm now logging the query string just in case, but I can't imagine that the value of the limit would change at runtime unless I have somehow managed to corrupt the memory of a pure python program. > Also what happens in the rlen == 0 case? See the unabridged code above. We wait up to 1 minute for a notification to arrive, and then we restart the loop. -- Vincent de Phily
Vincent de Phily <vincent.dephily@mobile-devices.fr> writes: > In case it changes anything, this is the uncut (but still anonimized) > function: > query = """UPDATE foo SET processing = 't' WHERE id IN > (SELECT id FROM foo WHERE processing = 'f' ORDER BY id ASC LIMIT %d > FOR UPDATE) > RETURNING *""" % (conf_getint('DEFAULT', 'push_count', 5000),) Well, of course this view of things exposes a relevant failure mode you hadn't mentioned: maybe sometimes the conf_getint() call returns something other than 5000? regards, tom lane
On Friday 09 May 2014 07:01:32 Tom Lane wrote: > Vincent de Phily <vincent.dephily@mobile-devices.fr> writes: > > In case it changes anything, this is the uncut (but still anonimized) > > > > function: > > query = """UPDATE foo SET processing = 't' WHERE id IN > > > > (SELECT id FROM foo WHERE processing = 'f' ORDER BY id ASC > > LIMIT %d > > > > FOR UPDATE) > > > > RETURNING *""" % (conf_getint('DEFAULT', 'push_count', 5000),) > > Well, of course this view of things exposes a relevant failure mode > you hadn't mentioned: maybe sometimes the conf_getint() call returns > something other than 5000? True. But I've commented already that I'd be very surprised (and wouldn't know how to begin) if that value was faulty (even though it would explain things nicely), because * It is parsed once at program start (using python's ConfigParser library) * It has the correct value of 5000 in most cases (as demonstrated by the frequency of number of rows returned) * There is no sign that I exited the loop (and therefore got the opportunity to change the value of the query) before I start receiving overlong results. Still, I agree it's suspicious, so I'm now logging the query string whenever I get over 5000 results (hardcoded). We'll see next time it happens. -- Vincent de Phily
On 05/09/2014 05:36 AM, Vincent de Phily wrote: > On Friday 09 May 2014 07:01:32 Tom Lane wrote: >> Vincent de Phily <vincent.dephily@mobile-devices.fr> writes: >>> In case it changes anything, this is the uncut (but still anonimized) >>> >>> function: >>> query = """UPDATE foo SET processing = 't' WHERE id IN >>> >>> (SELECT id FROM foo WHERE processing = 'f' ORDER BY id ASC >>> LIMIT %d >>> >>> FOR UPDATE) >>> >>> RETURNING *""" % (conf_getint('DEFAULT', 'push_count', 5000),) >> >> Well, of course this view of things exposes a relevant failure mode >> you hadn't mentioned: maybe sometimes the conf_getint() call returns >> something other than 5000? > > True. But I've commented already that I'd be very surprised (and wouldn't know > how to begin) if that value was faulty (even though it would explain things > nicely), because > * It is parsed once at program start (using python's ConfigParser library) What is parsed? > * It has the correct value of 5000 in most cases (as demonstrated by the > frequency of number of rows returned) Yes, but those are not the ones of interest. > * There is no sign that I exited the loop (and therefore got the opportunity > to change the value of the query) before I start receiving overlong results. Not sure I follow, You are passing a function as a parameter, it would be re-run each time the query was run in: curs.execute(query) FYI, the psycopg2 docs recommend you not use the parameter passing method above as it is susceptible to SQL injection: http://initd.org/psycopg/docs/usage.html#passing-parameters-to-sql-queries > > Still, I agree it's suspicious, so I'm now logging the query string whenever I > get over 5000 results (hardcoded). We'll see next time it happens. > -- Adrian Klaver adrian.klaver@aklaver.com
On 05/09/2014 01:45 AM, Vincent de Phily wrote: > On Thursday 08 May 2014 16:56:25 Adrian Klaver wrote: >> On 05/08/2014 03:11 PM, Vincent de Phily wrote: >>> On Thursday 08 May 2014 06:30:39 Adrian Klaver wrote: >>>> On 05/08/2014 04:09 AM, Vincent de Phily wrote: > > Before you grow suspicious of that conf_getint, the config is loaded once at > program startup, and the overlarge results hapen together with normal results > without having left the loop. Just in case I'm now logging the query string > anyway; who knows... As proof of concept that the value can change in the loop: In [17]: l = [1, 2, 3, 4, 5] In [18]: def getNumber(): return random.choice(l) ....: In [19]: ct = 0 In [20]: while ct < 5: s = 'Lucky number is %d' % (getNumber(),) ct += 1 print s ....: Lucky number is 5 Lucky number is 5 Lucky number is 4 Lucky number is 3 Lucky number is 2 -- Adrian Klaver adrian.klaver@aklaver.com
On Friday 09 May 2014 06:52:33 Adrian Klaver wrote: > On 05/09/2014 05:36 AM, Vincent de Phily wrote: > > On Friday 09 May 2014 07:01:32 Tom Lane wrote: > >> Vincent de Phily <vincent.dephily@mobile-devices.fr> writes: > >>> In case it changes anything, this is the uncut (but still anonimized) > >>> > >>> function: > >>> query = """UPDATE foo SET processing = 't' WHERE id IN > >>> > >>> (SELECT id FROM foo WHERE processing = 'f' ORDER BY id ASC > >>> LIMIT %d > >>> > >>> FOR UPDATE) > >>> > >>> RETURNING *""" % (conf_getint('DEFAULT', 'push_count', > >>> 5000),) > >> > >> Well, of course this view of things exposes a relevant failure mode > >> you hadn't mentioned: maybe sometimes the conf_getint() call returns > >> something other than 5000? > > > > True. But I've commented already that I'd be very surprised (and wouldn't > > know how to begin) if that value was faulty (even though it would explain > > things nicely), because > > * It is parsed once at program start (using python's ConfigParser library) > > What is parsed? The ini-style config files which contain the 'push_count' setting in the 'DEFAULT' section : def conf_getint(section, option, default='required'): return conf_get(section, option, default, lambda(c):int(c)) def conf_get(section, option, default='required', treat=lambda(c):c): try: return treat(CONFIG.get(section, option)) except Exception, e: if default == 'required': raise Exception("Can't get required value for %s %s: %s" % (section, option, e)) else: LOG.warning("Using default value %s for %s %s: %s", default, section, option, e) return default if __name__ == '__main__': # [...] configfiles = sys.argv[1:] CONFIG = ConfigParser.SafeConfigParser() CONFIG.read(configfiles) > > * It has the correct value of 5000 in most cases (as demonstrated by the > > frequency of number of rows returned) > > Yes, but those are not the ones of interest. In know, but it shows that I'm not always getting a wild value to begin with. I'm getting the expected/configured/default value of 5000, not some overflowed integer nonsense. > > * There is no sign that I exited the loop (and therefore got the > > opportunity> > > to change the value of the query) before I start receiving overlong > > results. > Not sure I follow, You are passing a function as a parameter, it > would be re-run each time the query was run in: > > curs.execute(query) No, curs.execute(query) does not re-run conf_getint(). The "query" variable is a standard python string that already got formated. And since that formating is done outside the loop, I need to exit the loop in order to re-format the query string with a possibly messed-up value. But the only way to exit-and-reenter that loop is to get an exception, which would show up in the logs. And since I don't see an exception in the logs at those times, I must have stayed in that loop, with the same value of "query" I have had since the begining. > FYI, the psycopg2 docs recommend you not use the parameter passing > method above as it is susceptible to SQL injection: > > http://initd.org/psycopg/docs/usage.html#passing-parameters-to-sql-queries I know, but * The fact that I'm using %d, that I coerced my parameter to be an int, and that this parameter is not user-controled make things safe. * I do not want to reformat and refetch the value each time I run execute(), as my query is the same for the whole lifetime of the program, so it would be a waste of CPU. * I could use psycopg's mogrify() instead of python's %, but there's simply no need in this case. I promess that I do when it's needed :) Thanks to all for taking an interest so far, this bug is... weird. -- Vincent de Phily
On Friday 09 May 2014 08:01:47 Adrian Klaver wrote: > On 05/09/2014 01:45 AM, Vincent de Phily wrote: > > Before you grow suspicious of that conf_getint, the config is loaded once > > at program startup, and the overlarge results hapen together with normal > > results without having left the loop. Just in case I'm now logging the > > query string anyway; who knows... > > As proof of concept that the value can change in the loop: > > > In [17]: l = [1, 2, 3, 4, 5] > > In [18]: def getNumber(): > return random.choice(l) > ....: > > In [19]: ct = 0 > > In [20]: while ct < 5: > s = 'Lucky number is %d' % (getNumber(),) > ct += 1 > print s > ....: > Lucky number is 5 > Lucky number is 5 > Lucky number is 4 > Lucky number is 3 > Lucky number is 2 Sure, but that's not what I'm doing at all. My "str" % (somefunc(),) is executed outside the loop, not inside it. This is python, not lisp. The variable "query" is a plain string, not a reference to how that string was created. -- Vincent de Phily
Vincent de Phily wrote > On Friday 09 May 2014 06:52:33 Adrian Klaver wrote: >> On 05/09/2014 05:36 AM, Vincent de Phily wrote: >> > On Friday 09 May 2014 07:01:32 Tom Lane wrote: >> >> Vincent de Phily < > vincent.dephily@ > > writes: >> >>> In case it changes anything, this is the uncut (but still anonimized) >> >>> >> >>> function: >> >>> query = """UPDATE foo SET processing = 't' WHERE id IN >> >>> >> >>> (SELECT id FROM foo WHERE processing = 'f' ORDER BY id >> ASC >> >>> LIMIT %d >> >>> >> >>> FOR UPDATE) >> >>> >> >>> RETURNING *""" % (conf_getint('DEFAULT', 'push_count', >> >>> 5000),) >> >> > > Thanks to all for taking an interest so far, this bug is... weird. This seems to likely be the same, still open, bug reported previously: No Number Assigned: http://www.postgresql.org/message-id/CANCipfpfzoYnOz5jj=UZ70_R=CwDHv36dqWSpwsi27vpm1z5sA@mail.gmail.com #8464 http://www.postgresql.org/message-id/E1VN53g-0002Iy-Il@wrigleys.postgresql.org #8470 is referenced in the first thread as well...though that is specifically a performance issue and not a query bug. The recommended work-around is to move the sub-query using the "FOR UPDATE" into a CTE. David J. -- View this message in context: http://postgresql.1045698.n5.nabble.com/Receiving-many-more-rows-than-expected-tp5803179p5803406.html Sent from the PostgreSQL - general mailing list archive at Nabble.com.
On 05/09/2014 08:14 AM, Vincent de Phily wrote: > On Friday 09 May 2014 08:01:47 Adrian Klaver wrote: >> On 05/09/2014 01:45 AM, Vincent de Phily wrote: >>> Before you grow suspicious of that conf_getint, the config is loaded once >>> at program startup, and the overlarge results hapen together with normal >>> results without having left the loop. Just in case I'm now logging the >>> query string anyway; who knows... >> >> As proof of concept that the value can change in the loop: >> >> >> In [17]: l = [1, 2, 3, 4, 5] >> >> In [18]: def getNumber(): >> return random.choice(l) >> ....: >> >> In [19]: ct = 0 >> >> In [20]: while ct < 5: >> s = 'Lucky number is %d' % (getNumber(),) >> ct += 1 >> print s >> ....: >> Lucky number is 5 >> Lucky number is 5 >> Lucky number is 4 >> Lucky number is 3 >> Lucky number is 2 > > Sure, but that's not what I'm doing at all. My "str" % (somefunc(),) is > executed outside the loop, not inside it. This is python, not lisp. The > variable "query" is a plain string, not a reference to how that string was > created. My mistake, what I get for trying to throw something together while heading out the door :( > > -- Adrian Klaver adrian.klaver@aklaver.com
On 05/09/2014 08:04 AM, Vincent de Phily wrote: > On Friday 09 May 2014 06:52:33 Adrian Klaver wrote: > > > Thanks to all for taking an interest so far, this bug is... weird. Yes weird. I tried to replicate your query to see if I could see anything similar. Tried it on 9.0.17 and 9.3.4 and could not get it to fail. Did hard coding the limit value change anything? > > -- Adrian Klaver adrian.klaver@aklaver.com
On 05/09/2014 08:36 AM, David G Johnston wrote: > Vincent de Phily wrote >> On Friday 09 May 2014 06:52:33 Adrian Klaver wrote: >>> On 05/09/2014 05:36 AM, Vincent de Phily wrote: >>>> On Friday 09 May 2014 07:01:32 Tom Lane wrote: >>>>> Vincent de Phily < > >> vincent.dephily@ > >> > writes: >>>>>> In case it changes anything, this is the uncut (but still anonimized) >>>>>> >>>>>> function: >>>>>> query = """UPDATE foo SET processing = 't' WHERE id IN >>>>>> >>>>>> (SELECT id FROM foo WHERE processing = 'f' ORDER BY id >>> ASC >>>>>> LIMIT %d >>>>>> >>>>>> FOR UPDATE) >>>>>> >>>>>> RETURNING *""" % (conf_getint('DEFAULT', 'push_count', >>>>>> 5000),) >>>>> >> >> Thanks to all for taking an interest so far, this bug is... weird. > > This seems to likely be the same, still open, bug reported previously: > > No Number Assigned: > http://www.postgresql.org/message-id/CANCipfpfzoYnOz5jj=UZ70_R=CwDHv36dqWSpwsi27vpm1z5sA@mail.gmail.com I ran Olivers pgbug.py against 9.0.17 and 9.3.4 and can confirm that I see the same error. When I turned off autovacuum on 9.3 I did not see the error which also confirms Olivers observation. Then I rewrote the program as a plpythonu function and ran it in the 9.0.17 instance. It ran successfully. It would seem this is either a psycopg2 issue or a client/server connection problem. Still trying to figure where to go from here. > > #8464 > http://www.postgresql.org/message-id/E1VN53g-0002Iy-Il@wrigleys.postgresql.org > > #8470 is referenced in the first thread as well...though that is > specifically a performance issue and not a query bug. > > The recommended work-around is to move the sub-query using the "FOR UPDATE" > into a CTE. > > David J. > > > > > -- > View this message in context: http://postgresql.1045698.n5.nabble.com/Receiving-many-more-rows-than-expected-tp5803179p5803406.html > Sent from the PostgreSQL - general mailing list archive at Nabble.com. > > -- Adrian Klaver adrian.klaver@aklaver.com
On Friday 09 May 2014 16:53:49 Adrian Klaver wrote: > On 05/09/2014 08:04 AM, Vincent de Phily wrote: > > Thanks to all for taking an interest so far, this bug is... weird. > > Yes weird. I tried to replicate your query to see if I could see > anything similar. Tried it on 9.0.17 and 9.3.4 and could not get it to fail. > > Did hard coding the limit value change anything? I didn't hardcode the formating, just the test to display more debug when the bug happens: > if rlen != curs.rowcount or rlen > 5000: > LOG.error("Fetched strange number of rows %d %d %s", rlen, curs.rowcount, > query) And I got 2 instances of the bug during the weekend. Annoyingly, the debug showed as expected that rlen == curs.rowcount and that the query does ask for 5000 records. So the querystring formating is apparently not at fault. I also verified that none of the returned row ids overlapped (as expected). So... Still not solved :( I'm running out of ideas, but there are still two things I want to try: * Logging all queries on the server side until I catch at least one instance of the bug. Is there a way to log the number of rows affected by a query ? * Create a new cursor for each execute(), in case there somehow is some leftover state between two execute() calls. However I did a number of code changes today; they should be unrelated but I'll let them run on their own for a while to verify that. -- Vincent de Phily
Did you try rewriting the query to avoid using an IN expression? UPDATE foo SET processing = 't' FROM ( SELECT id FROM foo WHERE processing = 'f' ORDER BY id ASC LIMIT 5000 FOR UPDATE ) src WHERE foo.id = src.id; The workaround I mentioned above said that a CTE was needed but I'm thinking that a simply FROM would be just as effective. Otherwise: UPDATE foo SET processing = 't' FROM ( WITH ids_to_update AS ( SELECT id FROM foo WHERE processing = 'f' ORDER BY id ASC LIMIT 5000 FOR UPDATE ) SELECT id FROM ids_to_update ) src WHERE foo.id = src.id; David J. -- View this message in context: http://postgresql.1045698.n5.nabble.com/Receiving-many-more-rows-than-expected-tp5803179p5803687.html Sent from the PostgreSQL - general mailing list archive at Nabble.com.
On Friday 09 May 2014 08:36:04 David G Johnston wrote: > This seems to likely be the same, still open, bug reported previously: > > No Number Assigned: > http://www.postgresql.org/message-id/CANCipfpfzoYnOz5jj=UZ70_R=CwDHv36dqWSpw > si27vpm1z5sA@mail.gmail.com > > #8464 > http://www.postgresql.org/message-id/E1VN53g-0002Iy-Il@wrigleys.postgresql.o > rg > > #8470 is referenced in the first thread as well...though that is > specifically a performance issue and not a query bug. > > The recommended work-around is to move the sub-query using the "FOR UPDATE" > into a CTE. Thanks for those pointers, it certainly looks like the same issue (the only difference being the size of the limit) and there has been a good amount of discussion there. I'll try the CTE workaround, although that leaves a bad taste in my mouth. From the discussions in the 1st thread, I wonder wether raising the isolation level to repeatable read would also fix the issue ? In any case, testing will take time because the bug triggers less than once a day and I haven't yet managed to reproduce it locally. ps: sorry I'm only seeing your answer now, it helps if you cc me when answering the list. -- Vincent de Phily
On Monday 12 May 2014 10:10:36 David G Johnston wrote: > Did you try rewriting the query to avoid using an IN expression? > > UPDATE foo SET processing = 't' > FROM ( > SELECT id FROM foo WHERE processing = 'f' ORDER BY id ASC LIMIT 5000 FOR > UPDATE > ) src > WHERE foo.id = src.id; > > The workaround I mentioned above said that a CTE was needed but I'm thinking > that a simply FROM would be just as effective. Otherwise: > > UPDATE foo SET processing = 't' > FROM ( > WITH ids_to_update AS ( > SELECT id FROM foo WHERE processing = 'f' ORDER BY id ASC LIMIT 5000 FOR > UPDATE > ) > SELECT id FROM ids_to_update > ) src > WHERE foo.id = src.id; As it happens, I've done a fair bit of refactoring in my code (but not the actual query), and now I cannot reproduce the bug anymore :/ The refactoring had to do with taking status queries to a different connection, and changing the timing of calling the problematic query and interruption by other threads, to increase throughput. -- Vincent de Phily