Thread: Default ordering option
Hi,
In this documentation, it is said:
If sorting is not chosen, the rows will be returned in an unspecified order. The actual order in that case will depend on the scan and join plan types and the order on disk, but it must not be relied on.
I would like to know if there is any way to change that to have a "real" random behaviour.
My use case:
At Doctolib, we do a lot of automatic tests.
Sometimes, people forgot to order their queries. Most of the time, there is no real problem on production. Let say, we display a user list without order.
When a developer writes a test for this feature, he will create 2 users A and B, then assert that they are displayed "[A, B]".
99% of the time the test will be ok, but sometimes, the displayed list will be "[B,A]", and the test will fail.
One solution could be to ensure random order with an even distribution, so that such failing test would be detected quicker.
Is that possible? Maybe with a plugin?
Thanks,
Cyril
On 7/23/19 8:43 AM, Cyril Champier wrote: > Hi, > > In this documentation > <https://www.postgresql.org/docs/9.1/queries-order.html>, it is said: > > If sorting is not chosen, the rows will be returned in an > unspecified order. The actual order in that case will depend on the > scan and join plan types and the order on disk, but it must not be > relied on. > > > I would like to know if there is any way to change that to have a "real" > random behaviour. > > My use case: > At Doctolib, we do a lot of automatic tests. > Sometimes, people forgot to order their queries. Most of the time, there > is no real problem on production. Let say, we display a user list > without order. > When a developer writes a test for this feature, he will create 2 users > A and B, then assert that they are displayed "[A, B]". > 99% of the time the test will be ok, but sometimes, the displayed list > will be "[B,A]", and the test will fail. > > One solution could be to ensure random order with an even distribution, > so that such failing test would be detected quicker. > > Is that possible? Maybe with a plugin? Not that I know of. A possible solution given below: test_(postgres)> insert into t1 values (1, 'dog'), (2, 'cat'), (3, 'fish'); INSERT 0 3 test_(postgres)> select * from t1 ; a | b ---+------ 1 | dog 2 | cat 3 | fish (3 rows) test_(postgres)> update t1 set b = 'dogfish' where a =1; UPDATE 1 test_(postgres)> select * from t1 ; a | b ---+--------- 2 | cat 3 | fish 1 | dogfish (3 rows) An UPDATE reorders the rows. Maybe throw an UPDATE into the test after creating the users to force an 'out of order' result? > > Thanks, > Cyril > > > -- Adrian Klaver adrian.klaver@aklaver.com
On 7/24/19 2:23 AM, Adrian Klaver wrote: > On 7/23/19 8:43 AM, Cyril Champier wrote: >> Hi, >> >> In this documentation <https://www.postgresql.org/docs/9.1/queries-order.html>, it is said: >> >> If sorting is not chosen, the rows will be returned in an >> unspecified order. The actual order in that case will depend on the >> scan and join plan types and the order on disk, but it must not be >> relied on. >> >> >> I would like to know if there is any way to change that to have a "real" random behaviour. >> >> My use case: >> At Doctolib, we do a lot of automatic tests. >> Sometimes, people forgot to order their queries. Most of the time, there is no real problem on production. Let say, wedisplay a user list without order. >> When a developer writes a test for this feature, he will create 2 users A and B, then assert that they are displayed "[A,B]". >> 99% of the time the test will be ok, but sometimes, the displayed list will be "[B,A]", and the test will fail. >> >> One solution could be to ensure random order with an even distribution, so that such failing test would be detected quicker. >> >> Is that possible? Maybe with a plugin? > > Not that I know of. > > A possible solution given below: > > test_(postgres)> insert into t1 values (1, 'dog'), (2, 'cat'), (3, 'fish'); > INSERT 0 3 > > test_(postgres)> select * from t1 ; > a | b > ---+------ > 1 | dog > 2 | cat > 3 | fish > (3 rows) > > test_(postgres)> update t1 set b = 'dogfish' where a =1; > UPDATE 1 > > test_(postgres)> select * from t1 ; > a | b > ---+--------- > 2 | cat > 3 | fish > 1 | dogfish > (3 rows) > > An UPDATE reorders the rows. Maybe throw an UPDATE into the test after creating the users to force an 'out of order' result? An UPDATE without changing any values should have the same effect, e.g. : UPDATE t1 SET b = b WHERE a = 1; Something like this WITH x AS (SELECT * FROM t1 ORDER BY a DESC) UPDATE t1 t SET a = t.a FROM x WHERE t.a = x.a would shuffle the rows into reverse order, which might be enough to catch out any missing ORDER BY (this assumes nothing else will touch the table and reorder it before the test is run). You could also order by RANDOM() but there would be a chance the rows would end up in sequential order. Regards Ian Barwick -- Ian Barwick https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Thanks for your answers.
Unfortunately the update trick only seems to work under certain conditions.
I do this to shuffle my patients table:
UPDATE "patients"
SET "updated_at" = NOW()
WHERE "patients"."id" = (SELECT "patients"."id" FROM "patients" ORDER BY random() LIMIT 1)
SET "updated_at" = NOW()
WHERE "patients"."id" = (SELECT "patients"."id" FROM "patients" ORDER BY random() LIMIT 1)
Then indeed, this query returns different order:
SELECT *
FROM "patients"
FROM "patients"
But this one (because it use an index?) always returns values in the same order:
SELECT "id"
FROM "patients"
And for the other suggestion, I cannot blindly add 'ORDER BY random()' to every select,
because of the incompatibility with distinct and union, and the way we use our orm.
On Wed, Jul 24, 2019 at 3:54 AM Ian Barwick <ian.barwick@2ndquadrant.com> wrote:
On 7/24/19 2:23 AM, Adrian Klaver wrote:
> On 7/23/19 8:43 AM, Cyril Champier wrote:
>> Hi,
>>
>> In this documentation <https://www.postgresql.org/docs/9.1/queries-order.html>, it is said:
>>
>> If sorting is not chosen, the rows will be returned in an
>> unspecified order. The actual order in that case will depend on the
>> scan and join plan types and the order on disk, but it must not be
>> relied on.
>>
>>
>> I would like to know if there is any way to change that to have a "real" random behaviour.
>>
>> My use case:
>> At Doctolib, we do a lot of automatic tests.
>> Sometimes, people forgot to order their queries. Most of the time, there is no real problem on production. Let say, we display a user list without order.
>> When a developer writes a test for this feature, he will create 2 users A and B, then assert that they are displayed "[A, B]".
>> 99% of the time the test will be ok, but sometimes, the displayed list will be "[B,A]", and the test will fail.
>>
>> One solution could be to ensure random order with an even distribution, so that such failing test would be detected quicker.
>>
>> Is that possible? Maybe with a plugin?
>
> Not that I know of.
>
> A possible solution given below:
>
> test_(postgres)> insert into t1 values (1, 'dog'), (2, 'cat'), (3, 'fish');
> INSERT 0 3
>
> test_(postgres)> select * from t1 ;
> a | b
> ---+------
> 1 | dog
> 2 | cat
> 3 | fish
> (3 rows)
>
> test_(postgres)> update t1 set b = 'dogfish' where a =1;
> UPDATE 1
>
> test_(postgres)> select * from t1 ;
> a | b
> ---+---------
> 2 | cat
> 3 | fish
> 1 | dogfish
> (3 rows)
>
> An UPDATE reorders the rows. Maybe throw an UPDATE into the test after creating the users to force an 'out of order' result?
An UPDATE without changing any values should have the same effect, e.g. :
UPDATE t1 SET b = b WHERE a = 1;
Something like this
WITH x AS (SELECT * FROM t1 ORDER BY a DESC)
UPDATE t1 t
SET a = t.a
FROM x
WHERE t.a = x.a
would shuffle the rows into reverse order, which might be enough to catch
out any missing ORDER BY (this assumes nothing else will touch the table
and reorder it before the test is run).
You could also order by RANDOM() but there would be a chance the rows would
end up in sequential order.
Regards
Ian Barwick
--
Ian Barwick https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
On 7/24/19 1:45 AM, Cyril Champier wrote: > Thanks for your answers. > Unfortunately the update trick only seems to work under certain conditions. > > I do this to shuffle my patients table: > UPDATE "patients" > SET "updated_at" = NOW() > WHERE "patients"."id" = (SELECT "patients"."id" FROM "patients" ORDER BY > random() LIMIT 1) > > Then indeed, this query returns different order: > SELECT * > FROM "patients" > > But this one (because it use an index?) always returns values in the > same order: > SELECT "id" > FROM "patients" Hmm, I don't see that: test=# \d t1 Table "public.t1" Column | Type | Collation | Nullable | Default --------+-------------------+-----------+----------+--------- a | integer | | not null | b | character varying | | | Indexes: "t1_pkey" PRIMARY KEY, btree (a) test=# select * from t1; a | b ---+--------- 2 | cat 3 | fish 1 | dogfish (3 rows) test=# select a from t1; a --- 2 3 1 (3 rows) Are you sure there is nothing going on between the first and second queries e.g. ROLLBACK? > > > > And for the other suggestion, I cannot blindly add 'ORDER BY random()' > to every select, > because of the incompatibility with distinct and union, and the way we > use our orm. > Are you talking about the production or test queries above? -- Adrian Klaver adrian.klaver@aklaver.com
Indeed, you are right, I do my test in pure sql and via ruby ActiveRecord, and I must had been confused,
the behaviour is correct in sql, it must have been a cache thing in ActiveRecord that prevented the reordering.
But meanwhile, I tested on our whole CI, and it took twice the normal time with updates to shuffle DB :(
For the union, I speak about production code like this:
"select count(*) from (#{directory_doctors_query_sql} union all #{profiles_query_sql}) as doctors"
In the to_sql, we cannot detect that we will be injected into a union.
So I cannot blindly add the random in the to_sql method.
On Wed, Jul 24, 2019 at 4:48 PM Adrian Klaver <adrian.klaver@aklaver.com> wrote:
On 7/24/19 1:45 AM, Cyril Champier wrote:
> Thanks for your answers.
> Unfortunately the update trick only seems to work under certain conditions.
>
> I do this to shuffle my patients table:
> UPDATE "patients"
> SET "updated_at" = NOW()
> WHERE "patients"."id" = (SELECT "patients"."id" FROM "patients" ORDER BY
> random() LIMIT 1)
>
> Then indeed, this query returns different order:
> SELECT *
> FROM "patients"
>
> But this one (because it use an index?) always returns values in the
> same order:
> SELECT "id"
> FROM "patients"
Hmm, I don't see that:
test=# \d t1
Table "public.t1"
Column | Type | Collation | Nullable | Default
--------+-------------------+-----------+----------+---------
a | integer | | not null |
b | character varying | | |
Indexes:
"t1_pkey" PRIMARY KEY, btree (a)
test=# select * from t1;
a | b
---+---------
2 | cat
3 | fish
1 | dogfish
(3 rows)
test=# select a from t1;
a
---
2
3
1
(3 rows)
Are you sure there is nothing going on between the first and second
queries e.g. ROLLBACK?
>
>
>
> And for the other suggestion, I cannot blindly add 'ORDER BY random()'
> to every select,
> because of the incompatibility with distinct and union, and the way we
> use our orm.
>
Are you talking about the production or test queries above?
--
Adrian Klaver
adrian.klaver@aklaver.com
On 7/24/19 8:22 AM, Cyril Champier wrote: > Indeed, you are right, I do my test in pure sql and via ruby > ActiveRecord, and I must had been confused, > the behaviour is correct in sql, it must have been a cache thing in > ActiveRecord that prevented the reordering. > But meanwhile, I tested on our whole CI, and it took twice the normal > time with updates to shuffle DB :( Well there extra operations so that is to be expected. The question is whether everything needs to be shuffled? It would seem you only need to do that for those tests that are expecting a set order. I went back and read your original post and am somewhat confused about that test. You said: "Sometimes, people forgot to order their queries. Most of the time, there is no real problem on production...." If order is not an issue in the production code why test for it in the test code? > > For the union, I speak about production code like this: > "select count(*) from (#{directory_doctors_query_sql} union all > #{profiles_query_sql}) as doctors" > In the to_sql, we cannot detect that we will be injected into a union. > So I cannot blindly add the random in the to_sql method. The below would not be affected by an order by in any case as the count() would be the same: "select count(*) from (#{directory_doctors_query_sql} union all #{profiles_query_sql}) as doctors" If you did want to use order by random() could you not just tack it on the end?: "... as doctors order by random()" > > > > > > > -- Adrian Klaver adrian.klaver@aklaver.com
On 2019-07-23 17:43, Cyril Champier wrote: > In this documentation > <https://www.postgresql.org/docs/9.1/queries-order.html>, it is said: > > If sorting is not chosen, the rows will be returned in an > unspecified order. The actual order in that case will depend on the > scan and join plan types and the order on disk, but it must not be > relied on. > > > I would like to know if there is any way to change that to have a "real" > random behaviour. It might be an interesting exercise to implement this as a post-parsing hook. -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Adrian:
If order is not an issue in the production code why test for it in the
test code?
In many cases, it would not be a problem in tests if we had an unordered array comparison helper.
But in other cases, it is a production issue.
In ruby ActiveRecord for exemple, you can do `Patient.find_by(last_name: 'champier')`,
which translates to `SELECT "patients".* FROM "patients" WHERE "patients"."last_name" = 'champier' LIMIT 1`.
If last_name is not unique, the returned record will be random.
So yes, everything as to be randomized, because the sources are multiples and the consequences can vary to a dramatic production bug, a failed CI 1% of the time, or to a useless test assertion.
Peter:
It might be an interesting exercise to implement this as a post-parsing
hook.
I known nothing about that, but that sounds interesting, do you have any documentation pointer to help me implement that?
On Wed, Jul 24, 2019 at 10:36 PM Peter Eisentraut <peter.eisentraut@2ndquadrant.com> wrote:
On 2019-07-23 17:43, Cyril Champier wrote:
> In this documentation
> <https://www.postgresql.org/docs/9.1/queries-order.html>, it is said:
>
> If sorting is not chosen, the rows will be returned in an
> unspecified order. The actual order in that case will depend on the
> scan and join plan types and the order on disk, but it must not be
> relied on.
>
>
> I would like to know if there is any way to change that to have a "real"
> random behaviour.
It might be an interesting exercise to implement this as a post-parsing
hook.
--
Peter Eisentraut http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 2019-07-25 09:43, Cyril Champier wrote: > It might be an interesting exercise to implement this as a post-parsing > hook. > > > I known nothing about that, but that sounds interesting, do you have any > documentation pointer to help me implement that? Look for post_parse_analyze_hook. Walk the parsed query tree, look for queries without ordering clause and manufacture one. -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Peter:
So I would need to create a pg extension encapsulating this hook callback?
If this is the case, it seems it will be much more complicated than expected, and I wont be able to do it :(
But thanks for the suggestion anyway.
On Thu, Jul 25, 2019 at 12:21 PM Peter Eisentraut <peter.eisentraut@2ndquadrant.com> wrote:
On 2019-07-25 09:43, Cyril Champier wrote:
> It might be an interesting exercise to implement this as a post-parsing
> hook.
>
>
> I known nothing about that, but that sounds interesting, do you have any
> documentation pointer to help me implement that?
Look for post_parse_analyze_hook. Walk the parsed query tree, look for
queries without ordering clause and manufacture one.
--
Peter Eisentraut http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 7/25/19 12:43 AM, Cyril Champier wrote: > *Adrian*: > > If order is not an issue in the production code why test for it in the > test code? > > > In many cases, it would not be a problem in tests if we had an unordered > array comparison helper. > But in other cases, it is a production issue. > In ruby ActiveRecord for exemple, you can do `Patient.find_by(last_name: > 'champier')`, > which translates to `SELECT "patients".* FROM "patients" WHERE > "patients"."last_name" = 'champier' LIMIT 1`. > If last_name is not unique, the returned record will be random. Are you really looking for a pseudo-random name? If so would not(warning not a Ruby developer, so below is tentative): Patient.where(["last_name = :last_name", {last_name: "champier"}]).order('RANDOM()').first work better? If not why not use something that returns all possible matches? > > So yes, everything as to be randomized, because the sources are > multiples and the consequences can vary to a dramatic production bug, a > failed CI 1% of the time, or to a useless test assertion. One way I can think of doing this is write a script that walks through your tables in the test db and does an UPDATE across the rows. It is going to add time to your tests, but then I believe that is going to be the case for anything you do. Or you could look at something I have never tried, fuzzy testing. As a starting point: https://www.guru99.com/fuzz-testing.html Maybe other folks have suggestions on tools you could use for fuzzy testing. > > > *Peter*: > > It might be an interesting exercise to implement this as a post-parsing > hook. > > I known nothing about that, but that sounds interesting, do you have any > documentation pointer to help me implement that? > > > > On Wed, Jul 24, 2019 at 10:36 PM Peter Eisentraut > <peter.eisentraut@2ndquadrant.com > <mailto:peter.eisentraut@2ndquadrant.com>> wrote: > > On 2019-07-23 17:43, Cyril Champier wrote: > > In this documentation > > <https://www.postgresql.org/docs/9.1/queries-order.html>, it is said: > > > > If sorting is not chosen, the rows will be returned in an > > unspecified order. The actual order in that case will depend > on the > > scan and join plan types and the order on disk, but it must > not be > > relied on. > > > > > > I would like to know if there is any way to change that to have a > "real" > > random behaviour. > > It might be an interesting exercise to implement this as a post-parsing > hook. > > -- > Peter Eisentraut http://www.2ndQuadrant.com/ > PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services > -- Adrian Klaver adrian.klaver@aklaver.com
Adrian:
Are you really looking for a pseudo-random name?
No, the code I pasted was an existing production bug: the last_name should have been unique, so the selected patient would always be the same.
This should have been detected in tests, but since the order was "almost always the same", our test was green 99% of the time, so we discarded it as flaky.
Fuzzy testing could be an option, but this would go too far, as for Peter extension suggestion.
We have huge existing codebase with more than 10K tests, and I do not want to modify our whole testing strategy.
Meanwhile, I went for an ORM patch (ActiveRecord) and forbid usages that can workaround it.
If you are interested, here is a gist:
I wish there was a simple flag to activate in PostgreSQL to do that!
Thanks for your ideas!
On Thu, Jul 25, 2019 at 4:55 PM Adrian Klaver <adrian.klaver@aklaver.com> wrote:
On 7/25/19 12:43 AM, Cyril Champier wrote:
> *Adrian*:
>
> If order is not an issue in the production code why test for it in the
> test code?
>
>
> In many cases, it would not be a problem in tests if we had an unordered
> array comparison helper.
> But in other cases, it is a production issue.
> In ruby ActiveRecord for exemple, you can do `Patient.find_by(last_name:
> 'champier')`,
> which translates to `SELECT "patients".* FROM "patients" WHERE
> "patients"."last_name" = 'champier' LIMIT 1`.
> If last_name is not unique, the returned record will be random.
Are you really looking for a pseudo-random name?
If so would not(warning not a Ruby developer, so below is tentative):
Patient.where(["last_name = :last_name", {last_name:
"champier"}]).order('RANDOM()').first
work better?
If not why not use something that returns all possible matches?
>
> So yes, everything as to be randomized, because the sources are
> multiples and the consequences can vary to a dramatic production bug, a
> failed CI 1% of the time, or to a useless test assertion.
One way I can think of doing this is write a script that walks through
your tables in the test db and does an UPDATE across the rows. It is
going to add time to your tests, but then I believe that is going to be
the case for anything you do. Or you could look at something I have
never tried, fuzzy testing. As a starting point:
https://www.guru99.com/fuzz-testing.html
Maybe other folks have suggestions on tools you could use for fuzzy
testing.
>
>
> *Peter*:
>
> It might be an interesting exercise to implement this as a post-parsing
> hook.
>
> I known nothing about that, but that sounds interesting, do you have any
> documentation pointer to help me implement that?
>
>
>
> On Wed, Jul 24, 2019 at 10:36 PM Peter Eisentraut
> <peter.eisentraut@2ndquadrant.com
> <mailto:peter.eisentraut@2ndquadrant.com>> wrote:
>
> On 2019-07-23 17:43, Cyril Champier wrote:
> > In this documentation
> > <https://www.postgresql.org/docs/9.1/queries-order.html>, it is said:
> >
> > If sorting is not chosen, the rows will be returned in an
> > unspecified order. The actual order in that case will depend
> on the
> > scan and join plan types and the order on disk, but it must
> not be
> > relied on.
> >
> >
> > I would like to know if there is any way to change that to have a
> "real"
> > random behaviour.
>
> It might be an interesting exercise to implement this as a post-parsing
> hook.
>
> --
> Peter Eisentraut http://www.2ndQuadrant.com/
> PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
>
--
Adrian Klaver
adrian.klaver@aklaver.com
On Fri, Jul 26, 2019 at 9:53 AM Cyril Champier <cyril.champier@doctolib.com> wrote: > > Adrian: > >> Are you really looking for a pseudo-random name? > > > No, the code I pasted was an existing production bug: the last_name should have been unique, so the selected patient wouldalways be the same. > This should have been detected in tests, but since the order was "almost always the same", our test was green 99% of thetime, so we discarded it as flaky. If the filter should return at most 1 row, why put a LIMIT in the first place? Even with a forced random() you won't get a failure every time, while asserting there's at most 1 row returned is guaranteed to fail?
Julien,
Because it's production code generated by our ORM for this command: `Patient.find_by(last_name: 'champier')`.
Of course this was not intended by the developer that though the last_name was unique.
On Fri, Jul 26, 2019 at 10:10 AM Julien Rouhaud <rjuju123@gmail.com> wrote:
On Fri, Jul 26, 2019 at 9:53 AM Cyril Champier
<cyril.champier@doctolib.com> wrote:
>
> Adrian:
>
>> Are you really looking for a pseudo-random name?
>
>
> No, the code I pasted was an existing production bug: the last_name should have been unique, so the selected patient would always be the same.
> This should have been detected in tests, but since the order was "almost always the same", our test was green 99% of the time, so we discarded it as flaky.
If the filter should return at most 1 row, why put a LIMIT in the
first place? Even with a forced random() you won't get a failure
every time, while asserting there's at most 1 row returned is
guaranteed to fail?
On 2019-07-26 09:52:52 +0200, Cyril Champier wrote: > No, the code I pasted was an existing production bug: the last_name > should have been unique, so the selected patient would always be the > same. This should have been detected in tests, but since the order was > "almost always the same", our test was green 99% of the time, so we > discarded it as flaky. > > Fuzzy testing could be an option, but this would go too far, as for > Peter extension suggestion. We have huge existing codebase with more > than 10K tests, and I do not want to modify our whole testing > strategy. > > Meanwhile, I went for an ORM patch (ActiveRecord) and forbid usages > that can workaround it. Another idea: How do ypu prepare your test data? Do you have a (possibly large) test database or do you populate a test database with test-specific data in a fixture? If you do the latter, you might be able insert the data in random order. hp -- _ | Peter J. Holzer | we build much bigger, better disasters now |_|_) | | because we have much more sophisticated | | | hjp@hjp.at | management tools. __/ | http://www.hjp.at/ | -- Ross Anderson <https://www.edge.org/>