Re: [patch] Proposal for \crosstabview in psql - Mailing list pgsql-hackers

From Daniel Verite
Subject Re: [patch] Proposal for \crosstabview in psql
Date
Msg-id d4186a6f-4e25-4c62-bd65-4cf3929c21c6@mm
Whole thread Raw
In response to Re: [patch] Proposal for \rotate in psql  ("Daniel Verite" <daniel@manitou-mail.org>)
Responses Re: [patch] Proposal for \crosstabview in psql
Re: [patch] Proposal for \crosstabview in psql
List pgsql-hackers
   Hi,

Here's an updated patch that replaces sorted arrays by AVL binary trees
when gathering distinct values for the columns involved in the pivot.
The change is essential for large resultsets. For instance,
it allows to process a query like this (10 million rows x 10 columns):

select x,(random()*10)::int, (random()*1000)::int from
  generate_series(1,10000000) as x
 \crosstabview

which takes about 30 seconds to run and display on my machine with the
attached patch. That puts it seemingly in the same ballpark than
the equivalent test with the server-side crosstab().

With the previous iterations of the patch, this test would never end,
even with much smaller sets, as the execution time of the 1st step
grew exponentially with the number of distinct keys.
The exponential effect starts to be felt at about 10k values on my low-end
CPU,
and from there quickly becomes problematic.

As a client-side display feature, processing millions of rows like in
the query above does not necessarily make sense, it's pushing the
envelope, but stalling way below 100k rows felt lame, so I'm happy to get
rid of that limitation.

However, there is another one. The above example does not need or request
an additional sort step, but if it did, sorting more than 65535 entries in
the vertical header would error out, because values are shipped as
parameters to PQexecParams(), which only accepts that much.
To avoid the problem, when the rows in the output "grid" exceed 2^16 and
they need to be sorted, the user must  let the sort being driven by ORDER BY
beforehand in the query, knowing that the pivot will keep the original
ordering intact in the vertical header.

I'm still thinking about extending this based on Pavel's diff for the
"label" column, so that
 \crosstabview [+|-]colV[:colSortH] [+|-]colH[:colSortH]
would mean to use colV/H as grid headers but sort them according
to colSortV/H.
I prefer that syntax over adding more parameters, and also I'd like
to have it work in both V and H directions.

Aside from the AVL trees, there are a few other minor changes in that
patch:
- move non-exportable structs from the .h to the .c
- move code in common.c to respect alphabetical ordering
- if vertical sort is requested, add explicit check against more than 65535
  params instead of letting the sort query fail
- report all failure cases of the sort query
- rename sortColumns to serverSort and use less the term "columns" in
  comments and variables.


Best regards,
--
Daniel Vérité
PostgreSQL-powered mailer: http://www.manitou-mail.org
Twitter: @DanielVerite

Attachment

pgsql-hackers by date:

Previous
From: Jeff Janes
Date:
Subject: Re: GIN data corruption bug(s) in 9.6devel
Next
From: Jeff Janes
Date:
Subject: Re: Using quicksort for every external sort run