Thread: [Patch] Data type cache to speed up query tool

[Patch] Data type cache to speed up query tool

From

aiht

Date:

24 November 2012, 08:56:13

Hi pgadmin hackers,

I have been having trouble with the speed of the pgAdmin query tool on
remote links, and I'm sure I'm not the only one.
The delay is due to loading the data type information for every column:
 > QUERY  : Scalar query (localhost:50119): SELECT format_type(oid,-1)
as typname FROM pg_type WHERE oid = 1114
 > QUERY  : Scalar query (localhost:50119): SELECT CASE WHEN
typbasetype=0 THEN oid else typbasetype END AS basetype FROM pg_type
WHERE oid=1114

I have been playing around with a fix for quite a while, but I have
finally sat myself down and put it on github
(https://github.com/aiht/pgadmin3/commit/80a896e674bf68cc67aa3dc0dd425db9aa7372aa)
(patch attached, too).

I'm still trying to make sure it does exactly what I expect, so the
debug logging is quite heavy, but I wanted to ask a few questions to see
if I am on the right track.
If this code is useful to the project, I'll trim the logging down so it
isn't in the way.

Code conventions:
* Does the pg prefix for a class name have a specific meaning and if so,
am I using it wrongly?
* I have copied the i prefix on some methods, but I don't really
understand what it's for; I was thinking perhaps 'internal'?

Further suggestions:
* This code supports pre-caching all types in one go, but currently only
loads them one by one. Maybe this should be an option.
* When a pgConn is Duplicate()d, its cache is copied for the new
connection to use, but as far as I can see this is only called when
doing File->New Window in the query tool. If more pgConns are
constructed by Duplicate() then caching may be more efficient.
* Quite a few other parts of the code do their own type-info loading; at
least a few of them could share in this cache. However, the other code
does not load types one by one so they do not have the same latency
issues as the query tool.

Possible problems:
* Threading? As far as I have seen, the code doesn't tend to use
locking. The query threads just avoid accessing each others objects
while they are active. I have not added any locking, but each cache
object belongs to a single connection, and is only used from code that
was already using that connection, so I think it is okay.
* The cache is currently not flushed, because I was not sure when to
trigger it. The main object tree should probably flush its cache when
the view is refreshed, but I am not sure about what kind of changes the
server may make to existing types that require a flush.
* I have not tested this on Postgres variants (EDB etc.) or on old
versions. I have not tested the current version on Windows (although I
did test an old version of the same code long ago).

I have not yet patched the MSVS project file, but the only change
required should be to add the pgTypeCache.{h,cpp} files.


I hope it helps!
Harun

Attachment

20121124-typeCache.patch

Re: [Patch] Data type cache to speed up query tool

From

Dave Page

Date:

24 November 2012, 15:22:02

Hi

On Sat, Nov 24, 2012 at 8:55 AM, aiht <aihtdikh@gmail.com> wrote:
> Hi pgadmin hackers,
>
> I have been having trouble with the speed of the pgAdmin query tool on
> remote links, and I'm sure I'm not the only one.
> The delay is due to loading the data type information for every column:
>> QUERY  : Scalar query (localhost:50119): SELECT format_type(oid,-1) as
>> typname FROM pg_type WHERE oid = 1114
>> QUERY  : Scalar query (localhost:50119): SELECT CASE WHEN typbasetype=0
>> THEN oid else typbasetype END AS basetype FROM pg_type WHERE oid=1114
>
> I have been playing around with a fix for quite a while, but I have finally
> sat myself down and put it on github
> (https://github.com/aiht/pgadmin3/commit/80a896e674bf68cc67aa3dc0dd425db9aa7372aa)
> (patch attached, too).

This is a nice idea, but can't we use/extend the pgDatatype class?
That's already intended to handle type caching, albeit not between
connections, if memory serves (I'm not sure that's a good idea anyway,
as one session may see a different set of types to another at any
given point in time).

> Code conventions:
> * Does the pg prefix for a class name have a specific meaning and if so, am
> I using it wrongly?

It's used to denote classes related to Postgres (any fork). We have
some edb* and some gp* classes as well, for things specific to
EnterpriseDB's Postgres Plus Advanced Server and Greenplum database.
The original thinking was that eventually we might also support some
other databases (e.g. odbc*, my* etc) to some extent.

> * I have copied the i prefix on some methods, but I don't really understand
> what it's for; I was thinking perhaps 'internal'?

Yes - it's used for setters that are supposed to be private. I
wouldn't be surprised if some of them are public now though (they
could be fixed if you're feeling bored).

> Further suggestions:
> * This code supports pre-caching all types in one go, but currently only
> loads them one by one. Maybe this should be an option.

It should probably pre-cache them all. That'll be a lot cheaper.

> * When a pgConn is Duplicate()d, its cache is copied for the new connection
> to use, but as far as I can see this is only called when doing File->New
> Window in the query tool. If more pgConns are constructed by Duplicate()
> then caching may be more efficient.

Not so sure about that, as mentioned above.

> * Quite a few other parts of the code do their own type-info loading; at
> least a few of them could share in this cache. However, the other code does
> not load types one by one so they do not have the same latency issues as the
> query tool.

Yeah, but it would give us a common interface, and overall would
improve efficiency.

> Possible problems:
> * Threading? As far as I have seen, the code doesn't tend to use locking.
> The query threads just avoid accessing each others objects while they are
> active. I have not added any locking, but each cache object belongs to a
> single connection, and is only used from code that was already using that
> connection, so I think it is okay.

Yes, it should be.

> * The cache is currently not flushed, because I was not sure when to trigger
> it. The main object tree should probably flush its cache when the view is
> refreshed, but I am not sure about what kind of changes the server may make
> to existing types that require a flush.

Agreed on the tree. Not sure about the general case either though.

--
Dave Page
Blog: http://pgsnake.blogspot.com
Twitter: @pgsnake

EnterpriseDB UK: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [Patch] Data type cache to speed up query tool

From

aiht

Date:

26 November 2012, 01:12:49

Hi Dave,

On 24/11/12 23:21, Dave Page wrote:
> This is a nice idea, but can't we use/extend the pgDatatype class?
> That's already intended to handle type caching, albeit not between
> connections, if memory serves (I'm not sure that's a good idea anyway,
> as one session may see a different set of types to another at any
> given point in time).
Good idea; I was being overly conservative, replacing only like for like.
I'm converting to use this class so the cache is more generally
applicable. It's more efficient, too: no need to load every typmod
variant from the server.
I only had a few hours to work on it so far, so no patch release yet,
but the WIP is on github aiht/pgadmin if you're interested.
>> * I have copied the i prefix on some methods, but I don't really understand
>> what it's for; I was thinking perhaps 'internal'?
> Yes - it's used for setters that are supposed to be private. I
> wouldn't be surprised if some of them are public now though (they
> could be fixed if you're feeling bored).
Thanks. I'm not quite that bored right now, sorry. Maybe later :)
> It should probably pre-cache them all. That'll be a lot cheaper.
Agreed. I have also added support for single-record loading in the
DatatypeReader so that new types (created after the cache was filled)
can still be loaded later.
>> * When a pgConn is Duplicate()d, its cache is copied for the new connection
>> to use, but as far as I can see this is only called when doing File->New
>> Window in the query tool. If more pgConns are constructed by Duplicate()
>> then caching may be more efficient.
> Not so sure about that, as mentioned above.
Yeah, now that I've looked closer it doesn't seem so easy to Duplicate()
other connections. It seems safe in the query tool because the new
window uses the same host/db/etc. But perhaps I am missing some points
about e.g. schema search paths?
>
>> * Quite a few other parts of the code do their own type-info loading; at
>> least a few of them could share in this cache. However, the other code does
>> not load types one by one so they do not have the same latency issues as the
>> query tool.
> Yeah, but it would give us a common interface, and overall would
> improve efficiency.
Agreed; still worthwhile.

Thanks,
Harun

Re: [Patch] Data type cache to speed up query tool

From

Dave Page

Date:

26 November 2012, 13:26:51

Hi

On Mon, Nov 26, 2012 at 1:12 AM, aiht <aihtdikh@gmail.com> wrote:
>
>
>>> * When a pgConn is Duplicate()d, its cache is copied for the new
>>> connection
>>> to use, but as far as I can see this is only called when doing File->New
>>> Window in the query tool. If more pgConns are constructed by Duplicate()
>>> then caching may be more efficient.
>>
>> Not so sure about that, as mentioned above.
>
> Yeah, now that I've looked closer it doesn't seem so easy to Duplicate()
> other connections. It seems safe in the query tool because the new window
> uses the same host/db/etc. But perhaps I am missing some points about e.g.
> schema search paths?

It should be safe - we're only duplicating the connection parameters
and then opening a new connection with them. We're not copying session
state or anything like that.

--
Dave Page
Blog: http://pgsnake.blogspot.com
Twitter: @pgsnake

EnterpriseDB UK: http://www.enterprisedb.com
The Enterprise PostgreSQL Company