pgpro_scout

pgpro_scout — detect sensitive information in a database

Synopsis

pgpro_scout config [option...]

pgpro_scout inspect [option...]

Description #

pgpro_scout is a command-line utility to detect sensitive (confidential) information in a database. pgpro_scout looks for sensitive information in database column names and field values. The search requires a specialized dictionary that the utility allows you to create.

pgpro_scout does not aim to find all the sensitive information, but only performs data intelligence to determine whether such information exists and show sensitive data samples.

pgpro_scout is provided with Postgres Pro Standard as a separate pre-built package pgpro_scout (for the detailed installation instructions, see Chapter 16).

The dictionary is a text file whose lines have the key-value-comment format. Where:

  • key is a string with the conventional name of a search parameter.

  • value defines a condition that a database field must meet to be considered sensitive. Can be a regular expression or an array of string values.

  • comment is a text description of the search condition to be visible in search results.

pgpro_scout generates a report file with information on sample occurrences of the sensitive information found. Refer to the section called “pgpro_scout Report” for details.

pgpro_scout Usage #

To perform the database inspection for occurrences of sensitive information, you need to create the dictionary and launch the search passing a list of database objects to look through.

Creating the pgpro_scout Dictionary #

To create a base dictionary, run the config command. You can update this dictionary and extend it with new search conditions if needed. Here is an example of the dictionary:

scout:
  column_names:
    matchers:
      - key: phones
        match_values: ["phones", "phone"]
        comment: "Phone number fields"
  column_values:
    disabled: true # optional
    case_sensitive: false # optional
    matchers:
      - key: surnames
        disabled: false # optional
        case_sensitive: true # optional
        match_values: ["Holmes", "Watson"]
      - key: phones_data
        pattern: ^((8|\+7)[\- ]?)?(\(?\d{3}\)?[\- ]?)?[\d\- ]{7,10}$
        comment: "Phone numbers"
exclude:
  schemas: ['pgpro_sfile_data', 'dbms_lob', 'profile', 'information_schema', 'pg_catalog', 'pg_toast']
  column_types: ['timestamp with time zone']

Dictionary Structure #

The dictionary consists of the following sections:

  • scout — contains the description of search conditions.

  • exclude — lists database schemas and field types to be excluded from the search.

These types of search conditions can be specified:

  • column_names — field names.

  • column_values — field values.

A matchers element contains a list of search conditions. Each matchers element may contain the following parameters:

  • key — the search key.

  • match_values — the list of strings to look for in the database fields. Do not specify together with pattern.

  • pattern — a regular expression for search in the database fields. Do not specify together with match_values.

  • case_sensitive — a boolean value to define whether the search is case-sensitive if match_values is specified.

  • disabled — a boolean value to define whether to exclude the condition from the search.

  • comment — the comment to be shown in the report file.

Specifying Database Objects to Look through #

To specify the objects to look through, use the --search-path option when launching the search. --search-path takes a string that contains comma-separated substrings with database objects to restrict the search to:

  • To restrict the search to a database schema, specify a substring with the schema name.

  • To restrict the search to an individual table or tables, specify a substring like this: SCHEMA.TABLE or SCHEMA.{TABLE1, TABLE2}.

  • To restrict the search to an individual column or columns of a certain table, specify a substring like this: SCHEMA.TABLE.COLUMN or SCHEMA.TABLE.{COLUMN1, COLUMN2}.

Table I.1. Examples of Specifying Search Objects

--search-path ValueSearch Range
publicAll tables (with all columns) in the public schema.
public.{products, sales}products and sales tables (with all columns) in the public schema.
public.products.{id, email}id and email columns of the products table in the public schema.
public.productsOnly the public.products table.
public.products,events,holds.card.idThe public.products table, all tables in the events schema, and the id column of the holds.card table.

Launching the Search of Sensitive Information #

To launch the search of sensitive information with the available dictionary, run the inspect command specifying --search path as explained in the section called “Specifying Database Objects to Look through”.

Command-Line Reference #

This section describes pgpro_scout commands. Optional parameters are enclosed in square brackets.

config #

pgpro_scout config [--out|-o string] [--override|-O] [--help|-h]

Generates a base pgpro_scout dictionary. The meaning of the options is as follows:

--h
--help

Output the command reference information.

-O
--override

Override the dictionary file if it exists.

-o filepath
--out filepath

Path to the dictionary file.

inspect #

pgpro_scout inspect [--batch-size|-b uint] [--config-file|-f string]
[--dbname|-d string] [--disable-comment] [--hidden-sample] [--host|-H string]
[--order|-O string] [--output|-o string] [--password|-W string]
[--port|-p uint16] [--report-buffer uint16] [--rows-number|-n uint16]
[--sample-size uint16] [--search-path|-s string]
[--threads|-j int] [--username|-U string] [--help|-h]

Launches the search of sensitive information. The meaning of the options is as follows:

-b size
--batch-size size #

Number of database rows to be processed by one query for each task.

Default: 1000.

-f filepath
--config-file filepath #

Path to the dictionary file.

--d name
--dbname name #

Database name to connect to.

--disable-comment #

Create the report file without comments.

--h string
--help string #

Output the command reference information.

--hidden-sample #

Mask found samples in the report. For example: Br**te will be output instead of Bronte.

--H name
--host name #

Database host name or IP address.

Default: localhost.

-O search_order
--order search_order #

The order of looking through rows. Possible values:

  • tail — get last rows

  • head — get first rows

  • random — get random rows

Default: head.

-o filepath
--output filepath #

Path to the report file.

--W string
--password string #

Password to connect to the database.

--p name
--port name #

Database port.

Default: 5432.

--report-buffer size #

Size of the report buffer.

Default: 64

-n number
--rows-number number #

Number of rows to look through in each table.

Default: 10000.

--sample-size number #

Number of found samples to be included in the report for each table. If set to 0, samples of found rows will not be included in the report.

Default: 5.

-s filepath
--search-path filepath #

List of schemas, tables, and fields to look through. See the section called “Specifying Database Objects to Look through” for details.

Default: 5.

-j filepath
--threads filepath #

Number of threads used to run the utility.

Default: 14.

--U name
--username name #

Username to connect to the database.

pgpro_scout Report #

A csv file is created as a result of pgpro_scout search. This file provides the following information:

  • Date and time of the search completion.

  • Report field headings.

  • Search results.

Report Structure #

The search results contain the following fields:

Table I.2. Report Fields

NameDescription
locationFull name of a table, that is <schema name>.<table_name>, with the occurrence found.
columnColumn of the table.
keyKey of the search condition from the dictionary.
commentComment of the search condition from the dictionary.
sampleSample occurrence found. If the search is run with --sample-size = 0, this field is not included in the report. To mask the contents of this field, enable --hidden-sample when launching the search.

The report will contain not more than --sample-size lines for each table.

Example #

The following is an example of the report file:

#report time: 2025-06-16T11:10:58+03:00
location,column,key,comment, sample
public.expected,phone,phones,Phone numbers,
events.holds,user_surname,surname,Last names, Holmes
events.holds,user_surname,surname,Last names, Watson