21.6. Sampling Functions #

TABLESAMPLE (sampling_method(percentage | rows)) #

Samples a subset of rows from an analytical table or query result. This is useful for analyzing large datasets by working with representative samples, improving query performance for exploratory data analysis.

Example 21.20. 

-- Sample 10% of rows from a table
SELECT * FROM large_table TABLESAMPLE SYSTEM(10);

-- Sample approximately 1000 rows
SELECT * FROM events TABLESAMPLE SYSTEM(1000 ROWS);

-- Sample from data lake files
SELECT * FROM read_parquet('s3://datalake/**/*.parquet') TABLESAMPLE SYSTEM(5);

-- Use sampling for quick data profiling
SELECT
    region,
    COUNT(*) as sample_count,
    AVG(revenue) as avg_revenue
FROM sales_data TABLESAMPLE SYSTEM(2)
GROUP BY region;

-- Sample from joins for performance
SELECT c.name, COUNT(o.id) as order_count
FROM customers c
JOIN orders o TABLESAMPLE SYSTEM(10) ON c.id = o.customer_id
GROUP BY c.name;

Sampling methods:

  • SYSTEM: Random sampling at the storage level (faster, approximate percentage).

  • BERNOULLI: Row-by-row random sampling (slower, exact percentage).

Example 21.21. 

-- System sampling (recommended for large tables)
SELECT * FROM huge_table TABLESAMPLE SYSTEM(1);

-- Bernoulli sampling (exact percentage)
SELECT * FROM medium_table TABLESAMPLE BERNOULLI(5);

Use cases:

  • Data exploration: Quick analysis of large datasets.

  • Performance testing: Test queries on sample data.

  • Data profiling: Understand data distribution patterns.

  • ETL operations development: Develop pipelines on sample data.

  • Quality checks: Validate data quality on samples.

Required parameters:

Name

Type

Description

sampling_method

keyword

SYSTEM or BERNOULLI

percentage

numeric

Percentage of rows to sample (0-100)

Optional parameters:

Name

Type

Description

rows

integer

Approximate number of rows to sample (use with ROWS keyword)