Skip to content

Data Comparison

PondPilot’s data comparison feature helps you find differences between tables, queries, or any combination. Compare schema and data using multiple algorithms optimized for different scenarios.

  1. Right-click a table in the Data Explorer
  2. Select “Compare”
  3. Choose the second data source
  4. Configure comparison options
  1. Press Ctrl+K to open Spotlight
  2. Search for “Compare”
  3. Select “New Comparison”
  4. Choose both data sources

Compare two tables from your data sources:

Source A: sales_2023
Source B: sales_2024

Compare results of two different queries:

-- Source A
SELECT * FROM orders WHERE status = 'completed'
-- Source B
SELECT * FROM orders WHERE status = 'shipped'

Compare a table against a query result:

Source A: customers (table)
Source B: SELECT * FROM customers WHERE active = true (query)

Compare tables from different databases:

Source A: local_db.users
Source B: remote_db.users

Specify how to match rows between sources:

Single Key:

Join on: id

Composite Key:

Join on: customer_id, order_date

Key Mapping (when column names differ):

Source A: user_id → Source B: customer_id

Map columns with different names:

Source ASource B
full_namename
created_atcreation_date
amtamount

Exclude columns from comparison:

  • Timestamps that always differ
  • Auto-generated IDs
  • Audit columns

Apply filters to narrow comparison scope:

Common Filter (applies to both sources):

WHERE region = 'US' AND year = 2024

Separate Filters (different filter per source):

-- Source A
WHERE status = 'active'
-- Source B
WHERE is_active = true
ModeDescription
StrictExact value matching (type-sensitive)
CoerceType conversion before comparison

Use Coerce when comparing:

  • String “123” vs Integer 123
  • Date strings vs Date objects
  • Decimal vs Float

PondPilot offers multiple algorithms optimized for different scenarios:

Let PondPilot choose the best algorithm based on:

  • Dataset size
  • Available memory
  • Key distribution

Best for large datasets with memory constraints.

How it works:

  1. Hashes rows into buckets
  2. Compares buckets incrementally
  3. Streams results without loading all data

Best when:

  • Datasets exceed available memory
  • You need streaming results
  • Memory efficiency is critical

Best for moderate datasets with simple comparisons.

How it works:

  1. Performs a full outer join on keys
  2. Compares matched rows column by column
  3. Identifies unmatched rows

Best when:

  • Datasets fit in memory
  • You need complete comparison
  • Key relationships are clear

Best for quick validation of large datasets.

How it works:

  1. Takes a random sample from each source
  2. Compares the samples
  3. Extrapolates differences

Best when:

  • You need a quick estimate
  • Full comparison is too slow
  • Approximate results are acceptable
MetricDescription
Rows in A onlyRecords that exist only in Source A
Rows in B onlyRecords that exist only in Source B
Matching rowsRecords identical in both sources
Different rowsRecords with same key but different values
Total differencesSum of all differences

Before comparing data, PondPilot analyzes schemas:

FindingDescription
✓ Column matchColumn exists in both with same type
⚠️ Type mismatchColumn exists in both with different types
✗ Missing in AColumn only in Source B
✗ Missing in BColumn only in Source A

Drill into specific differences:

Row ID: 1234
Column: status
Source A: "pending"
Source B: "completed"

Large comparisons show real-time progress:

StageDescription
CountingCounting rows in each source
SplittingDividing data into buckets
InsertingProcessing comparison buckets
FinalizingGenerating final results

Progress includes:

  • Processed row count
  • Difference count so far
  • Bucket completion (for hash algorithm)
  • Estimated completion

Comparison results are stored in PondPilot’s system database:

-- Query stored comparison results
SELECT * FROM system.comparison_results
WHERE comparison_id = 'abc123';

Results persist across sessions until manually cleared.

  1. Select Sources

    • Choose Source A (table or query)
    • Choose Source B (table or query)
  2. Configure Join Keys

    • Select columns that uniquely identify rows
    • Map keys if names differ
  3. Map Columns

    • Match columns between sources
    • Exclude unnecessary columns
  4. Apply Filters (optional)

    • Filter to specific data subsets
    • Use common or separate filters
  5. Choose Algorithm

    • Use Auto-Select for best results
    • Or manually select based on needs
  6. Execute Comparison

    • Monitor progress
    • Review results
  7. Analyze Results

    • Check summary statistics
    • Drill into differences
    • Export findings
  • Index your join keys - Faster matching
  • Filter early - Reduce data volume before comparison
  • Use sampling first - Validate approach on small sample
  • Verify join keys - Wrong keys cause false differences
  • Check data types - Use Coerce mode for type mismatches
  • Exclude noise - Filter out timestamps and audit columns
  • Use Hash-Bucket - Memory efficient for millions of rows
  • Filter aggressively - Compare subsets when possible
  • Monitor progress - Cancel if taking too long
  • Verify join keys are correct
  • Check for data type mismatches
  • Ensure keys exist in both sources
  • Switch to Hash-Bucket algorithm
  • Apply filters to reduce dataset size
  • Compare in smaller batches
  • Use Sampling for quick estimate
  • Add filters to reduce scope
  • Check if join keys are indexed
  • Check column mappings are correct
  • Verify compare mode (Strict vs Coerce)
  • Look for trailing spaces or case differences