Data Comparison
PondPilot’s data comparison feature helps you find differences between tables, queries, or any combination. Compare schema and data using multiple algorithms optimized for different scenarios.
Starting a Comparison
Section titled “Starting a Comparison”Method 1: From Data Explorer
Section titled “Method 1: From Data Explorer”- Right-click a table in the Data Explorer
- Select “Compare”
- Choose the second data source
- Configure comparison options
Method 2: From Spotlight
Section titled “Method 2: From Spotlight”- Press Ctrl+K to open Spotlight
- Search for “Compare”
- Select “New Comparison”
- Choose both data sources
Comparison Types
Section titled “Comparison Types”Table vs Table
Section titled “Table vs Table”Compare two tables from your data sources:
Source A: sales_2023Source B: sales_2024Query vs Query
Section titled “Query vs Query”Compare results of two different queries:
-- Source ASELECT * FROM orders WHERE status = 'completed'
-- Source BSELECT * FROM orders WHERE status = 'shipped'Table vs Query
Section titled “Table vs Query”Compare a table against a query result:
Source A: customers (table)Source B: SELECT * FROM customers WHERE active = true (query)Cross-Database
Section titled “Cross-Database”Compare tables from different databases:
Source A: local_db.usersSource B: remote_db.usersConfiguration Options
Section titled “Configuration Options”Join Keys
Section titled “Join Keys”Specify how to match rows between sources:
Single Key:
Join on: idComposite Key:
Join on: customer_id, order_dateKey Mapping (when column names differ):
Source A: user_id → Source B: customer_idColumn Mapping
Section titled “Column Mapping”Map columns with different names:
| Source A | Source B |
|---|---|
full_name | name |
created_at | creation_date |
amt | amount |
Column Exclusion
Section titled “Column Exclusion”Exclude columns from comparison:
- Timestamps that always differ
- Auto-generated IDs
- Audit columns
Filtering
Section titled “Filtering”Apply filters to narrow comparison scope:
Common Filter (applies to both sources):
WHERE region = 'US' AND year = 2024Separate Filters (different filter per source):
-- Source AWHERE status = 'active'
-- Source BWHERE is_active = trueCompare Mode
Section titled “Compare Mode”| Mode | Description |
|---|---|
| Strict | Exact value matching (type-sensitive) |
| Coerce | Type conversion before comparison |
Use Coerce when comparing:
- String “123” vs Integer 123
- Date strings vs Date objects
- Decimal vs Float
Comparison Algorithms
Section titled “Comparison Algorithms”PondPilot offers multiple algorithms optimized for different scenarios:
Auto-Select (Recommended)
Section titled “Auto-Select (Recommended)”Let PondPilot choose the best algorithm based on:
- Dataset size
- Available memory
- Key distribution
Hash-Bucket Algorithm
Section titled “Hash-Bucket Algorithm”Best for large datasets with memory constraints.
How it works:
- Hashes rows into buckets
- Compares buckets incrementally
- Streams results without loading all data
Best when:
- Datasets exceed available memory
- You need streaming results
- Memory efficiency is critical
Join Algorithm
Section titled “Join Algorithm”Best for moderate datasets with simple comparisons.
How it works:
- Performs a full outer join on keys
- Compares matched rows column by column
- Identifies unmatched rows
Best when:
- Datasets fit in memory
- You need complete comparison
- Key relationships are clear
Sampling Algorithm
Section titled “Sampling Algorithm”Best for quick validation of large datasets.
How it works:
- Takes a random sample from each source
- Compares the samples
- Extrapolates differences
Best when:
- You need a quick estimate
- Full comparison is too slow
- Approximate results are acceptable
Understanding Results
Section titled “Understanding Results”Summary Statistics
Section titled “Summary Statistics”| Metric | Description |
|---|---|
| Rows in A only | Records that exist only in Source A |
| Rows in B only | Records that exist only in Source B |
| Matching rows | Records identical in both sources |
| Different rows | Records with same key but different values |
| Total differences | Sum of all differences |
Schema Comparison
Section titled “Schema Comparison”Before comparing data, PondPilot analyzes schemas:
| Finding | Description |
|---|---|
| ✓ Column match | Column exists in both with same type |
| ⚠️ Type mismatch | Column exists in both with different types |
| ✗ Missing in A | Column only in Source B |
| ✗ Missing in B | Column only in Source A |
Difference Details
Section titled “Difference Details”Drill into specific differences:
Row ID: 1234Column: statusSource A: "pending"Source B: "completed"Progress Tracking
Section titled “Progress Tracking”Large comparisons show real-time progress:
| Stage | Description |
|---|---|
| Counting | Counting rows in each source |
| Splitting | Dividing data into buckets |
| Inserting | Processing comparison buckets |
| Finalizing | Generating final results |
Progress includes:
- Processed row count
- Difference count so far
- Bucket completion (for hash algorithm)
- Estimated completion
Result Storage
Section titled “Result Storage”Comparison results are stored in PondPilot’s system database:
-- Query stored comparison resultsSELECT * FROM system.comparison_resultsWHERE comparison_id = 'abc123';Results persist across sessions until manually cleared.
Comparison Workflow
Section titled “Comparison Workflow”Step-by-Step
Section titled “Step-by-Step”-
Select Sources
- Choose Source A (table or query)
- Choose Source B (table or query)
-
Configure Join Keys
- Select columns that uniquely identify rows
- Map keys if names differ
-
Map Columns
- Match columns between sources
- Exclude unnecessary columns
-
Apply Filters (optional)
- Filter to specific data subsets
- Use common or separate filters
-
Choose Algorithm
- Use Auto-Select for best results
- Or manually select based on needs
-
Execute Comparison
- Monitor progress
- Review results
-
Analyze Results
- Check summary statistics
- Drill into differences
- Export findings
Tips & Best Practices
Section titled “Tips & Best Practices”Performance
Section titled “Performance”- Index your join keys - Faster matching
- Filter early - Reduce data volume before comparison
- Use sampling first - Validate approach on small sample
Accuracy
Section titled “Accuracy”- Verify join keys - Wrong keys cause false differences
- Check data types - Use Coerce mode for type mismatches
- Exclude noise - Filter out timestamps and audit columns
Large Datasets
Section titled “Large Datasets”- Use Hash-Bucket - Memory efficient for millions of rows
- Filter aggressively - Compare subsets when possible
- Monitor progress - Cancel if taking too long
Troubleshooting
Section titled “Troubleshooting””No matching rows found”
Section titled “”No matching rows found””- Verify join keys are correct
- Check for data type mismatches
- Ensure keys exist in both sources
”Out of memory”
Section titled “”Out of memory””- Switch to Hash-Bucket algorithm
- Apply filters to reduce dataset size
- Compare in smaller batches
”Comparison taking too long”
Section titled “”Comparison taking too long””- Use Sampling for quick estimate
- Add filters to reduce scope
- Check if join keys are indexed
”Unexpected differences”
Section titled “”Unexpected differences””- Check column mappings are correct
- Verify compare mode (Strict vs Coerce)
- Look for trailing spaces or case differences