jaccard distance vs. vanilla sql joins – a discussion

Getting your Trinity Audio player ready…

If billing records and inventory records already share a clean common key, then a conventional SQL join is usually the right tool.

Jaccard does not replace the join. It answers a different question.

A conventional join asks:

Does billing record X match inventory record Y on the shared key?

Jaccard asks:

How much do the two sets overlap overall?

So the advantage of Jaccard appears when you want a degree of agreement, not merely a yes/no match.


1. Conventional join: exact record matching

Suppose both tables have product_id.

SELECT
    b.invoice_id,
    b.product_id,
    b.quantity_billed,
    i.quantity_on_hand
FROM billing b
JOIN inventory i
    ON b.product_id = i.product_id;

This tells you:

These billing rows have matching inventory rows.

That is excellent for:

invoice reconciliation
foreign-key validation
matching billed products to stocked products
finding missing product IDs
joining quantities, prices, dates, SKUs, descriptions

A join is direct, deterministic, and auditable.


2. Jaccard: overlap between sets

Now imagine you want to compare the set of products billed this month against the set of products currently in inventory.

Billing products:

A, B, C, D, E

Inventory products:

B, C, D, E, F, G

Shared products:

B, C, D, E

Combined unique products:

A, B, C, D, E, F, G

So:

Jaccard similarity = 4 / 7 = 0.571
Jaccard distance   = 1 - 0.571 = 0.429

That gives you a summary score:

Billing and inventory overlap by 57.1%.

A join would show the matching rows.
Jaccard summarizes the relationship between the two populations.


3. The advantage: Jaccard measures reconciliation quality

A join gives you matches.

Jaccard gives you a metric.

That metric can be used to say:

This warehouse's billed products are 92% aligned with inventory.
This region's billed products are only 61% aligned with inventory.
This month is less aligned than last month.
This vendor's invoices do not resemble the inventory records they should match.

So Jaccard is useful for comparative quality control.


4. Example: compare billed SKUs to inventory SKUs by warehouse

Suppose:

billing(invoice_id, warehouse_id, sku, quantity_billed)
inventory(warehouse_id, sku, quantity_on_hand)

You can compute Jaccard similarity by warehouse:

WITH
billing_skus AS (
    SELECT DISTINCT warehouse_id, sku
    FROM billing
),
inventory_skus AS (
    SELECT DISTINCT warehouse_id, sku
    FROM inventory
),
intersection AS (
    SELECT
        b.warehouse_id,
        COUNT(*) AS intersection_count
    FROM billing_skus b
    JOIN inventory_skus i
        ON b.warehouse_id = i.warehouse_id
       AND b.sku = i.sku
    GROUP BY b.warehouse_id
),
union_set AS (
    SELECT warehouse_id, sku FROM billing_skus
    UNION
    SELECT warehouse_id, sku FROM inventory_skus
),
union_count AS (
    SELECT
        warehouse_id,
        COUNT(*) AS union_count
    FROM union_set
    GROUP BY warehouse_id
)
SELECT
    u.warehouse_id,
    COALESCE(i.intersection_count, 0) AS intersection_count,
    u.union_count,
    COALESCE(i.intersection_count, 0) * 1.0 / NULLIF(u.union_count, 0) AS jaccard_similarity,
    1.0 - COALESCE(i.intersection_count, 0) * 1.0 / NULLIF(u.union_count, 0) AS jaccard_distance
FROM union_count u
LEFT JOIN intersection i
    ON u.warehouse_id = i.warehouse_id
ORDER BY jaccard_similarity ASC;

This would rank warehouses from least aligned to most aligned.


5. Why this can be better than just counting join failures

A conventional reconciliation might say:

15 billing SKUs are missing from inventory.

But that number lacks context.

Missing 15 SKUs out of 20 is terrible.
Missing 15 SKUs out of 20,000 is minor.

Jaccard normalizes the mismatch:

intersection / union

So it gives you a proportional measure of overlap.

That makes it useful for:

ranking problem areas
trend analysis
threshold alerts
data quality dashboards
vendor comparison
warehouse comparison
monthly reconciliation scores
duplicate/system migration validation

6. But Jaccard also hides detail

Jaccard is not a replacement for the actual join because it compresses information.

A Jaccard score of 0.80 tells you the sets mostly overlap, but it does not tell you:

which SKUs are missing
which quantities disagree
which prices are wrong
which invoice lines are duplicated
which dates are inconsistent

So the normal workflow is:

1. Use Jaccard to score the alignment.
2. Use joins and anti-joins to investigate the actual exceptions.

For example:

-- Billed SKUs missing from inventory
SELECT DISTINCT
    b.warehouse_id,
    b.sku
FROM billing b
LEFT JOIN inventory i
    ON b.warehouse_id = i.warehouse_id
   AND b.sku = i.sku
WHERE i.sku IS NULL;

And the reverse:

-- Inventory SKUs with no billing activity
SELECT DISTINCT
    i.warehouse_id,
    i.sku
FROM inventory i
LEFT JOIN billing b
    ON i.warehouse_id = b.warehouse_id
   AND i.sku = b.sku
WHERE b.sku IS NULL;

Those tell you the actual exceptions.


7. Where Jaccard becomes especially valuable

Jaccard is most useful when comparing sets of related things, not individual rows.

For example:

ComparisonSet being compared
Warehouse vs warehouseSKUs carried
Vendor invoice vs receiving recordSKUs listed
Billing period vs inventory snapshotProduct population
Customer order vs shipmentItems expected vs shipped
ERP system vs migrated systemRecord keys present
Product catalog vs billing historyActive billable SKUs
One month vs anotherSKU activity pattern

In those cases, Jaccard gives you a clean similarity score.


8. The plain-English distinction

A SQL join says:

Show me the matching records.

Jaccard says:

Tell me how similar these two record populations are.

The join is the microscope.
Jaccard is the dashboard gauge.

The join answers:

What matches?

Jaccard answers:

How much of the total universe matches?

So if your billing and inventory tables have a reliable shared key, use joins for operational reconciliation. Add Jaccard when you want a summary metric of overlap, especially across warehouses, vendors, months, systems, product categories, or migration batches.


Posted

in

by

Tags:

Comments

Leave a Reply

Your email address will not be published. Required fields are marked *