|
Getting your Trinity Audio player ready...
|
- Tokenization vs. vectorization: Tokenization chops text into tokens for neural nets; vectorization turns whole objects (sentences/docs) into coordinates in a semantic space.
- LLM vs. vector DB embeddings: LLM token embeddings live inside the model’s ANN (learned with backprop). Vector DB embeddings sit outside in an index and are searched with Approximate Nearest Neighbor (ANN) algorithms.
- “Appropriate terrain” question: The semantic terrain (vector space) is only “right” if the embedding model’s training matches your domain/task—verify with retrieval quality and downstream accuracy.
- Oracle’s move (“vectorizing the DB”): Oracle 23ai adds a native
VECTORtype, vector indexes, and distance functions/operators directly into SQL, so you can do semantic queries alongside classic relational/SQL. That lets you build RAG-style and anomaly/mismatch checks inside the relational database—no separate vector DB required.
Hypothetical Telco Example — Schema → Sample Data → Vectorization → Mismatch Query
Scenario: A telco provides “Dedicated Internet Access (DIA) 1 Gbps” to Acme Corp.
Goal: Use semantic (vector) similarity plus SQL logic to detect mismatches between what is billed and what is actually provisioned in inventory.
Assumptions about Oracle 23ai features used below:
VECTOR(dim)data type for embedding columnsVECTOR_EMBEDDING(model_name, text)to compute embeddings inside the DB (you can also load precomputed vectors from your app)CREATE VECTOR INDEX … ORGANIZATION INMEMORY NEIGHBOR GRAPH DISTANCE COSINEfor ANN search- Distance helpers like
COSINE_DISTANCE(vec1, vec2)and/or operator equivalents- If your environment uses different function names, swap them accordingly—the pattern stays the same.
1) Create a minimal Telco schema
-- Customers
CREATE TABLE customers (
customer_id NUMBER PRIMARY KEY,
customer_name VARCHAR2(200)
);
-- Services purchased by customer (commercial “product-level” view)
CREATE TABLE services (
service_id NUMBER PRIMARY KEY,
customer_id NUMBER REFERENCES customers(customer_id),
service_name VARCHAR2(200),
service_desc CLOB,
service_vec VECTOR(1536) -- dimension should match your embedding model
);
-- Inventory actually provisioned (technical assets/lines/ports/circuits)
CREATE TABLE inventory_assets (
asset_id NUMBER PRIMARY KEY,
customer_id NUMBER REFERENCES customers(customer_id),
asset_type VARCHAR2(100),
asset_desc CLOB,
asset_vec VECTOR(1536)
);
-- Billing records (what we charge for a period)
CREATE TABLE billing_records (
bill_id NUMBER PRIMARY KEY,
customer_id NUMBER REFERENCES customers(customer_id),
service_id NUMBER REFERENCES services(service_id),
period_start DATE,
period_end DATE,
plan_code VARCHAR2(50),
qty NUMBER,
amount_usd NUMBER,
detail_text CLOB,
detail_vec VECTOR(1536)
);
2) Create vector indexes (ANN) for fast similarity
-- Services vectors
CREATE VECTOR INDEX svc_vec_hnsw ON services(service_vec)
ORGANIZATION INMEMORY NEIGHBOR GRAPH
DISTANCE COSINE
WITH TARGET ACCURACY 95;
-- Inventory vectors
CREATE VECTOR INDEX inv_vec_hnsw ON inventory_assets(asset_vec)
ORGANIZATION INMEMORY NEIGHBOR GRAPH
DISTANCE COSINE
WITH TARGET ACCURACY 95;
-- Billing vectors (optional, useful for free-text line items)
CREATE VECTOR INDEX bill_vec_hnsw ON billing_records(detail_vec)
ORGANIZATION INMEMORY NEIGHBOR GRAPH
DISTANCE COSINE
WITH TARGET ACCURACY 95;
3) Seed sample data (Acme Corp + DIA service)
INSERT INTO customers (customer_id, customer_name)
VALUES (101, 'Acme Corp');
-- One DIA service sold to Acme (1 Gbps dedicated internet w/ static IP block)
INSERT INTO services (service_id, customer_id, service_name, service_desc)
VALUES (
2001, 101,
'DIA 1Gbps - HQ',
'Dedicated Internet Access 1 Gbps to Acme HQ (123 Main St). Includes /29 static IPv4, enterprise CPE, and managed handoff.'
);
-- Inventory actually provisioned (assume 1 circuit, correct; only 1 CPE)
INSERT INTO inventory_assets (asset_id, customer_id, asset_type, asset_desc)
VALUES (3001, 101, 'Circuit', 'Fiber DIA circuit to HQ, committed 1 Gbps, SLA 99.99%');
INSERT INTO inventory_assets (asset_id, customer_id, asset_type, asset_desc)
VALUES (3002, 101, 'CPE', 'Managed router at HQ for DIA 1 Gbps handoff');
-- Billing for the month: (INTENTIONAL MISMATCH)
-- They’re billing for qty=2 DIA circuits, but only 1 circuit is in inventory.
INSERT INTO billing_records (
bill_id, customer_id, service_id, period_start, period_end,
plan_code, qty, amount_usd, detail_text
) VALUES (
4001, 101, 2001, DATE '2025-08-01', DATE '2025-08-31',
'DIA-1G', 2, 3000.00,
'Monthly charge for Dedicated Internet Access 1 Gbps at HQ with static IPs'
);
4) Vectorize (embed) the text fields in-database
Replace
'telco_embed_v1'with the actual model name you have registered (e.g., an ONNX embedding). If you bring vectors from an external service, you’dUPDATE ... SET service_vec = :bind_vectorinstead.
-- Services → embeddings
UPDATE services
SET service_vec = VECTOR_EMBEDDING('telco_embed_v1', service_desc)
WHERE service_vec IS NULL;
-- Inventory → embeddings
UPDATE inventory_assets
SET asset_vec = VECTOR_EMBEDDING('telco_embed_v1', asset_desc)
WHERE asset_vec IS NULL;
-- Billing line items → embeddings (optional but helpful for free-text matching)
UPDATE billing_records
SET detail_vec = VECTOR_EMBEDDING('telco_embed_v1', detail_text)
WHERE detail_vec IS NULL;
5) Build a semantic match between inventory and services
We’ll map each inventory asset to its nearest service by meaning (same customer), then count how many provisioned assets look like they “belong” to that service.
-- For each inventory asset, find the nearest service (same customer) by cosine distance
WITH inv_to_service AS (
SELECT
i.asset_id,
i.customer_id,
s.service_id,
COSINE_DISTANCE(i.asset_vec, s.service_vec) AS dist,
ROW_NUMBER() OVER (
PARTITION BY i.asset_id
ORDER BY COSINE_DISTANCE(i.asset_vec, s.service_vec)
) AS rn
FROM inventory_assets i
JOIN services s
ON s.customer_id = i.customer_id
)
SELECT * FROM inv_to_service WHERE rn = 1;
You can persist this mapping into a table (e.g., inv_service_map) if you want:
CREATE TABLE inv_service_map AS
SELECT asset_id, customer_id, service_id, dist
FROM (
SELECT
i.asset_id,
i.customer_id,
s.service_id,
COSINE_DISTANCE(i.asset_vec, s.service_vec) AS dist,
ROW_NUMBER() OVER (
PARTITION BY i.asset_id
ORDER BY COSINE_DISTANCE(i.asset_vec, s.service_vec)
) rn
FROM inventory_assets i
JOIN services s
ON s.customer_id = i.customer_id
)
WHERE rn = 1;
(Optional) Add a threshold—e.g., only accept matches with dist < 0.25.
6) Mismatch query: billing vs. provisioned (semantic) count
This query:
- takes the services a customer is billed for,
- counts how many inventory assets semantically map to each service (nearest in vector space),
- flags mismatches where
billed_qty != provisioned_qty.
-- Parameter for our customer and period
VAR p_customer_id NUMBER; EXEC :p_customer_id := 101;
VAR p_start DATE; EXEC :p_start := DATE '2025-08-01';
VAR p_end DATE; EXEC :p_end := DATE '2025-08-31';
WITH inv_counts AS (
SELECT
m.service_id,
COUNT(*) AS provisioned_qty
FROM inv_service_map m
-- Optional semantic-quality guardrail:
-- WHERE m.dist < 0.25
GROUP BY m.service_id
),
billed AS (
SELECT
b.service_id,
SUM(b.qty) AS billed_qty
FROM billing_records b
WHERE b.customer_id = :p_customer_id
AND b.period_start = :p_start
AND b.period_end = :p_end
GROUP BY b.service_id
)
SELECT
s.service_id,
s.service_name,
NVL(b.billed_qty, 0) AS billed_qty,
NVL(ic.provisioned_qty, 0) AS provisioned_qty,
(NVL(b.billed_qty, 0) - NVL(ic.provisioned_qty, 0)) AS qty_difference
FROM services s
LEFT JOIN billed b ON b.service_id = s.service_id
LEFT JOIN inv_counts ic ON ic.service_id = s.service_id
WHERE s.customer_id = :p_customer_id
-- Only show mismatches (or comment out to see all):
AND NVL(b.billed_qty, 0) <> NVL(ic.provisioned_qty, 0)
ORDER BY qty_difference DESC;
Expected result with our sample data:
service_id = 2001(“DIA 1Gbps – HQ”):billed_qty = 2(we billed for 2 circuits)provisioned_qty = 1(only 1 circuit maps semantically; the CPE typically wouldn’t be counted as an additional circuit—if you want to count assets by type, see the variant below)qty_difference = +1→ flagged mismatch
Variant: Only count specific asset types (e.g., circuits) as “provisioned units”
If your definition of “unit” is “circuit,” filter inventory to asset_type = 'Circuit' before counting:
WITH inv_to_service AS (
SELECT
i.asset_id,
i.customer_id,
s.service_id,
i.asset_type,
COSINE_DISTANCE(i.asset_vec, s.service_vec) AS dist,
ROW_NUMBER() OVER (
PARTITION BY i.asset_id
ORDER BY COSINE_DISTANCE(i.asset_vec, s.service_vec)
) AS rn
FROM inventory_assets i
JOIN services s
ON s.customer_id = i.customer_id
),
inv_counts AS (
SELECT
service_id,
COUNT(*) AS provisioned_qty
FROM inv_to_service
WHERE rn = 1
AND asset_type = 'Circuit' -- <— domain rule here
-- AND dist < 0.25 -- optional semantic threshold
GROUP BY service_id
),
billed AS (
SELECT service_id, SUM(qty) AS billed_qty
FROM billing_records
WHERE customer_id = :p_customer_id
AND period_start = :p_start
AND period_end = :p_end
GROUP BY service_id
)
SELECT
s.service_id,
s.service_name,
NVL(b.billed_qty, 0) AS billed_qty,
NVL(ic.provisioned_qty, 0) AS provisioned_qty,
(NVL(b.billed_qty, 0) - NVL(ic.provisioned_qty, 0)) AS qty_difference
FROM services s
LEFT JOIN billed b ON b.service_id = s.service_id
LEFT JOIN inv_counts ic ON ic.service_id = s.service_id
WHERE s.customer_id = :p_customer_id
AND NVL(b.billed_qty, 0) <> NVL(ic.provisioned_qty, 0)
ORDER BY qty_difference DESC;
7) Bonus: No service_id? Drive it from a natural language query
If a billing line has vague text (or you just want to check a scenario by description), embed the query and locate the nearest customer service first—then run the mismatch check for that inferred service:
-- 1) Get a query vector for an analyst’s description
VAR qvec VECTOR(1536);
EXEC :qvec := VECTOR_EMBEDDING('telco_embed_v1', '1 Gbps dedicated internet at Acme HQ');
-- 2) Find the nearest Acme service by meaning
WITH svc_ranked AS (
SELECT
s.service_id,
s.service_name,
COSINE_DISTANCE(s.service_vec, :qvec) AS dist,
ROW_NUMBER() OVER (ORDER BY COSINE_DISTANCE(s.service_vec, :qvec)) AS rn
FROM services s
WHERE s.customer_id = :p_customer_id
)
SELECT * FROM svc_ranked WHERE rn = 1; -- take this service_id and reuse in the mismatch query
What you now have
- A single database (relational + vector) that:
- Stores classic telco tables and semantic embeddings.
- Uses ANN vector indexes to map inventory assets to services by meaning (robust to naming drift).
- Uses standard SQL grouping/joins to surface billing vs. provisioned mismatches.
If you tell me your real embedding model + dimension and the exact “unit” definition you reconcile (circuits, ports, licenses, etc.), I’ll tailor the DDL, distance thresholds, and filters to match your production rules.
Leave a Reply