vectorizing data for use with an orale 23ai application in telco revenue recovery

Getting your Trinity Audio player ready...

  • Tokenization vs. vectorization: Tokenization chops text into tokens for neural nets; vectorization turns whole objects (sentences/docs) into coordinates in a semantic space.
  • LLM vs. vector DB embeddings: LLM token embeddings live inside the model’s ANN (learned with backprop). Vector DB embeddings sit outside in an index and are searched with Approximate Nearest Neighbor (ANN) algorithms.
  • “Appropriate terrain” question: The semantic terrain (vector space) is only “right” if the embedding model’s training matches your domain/task—verify with retrieval quality and downstream accuracy.
  • Oracle’s move (“vectorizing the DB”): Oracle 23ai adds a native VECTOR type, vector indexes, and distance functions/operators directly into SQL, so you can do semantic queries alongside classic relational/SQL. That lets you build RAG-style and anomaly/mismatch checks inside the relational database—no separate vector DB required.

Hypothetical Telco Example — Schema → Sample Data → Vectorization → Mismatch Query

Scenario: A telco provides “Dedicated Internet Access (DIA) 1 Gbps” to Acme Corp.
Goal: Use semantic (vector) similarity plus SQL logic to detect mismatches between what is billed and what is actually provisioned in inventory.

Assumptions about Oracle 23ai features used below:

  • VECTOR(dim) data type for embedding columns
  • VECTOR_EMBEDDING(model_name, text) to compute embeddings inside the DB (you can also load precomputed vectors from your app)
  • CREATE VECTOR INDEX … ORGANIZATION INMEMORY NEIGHBOR GRAPH DISTANCE COSINE for ANN search
  • Distance helpers like COSINE_DISTANCE(vec1, vec2) and/or operator equivalents
  • If your environment uses different function names, swap them accordingly—the pattern stays the same.

1) Create a minimal Telco schema

-- Customers
CREATE TABLE customers (
  customer_id   NUMBER PRIMARY KEY,
  customer_name VARCHAR2(200)
);

-- Services purchased by customer (commercial “product-level” view)
CREATE TABLE services (
  service_id    NUMBER PRIMARY KEY,
  customer_id   NUMBER REFERENCES customers(customer_id),
  service_name  VARCHAR2(200),
  service_desc  CLOB,
  service_vec   VECTOR(1536)  -- dimension should match your embedding model
);

-- Inventory actually provisioned (technical assets/lines/ports/circuits)
CREATE TABLE inventory_assets (
  asset_id      NUMBER PRIMARY KEY,
  customer_id   NUMBER REFERENCES customers(customer_id),
  asset_type    VARCHAR2(100),
  asset_desc    CLOB,
  asset_vec     VECTOR(1536)
);

-- Billing records (what we charge for a period)
CREATE TABLE billing_records (
  bill_id       NUMBER PRIMARY KEY,
  customer_id   NUMBER REFERENCES customers(customer_id),
  service_id    NUMBER REFERENCES services(service_id),
  period_start  DATE,
  period_end    DATE,
  plan_code     VARCHAR2(50),
  qty           NUMBER,
  amount_usd    NUMBER,
  detail_text   CLOB,
  detail_vec    VECTOR(1536)
);

2) Create vector indexes (ANN) for fast similarity

-- Services vectors
CREATE VECTOR INDEX svc_vec_hnsw ON services(service_vec)
  ORGANIZATION INMEMORY NEIGHBOR GRAPH
  DISTANCE COSINE
  WITH TARGET ACCURACY 95;

-- Inventory vectors
CREATE VECTOR INDEX inv_vec_hnsw ON inventory_assets(asset_vec)
  ORGANIZATION INMEMORY NEIGHBOR GRAPH
  DISTANCE COSINE
  WITH TARGET ACCURACY 95;

-- Billing vectors (optional, useful for free-text line items)
CREATE VECTOR INDEX bill_vec_hnsw ON billing_records(detail_vec)
  ORGANIZATION INMEMORY NEIGHBOR GRAPH
  DISTANCE COSINE
  WITH TARGET ACCURACY 95;

3) Seed sample data (Acme Corp + DIA service)

INSERT INTO customers (customer_id, customer_name)
VALUES (101, 'Acme Corp');

-- One DIA service sold to Acme (1 Gbps dedicated internet w/ static IP block)
INSERT INTO services (service_id, customer_id, service_name, service_desc)
VALUES (
  2001, 101,
  'DIA 1Gbps - HQ',
  'Dedicated Internet Access 1 Gbps to Acme HQ (123 Main St). Includes /29 static IPv4, enterprise CPE, and managed handoff.'
);

-- Inventory actually provisioned (assume 1 circuit, correct; only 1 CPE)
INSERT INTO inventory_assets (asset_id, customer_id, asset_type, asset_desc)
VALUES (3001, 101, 'Circuit', 'Fiber DIA circuit to HQ, committed 1 Gbps, SLA 99.99%');

INSERT INTO inventory_assets (asset_id, customer_id, asset_type, asset_desc)
VALUES (3002, 101, 'CPE', 'Managed router at HQ for DIA 1 Gbps handoff');

-- Billing for the month: (INTENTIONAL MISMATCH)
-- They’re billing for qty=2 DIA circuits, but only 1 circuit is in inventory.
INSERT INTO billing_records (
  bill_id, customer_id, service_id, period_start, period_end,
  plan_code, qty, amount_usd, detail_text
) VALUES (
  4001, 101, 2001, DATE '2025-08-01', DATE '2025-08-31',
  'DIA-1G', 2, 3000.00,
  'Monthly charge for Dedicated Internet Access 1 Gbps at HQ with static IPs'
);

4) Vectorize (embed) the text fields in-database

Replace 'telco_embed_v1' with the actual model name you have registered (e.g., an ONNX embedding). If you bring vectors from an external service, you’d UPDATE ... SET service_vec = :bind_vector instead.

-- Services → embeddings
UPDATE services
SET service_vec = VECTOR_EMBEDDING('telco_embed_v1', service_desc)
WHERE service_vec IS NULL;

-- Inventory → embeddings
UPDATE inventory_assets
SET asset_vec = VECTOR_EMBEDDING('telco_embed_v1', asset_desc)
WHERE asset_vec IS NULL;

-- Billing line items → embeddings (optional but helpful for free-text matching)
UPDATE billing_records
SET detail_vec = VECTOR_EMBEDDING('telco_embed_v1', detail_text)
WHERE detail_vec IS NULL;

5) Build a semantic match between inventory and services

We’ll map each inventory asset to its nearest service by meaning (same customer), then count how many provisioned assets look like they “belong” to that service.

-- For each inventory asset, find the nearest service (same customer) by cosine distance
WITH inv_to_service AS (
  SELECT
    i.asset_id,
    i.customer_id,
    s.service_id,
    COSINE_DISTANCE(i.asset_vec, s.service_vec) AS dist,
    ROW_NUMBER() OVER (
      PARTITION BY i.asset_id
      ORDER BY COSINE_DISTANCE(i.asset_vec, s.service_vec)
    ) AS rn
  FROM inventory_assets i
  JOIN services s
    ON s.customer_id = i.customer_id
)
SELECT * FROM inv_to_service WHERE rn = 1;

You can persist this mapping into a table (e.g., inv_service_map) if you want:

CREATE TABLE inv_service_map AS
SELECT asset_id, customer_id, service_id, dist
FROM (
  SELECT
    i.asset_id,
    i.customer_id,
    s.service_id,
    COSINE_DISTANCE(i.asset_vec, s.service_vec) AS dist,
    ROW_NUMBER() OVER (
      PARTITION BY i.asset_id
      ORDER BY COSINE_DISTANCE(i.asset_vec, s.service_vec)
    ) rn
  FROM inventory_assets i
  JOIN services s
    ON s.customer_id = i.customer_id
)
WHERE rn = 1;

(Optional) Add a threshold—e.g., only accept matches with dist < 0.25.

6) Mismatch query: billing vs. provisioned (semantic) count

This query:

  • takes the services a customer is billed for,
  • counts how many inventory assets semantically map to each service (nearest in vector space),
  • flags mismatches where billed_qty != provisioned_qty.
-- Parameter for our customer and period
VAR p_customer_id NUMBER; EXEC :p_customer_id := 101;
VAR p_start DATE;        EXEC :p_start := DATE '2025-08-01';
VAR p_end DATE;          EXEC :p_end   := DATE '2025-08-31';

WITH inv_counts AS (
  SELECT
    m.service_id,
    COUNT(*) AS provisioned_qty
  FROM inv_service_map m
  -- Optional semantic-quality guardrail:
  -- WHERE m.dist < 0.25
  GROUP BY m.service_id
),
billed AS (
  SELECT
    b.service_id,
    SUM(b.qty) AS billed_qty
  FROM billing_records b
  WHERE b.customer_id = :p_customer_id
    AND b.period_start = :p_start
    AND b.period_end   = :p_end
  GROUP BY b.service_id
)
SELECT
  s.service_id,
  s.service_name,
  NVL(b.billed_qty, 0)      AS billed_qty,
  NVL(ic.provisioned_qty, 0) AS provisioned_qty,
  (NVL(b.billed_qty, 0) - NVL(ic.provisioned_qty, 0)) AS qty_difference
FROM services s
LEFT JOIN billed     b  ON b.service_id = s.service_id
LEFT JOIN inv_counts ic ON ic.service_id = s.service_id
WHERE s.customer_id = :p_customer_id
  -- Only show mismatches (or comment out to see all):
  AND NVL(b.billed_qty, 0) <> NVL(ic.provisioned_qty, 0)
ORDER BY qty_difference DESC;

Expected result with our sample data:

  • service_id = 2001 (“DIA 1Gbps – HQ”):
    • billed_qty = 2 (we billed for 2 circuits)
    • provisioned_qty = 1 (only 1 circuit maps semantically; the CPE typically wouldn’t be counted as an additional circuit—if you want to count assets by type, see the variant below)
    • qty_difference = +1flagged mismatch

Variant: Only count specific asset types (e.g., circuits) as “provisioned units”

If your definition of “unit” is “circuit,” filter inventory to asset_type = 'Circuit' before counting:

WITH inv_to_service AS (
  SELECT
    i.asset_id,
    i.customer_id,
    s.service_id,
    i.asset_type,
    COSINE_DISTANCE(i.asset_vec, s.service_vec) AS dist,
    ROW_NUMBER() OVER (
      PARTITION BY i.asset_id
      ORDER BY COSINE_DISTANCE(i.asset_vec, s.service_vec)
    ) AS rn
  FROM inventory_assets i
  JOIN services s
    ON s.customer_id = i.customer_id
),
inv_counts AS (
  SELECT
    service_id,
    COUNT(*) AS provisioned_qty
  FROM inv_to_service
  WHERE rn = 1
    AND asset_type = 'Circuit'      -- <— domain rule here
    -- AND dist < 0.25               -- optional semantic threshold
  GROUP BY service_id
),
billed AS (
  SELECT service_id, SUM(qty) AS billed_qty
  FROM billing_records
  WHERE customer_id = :p_customer_id
    AND period_start = :p_start
    AND period_end   = :p_end
  GROUP BY service_id
)
SELECT
  s.service_id,
  s.service_name,
  NVL(b.billed_qty, 0)       AS billed_qty,
  NVL(ic.provisioned_qty, 0) AS provisioned_qty,
  (NVL(b.billed_qty, 0) - NVL(ic.provisioned_qty, 0)) AS qty_difference
FROM services s
LEFT JOIN billed     b  ON b.service_id = s.service_id
LEFT JOIN inv_counts ic ON ic.service_id = s.service_id
WHERE s.customer_id = :p_customer_id
  AND NVL(b.billed_qty, 0) <> NVL(ic.provisioned_qty, 0)
ORDER BY qty_difference DESC;

7) Bonus: No service_id? Drive it from a natural language query

If a billing line has vague text (or you just want to check a scenario by description), embed the query and locate the nearest customer service first—then run the mismatch check for that inferred service:

-- 1) Get a query vector for an analyst’s description
VAR qvec VECTOR(1536);
EXEC :qvec := VECTOR_EMBEDDING('telco_embed_v1', '1 Gbps dedicated internet at Acme HQ');

-- 2) Find the nearest Acme service by meaning
WITH svc_ranked AS (
  SELECT
    s.service_id,
    s.service_name,
    COSINE_DISTANCE(s.service_vec, :qvec) AS dist,
    ROW_NUMBER() OVER (ORDER BY COSINE_DISTANCE(s.service_vec, :qvec)) AS rn
  FROM services s
  WHERE s.customer_id = :p_customer_id
)
SELECT * FROM svc_ranked WHERE rn = 1;  -- take this service_id and reuse in the mismatch query

What you now have

  • A single database (relational + vector) that:
    • Stores classic telco tables and semantic embeddings.
    • Uses ANN vector indexes to map inventory assets to services by meaning (robust to naming drift).
    • Uses standard SQL grouping/joins to surface billing vs. provisioned mismatches.

If you tell me your real embedding model + dimension and the exact “unit” definition you reconcile (circuits, ports, licenses, etc.), I’ll tailor the DDL, distance thresholds, and filters to match your production rules.


Posted

in

by

Tags:

Comments

Leave a Reply

Your email address will not be published. Required fields are marked *