test rdb –> llm embedding –> cosine similarity inference vs. sql deterministic

Getting your Trinity Audio player ready…

Test RDB Creation

Following the TVRE guide, I’ve created a simple test relational database (RDB) schema for a “Students” table. This mirrors the guide’s “Employees” example but uses a different dataset to demonstrate the concept. The schema includes a primary key (ID) and attributes (Name, Age, Major).

Schema

Table: Students
Columns:
ID (Primary Key, Integer)
Name (String)
Age (Integer)
Major (String)

Sample Data

ID	Name	Age	Major
1	John	20	Computer Science
2	Alice	22	Biology
3	Bob	21	Physics

This data is stored in a standard RDB format, where traditional SQL queries can perform exact matches, aggregates, or joins (if multiple tables existed), but cannot inherently infer semantic relationships like similarity between majors (e.g., “Computer Science” being related to “Engineering”).

Mapping the RDB to a Test LLM via TVRE

TVRE maps the RDB to a simulated LLM by:

Token Mapping: Primary keys (IDs) are hashed to discrete token IDs, simulating an LLM’s vocabulary tokens. These can be injected into LLM prompts for retrieval or reasoning.
Vector Embeddings: Each attribute is serialized (e.g., “Major: Biology”) and mapped to a 5-dimensional vector (simulated via random generation with a seed for reproducibility, as in the guide). In a real implementation, an LLM like Sentence Transformers would generate high-dimensional embeddings (e.g., 768D). A composite vector per row is the average of its attribute vectors.
Matrices: Vectors are organized into matrices per attribute type (e.g., a 3×5 matrix for Majors), enabling matrix math for operations like similarity searches.

Token IDs (Simulated LLM Tokens)

ID 1 → Token 5133
ID 2 → Token 3293
ID 3 → Token 8730

Example Attribute Vectors (Per Row)

For brevity, here’s a sample for ID 1 (John):

Name: [0.37454012 0.95071431 0.73199394 0.59865848 0.15601864]
Age: [0.15599452 0.05808361 0.86617615 0.60111501 0.70807258]
Major: [0.02058449 0.96990985 0.83244264 0.21233911 0.18182497]
Composite: [0.18370638 0.65956926 0.81020424 0.4707042 0.34863873]

Similar vectors exist for IDs 2 and 3. These are stacked into matrices (e.g., Major Matrix is 3×5 with rows for each student’s major vector).

Integration with a test LLM: The token IDs allow prompts like “Retrieve details for token 5133,” while vectors enable semantic operations. The LLM can then reason over retrieved data, augmenting RDB queries with inferred context.

Use Cases: Inferring Relationships with TVRE (Beyond SQL/RDB)

TVRE uses matrix math (e.g., cosine similarity) on vectors to infer semantic relationships that pure SQL/RDB cannot, as SQL relies on exact matches or predefined rules—no inherent support for similarity or fuzzy inference. Cosine similarity is computed as:
[ \text{similarity} = \frac{\mathbf{q} \cdot \mathbf{v}}{|\mathbf{q}| \cdot |\mathbf{v}|} ]
where (\mathbf{q}) is the query vector and (\mathbf{v}) is a row vector from the matrix.

Here are four use cases, with results from simulated computations. (Note: Similarities are based on random vectors for demo; real LLM embeddings would capture true semantics.)

Use Case 1: Infer Students with Majors Similar to “Engineering”

Query: Find majors semantically similar to “Engineering” (e.g., for recommending related fields).
TVRE Inference: Embed the query as a vector and compute cosine similarity against the Major matrix. This infers fuzzy matches (e.g., “Computer Science” or “Physics” as related tech/science fields).
Why SQL Can’t: SQL requires exact matches (e.g., WHERE Major = 'Engineering') or manual keyword rules; no automatic semantic similarity.
Results (Sorted by similarity):
ID Major Similarity
2 Biology 0.9813
3 Physics 0.6956
1 Computer Science 0.6433 Inferred Relationship: Biology is deemed most similar (in this simulation). An LLM prompt could then use the top token (3293 for ID 2): “Describe career paths for token 3293 in engineering contexts.” Use Case 2: Infer Students with Names Similar to “Eve”
- Query: Find names semantically similar to “Eve” (e.g., for fuzzy search in large datasets).
- TVRE Inference: Embed the query and compare via cosine similarity on the Name matrix, capturing phonetic or contextual similarities.
- Why SQL Can’t: SQL supports partial matches (e.g., LIKE '%Eve%'), but not semantic (e.g., “Alice” as a similar short female name).
- Results (Sorted by similarity):
  ID Name Similarity
  2 Alice 0.9451
  1 John 0.9115
  3 Bob 0.7806 Inferred Relationship: Alice is most similar. This could infer social groupings, with an LLM prompt: “Generate a story involving token 3293 (Alice) and Eve.” Use Case 3: Infer Overall Similar Profiles to a New Student
  - Query: Find students similar to a new profile (Name: Charlie, Age: 23, Major: Mathematics), e.g., for peer matching.
  - TVRE Inference: Generate a composite vector for the new profile and compute similarity against the Composite matrix, holistically comparing all attributes.
  - Why SQL Can’t: SQL could filter by ranges (e.g., WHERE Age BETWEEN 20 AND 25), but not weighted multi-attribute semantic similarity without custom scoring.
  - Results (Sorted by similarity):
    ID Name Age Major Similarity
    2 Alice 22 Biology 0.9035
    1 John 20 Computer Science 0.7730
    3 Bob 21 Physics 0.6813 Inferred Relationship: Alice’s profile is most similar overall. An LLM could infer: “Recommend collaborations for token 3293 with the new student in math-bio intersections.” Use Case 4: LLM-Augmented Relationship Inference
    - Query: Combine semantic search with LLM reasoning, e.g., using the top result from Use Case 1.
    - TVRE Inference: After vector-based selection, inject the token ID into an LLM prompt for deeper reasoning (e.g., infer career overlaps).
    - Why SQL Can’t: SQL retrieves data but can’t generate natural language inferences or chain to external models without integration layers.
    - Example: For the top match (ID 2, Biology similar to Engineering), use token 3293. Simulated LLM Prompt: “Describe the student associated with token 3293 and infer how their biology background relates to engineering fields like bioengineering.”
    - Inferred Relationship: This enables discovering implicit links (e.g., Biology → Bioengineering), which SQL alone couldn’t hypothesize. In practice, the LLM outputs reasoned text based on the embedded context.