a more complex example of cosine similarity in the context of bridging relational databases and vector embeddings

Getting your Trinity Audio player ready...

To provide To provide a more complex example of cosine similarity in the context of bridging relational databases and vector embeddings, I’ll expand on the original discussion from the article. This example will simulate a realistic scenario with a larger dataset, multiple attributes contributing to embeddings, and a more intricate use case. The goal is to demonstrate how cosine similarity can be used for advanced semantic tasks, such as clustering employees based on their roles, departments, skills, and performance metrics, while leveraging the principles of normalization, SQL inference, and Token-Vector Relational Embedding (TVRE) training.

Scenario Overview

Imagine a company with a normalized relational database containing:

Employees (ID, Name, Role, Salary, YearsExperience)
Departments (ID, Name)
Skills (ID, EmployeeID, SkillName)
PerformanceReviews (ID, EmployeeID, Rating, ReviewYear)

The dataset is larger (10 employees, 3 departments, multiple skills, and performance ratings from 2023–2025). The use case is to identify clusters of similar employees for team formation, using cosine similarity on TVRE-like embeddings. These embeddings combine:

Role vectors (semantic similarity, e.g., “Data Scientist” ≈ “Machine Learning Engineer”).
Department vectors (e.g., “Engineering” ≈ “R&D”).
Skill vectors (e.g., “Python” and “R” are close).
Normalized salary and experience (scaled to [0,1]).
Performance rating averages (weighted by recency).

This setup is more complex because it:

Integrates multiple tables (requiring SQL joins for inference).
Uses higher-dimensional embeddings (5D instead of 3D).
Incorporates weighted attributes and temporal data (recent performance matters more).
Applies cosine similarity to cluster employees, simulating a real-world AI-driven task.

Environment Setup

We’ll simulate this in Python using SQLite for the normalized database, NumPy for vector operations, and SciPy for cosine similarity. The embeddings are constructed as follows:

Role Vectors: Predefined 5D vectors based on semantic proximity (e.g., tech roles cluster together).
Department Vectors: 5D vectors reflecting department similarity.
Skill Vectors: Averaged across an employee’s skills, with predefined vectors for each skill.
Salary and Experience: Normalized to [0,1] based on dataset min-max.
Performance: Weighted average of ratings (2025: 0.5, 2024: 0.3, 2023: 0.2) normalized to [0,1].
Composite Embedding: Weighted concatenation (Role: 0.4, Department: 0.3, Skills: 0.2, Salary+Experience: 0.1, Performance: 0.1).

Cosine similarity is computed between all employee pairs to identify clusters, which are visualized in a heatmap to show similarity patterns.

Simulated Dataset

Tables (simplified for brevity):

Employees: 10 employees with roles like Data Scientist, Software Engineer, Product Manager, etc.
Departments: Engineering, R&D, Marketing.
Skills: Each employee has 2–4 skills (e.g., Python, SQL, Leadership).
PerformanceReviews: Ratings (1–5) for 2023–2025.

Sample Data (subset for illustration):

EmployeeID	Name	Role	Salary	YearsExperience	Department
1	Alice	Data Scientist	120000	5	Engineering
2	Bob	Software Engineer	110000	4	R&D
3	Carol	Product Manager	115000	6	Marketing
…	…	…	…	…	…

EmployeeID	SkillName
1	Python
1	SQL
2	Python
2	Java
…	…

EmployeeID	Rating	ReviewYear
1	4	2025
1	3	2024
2	5	2025
…	…	…

Generating Embeddings

Normalization: The schema is in 3NF, with separate tables for employees, departments, skills, and reviews to eliminate redundancy.
SQL Inference: A join query combines data:

   SELECT e.Name, e.Role, e.Salary, e.YearsExperience, d.Name AS DepartmentName,
          GROUP_CONCAT(s.SkillName) AS Skills, AVG(pr.Rating) AS AvgRating
   FROM Employees e
   JOIN Departments d ON e.DepartmentID = d.ID
   LEFT JOIN Skills s ON e.ID = s.EmployeeID
   LEFT JOIN PerformanceReviews pr ON e.ID = pr.EmployeeID
   GROUP BY e.ID;

This infers relationships and aggregates skills/performance.

TVRE Simulation:

Role vectors (e.g., Data Scientist: [0.9, 0.8, 0.2, 0.3, 0.1], Software Engineer: [0.85, 0.75, 0.25, 0.35, 0.15]).
Department vectors (e.g., Engineering: [0.8, 0.7, 0.1, 0.2, 0.3]).
Skill vectors (e.g., Python: [0.9, 0.2, 0.8, 0.1, 0.3]).
Salary/Experience normalized (e.g., 120000 → 0.8, 5 years → 0.625).
Performance weighted (e.g., Alice: (40.5 + 30.3 + 3*0.2)/5 = 0.58).
Composite embedding: Concatenate weighted vectors into a 5D vector per employee.

Sample Embeddings (5D):

Alice: [0.870, 0.650, 0.210, 0.260, 0.170]
Bob: [0.830, 0.610, 0.230, 0.280, 0.190]
Carol: [0.450, 0.720, 0.610, 0.510, 0.430]
… (up to 10 employees).

Cosine Similarity Calculation

Cosine similarity is computed for all pairs using:
[ \text{cosine_similarity}(A, B) = \frac{A \cdot B}{|A| |B|} ]
This produces a 10×10 similarity matrix (values 0 to 1).

Sample Similarity Matrix (subset):

	Alice	Bob	Carol	…
Alice	1.000	0.992	0.765	…
Bob	0.992	1.000	0.771	…
Carol	0.765	0.771	1.000	…
…	…	…	…	…

Clustering and Interpretation

Using a threshold (e.g., similarity > 0.95), we identify clusters:

Cluster 1 (Tech): Alice, Bob, Dave, Emma (Data Scientists/Engineers in Engineering/R&D, Python/SQL skills, high ratings).
Cluster 2 (Non-Tech): Carol, Frank (Product Managers in Marketing, Leadership skills).
Cluster 3 (Mixed): Grace, Henry (Analysts with hybrid skills).

This clustering reflects semantic similarity (tech vs. non-tech) driven by TVRE embeddings, not just explicit SQL filters.

Visualization

To display the similarity matrix, I’ll create a heatmap where the x and y axes are employee names, and the color intensity represents cosine similarity (darker = higher similarity). The chart uses a sequential color scheme for clarity.

{
  "type": "matrix",
  "data": {
    "labels": {
      "x": ["Alice", "Bob", "Carol", "Dave", "Emma", "Frank", "Grace", "Henry", "Ivy", "Jack"],
      "y": ["Alice", "Bob", "Carol", "Dave", "Emma", "Frank", "Grace", "Henry", "Ivy", "Jack"]
    },
    "datasets": [{
      "label": "Cosine Similarity",
      "data": [
        {"x": "Alice", "y": "Alice", "v": 1.000},
        {"x": "Alice", "y": "Bob", "v": 0.992},
        {"x": "Alice", "y": "Carol", "v": 0.765},
        {"x": "Alice", "y": "Dave", "v": 0.987},
        {"x": "Alice", "y": "Emma", "v": 0.980},
        {"x": "Alice", "y": "Frank", "v": 0.750},
        {"x": "Alice", "y": "Grace", "v": 0.870},
        {"x": "Alice", "y": "Henry", "v": 0.865},
        {"x": "Alice", "y": "Ivy", "v": 0.910},
        {"x": "Alice", "y": "Jack", "v": 0.905},
        {"x": "Bob", "y": "Bob", "v": 1.000},
        {"x": "Bob", "y": "Carol", "v": 0.771},
        {"x": "Bob", "y": "Dave", "v": 0.995},
        {"x": "Bob", "y": "Emma", "v": 0.989},
        {"x": "Bob", "y": "Frank", "v": 0.755},
        {"x": "Bob", "y": "Grace", "v": 0.875},
        {"x": "Bob", "y": "Henry", "v": 0.870},
        {"x": "Bob", "y": "Ivy", "v": 0.915},
        {"x": "Bob", "y": "Jack", "v": 0.910},
        {"x": "Carol", "y": "Carol", "v": 1.000},
        {"x": "Carol", "y": "Dave", "v": 0.780},
        {"x": "Carol", "y": "Emma", "v": 0.775},
        {"x": "Carol", "y": "Frank", "v": 0.950},
        {"x": "Carol", "y": "Grace", "v": 0.820},
        {"x": "Carol", "y": "Henry", "v": 0.815},
        {"x": "Carol", "y": "Ivy", "v": 0.790},
        {"x": "Carol", "y": "Jack", "v": 0.785},
        {"x": "Dave", "y": "Dave", "v": 1.000},
        {"x": "Dave", "y": "Emma", "v": 0.993},
        {"x": "Dave", "y": "Frank", "v": 0.760},
        {"x": "Dave", "y": "Grace", "v": 0.880},
        {"x": "Dave", "y": "Henry", "v": 0.875},
        {"x": "Dave", "y": "Ivy", "v": 0.920},
        {"x": "Dave", "y": "Jack", "v": 0.915},
        {"x": "Emma", "y": "Emma", "v": 1.000},
        {"x": "Emma", "y": "Frank", "v": 0.755},
        {"x": "Emma", "y": "Grace", "v": 0.885},
        {"x": "Emma", "y": "Henry", "v": 0.880},
        {"x": "Emma", "y": "Ivy", "v": 0.925},
        {"x": "Emma", "y": "Jack", "v": 0.920},
        {"x": "Frank", "y": "Frank", "v": 1.000},
        {"x": "Frank", "y": "Grace", "v": 0.825},
        {"x": "Frank", "y": "Henry", "v": 0.820},
        {"x": "Frank", "y": "Ivy", "v": 0.795},
        {"x": "Frank", "y": "Jack", "v": 0.790},
        {"x": "Grace", "y": "Grace", "v": 1.000},
        {"x": "Grace", "y": "Henry", "v": 0.990},
        {"x": "Grace", "y": "Ivy", "v": 0.895},
        {"x": "Grace", "y": "Jack", "v": 0.890},
        {"x": "Henry", "y": "Henry", "v": 1.000},
        {"x": "Henry", "y": "Ivy", "v": 0.890},
        {"x": "Henry", "y": "Jack", "v": 0.885},
        {"x": "Ivy", "y": "Ivy", "v": 1.000},
        {"x": "Ivy", "y": "Jack", "v": 0.995},
        {"x": "Jack", "y": "Jack", "v": 1.000}
      ],
      "backgroundColor": [
        "rgba(255, 99, 132, 0.2)",
        "rgba(54, 162, 235, 0.2)",
        "rgba(255, 206, 86, 0.2)",
        "rgba(75, 192, 192, 0.2)",
        "rgba(153, 102, 255, 0.2)",
        "rgba(255, 159, 64, 0.2)",
        "rgba(199, 199, 199, 0.2)",
        "rgba(83, 83, 83, 0.2)",
        "rgba(255, 99, 132, 0.4)",
        "rgba(54, 162, 235, 0.4)"
      ],
      "borderColor": [
        "rgba(255, 99, 132, 1)",
        "rgba(54, 162, 235, 1)",
        "rgba(255, 206, 86, 1)",
        "rgba(75, 192, 192, 1)",
        "rgba(153, 102, 255, 1)",
        "rgba(255, 159, 64, 1)",
        "rgba(199, 199, 199, 1)",
        "rgba(83, 83, 83, 1)",
        "rgba(255, 99, 132, 0.8)",
        "rgba(54, 162, 235, 0.8)"
      ],
      "borderWidth": 1
    }]
  },
  "options": {
    "scales": {
      "x": {
        "title": {
          "display": true,
          "text": "Employees"
        }
      },
      "y": {
        "title": {
          "display": true,
          "text": "Employees"
        }
      }
    },
    "plugins": {
      "legend": {
        "display": false
      },
      "title": {
        "display": true,
        "text": "Cosine Similarity Heatmap of Employee Embeddings"
      }
    }
  }
}

Analysis

Tech Cluster: Alice, Bob, Dave, Emma have similarities > 0.98, reflecting shared tech roles (Data Scientist, Software Engineer), departments (Engineering, R&D), and skills (Python, SQL). Their high ratings and experience reinforce proximity.
Non-Tech Cluster: Carol and Frank (similarity 0.95) share Product Manager roles, Marketing department, and Leadership skills.
Mixed Cluster: Grace and Henry (similarity 0.99) are analysts with overlapping skills (e.g., SQL, Analytics) but different departments, showing TVRE’s ability to capture skill-driven similarity.
Cross-Cluster: Ivy and Jack (similarity 0.995) are hybrid roles (e.g., DevOps with tech and management skills), bridging clusters.

Connection to Article

Normalization: The schema’s 3NF structure avoids redundancy (e.g., skills are not duplicated in the Employees table), ensuring clean data for embedding.
SQL Inference: The join query reconstructs relationships dynamically, mimicking how historical SQL could train TVRE to learn patterns (e.g., frequent tech-skill joins).
TVRE Training: The composite embeddings simulate a trained TVRE model, where weights (e.g., Role: 0.4) could be optimized using historical queries or business heuristics. Cosine similarity leverages these embeddings for semantic tasks.

Extensions

Scalability: In production, a vector database (e.g., Pinecone) could handle millions of embeddings, with cosine similarity computed via optimized ANN (Approximate Nearest Neighbor) algorithms.
Dynamic Updates: New performance reviews could update embeddings incrementally, with weights adjusted for recency.
LLM Integration: Embeddings could feed into an LLM for natural language queries (e.g., “Find employees like Alice”), using cosine similarity to rank results.
Anomaly Detection: Employees with low similarity to any cluster (e.g., < 0.7 to all) could be flagged as outliers (e.g., misclassified roles).

This example demonstrates cosine similarity’s power in a complex, multi-attribute TVRE setup, bridging structured RDBs with semantic vector spaces. If you’d like to explore another visualization (e.g., scatter plot of embeddings via PCA) or a specific use case (e.g., anomaly detection), let me know!s, I’ll expand on the original discussion from the article. This example will simulate a realistic scenario with a larger dataset, multiple attributes contributing to embeddings, and a more intricate use case. The goal is to demonstrate how cosine similarity can be used for advanced semantic tasks, such as clustering employees based on their roles, departments, skills, and performance metrics, while leveraging the principles of normalization, SQL inference, and Token-Vector Relational Embedding (TVRE) training.

Scenario Overview

Imagine a company with a normalized relational database containing:

Employees (ID, Name, Role, Salary, YearsExperience)
Departments (ID, Name)
Skills (ID, EmployeeID, SkillName)
PerformanceReviews (ID, EmployeeID, Rating, ReviewYear)

Role vectors (semantic similarity, e.g., “Data Scientist” ≈ “Machine Learning Engineer”).
Department vectors (e.g., “Engineering” ≈ “R&D”).
Skill vectors (e.g., “Python” and “R” are close).
Normalized salary and experience (scaled to [0,1]).
Performance rating averages (weighted by recency).

This setup is more complex because it:

Integrates multiple tables (requiring SQL joins for inference).
Uses higher-dimensional embeddings (5D instead of 3D).
Incorporates weighted attributes and temporal data (recent performance matters more).
Applies cosine similarity to cluster employees, simulating a real-world AI-driven task.

Environment Setup

We’ll simulate this in Python using SQLite for the normalized database, NumPy for vector operations, and SciPy for cosine similarity. The embeddings are constructed as follows:

Role Vectors: Predefined 5D vectors based on semantic proximity (e.g., tech roles cluster together).
Department Vectors: 5D vectors reflecting department similarity.
Skill Vectors: Averaged across an employee’s skills, with predefined vectors for each skill.
Salary and Experience: Normalized to [0,1] based on dataset min-max.
Performance: Weighted average of ratings (2025: 0.5, 2024: 0.3, 2023: 0.2) normalized to [0,1].
Composite Embedding: Weighted concatenation (Role: 0.4, Department: 0.3, Skills: 0.2, Salary+Experience: 0.1, Performance: 0.1).

Cosine similarity is computed between all employee pairs to identify clusters, which are visualized in a heatmap to show similarity patterns.

Simulated Dataset

Tables (simplified for brevity):

Employees: 10 employees with roles like Data Scientist, Software Engineer, Product Manager, etc.
Departments: Engineering, R&D, Marketing.
Skills: Each employee has 2–4 skills (e.g., Python, SQL, Leadership).
PerformanceReviews: Ratings (1–5) for 2023–2025.

Sample Data (subset for illustration):

EmployeeID	Name	Role	Salary	YearsExperience	Department
1	Alice	Data Scientist	120000	5	Engineering
2	Bob	Software Engineer	110000	4	R&D
3	Carol	Product Manager	115000	6	Marketing
…	…	…	…	…	…

EmployeeID	SkillName
1	Python
1	SQL
2	Python
2	Java
…	…

EmployeeID	Rating	ReviewYear
1	4	2025
1	3	2024
2	5	2025
…	…	…

Generating Embeddings

Normalization: The schema is in 3NF, with separate tables for employees, departments, skills, and reviews to eliminate redundancy.
SQL Inference: A join query combines data:

   SELECT e.Name, e.Role, e.Salary, e.YearsExperience, d.Name AS DepartmentName,
          GROUP_CONCAT(s.SkillName) AS Skills, AVG(pr.Rating) AS AvgRating
   FROM Employees e
   JOIN Departments d ON e.DepartmentID = d.ID
   LEFT JOIN Skills s ON e.ID = s.EmployeeID
   LEFT JOIN PerformanceReviews pr ON e.ID = pr.EmployeeID
   GROUP BY e.ID;

This infers relationships and aggregates skills/performance.

TVRE Simulation:

Role vectors (e.g., Data Scientist: [0.9, 0.8, 0.2, 0.3, 0.1], Software Engineer: [0.85, 0.75, 0.25, 0.35, 0.15]).
Department vectors (e.g., Engineering: [0.8, 0.7, 0.1, 0.2, 0.3]).
Skill vectors (e.g., Python: [0.9, 0.2, 0.8, 0.1, 0.3]).
Salary/Experience normalized (e.g., 120000 → 0.8, 5 years → 0.625).
Performance weighted (e.g., Alice: (40.5 + 30.3 + 3*0.2)/5 = 0.58).
Composite embedding: Concatenate weighted vectors into a 5D vector per employee.

Sample Embeddings (5D):

Alice: [0.870, 0.650, 0.210, 0.260, 0.170]
Bob: [0.830, 0.610, 0.230, 0.280, 0.190]
Carol: [0.450, 0.720, 0.610, 0.510, 0.430]
… (up to 10 employees).

Cosine Similarity Calculation

Cosine similarity is computed for all pairs using:
[ \text{cosine_similarity}(A, B) = \frac{A \cdot B}{|A| |B|} ]
This produces a 10×10 similarity matrix (values 0 to 1).

Sample Similarity Matrix (subset):

	Alice	Bob	Carol	…
Alice	1.000	0.992	0.765	…
Bob	0.992	1.000	0.771	…
Carol	0.765	0.771	1.000	…
…	…	…	…	…

Clustering and Interpretation

Using a threshold (e.g., similarity > 0.95), we identify clusters:

Cluster 1 (Tech): Alice, Bob, Dave, Emma (Data Scientists/Engineers in Engineering/R&D, Python/SQL skills, high ratings).
Cluster 2 (Non-Tech): Carol, Frank (Product Managers in Marketing, Leadership skills).
Cluster 3 (Mixed): Grace, Henry (Analysts with hybrid skills).

This clustering reflects semantic similarity (tech vs. non-tech) driven by TVRE embeddings, not just explicit SQL filters.

Visualization

{
  "type": "matrix",
  "data": {
    "labels": {
      "x": ["Alice", "Bob", "Carol", "Dave", "Emma", "Frank", "Grace", "Henry", "Ivy", "Jack"],
      "y": ["Alice", "Bob", "Carol", "Dave", "Emma", "Frank", "Grace", "Henry", "Ivy", "Jack"]
    },
    "datasets": [{
      "label": "Cosine Similarity",
      "data": [
        {"x": "Alice", "y": "Alice", "v": 1.000},
        {"x": "Alice", "y": "Bob", "v": 0.992},
        {"x": "Alice", "y": "Carol", "v": 0.765},
        {"x": "Alice", "y": "Dave", "v": 0.987},
        {"x": "Alice", "y": "Emma", "v": 0.980},
        {"x": "Alice", "y": "Frank", "v": 0.750},
        {"x": "Alice", "y": "Grace", "v": 0.870},
        {"x": "Alice", "y": "Henry", "v": 0.865},
        {"x": "Alice", "y": "Ivy", "v": 0.910},
        {"x": "Alice", "y": "Jack", "v": 0.905},
        {"x": "Bob", "y": "Bob", "v": 1.000},
        {"x": "Bob", "y": "Carol", "v": 0.771},
        {"x": "Bob", "y": "Dave", "v": 0.995},
        {"x": "Bob", "y": "Emma", "v": 0.989},
        {"x": "Bob", "y": "Frank", "v": 0.755},
        {"x": "Bob", "y": "Grace", "v": 0.875},
        {"x": "Bob", "y": "Henry", "v": 0.870},
        {"x": "Bob", "y": "Ivy", "v": 0.915},
        {"x": "Bob", "y": "Jack", "v": 0.910},
        {"x": "Carol", "y": "Carol", "v": 1.000},
        {"x": "Carol", "y": "Dave", "v": 0.780},
        {"x": "Carol", "y": "Emma", "v": 0.775},
        {"x": "Carol", "y": "Frank", "v": 0.950},
        {"x": "Carol", "y": "Grace", "v": 0.820},
        {"x": "Carol", "y": "Henry", "v": 0.815},
        {"x": "Carol", "y": "Ivy", "v": 0.790},
        {"x": "Carol", "y": "Jack", "v": 0.785},
        {"x": "Dave", "y": "Dave", "v": 1.000},
        {"x": "Dave", "y": "Emma", "v": 0.993},
        {"x": "Dave", "y": "Frank", "v": 0.760},
        {"x": "Dave", "y": "Grace", "v": 0.880},
        {"x": "Dave", "y": "Henry", "v": 0.875},
        {"x": "Dave", "y": "Ivy", "v": 0.920},
        {"x": "Dave", "y": "Jack", "v": 0.915},
        {"x": "Emma", "y": "Emma", "v": 1.000},
        {"x": "Emma", "y": "Frank", "v": 0.755},
        {"x": "Emma", "y": "Grace", "v": 0.885},
        {"x": "Emma", "y": "Henry", "v": 0.880},
        {"x": "Emma", "y": "Ivy", "v": 0.925},
        {"x": "Emma", "y": "Jack", "v": 0.920},
        {"x": "Frank", "y": "Frank", "v": 1.000},
        {"x": "Frank", "y": "Grace", "v": 0.825},
        {"x": "Frank", "y": "Henry", "v": 0.820},
        {"x": "Frank", "y": "Ivy", "v": 0.795},
        {"x": "Frank", "y": "Jack", "v": 0.790},
        {"x": "Grace", "y": "Grace", "v": 1.000},
        {"x": "Grace", "y": "Henry", "v": 0.990},
        {"x": "Grace", "y": "Ivy", "v": 0.895},
        {"x": "Grace", "y": "Jack", "v": 0.890},
        {"x": "Henry", "y": "Henry", "v": 1.000},
        {"x": "Henry", "y": "Ivy", "v": 0.890},
        {"x": "Henry", "y": "Jack", "v": 0.885},
        {"x": "Ivy", "y": "Ivy", "v": 1.000},
        {"x": "Ivy", "y": "Jack", "v": 0.995},
        {"x": "Jack", "y": "Jack", "v": 1.000}
      ],
      "backgroundColor": [
        "rgba(255, 99, 132, 0.2)",
        "rgba(54, 162, 235, 0.2)",
        "rgba(255, 206, 86, 0.2)",
        "rgba(75, 192, 192, 0.2)",
        "rgba(153, 102, 255, 0.2)",
        "rgba(255, 159, 64, 0.2)",
        "rgba(199, 199, 199, 0.2)",
        "rgba(83, 83, 83, 0.2)",
        "rgba(255, 99, 132, 0.4)",
        "rgba(54, 162, 235, 0.4)"
      ],
      "borderColor": [
        "rgba(255, 99, 132, 1)",
        "rgba(54, 162, 235, 1)",
        "rgba(255, 206, 86, 1)",
        "rgba(75, 192, 192, 1)",
        "rgba(153, 102, 255, 1)",
        "rgba(255, 159, 64, 1)",
        "rgba(199, 199, 199, 1)",
        "rgba(83, 83, 83, 1)",
        "rgba(255, 99, 132, 0.8)",
        "rgba(54, 162, 235, 0.8)"
      ],
      "borderWidth": 1
    }]
  },
  "options": {
    "scales": {
      "x": {
        "title": {
          "display": true,
          "text": "Employees"
        }
      },
      "y": {
        "title": {
          "display": true,
          "text": "Employees"
        }
      }
    },
    "plugins": {
      "legend": {
        "display": false
      },
      "title": {
        "display": true,
        "text": "Cosine Similarity Heatmap of Employee Embeddings"
      }
    }
  }
}

Analysis

Tech Cluster: Alice, Bob, Dave, Emma have similarities > 0.98, reflecting shared tech roles (Data Scientist, Software Engineer), departments (Engineering, R&D), and skills (Python, SQL). Their high ratings and experience reinforce proximity.
Non-Tech Cluster: Carol and Frank (similarity 0.95) share Product Manager roles, Marketing department, and Leadership skills.
Mixed Cluster: Grace and Henry (similarity 0.99) are analysts with overlapping skills (e.g., SQL, Analytics) but different departments, showing TVRE’s ability to capture skill-driven similarity.
Cross-Cluster: Ivy and Jack (similarity 0.995) are hybrid roles (e.g., DevOps with tech and management skills), bridging clusters.

Connection to Article

Normalization: The schema’s 3NF structure avoids redundancy (e.g., skills are not duplicated in the Employees table), ensuring clean data for embedding.
SQL Inference: The join query reconstructs relationships dynamically, mimicking how historical SQL could train TVRE to learn patterns (e.g., frequent tech-skill joins).
TVRE Training: The composite embeddings simulate a trained TVRE model, where weights (e.g., Role: 0.4) could be optimized using historical queries or business heuristics. Cosine similarity leverages these embeddings for semantic tasks.

Extensions

Scalability: In production, a vector database (e.g., Pinecone) could handle millions of embeddings, with cosine similarity computed via optimized ANN (Approximate Nearest Neighbor) algorithms.
Dynamic Updates: New performance reviews could update embeddings incrementally, with weights adjusted for recency.
LLM Integration: Embeddings could feed into an LLM for natural language queries (e.g., “Find employees like Alice”), using cosine similarity to rank results.
Anomaly Detection: Employees with low similarity to any cluster (e.g., < 0.7 to all) could be flagged as outliers (e.g., misclassified roles).

a more complex example of cosine similarity in the context of bridging relational databases and vector embeddings

Scenario Overview

Environment Setup

Simulated Dataset

Generating Embeddings

Cosine Similarity Calculation

Clustering and Interpretation

Visualization

Analysis

Connection to Article

Extensions

Scenario Overview

Environment Setup

Simulated Dataset

Generating Embeddings

Cosine Similarity Calculation

Clustering and Interpretation

Visualization

Analysis

Connection to Article

Extensions

Comments

Leave a Reply Cancel reply