a more complex example of cosine similarity in the context of bridging relational databases and vector embeddings

Getting your Trinity Audio player ready...

To provide To provide a more complex example of cosine similarity in the context of bridging relational databases and vector embeddings, I’ll expand on the original discussion from the article. This example will simulate a realistic scenario with a larger dataset, multiple attributes contributing to embeddings, and a more intricate use case. The goal is to demonstrate how cosine similarity can be used for advanced semantic tasks, such as clustering employees based on their roles, departments, skills, and performance metrics, while leveraging the principles of normalization, SQL inference, and Token-Vector Relational Embedding (TVRE) training.

Scenario Overview

Imagine a company with a normalized relational database containing:

  • Employees (ID, Name, Role, Salary, YearsExperience)
  • Departments (ID, Name)
  • Skills (ID, EmployeeID, SkillName)
  • PerformanceReviews (ID, EmployeeID, Rating, ReviewYear)

The dataset is larger (10 employees, 3 departments, multiple skills, and performance ratings from 2023–2025). The use case is to identify clusters of similar employees for team formation, using cosine similarity on TVRE-like embeddings. These embeddings combine:

  • Role vectors (semantic similarity, e.g., “Data Scientist” ≈ “Machine Learning Engineer”).
  • Department vectors (e.g., “Engineering” ≈ “R&D”).
  • Skill vectors (e.g., “Python” and “R” are close).
  • Normalized salary and experience (scaled to [0,1]).
  • Performance rating averages (weighted by recency).

This setup is more complex because it:

  • Integrates multiple tables (requiring SQL joins for inference).
  • Uses higher-dimensional embeddings (5D instead of 3D).
  • Incorporates weighted attributes and temporal data (recent performance matters more).
  • Applies cosine similarity to cluster employees, simulating a real-world AI-driven task.

Environment Setup

We’ll simulate this in Python using SQLite for the normalized database, NumPy for vector operations, and SciPy for cosine similarity. The embeddings are constructed as follows:

  • Role Vectors: Predefined 5D vectors based on semantic proximity (e.g., tech roles cluster together).
  • Department Vectors: 5D vectors reflecting department similarity.
  • Skill Vectors: Averaged across an employee’s skills, with predefined vectors for each skill.
  • Salary and Experience: Normalized to [0,1] based on dataset min-max.
  • Performance: Weighted average of ratings (2025: 0.5, 2024: 0.3, 2023: 0.2) normalized to [0,1].
  • Composite Embedding: Weighted concatenation (Role: 0.4, Department: 0.3, Skills: 0.2, Salary+Experience: 0.1, Performance: 0.1).

Cosine similarity is computed between all employee pairs to identify clusters, which are visualized in a heatmap to show similarity patterns.

Simulated Dataset

Tables (simplified for brevity):

  • Employees: 10 employees with roles like Data Scientist, Software Engineer, Product Manager, etc.
  • Departments: Engineering, R&D, Marketing.
  • Skills: Each employee has 2–4 skills (e.g., Python, SQL, Leadership).
  • PerformanceReviews: Ratings (1–5) for 2023–2025.

Sample Data (subset for illustration):

EmployeeIDNameRoleSalaryYearsExperienceDepartment
1AliceData Scientist1200005Engineering
2BobSoftware Engineer1100004R&D
3CarolProduct Manager1150006Marketing
EmployeeIDSkillName
1Python
1SQL
2Python
2Java
EmployeeIDRatingReviewYear
142025
132024
252025

Generating Embeddings

  1. Normalization: The schema is in 3NF, with separate tables for employees, departments, skills, and reviews to eliminate redundancy.
  2. SQL Inference: A join query combines data:
   SELECT e.Name, e.Role, e.Salary, e.YearsExperience, d.Name AS DepartmentName,
          GROUP_CONCAT(s.SkillName) AS Skills, AVG(pr.Rating) AS AvgRating
   FROM Employees e
   JOIN Departments d ON e.DepartmentID = d.ID
   LEFT JOIN Skills s ON e.ID = s.EmployeeID
   LEFT JOIN PerformanceReviews pr ON e.ID = pr.EmployeeID
   GROUP BY e.ID;

This infers relationships and aggregates skills/performance.

  1. TVRE Simulation:
  • Role vectors (e.g., Data Scientist: [0.9, 0.8, 0.2, 0.3, 0.1], Software Engineer: [0.85, 0.75, 0.25, 0.35, 0.15]).
  • Department vectors (e.g., Engineering: [0.8, 0.7, 0.1, 0.2, 0.3]).
  • Skill vectors (e.g., Python: [0.9, 0.2, 0.8, 0.1, 0.3]).
  • Salary/Experience normalized (e.g., 120000 → 0.8, 5 years → 0.625).
  • Performance weighted (e.g., Alice: (40.5 + 30.3 + 3*0.2)/5 = 0.58).
  • Composite embedding: Concatenate weighted vectors into a 5D vector per employee.

Sample Embeddings (5D):

  • Alice: [0.870, 0.650, 0.210, 0.260, 0.170]
  • Bob: [0.830, 0.610, 0.230, 0.280, 0.190]
  • Carol: [0.450, 0.720, 0.610, 0.510, 0.430]
  • … (up to 10 employees).

Cosine Similarity Calculation

Cosine similarity is computed for all pairs using:
[ \text{cosine_similarity}(A, B) = \frac{A \cdot B}{|A| |B|} ]
This produces a 10×10 similarity matrix (values 0 to 1).

Sample Similarity Matrix (subset):

AliceBobCarol
Alice1.0000.9920.765
Bob0.9921.0000.771
Carol0.7650.7711.000

Clustering and Interpretation

Using a threshold (e.g., similarity > 0.95), we identify clusters:

  • Cluster 1 (Tech): Alice, Bob, Dave, Emma (Data Scientists/Engineers in Engineering/R&D, Python/SQL skills, high ratings).
  • Cluster 2 (Non-Tech): Carol, Frank (Product Managers in Marketing, Leadership skills).
  • Cluster 3 (Mixed): Grace, Henry (Analysts with hybrid skills).

This clustering reflects semantic similarity (tech vs. non-tech) driven by TVRE embeddings, not just explicit SQL filters.

Visualization

To display the similarity matrix, I’ll create a heatmap where the x and y axes are employee names, and the color intensity represents cosine similarity (darker = higher similarity). The chart uses a sequential color scheme for clarity.

{
  "type": "matrix",
  "data": {
    "labels": {
      "x": ["Alice", "Bob", "Carol", "Dave", "Emma", "Frank", "Grace", "Henry", "Ivy", "Jack"],
      "y": ["Alice", "Bob", "Carol", "Dave", "Emma", "Frank", "Grace", "Henry", "Ivy", "Jack"]
    },
    "datasets": [{
      "label": "Cosine Similarity",
      "data": [
        {"x": "Alice", "y": "Alice", "v": 1.000},
        {"x": "Alice", "y": "Bob", "v": 0.992},
        {"x": "Alice", "y": "Carol", "v": 0.765},
        {"x": "Alice", "y": "Dave", "v": 0.987},
        {"x": "Alice", "y": "Emma", "v": 0.980},
        {"x": "Alice", "y": "Frank", "v": 0.750},
        {"x": "Alice", "y": "Grace", "v": 0.870},
        {"x": "Alice", "y": "Henry", "v": 0.865},
        {"x": "Alice", "y": "Ivy", "v": 0.910},
        {"x": "Alice", "y": "Jack", "v": 0.905},
        {"x": "Bob", "y": "Bob", "v": 1.000},
        {"x": "Bob", "y": "Carol", "v": 0.771},
        {"x": "Bob", "y": "Dave", "v": 0.995},
        {"x": "Bob", "y": "Emma", "v": 0.989},
        {"x": "Bob", "y": "Frank", "v": 0.755},
        {"x": "Bob", "y": "Grace", "v": 0.875},
        {"x": "Bob", "y": "Henry", "v": 0.870},
        {"x": "Bob", "y": "Ivy", "v": 0.915},
        {"x": "Bob", "y": "Jack", "v": 0.910},
        {"x": "Carol", "y": "Carol", "v": 1.000},
        {"x": "Carol", "y": "Dave", "v": 0.780},
        {"x": "Carol", "y": "Emma", "v": 0.775},
        {"x": "Carol", "y": "Frank", "v": 0.950},
        {"x": "Carol", "y": "Grace", "v": 0.820},
        {"x": "Carol", "y": "Henry", "v": 0.815},
        {"x": "Carol", "y": "Ivy", "v": 0.790},
        {"x": "Carol", "y": "Jack", "v": 0.785},
        {"x": "Dave", "y": "Dave", "v": 1.000},
        {"x": "Dave", "y": "Emma", "v": 0.993},
        {"x": "Dave", "y": "Frank", "v": 0.760},
        {"x": "Dave", "y": "Grace", "v": 0.880},
        {"x": "Dave", "y": "Henry", "v": 0.875},
        {"x": "Dave", "y": "Ivy", "v": 0.920},
        {"x": "Dave", "y": "Jack", "v": 0.915},
        {"x": "Emma", "y": "Emma", "v": 1.000},
        {"x": "Emma", "y": "Frank", "v": 0.755},
        {"x": "Emma", "y": "Grace", "v": 0.885},
        {"x": "Emma", "y": "Henry", "v": 0.880},
        {"x": "Emma", "y": "Ivy", "v": 0.925},
        {"x": "Emma", "y": "Jack", "v": 0.920},
        {"x": "Frank", "y": "Frank", "v": 1.000},
        {"x": "Frank", "y": "Grace", "v": 0.825},
        {"x": "Frank", "y": "Henry", "v": 0.820},
        {"x": "Frank", "y": "Ivy", "v": 0.795},
        {"x": "Frank", "y": "Jack", "v": 0.790},
        {"x": "Grace", "y": "Grace", "v": 1.000},
        {"x": "Grace", "y": "Henry", "v": 0.990},
        {"x": "Grace", "y": "Ivy", "v": 0.895},
        {"x": "Grace", "y": "Jack", "v": 0.890},
        {"x": "Henry", "y": "Henry", "v": 1.000},
        {"x": "Henry", "y": "Ivy", "v": 0.890},
        {"x": "Henry", "y": "Jack", "v": 0.885},
        {"x": "Ivy", "y": "Ivy", "v": 1.000},
        {"x": "Ivy", "y": "Jack", "v": 0.995},
        {"x": "Jack", "y": "Jack", "v": 1.000}
      ],
      "backgroundColor": [
        "rgba(255, 99, 132, 0.2)",
        "rgba(54, 162, 235, 0.2)",
        "rgba(255, 206, 86, 0.2)",
        "rgba(75, 192, 192, 0.2)",
        "rgba(153, 102, 255, 0.2)",
        "rgba(255, 159, 64, 0.2)",
        "rgba(199, 199, 199, 0.2)",
        "rgba(83, 83, 83, 0.2)",
        "rgba(255, 99, 132, 0.4)",
        "rgba(54, 162, 235, 0.4)"
      ],
      "borderColor": [
        "rgba(255, 99, 132, 1)",
        "rgba(54, 162, 235, 1)",
        "rgba(255, 206, 86, 1)",
        "rgba(75, 192, 192, 1)",
        "rgba(153, 102, 255, 1)",
        "rgba(255, 159, 64, 1)",
        "rgba(199, 199, 199, 1)",
        "rgba(83, 83, 83, 1)",
        "rgba(255, 99, 132, 0.8)",
        "rgba(54, 162, 235, 0.8)"
      ],
      "borderWidth": 1
    }]
  },
  "options": {
    "scales": {
      "x": {
        "title": {
          "display": true,
          "text": "Employees"
        }
      },
      "y": {
        "title": {
          "display": true,
          "text": "Employees"
        }
      }
    },
    "plugins": {
      "legend": {
        "display": false
      },
      "title": {
        "display": true,
        "text": "Cosine Similarity Heatmap of Employee Embeddings"
      }
    }
  }
}

Analysis

  • Tech Cluster: Alice, Bob, Dave, Emma have similarities > 0.98, reflecting shared tech roles (Data Scientist, Software Engineer), departments (Engineering, R&D), and skills (Python, SQL). Their high ratings and experience reinforce proximity.
  • Non-Tech Cluster: Carol and Frank (similarity 0.95) share Product Manager roles, Marketing department, and Leadership skills.
  • Mixed Cluster: Grace and Henry (similarity 0.99) are analysts with overlapping skills (e.g., SQL, Analytics) but different departments, showing TVRE’s ability to capture skill-driven similarity.
  • Cross-Cluster: Ivy and Jack (similarity 0.995) are hybrid roles (e.g., DevOps with tech and management skills), bridging clusters.

Connection to Article

  • Normalization: The schema’s 3NF structure avoids redundancy (e.g., skills are not duplicated in the Employees table), ensuring clean data for embedding.
  • SQL Inference: The join query reconstructs relationships dynamically, mimicking how historical SQL could train TVRE to learn patterns (e.g., frequent tech-skill joins).
  • TVRE Training: The composite embeddings simulate a trained TVRE model, where weights (e.g., Role: 0.4) could be optimized using historical queries or business heuristics. Cosine similarity leverages these embeddings for semantic tasks.

Extensions

  • Scalability: In production, a vector database (e.g., Pinecone) could handle millions of embeddings, with cosine similarity computed via optimized ANN (Approximate Nearest Neighbor) algorithms.
  • Dynamic Updates: New performance reviews could update embeddings incrementally, with weights adjusted for recency.
  • LLM Integration: Embeddings could feed into an LLM for natural language queries (e.g., “Find employees like Alice”), using cosine similarity to rank results.
  • Anomaly Detection: Employees with low similarity to any cluster (e.g., < 0.7 to all) could be flagged as outliers (e.g., misclassified roles).

This example demonstrates cosine similarity’s power in a complex, multi-attribute TVRE setup, bridging structured RDBs with semantic vector spaces. If you’d like to explore another visualization (e.g., scatter plot of embeddings via PCA) or a specific use case (e.g., anomaly detection), let me know!s, I’ll expand on the original discussion from the article. This example will simulate a realistic scenario with a larger dataset, multiple attributes contributing to embeddings, and a more intricate use case. The goal is to demonstrate how cosine similarity can be used for advanced semantic tasks, such as clustering employees based on their roles, departments, skills, and performance metrics, while leveraging the principles of normalization, SQL inference, and Token-Vector Relational Embedding (TVRE) training.

Scenario Overview

Imagine a company with a normalized relational database containing:

  • Employees (ID, Name, Role, Salary, YearsExperience)
  • Departments (ID, Name)
  • Skills (ID, EmployeeID, SkillName)
  • PerformanceReviews (ID, EmployeeID, Rating, ReviewYear)

The dataset is larger (10 employees, 3 departments, multiple skills, and performance ratings from 2023–2025). The use case is to identify clusters of similar employees for team formation, using cosine similarity on TVRE-like embeddings. These embeddings combine:

  • Role vectors (semantic similarity, e.g., “Data Scientist” ≈ “Machine Learning Engineer”).
  • Department vectors (e.g., “Engineering” ≈ “R&D”).
  • Skill vectors (e.g., “Python” and “R” are close).
  • Normalized salary and experience (scaled to [0,1]).
  • Performance rating averages (weighted by recency).

This setup is more complex because it:

  • Integrates multiple tables (requiring SQL joins for inference).
  • Uses higher-dimensional embeddings (5D instead of 3D).
  • Incorporates weighted attributes and temporal data (recent performance matters more).
  • Applies cosine similarity to cluster employees, simulating a real-world AI-driven task.

Environment Setup

We’ll simulate this in Python using SQLite for the normalized database, NumPy for vector operations, and SciPy for cosine similarity. The embeddings are constructed as follows:

  • Role Vectors: Predefined 5D vectors based on semantic proximity (e.g., tech roles cluster together).
  • Department Vectors: 5D vectors reflecting department similarity.
  • Skill Vectors: Averaged across an employee’s skills, with predefined vectors for each skill.
  • Salary and Experience: Normalized to [0,1] based on dataset min-max.
  • Performance: Weighted average of ratings (2025: 0.5, 2024: 0.3, 2023: 0.2) normalized to [0,1].
  • Composite Embedding: Weighted concatenation (Role: 0.4, Department: 0.3, Skills: 0.2, Salary+Experience: 0.1, Performance: 0.1).

Cosine similarity is computed between all employee pairs to identify clusters, which are visualized in a heatmap to show similarity patterns.

Simulated Dataset

Tables (simplified for brevity):

  • Employees: 10 employees with roles like Data Scientist, Software Engineer, Product Manager, etc.
  • Departments: Engineering, R&D, Marketing.
  • Skills: Each employee has 2–4 skills (e.g., Python, SQL, Leadership).
  • PerformanceReviews: Ratings (1–5) for 2023–2025.

Sample Data (subset for illustration):

EmployeeIDNameRoleSalaryYearsExperienceDepartment
1AliceData Scientist1200005Engineering
2BobSoftware Engineer1100004R&D
3CarolProduct Manager1150006Marketing
EmployeeIDSkillName
1Python
1SQL
2Python
2Java
EmployeeIDRatingReviewYear
142025
132024
252025

Generating Embeddings

  1. Normalization: The schema is in 3NF, with separate tables for employees, departments, skills, and reviews to eliminate redundancy.
  2. SQL Inference: A join query combines data:
   SELECT e.Name, e.Role, e.Salary, e.YearsExperience, d.Name AS DepartmentName,
          GROUP_CONCAT(s.SkillName) AS Skills, AVG(pr.Rating) AS AvgRating
   FROM Employees e
   JOIN Departments d ON e.DepartmentID = d.ID
   LEFT JOIN Skills s ON e.ID = s.EmployeeID
   LEFT JOIN PerformanceReviews pr ON e.ID = pr.EmployeeID
   GROUP BY e.ID;

This infers relationships and aggregates skills/performance.

  1. TVRE Simulation:
  • Role vectors (e.g., Data Scientist: [0.9, 0.8, 0.2, 0.3, 0.1], Software Engineer: [0.85, 0.75, 0.25, 0.35, 0.15]).
  • Department vectors (e.g., Engineering: [0.8, 0.7, 0.1, 0.2, 0.3]).
  • Skill vectors (e.g., Python: [0.9, 0.2, 0.8, 0.1, 0.3]).
  • Salary/Experience normalized (e.g., 120000 → 0.8, 5 years → 0.625).
  • Performance weighted (e.g., Alice: (40.5 + 30.3 + 3*0.2)/5 = 0.58).
  • Composite embedding: Concatenate weighted vectors into a 5D vector per employee.

Sample Embeddings (5D):

  • Alice: [0.870, 0.650, 0.210, 0.260, 0.170]
  • Bob: [0.830, 0.610, 0.230, 0.280, 0.190]
  • Carol: [0.450, 0.720, 0.610, 0.510, 0.430]
  • … (up to 10 employees).

Cosine Similarity Calculation

Cosine similarity is computed for all pairs using:
[ \text{cosine_similarity}(A, B) = \frac{A \cdot B}{|A| |B|} ]
This produces a 10×10 similarity matrix (values 0 to 1).

Sample Similarity Matrix (subset):

AliceBobCarol
Alice1.0000.9920.765
Bob0.9921.0000.771
Carol0.7650.7711.000

Clustering and Interpretation

Using a threshold (e.g., similarity > 0.95), we identify clusters:

  • Cluster 1 (Tech): Alice, Bob, Dave, Emma (Data Scientists/Engineers in Engineering/R&D, Python/SQL skills, high ratings).
  • Cluster 2 (Non-Tech): Carol, Frank (Product Managers in Marketing, Leadership skills).
  • Cluster 3 (Mixed): Grace, Henry (Analysts with hybrid skills).

This clustering reflects semantic similarity (tech vs. non-tech) driven by TVRE embeddings, not just explicit SQL filters.

Visualization

To display the similarity matrix, I’ll create a heatmap where the x and y axes are employee names, and the color intensity represents cosine similarity (darker = higher similarity). The chart uses a sequential color scheme for clarity.

{
  "type": "matrix",
  "data": {
    "labels": {
      "x": ["Alice", "Bob", "Carol", "Dave", "Emma", "Frank", "Grace", "Henry", "Ivy", "Jack"],
      "y": ["Alice", "Bob", "Carol", "Dave", "Emma", "Frank", "Grace", "Henry", "Ivy", "Jack"]
    },
    "datasets": [{
      "label": "Cosine Similarity",
      "data": [
        {"x": "Alice", "y": "Alice", "v": 1.000},
        {"x": "Alice", "y": "Bob", "v": 0.992},
        {"x": "Alice", "y": "Carol", "v": 0.765},
        {"x": "Alice", "y": "Dave", "v": 0.987},
        {"x": "Alice", "y": "Emma", "v": 0.980},
        {"x": "Alice", "y": "Frank", "v": 0.750},
        {"x": "Alice", "y": "Grace", "v": 0.870},
        {"x": "Alice", "y": "Henry", "v": 0.865},
        {"x": "Alice", "y": "Ivy", "v": 0.910},
        {"x": "Alice", "y": "Jack", "v": 0.905},
        {"x": "Bob", "y": "Bob", "v": 1.000},
        {"x": "Bob", "y": "Carol", "v": 0.771},
        {"x": "Bob", "y": "Dave", "v": 0.995},
        {"x": "Bob", "y": "Emma", "v": 0.989},
        {"x": "Bob", "y": "Frank", "v": 0.755},
        {"x": "Bob", "y": "Grace", "v": 0.875},
        {"x": "Bob", "y": "Henry", "v": 0.870},
        {"x": "Bob", "y": "Ivy", "v": 0.915},
        {"x": "Bob", "y": "Jack", "v": 0.910},
        {"x": "Carol", "y": "Carol", "v": 1.000},
        {"x": "Carol", "y": "Dave", "v": 0.780},
        {"x": "Carol", "y": "Emma", "v": 0.775},
        {"x": "Carol", "y": "Frank", "v": 0.950},
        {"x": "Carol", "y": "Grace", "v": 0.820},
        {"x": "Carol", "y": "Henry", "v": 0.815},
        {"x": "Carol", "y": "Ivy", "v": 0.790},
        {"x": "Carol", "y": "Jack", "v": 0.785},
        {"x": "Dave", "y": "Dave", "v": 1.000},
        {"x": "Dave", "y": "Emma", "v": 0.993},
        {"x": "Dave", "y": "Frank", "v": 0.760},
        {"x": "Dave", "y": "Grace", "v": 0.880},
        {"x": "Dave", "y": "Henry", "v": 0.875},
        {"x": "Dave", "y": "Ivy", "v": 0.920},
        {"x": "Dave", "y": "Jack", "v": 0.915},
        {"x": "Emma", "y": "Emma", "v": 1.000},
        {"x": "Emma", "y": "Frank", "v": 0.755},
        {"x": "Emma", "y": "Grace", "v": 0.885},
        {"x": "Emma", "y": "Henry", "v": 0.880},
        {"x": "Emma", "y": "Ivy", "v": 0.925},
        {"x": "Emma", "y": "Jack", "v": 0.920},
        {"x": "Frank", "y": "Frank", "v": 1.000},
        {"x": "Frank", "y": "Grace", "v": 0.825},
        {"x": "Frank", "y": "Henry", "v": 0.820},
        {"x": "Frank", "y": "Ivy", "v": 0.795},
        {"x": "Frank", "y": "Jack", "v": 0.790},
        {"x": "Grace", "y": "Grace", "v": 1.000},
        {"x": "Grace", "y": "Henry", "v": 0.990},
        {"x": "Grace", "y": "Ivy", "v": 0.895},
        {"x": "Grace", "y": "Jack", "v": 0.890},
        {"x": "Henry", "y": "Henry", "v": 1.000},
        {"x": "Henry", "y": "Ivy", "v": 0.890},
        {"x": "Henry", "y": "Jack", "v": 0.885},
        {"x": "Ivy", "y": "Ivy", "v": 1.000},
        {"x": "Ivy", "y": "Jack", "v": 0.995},
        {"x": "Jack", "y": "Jack", "v": 1.000}
      ],
      "backgroundColor": [
        "rgba(255, 99, 132, 0.2)",
        "rgba(54, 162, 235, 0.2)",
        "rgba(255, 206, 86, 0.2)",
        "rgba(75, 192, 192, 0.2)",
        "rgba(153, 102, 255, 0.2)",
        "rgba(255, 159, 64, 0.2)",
        "rgba(199, 199, 199, 0.2)",
        "rgba(83, 83, 83, 0.2)",
        "rgba(255, 99, 132, 0.4)",
        "rgba(54, 162, 235, 0.4)"
      ],
      "borderColor": [
        "rgba(255, 99, 132, 1)",
        "rgba(54, 162, 235, 1)",
        "rgba(255, 206, 86, 1)",
        "rgba(75, 192, 192, 1)",
        "rgba(153, 102, 255, 1)",
        "rgba(255, 159, 64, 1)",
        "rgba(199, 199, 199, 1)",
        "rgba(83, 83, 83, 1)",
        "rgba(255, 99, 132, 0.8)",
        "rgba(54, 162, 235, 0.8)"
      ],
      "borderWidth": 1
    }]
  },
  "options": {
    "scales": {
      "x": {
        "title": {
          "display": true,
          "text": "Employees"
        }
      },
      "y": {
        "title": {
          "display": true,
          "text": "Employees"
        }
      }
    },
    "plugins": {
      "legend": {
        "display": false
      },
      "title": {
        "display": true,
        "text": "Cosine Similarity Heatmap of Employee Embeddings"
      }
    }
  }
}

Analysis

  • Tech Cluster: Alice, Bob, Dave, Emma have similarities > 0.98, reflecting shared tech roles (Data Scientist, Software Engineer), departments (Engineering, R&D), and skills (Python, SQL). Their high ratings and experience reinforce proximity.
  • Non-Tech Cluster: Carol and Frank (similarity 0.95) share Product Manager roles, Marketing department, and Leadership skills.
  • Mixed Cluster: Grace and Henry (similarity 0.99) are analysts with overlapping skills (e.g., SQL, Analytics) but different departments, showing TVRE’s ability to capture skill-driven similarity.
  • Cross-Cluster: Ivy and Jack (similarity 0.995) are hybrid roles (e.g., DevOps with tech and management skills), bridging clusters.

Connection to Article

  • Normalization: The schema’s 3NF structure avoids redundancy (e.g., skills are not duplicated in the Employees table), ensuring clean data for embedding.
  • SQL Inference: The join query reconstructs relationships dynamically, mimicking how historical SQL could train TVRE to learn patterns (e.g., frequent tech-skill joins).
  • TVRE Training: The composite embeddings simulate a trained TVRE model, where weights (e.g., Role: 0.4) could be optimized using historical queries or business heuristics. Cosine similarity leverages these embeddings for semantic tasks.

Extensions

  • Scalability: In production, a vector database (e.g., Pinecone) could handle millions of embeddings, with cosine similarity computed via optimized ANN (Approximate Nearest Neighbor) algorithms.
  • Dynamic Updates: New performance reviews could update embeddings incrementally, with weights adjusted for recency.
  • LLM Integration: Embeddings could feed into an LLM for natural language queries (e.g., “Find employees like Alice”), using cosine similarity to rank results.
  • Anomaly Detection: Employees with low similarity to any cluster (e.g., < 0.7 to all) could be flagged as outliers (e.g., misclassified roles).

This example demonstrates cosine similarity’s power in a complex, multi-attribute TVRE setup, bridging structured RDBs with semantic vector spaces. If you’d like to explore another visualization (e.g., scatter plot of embeddings via PCA) or a specific use case (e.g., anomaly detection), let me know!


Posted

in

by

Tags:

Comments

Leave a Reply

Your email address will not be published. Required fields are marked *