rel2vec case study

Getting your Trinity Audio player ready…

### Key Points

– Used Northwind database to show how relational data turns into machine learning embeddings.

– Process involves extracting data, building contexts from orders, generating embeddings, and using them for recommendations.

– Example: Recommended products to customer ALFKI based on similar product embeddings.

### Data Extraction

We start with the Northwind database, a sample e-commerce dataset with tables for customers, orders, order details, and products. We identify relationships like customers linked to orders and orders to products.

### Context Construction

Each order is treated as a context where the customer and ordered products co-occur. For example:

– Order 10248 includes customer ALFKI and products 11 (Queens Camilla Tea) and 42 (Singaporean Hokkien Fried Mee).

### Embedding Generation

We use a Word2Vec-like model to train embeddings, treating each order as a document with customer and product IDs as “words.” This captures relationships, like ALFKI being related to products 11 and 42.

### Application: Recommendation

For customer ALFKI, we find products with similar embeddings to 11 and 42, recommending product 29 (Thüringer Rostbratwurst) based on calculated distances.

### Surprising Detail: Graph-Based Approach

Interestingly, the process could also use graph embeddings like Node2Vec, treating data as a bipartite graph of customers and products, potentially capturing more nuanced relationships.

### Comprehensive Analysis of Converting Relational Data to Machine Learning Embeddings

This analysis details the process of transforming relational database data into machine learning (ML)-friendly embeddings, using the Northwind database as a case example. It aligns with the Rel2Vec system described at [https://lfyadda.com/from-relational-databases-to-machine-learning-embeddings-a-system-for-converting-oracle-data-into-relationship-tables/](https://lfyadda.com/from-relational-databases-to-machine-learning-embeddings-a-system-for-converting-oracle-data-into-relationship-tables/), which focuses on extracting relationship structures for ML applications.

#### Background and Motivation

Relational databases, such as those using Oracle, store data in tables with defined relationships, often through foreign keys and joins. For ML, this data needs to be converted into numerical representations (embeddings) that capture semantic and relational information, enabling tasks like clustering, classification, and recommendation. The Rel2Vec system proposes a method to bridge this gap, leveraging co-occurrence and graph-based techniques, inspired by models like Word2Vec and Node2Vec.

#### Case Example: Northwind Database

The Northwind database, a sample e-commerce dataset, provides a practical context. It includes tables for **Customers**, **Orders**, **OrderDetails**, and **Products**, with relationships such as customers linked to orders via `CustomerID`, and orders linked to products via `OrderDetails`. We use actual data from this database to illustrate the process.

##### Data Extraction

First, we connect to the Northwind database and extract the schema:

– **Customers**: Contains fields like `CustomerID`, `CompanyName`, `ContactName`.

– **Orders**: Includes `OrderID`, `CustomerID`, `OrderDate`.

– **OrderDetails**: Links `OrderID` to `ProductID`, with fields like `Quantity` and `UnitPrice`.

– **Products**: Includes `ProductID`, `ProductName`, `CategoryID`.

We identify join paths:

– Customers → Orders (via `CustomerID`).

– Orders → OrderDetails (via `OrderID`).

– OrderDetails → Products (via `ProductID`).

This step mirrors the Rel2Vec system’s data extraction phase, reading schema and determining relationships.

##### Context Construction

Next, we construct contexts by treating each order as a unit where entities (customers and products) co-occur. For each order, we collect the customer ID and all product IDs associated with it. Using sample data:

| OrderID | CustomerID | ProductIDs       | Notes                          |

|———|————|——————|——————————–|

| 10248   | ALFKI      | 11, 42           | ALFKI ordered Queens Camilla Tea and Singaporean Hokkien Fried Mee |

| 10280   | ANATR      | 7, 13            | ANATR ordered Uncle Bob’s Organic Dried Pears and Konbu |

| 10281   | ANTON      | 29, 30           | ANTON ordered Thüringer Rostbratwurst and Nordic Sauces |

Product names are retrieved from the Products table for clarity:

– Product 11: Queens Camilla Tea

– Product 42: Singaporean Hokkien Fried Mee

– Product 7: Uncle Bob’s Organic Dried Pears

– Product 13: Konbu

– Product 29: Thüringer Rostbratwurst

– Product 30: Nordic Sauces

This step aligns with Rel2Vec’s context construction, identifying entities and collecting co-occurrence information, treating orders as contexts.

##### Embedding Generation

We generate embeddings using a Word2Vec-like model, treating each order as a document with customer and product IDs as “words.” For example:

– Document for Order 10248: [ALFKI, 11, 42]

– Document for Order 10280: [ANATR, 7, 13]

– Document for Order 10281: [ANTON, 29, 30]

The model, such as Skip-gram or CBOW, learns embeddings such that entities appearing in the same order have similar representations. For instance, ALFKI’s embedding will be close to those of products 11 and 42, capturing their relationship.

Rel2Vec also mentions graph-based methods like Node2Vec or DeepWalk, which could treat the data as a bipartite graph:

– Nodes: Customers (e.g., ALFKI, ANATR) and Products (e.g., 11, 42).

– Edges: Between customers and products if the customer ordered the product.

Random walks on this graph would generate sequences (e.g., ALFKI → 11 → ANATR → 13), and embeddings would be trained on these sequences, potentially capturing more nuanced relationships. However, for simplicity, we focus on the co-occurrence approach.

For illustration, assume simplified 2D embeddings:

– ALFKI: [0.5, 0.5]

– Product 11: [0.6, 0.4]

– Product 42: [0.4, 0.6]

– Product 7: [0.1, 0.9]

– Product 13: [0.2, 0.8]

– Product 29: [0.5, 0.5]

– Product 30: [0.8, 0.2]

These embeddings are hypothetical but demonstrate the concept.

##### Export and Integration

The embeddings are stored, e.g., in an `EntityEmbeddings` table with fields:

– `EntityType`: Customer or Product

– `EntityID`: ALFKI, 11, etc.

– `Vector`: The embedding vector, e.g., [0.5, 0.5] for ALFKI.

Rel2Vec suggests providing APIs or SQL extensions for querying, such as finding similar products to a customer.

##### Application: Recommendation System

To demonstrate utility, we use embeddings for recommending products to customer ALFKI:

1. ALFKI has ordered products 11 and 42.

2. Find products similar to 11 and 42 using cosine similarity or Euclidean distance:

   – Distance between Product 11 ([0.6, 0.4]) and others:

     – Product 7: sqrt((0.6-0.1)^2 + (0.4-0.9)^2) ≈ 0.707

     – Product 13: sqrt((0.6-0.2)^2 + (0.4-0.8)^2) ≈ 0.566

     – Product 29: sqrt((0.6-0.5)^2 + (0.4-0.5)^2) ≈ 0.141

     – Product 30: sqrt((0.6-0.8)^2 + (0.4-0.2)^2) ≈ 0.283

   – Distance between Product 42 ([0.4, 0.6]) and others:

     – Product 7: sqrt((0.4-0.1)^2 + (0.6-0.9)^2) ≈ 0.424

     – Product 13: sqrt((0.4-0.2)^2 + (0.6-0.8)^2) ≈ 0.283

     – Product 29: sqrt((0.4-0.5)^2 + (0.6-0.5)^2) ≈ 0.141

     – Product 30: sqrt((0.4-0.8)^2 + (0.6-0.2)^2) ≈ 0.566

3. Product 29 is closest to both 11 and 42 (distance ≈ 0.141), so recommend Product 29 (Thüringer Rostbratwurst) to ALFKI.

This recommendation leverages the learned relationships, aligning with Rel2Vec’s applications like recommendations and anomaly detection.

#### Evaluation and Future Considerations

Rel2Vec plans to evaluate on metrics like similarity quality (cosine distance) and downstream tasks, comparing with baselines like one-hot encoding or Node2Vec. Our example is simplified, but in practice, larger datasets would yield more robust embeddings. Future work includes handling complex schemas (multi-hop joins), incorporating temporal data (e.g., `OrderDate`), and integrating with ML frameworks like TensorFlow or PyTorch.

#### Conclusion

This case example demonstrates how relational data from the Northwind database can be transformed into ML embeddings using a process akin to Rel2Vec. By extracting relationships, constructing contexts, generating embeddings, and applying them to recommendations, we bridge the gap between relational databases and ML pipelines, enabling data-driven insights.

### Key Citations

– [Rel2Vec System Converting Oracle Data into Relationship Tables](https://lfyadda.com/from-relational-databases-to-machine-learning-embeddings-a-system-for-converting-oracle-data-into-relationship-tables/)

– [Northwind Database Sample Data](https://northwinddatabase.codeplex.com/)


Posted

in

by

Tags:

Comments

Leave a Reply

Your email address will not be published. Required fields are marked *