rel2vec – more case study

Getting your Trinity Audio player ready…

From Relational Databases to Machine Learning Embeddings: A System for Converting Oracle Data into Relationship Tables


Abstract

Modern machine learning (ML) methods often rely on numerical “embeddings” to capture relationships between entities. While these methods are well-established in the text domain—e.g., Word2Vec for word embeddings—relational databases remain an untapped source of structured relationships. This paper proposes a system to transform a relational Oracle database into a set of tables and embeddings that represent entity relationships in a manner conducive to standard ML workflows. By leveraging concepts akin to co-occurrence in text (extended to entity relationships and joins), we demonstrate how to generate rich, meaningful embeddings that reveal insights, enable recommendations, and facilitate data-driven discovery.


1. Introduction

Relational databases like Oracle store vast amounts of structured data in tables linked by foreign keys and joins. These links capture valuable relational knowledge, such as which customers purchased which products, or which employees manage specific departments. However, standard machine learning pipelines typically require numerical vector representations of data (commonly referred to as embeddings).

In contrast to textual embeddings (e.g., Word2Vec) that exploit word co-occurrence in sentences, there has been less focus on systematically extracting relationship information from relational databases into embeddings. This paper proposes a novel system—referred to hereafter as Rel2Vec—that:

  1. Extracts data from a relational database (Oracle).
  2. Transforms the relationships (foreign keys, joins, row co-occurrences) into a format analogous to “word contexts.”
  3. Generates embeddings for entities (rows) and attributes (column values), suitable for downstream ML tasks such as clustering, classification, recommendation, and anomaly detection.

2. Background and Related Work

2.1 Word2Vec and Embeddings

Word2Vec’s primary contribution was showing how local co-occurrence (words within a window of text) can be used to learn meaningful vector representations. These representations capture semantic relationships such as king – man + woman ≈ queen. Translating this idea to relational data involves redefining “context” to mean shared relationships or joint appearances in tables.

2.2 Graph Embeddings

Graph embedding techniques (e.g., Node2Vec, DeepWalk, TransE) have gained traction for capturing relationships in networks or knowledge graphs. Relational data can be seen as a special case of a graph, where each table row is a node, and foreign keys or joins define edges. These techniques often rely on random walks or local neighborhoods to identify which nodes appear together and can produce node embeddings that capture structural similarity.

2.3 Database Embeddings

Some recent approaches treat each row in a database as if it were a “sentence,” and each attribute as a “word,” or they treat entire joins as contexts. Such frameworks remain relatively underexplored and are often proprietary solutions in large enterprises. Our proposal seeks to formalize this process into a systematic pipeline.


3. System Overview

3.1 Architecture

Our proposed Rel2Vec system consists of four main stages:

  1. Data Extraction
    • Connect to the Oracle database.
    • Read schema metadata (tables, columns, foreign keys).
    • Determine how tables are linked (join paths).
  2. Context Construction
    • Identify “entities” (such as rows) that should receive embeddings (e.g., Customers, Products, Orders).
    • Establish contexts by collecting “co-occurrence” information. For instance, a given Customer entity co-occurs with Orders they have placed; an Order co-occurs with the Products involved.
  3. Embedding Generation
    • Create input sequences or adjacency lists representing these relationships. This is analogous to treating each row or each join result as a “sentence” of entities/attributes.
    • Train embeddings using either a Word2Vec-like approach (CBOW, Skip-gram) or graph-based methods (Node2Vec, DeepWalk).
    • Use softmax or similar techniques (negative sampling, hierarchical softmax) to convert raw predictions into probability distributions and compute the training loss.
  4. Export and Integration
    • Store the learned embeddings in a separate table or a specialized data store.
    • Provide utilities (APIs or SQL extensions) to query, visualize, or further process these embeddings (e.g., “Find me the most similar Product embeddings to product X”).

3.2 Data Models

To clarify the transformation, consider a simple retail database schema:

  • Customers
    • CustomerID, Name, City
  • Orders
    • OrderID, CustomerID, OrderDate
  • OrderItems
    • OrderID, ProductID, Quantity
  • Products
    • ProductID, ProductName, Category

Rel2Vec will identify relationships (e.g., Customers join to Orders via CustomerID, Orders join to OrderItems via OrderID, etc.). By traversing these relationships, the system constructs “context windows” that capture which entities (e.g., which customers, which products) co-occur in the same order. These co-occurrences are then fed into a model (like Skip-gram with negative sampling) to produce embeddings.


4. Example Pipeline

  1. Extracting Data
    • A script connects to Oracle, queries all rows from Customers, Orders, OrderItems, Products.
    • It extracts foreign key relationships: Orders(CustomerID) → Customers(CustomerID), OrderItems(OrderID) → Orders(OrderID), etc.
  2. Building Co-occurrence Vectors
    • For each OrderID, find the CustomerID and the ProductID entries (via OrderItems).
    • Construct a sequence or “mini-batch” that says: (CustomerID=42), (ProductID=100), (ProductID=200), (ProductID=250) all appear together.
    • Repeat for every order in the dataset.
  3. Embedding Model
    • The system uses a Skip-gram approach, where the presence of one entity in a context helps predict the presence of related entities.
    • A variant of softmax (negative sampling, for efficiency) is applied to estimate probabilities and compute gradients.
    • The result is a trained embedding matrix, e.g.: Embedding(CustomerID=42)=[0.11,−0.38,0.07,… ] \text{Embedding}(CustomerID=42) = [0.11, -0.38, 0.07, \dots]Embedding(CustomerID=42)=[0.11,−0.38,0.07,…]
  4. Storing and Querying Embeddings
    • After training, the system writes each entity’s embedding to a new table, EntityEmbeddings, with columns like EntityType, EntityID, and a vector (or separate columns) for each dimension.
    • A user can query, for example, the top 5 products most similar to ProductID=100 by computing cosine similarity on the embedding vectors.

5. Experimental Evaluation

We plan to evaluate Rel2Vec on both synthetic and real-world data:

  • Metrics
    • Similarity Quality: Do related entities have high cosine similarity?
    • Downstream Task Performance: Classification (e.g., predicting product categories) or recommendation tasks (e.g., next product to buy).
    • Scalability: How efficiently does the system handle large tables with millions of rows?
  • Baselines
    • Traditional one-hot encoding or attribute-based features.
    • Graph-based embedding methods (Node2Vec).
    • Pure Word2Vec if we linearize data naively.

6. Discussion and Future Work

  1. Complex Schemas
    • Real-world Oracle databases often have complex foreign key relationships. Adapting Rel2Vec to navigate multi-hop joins remains a challenge.
    • We plan to explore hierarchical or iterative embedding generation that incrementally captures deeper relationships.
  2. Temporal and Continuous Data
    • Incorporating timestamps, numerical columns, and other continuous features into embedding training is an open question.
    • We propose enhancements such as weighting contexts by time intervals or normalizing attributes (e.g., purchase amounts).
  3. Integration with ML Pipelines
    • Storing embeddings in a universal format (e.g., parquet files or in-database tables) can make it easier to feed them into popular ML frameworks (TensorFlow, PyTorch, scikit-learn).

7. Conclusion

This paper has proposed Rel2Vec, a system for converting an Oracle relational database into ML-friendly embeddings by systematically extracting relationship structures and modeling them in a manner analogous to Word2Vec co-occurrences. By doing so, it addresses a gap between structured database information and modern embedding-centric machine learning pipelines. Our approach opens doors to novel applications—such as advanced recommendations, anomaly detection, and semantic exploration—within enterprise environments that rely heavily on Oracle and other relational databases.

Keywords: Relational Databases, Oracle, Word2Vec, Entity Embeddings, Graph Embeddings, Softmax, Machine Learning, Data Integration


References

  1. Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient Estimation of Word Representations in Vector Space.
  2. Grover, A., & Leskovec, J. (2016). Node2Vec: Scalable Feature Learning for Networks. Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.
  3. Bordes, A., Usunier, N., Garcia-Duran, A., Weston, J., & Yakhnenko, O. (2013). Translating Embeddings for Modeling Multi-relational Data. Advances in Neural Information Processing Systems, 26.

Posted

in

by

Tags:

Comments

Leave a Reply

Your email address will not be published. Required fields are marked *