Getting your Trinity Audio player ready…
|
From Relational Databases to Machine Learning Embeddings: A System for Converting Oracle Data into Relationship Tables
Abstract
Modern machine learning (ML) methods often rely on numerical “embeddings” to capture relationships between entities. While these methods are well-established in the text domain—e.g., Word2Vec for word embeddings—relational databases remain an untapped source of structured relationships. This paper proposes a system to transform a relational Oracle database into a set of tables and embeddings that represent entity relationships in a manner conducive to standard ML workflows. By leveraging concepts akin to co-occurrence in text (extended to entity relationships and joins), we demonstrate how to generate rich, meaningful embeddings that reveal insights, enable recommendations, and facilitate data-driven discovery.
1. Introduction
Relational databases like Oracle store vast amounts of structured data in tables linked by foreign keys and joins. These links capture valuable relational knowledge, such as which customers purchased which products, or which employees manage specific departments. However, standard machine learning pipelines typically require numerical vector representations of data (commonly referred to as embeddings).
In contrast to textual embeddings (e.g., Word2Vec) that exploit word co-occurrence in sentences, there has been less focus on systematically extracting relationship information from relational databases into embeddings. This paper proposes a novel system—referred to hereafter as Rel2Vec—that:
- Extracts data from a relational database (Oracle).
- Transforms the relationships (foreign keys, joins, row co-occurrences) into a format analogous to “word contexts.”
- Generates embeddings for entities (rows) and attributes (column values), suitable for downstream ML tasks such as clustering, classification, recommendation, and anomaly detection.
2. Background and Related Work
2.1 Word2Vec and Embeddings
Word2Vec’s primary contribution was showing how local co-occurrence (words within a window of text) can be used to learn meaningful vector representations. These representations capture semantic relationships such as king – man + woman ≈ queen. Translating this idea to relational data involves redefining “context” to mean shared relationships or joint appearances in tables.
2.2 Graph Embeddings
Graph embedding techniques (e.g., Node2Vec, DeepWalk, TransE) have gained traction for capturing relationships in networks or knowledge graphs. Relational data can be seen as a special case of a graph, where each table row is a node, and foreign keys or joins define edges. These techniques often rely on random walks or local neighborhoods to identify which nodes appear together and can produce node embeddings that capture structural similarity.
2.3 Database Embeddings
Some recent approaches treat each row in a database as if it were a “sentence,” and each attribute as a “word,” or they treat entire joins as contexts. Such frameworks remain relatively underexplored and are often proprietary solutions in large enterprises. Our proposal seeks to formalize this process into a systematic pipeline.
3. System Overview
3.1 Architecture
Our proposed Rel2Vec system consists of four main stages:
- Data Extraction
- Connect to the Oracle database.
- Read schema metadata (tables, columns, foreign keys).
- Determine how tables are linked (join paths).
- Context Construction
- Identify “entities” (such as rows) that should receive embeddings (e.g.,
Customers
,Products
,Orders
). - Establish contexts by collecting “co-occurrence” information. For instance, a given
Customer
entity co-occurs withOrders
they have placed; anOrder
co-occurs with theProducts
involved.
- Identify “entities” (such as rows) that should receive embeddings (e.g.,
- Embedding Generation
- Create input sequences or adjacency lists representing these relationships. This is analogous to treating each row or each join result as a “sentence” of entities/attributes.
- Train embeddings using either a Word2Vec-like approach (CBOW, Skip-gram) or graph-based methods (Node2Vec, DeepWalk).
- Use softmax or similar techniques (negative sampling, hierarchical softmax) to convert raw predictions into probability distributions and compute the training loss.
- Export and Integration
- Store the learned embeddings in a separate table or a specialized data store.
- Provide utilities (APIs or SQL extensions) to query, visualize, or further process these embeddings (e.g., “Find me the most similar
Product
embeddings to product X”).
3.2 Data Models
To clarify the transformation, consider a simple retail database schema:
- Customers
CustomerID
,Name
,City
- Orders
OrderID
,CustomerID
,OrderDate
- OrderItems
OrderID
,ProductID
,Quantity
- Products
ProductID
,ProductName
,Category
Rel2Vec will identify relationships (e.g., Customers
join to Orders
via CustomerID
, Orders
join to OrderItems
via OrderID
, etc.). By traversing these relationships, the system constructs “context windows” that capture which entities (e.g., which customers, which products) co-occur in the same order. These co-occurrences are then fed into a model (like Skip-gram with negative sampling) to produce embeddings.
4. Example Pipeline
- Extracting Data
- A script connects to Oracle, queries all rows from
Customers
,Orders
,OrderItems
,Products
. - It extracts foreign key relationships:
Orders(CustomerID) → Customers(CustomerID)
,OrderItems(OrderID) → Orders(OrderID)
, etc.
- A script connects to Oracle, queries all rows from
- Building Co-occurrence Vectors
- For each
OrderID
, find theCustomerID
and theProductID
entries (viaOrderItems
). - Construct a sequence or “mini-batch” that says:
(CustomerID=42)
,(ProductID=100)
,(ProductID=200)
,(ProductID=250)
all appear together. - Repeat for every order in the dataset.
- For each
- Embedding Model
- The system uses a Skip-gram approach, where the presence of one entity in a context helps predict the presence of related entities.
- A variant of softmax (negative sampling, for efficiency) is applied to estimate probabilities and compute gradients.
- The result is a trained embedding matrix, e.g.: Embedding(CustomerID=42)=[0.11,−0.38,0.07,… ] \text{Embedding}(CustomerID=42) = [0.11, -0.38, 0.07, \dots]Embedding(CustomerID=42)=[0.11,−0.38,0.07,…]
- Storing and Querying Embeddings
- After training, the system writes each entity’s embedding to a new table,
EntityEmbeddings
, with columns likeEntityType
,EntityID
, and a vector (or separate columns) for each dimension. - A user can query, for example, the top 5 products most similar to
ProductID=100
by computing cosine similarity on the embedding vectors.
- After training, the system writes each entity’s embedding to a new table,
5. Experimental Evaluation
We plan to evaluate Rel2Vec on both synthetic and real-world data:
- Metrics
- Similarity Quality: Do related entities have high cosine similarity?
- Downstream Task Performance: Classification (e.g., predicting product categories) or recommendation tasks (e.g., next product to buy).
- Scalability: How efficiently does the system handle large tables with millions of rows?
- Baselines
- Traditional one-hot encoding or attribute-based features.
- Graph-based embedding methods (Node2Vec).
- Pure Word2Vec if we linearize data naively.
6. Discussion and Future Work
- Complex Schemas
- Real-world Oracle databases often have complex foreign key relationships. Adapting Rel2Vec to navigate multi-hop joins remains a challenge.
- We plan to explore hierarchical or iterative embedding generation that incrementally captures deeper relationships.
- Temporal and Continuous Data
- Incorporating timestamps, numerical columns, and other continuous features into embedding training is an open question.
- We propose enhancements such as weighting contexts by time intervals or normalizing attributes (e.g., purchase amounts).
- Integration with ML Pipelines
- Storing embeddings in a universal format (e.g., parquet files or in-database tables) can make it easier to feed them into popular ML frameworks (TensorFlow, PyTorch, scikit-learn).
7. Conclusion
This paper has proposed Rel2Vec, a system for converting an Oracle relational database into ML-friendly embeddings by systematically extracting relationship structures and modeling them in a manner analogous to Word2Vec co-occurrences. By doing so, it addresses a gap between structured database information and modern embedding-centric machine learning pipelines. Our approach opens doors to novel applications—such as advanced recommendations, anomaly detection, and semantic exploration—within enterprise environments that rely heavily on Oracle and other relational databases.
Keywords: Relational Databases, Oracle, Word2Vec, Entity Embeddings, Graph Embeddings, Softmax, Machine Learning, Data Integration
References
- Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient Estimation of Word Representations in Vector Space.
- Grover, A., & Leskovec, J. (2016). Node2Vec: Scalable Feature Learning for Networks. Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.
- Bordes, A., Usunier, N., Garcia-Duran, A., Weston, J., & Yakhnenko, O. (2013). Translating Embeddings for Modeling Multi-relational Data. Advances in Neural Information Processing Systems, 26.
Leave a Reply