作為一個擁有超過二十年經驗的關係型数据库系統解決方案架構師,我最近開始探索MariaDB的新产品Vector Edition,看看它是否能夠解決我們面臨的一些AI數據挑戰。乍一看起來 seems quite convincing, especially with how it could bring AI magic right into a regular database setup. However, I wanted to test it with a simple use case to see how it performs in practice.
在本文中,我將分享我親身經驗和對MariaDB的向量功能 observations by running a simple use case. 具體來說,我將將樣本客戶評論加载到MariaDB中,並執行快速的類似搜索以尋找相關的評論。
環境設定
我的實驗從設定一個Docker容器開始,使用包括向量功能在內的MariaDB最新版本(11.6)。
# Pull the latest release
docker pull quay.io/mariadb-foundation/mariadb-devel:11.6-vector-preview
# Update password
docker run -d --name mariadb_vector -e MYSQL_ROOT_PASSWORD=<replace_password> quay.io/mariadb-foundation/mariadb-devel:11.6-vector-preview
現在,創建一個表並用一些示例客戶評論來填充它,這些評論包含情感得分和每個評論的嵌入表示。為了生成文本嵌入,我使用SentenceTransformer
,這是一個可以让你使用预训练模型的工具。具體來說,我決定使用一個稱為paraphrase-MiniLM-L6-v2的模型的模型,該模型將我們的客戶評論映射到384維度的空間中。
import mysql.connector
import numpy as np
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('paraphrase-MiniLM-L6-v2')
# I already have a database created with a name vectordb
connection = mysql.connector.connect(
host="localhost",
user="root",
password="<password>", # Replace me
database="vectordb"
)
cursor = connection.cursor()
# Create a table to store customer reviews with sentiment score and embeddings.
cursor.execute("""
CREATE TABLE IF NOT EXISTS customer_reviews (
id INT PRIMARY KEY AUTO_INCREMENT,
product_name INT,
customer_review TEXT,
customer_sentiment_score FLOAT,
customer_review_embedding BLOB,
INDEX vector_idx (customer_review_embedding) USING HNSW
) ENGINE=ColumnStore;
""")
# Sample reviews
reviews = [
(1, "This product exceeded my expectations. Highly recommended!", 0.9),
(1, "Decent quality, but pricey.", 0.6),
(2, "Terrible experience. The product does not work.", 0.1),
(2, "Average product, ok ok", 0.5),
(3, "Absolutely love it! Best purchase I have made this year.", 1.0)
]
# Load sample reviews into vector DB
for product_id, review_text, sentiment_score in reviews:
embedding = model.encode(review_text)
cursor.execute(
"INSERT INTO customer_reviews (product_id, review_text, sentiment_score, review_embedding) VALUES (%s, %s, %s, %s)",
(product_id, review_text, sentiment_score, embedding.tobytes()))
connection.commit()
connection.close()
現在,讓我們利用MariaDB的向量功能來尋找類似的評論。這就像是在問 “其他客戶有誰說過與這個評論相似的话?“。在下面的示例中,我將尋找與一位客戶评论“我非常滿意!”最相似的前2篇評論。為了做到這一點,我使用最新版本中可用的向量函數(VEC_Distance_Euclidean
)。
# Convert the target customer review into vector
target_review_embedding = model.encode("I am super satisfied!")
# Find top 2 similar reviews using MariaDB's VEC_Distance_Euclidean function
cursor.execute("""
SELECT review_text, sentiment_score, VEC_Distance_Euclidean(review_embedding, %s) AS similarity
FROM customer_reviews
ORDER BY similarity
LIMIT %s
""", (target_review_embedding.tobytes(), 2))
similar_reviews = cursor.fetchall()
觀察
- 它的設定非常簡單,我們可以將結構化數據(如產品ID和情感得分)、非結構化數據(評論文本)以及它們的向量表示全部放入一個單一表中。
- 我喜歡它能夠使用SQL語法與向量操作結合,這讓已經熟悉關係數據庫的團隊易于上手。以下是本版本支持的向量函數的完整列表。
- HNSW索引改善了我在測試的較大數據集中類似搜索查詢的性能。
結論
總的來說,我印象很好!MariaDB的地向版將简化某些由AI驅動的架構。它橋接了傳統數據庫世界與AI工具不斷變化的需求之間的差距。在未來的几个月裡,我期待著看到这项技術如何變得成熟,以及社區如何在現實世界的應用中采纳它。
Source:
https://dzone.com/articles/mariadb-vector-edition-hands-on-review