E-commerce Review Analysis with Text Embeddings

Roles & Stack

Data ScientistML Engineer

PythonOpenAI APIChromaDBscikit-learnMatplotlibt-SNE

Impact

Semantic clustering of 23,000+ customer reviews
2D visualization of high-dimensional embeddings
Automated sentiment pattern detection
Product category clustering analysis

Year

2025

The Situation: Millions of Reviews, No Insights

E-commerce platforms accumulate massive volumes of customer reviews. These reviews contain invaluable signals about product quality, fit issues, and customer sentiment—but analyzing thousands of text reviews manually is impossible.

The Women's Clothing E-Commerce Reviews dataset presented a perfect testbed: 23,000+ reviews with ratings, product categories, and recommendation flags. The goal was to uncover patterns that keyword searches and simple aggregations miss.

23,000+ Customer ReviewsHidden Sentiment PatternsBeyond Keyword Analysis

The Problem: Text is High-Dimensional

Traditional approaches like word frequency analysis or sentiment scoring reduce rich text to simplistic metrics. They miss nuance: a 3-star review saying 'fabric is nice but runs small' contains different information than 'average quality, fine for the price.'

The challenge was to represent reviews in a way that captures semantic meaning—so that similar opinions cluster together naturally, regardless of the exact words used.

Nuance Lost in Simple MetricsSemantic Similarity ChallengeHigh-Dimensional Text Data

The Solution: Embedding + Dimensionality Reduction

I built a pipeline that converts each review into a high-dimensional vector using OpenAI's embedding model. These vectors capture the 'meaning' of the text—similar reviews have similar vectors, even if they use different words.

To visualize these vectors, I applied t-SNE (t-distributed Stochastic Neighbor Embedding) to reduce 1536 dimensions down to 2D, creating a scatter plot where clusters represent groups of semantically similar reviews.

OpenAI text-embedding-3-smallt-SNE for VisualizationSemantic Clustering

Data Pipeline: Loading & Preprocessing

The embedding pipeline starts with loading and cleaning the dataset. We use pandas for data manipulation and filter out reviews with missing text:

python

1	import openai
2	import pandas as pd
3	import numpy as np
4	from sklearn.manifold import TSNE
5	import matplotlib.pyplot as plt
6
7	# Load and clean the reviews dataset
8	df = pd.read_csv('womens_clothing_e-commerce_reviews.csv')
9	df = df.dropna(subset=['Review Text'])
10	print(f"Loaded {len(df):,} reviews")
11
12	# Sample for visualization (full dataset is 23k+ rows)
13	reviews = df['Review Text'].tolist()[:1000]
14	ratings = df['Rating'].tolist()[:1000]
15	print(f"Sampled {len(reviews)} reviews for embedding")

Pandas Data LoadingMissing Value HandlingSmart Sampling Strategy

Embedding Generation: OpenAI API Integration

Each review is converted to a 1536-dimensional vector using OpenAI's text-embedding-3-small model. The function includes type hints and handles the API response structure:

python

1	def get_embedding(text: str) -> list[float]:
2	"""Generate embedding vector for a text string."""
3	response = openai.embeddings.create(
4	model="text-embedding-3-small",
5	input=text
6	)
7	return response.data[0].embedding
8
9	# Generate embeddings with progress tracking
10	embeddings = []
11	for i, review in enumerate(reviews):
12	emb = get_embedding(review)
13	embeddings.append(emb)
14	if (i + 1) % 100 == 0:
15	print(f"Processed {i + 1}/{len(reviews)} reviews")
16
17	embeddings = np.array(embeddings)
18	print(f"Embedding matrix shape: {embeddings.shape}")

1536-Dimensional VectorsProgress TrackingNumPy Array Conversion

Visualization: t-SNE Projection

t-SNE reduces the 1536-dimensional embeddings to 2D while preserving local neighborhood structure. Reviews that are semantically similar will cluster together in the visualization:

python

1	# Apply t-SNE dimensionality reduction
2	print("Running t-SNE projection...")
3	tsne = TSNE(
4	n_components=2,
5	perplexity=30,
6	random_state=42,
7	n_iter=1000
8	)
9	embeddings_2d = tsne.fit_transform(embeddings)
10
11	# Create visualization with rating-based coloring
12	plt.figure(figsize=(14, 10))
13	scatter = plt.scatter(
14	embeddings_2d[:, 0],
15	embeddings_2d[:, 1],
16	c=ratings,
17	cmap='RdYlGn',
18	alpha=0.6,
19	s=50
20	)
21	plt.colorbar(scatter, label='Customer Rating (1-5)')
22	plt.title('E-commerce Review Embeddings')
23	plt.xlabel('t-SNE Dimension 1')
24	plt.ylabel('t-SNE Dimension 2')
25	plt.savefig('review_clusters.png', dpi=300)
26	print("Saved visualization to review_clusters.png")

2D Projection from 1536DRating-Based ColoringPublication-Quality Export

Insights & Impact

The visualization revealed distinct clusters: positive reviews about fit and quality grouped together, while complaints about sizing formed a separate cluster. This insight could drive product improvements—if 'runs small' complaints cluster together, the sizing chart needs updating.

The methodology is applicable beyond e-commerce: support ticket analysis, social media monitoring, competitive intelligence. The embedding approach captures meaning that keyword-based methods miss, enabling deeper understanding of unstructured text data.

Actionable Product InsightsSizing Issue DetectionTransferable MethodologySemantic Understanding at Scale

Gallery

Process visuals

Explore next

Validated RAG Chatbot Ecosystem

A sophisticated AI engineering implementation of a full Retrieval-Augmented Generation (RAG) pipeline. The system processes large-scale knowledge bases into high-dimensional vector embeddings, performs optimized semantic search for context retrieval, and generates validated, grounded chatbot responses with automated confidence scoring.

Read the next case study

Roles & Stack

Data ScientistML Engineer

PythonOpenAI APIChromaDBscikit-learnMatplotlibt-SNE

Impact

Semantic clustering of 23,000+ customer reviews
2D visualization of high-dimensional embeddings
Automated sentiment pattern detection
Product category clustering analysis

Year

2025

The Situation: Millions of Reviews, No Insights

23,000+ Customer ReviewsHidden Sentiment PatternsBeyond Keyword Analysis

The Problem: Text is High-Dimensional

The challenge was to represent reviews in a way that captures semantic meaning—so that similar opinions cluster together naturally, regardless of the exact words used.

Nuance Lost in Simple MetricsSemantic Similarity ChallengeHigh-Dimensional Text Data

The Solution: Embedding + Dimensionality Reduction

OpenAI text-embedding-3-smallt-SNE for VisualizationSemantic Clustering

Data Pipeline: Loading & Preprocessing

The embedding pipeline starts with loading and cleaning the dataset. We use pandas for data manipulation and filter out reviews with missing text:

python

1	import openai
2	import pandas as pd
3	import numpy as np
4	from sklearn.manifold import TSNE
5	import matplotlib.pyplot as plt
6
7	# Load and clean the reviews dataset
8	df = pd.read_csv('womens_clothing_e-commerce_reviews.csv')
9	df = df.dropna(subset=['Review Text'])
10	print(f"Loaded {len(df):,} reviews")
11
12	# Sample for visualization (full dataset is 23k+ rows)
13	reviews = df['Review Text'].tolist()[:1000]
14	ratings = df['Rating'].tolist()[:1000]
15	print(f"Sampled {len(reviews)} reviews for embedding")

Pandas Data LoadingMissing Value HandlingSmart Sampling Strategy

Embedding Generation: OpenAI API Integration

Each review is converted to a 1536-dimensional vector using OpenAI's text-embedding-3-small model. The function includes type hints and handles the API response structure:

python

1	def get_embedding(text: str) -> list[float]:
2	"""Generate embedding vector for a text string."""
3	response = openai.embeddings.create(
4	model="text-embedding-3-small",
5	input=text
6	)
7	return response.data[0].embedding
8
9	# Generate embeddings with progress tracking
10	embeddings = []
11	for i, review in enumerate(reviews):
12	emb = get_embedding(review)
13	embeddings.append(emb)
14	if (i + 1) % 100 == 0:
15	print(f"Processed {i + 1}/{len(reviews)} reviews")
16
17	embeddings = np.array(embeddings)
18	print(f"Embedding matrix shape: {embeddings.shape}")

1536-Dimensional VectorsProgress TrackingNumPy Array Conversion

Visualization: t-SNE Projection

t-SNE reduces the 1536-dimensional embeddings to 2D while preserving local neighborhood structure. Reviews that are semantically similar will cluster together in the visualization:

python

1	# Apply t-SNE dimensionality reduction
2	print("Running t-SNE projection...")
3	tsne = TSNE(
4	n_components=2,
5	perplexity=30,
6	random_state=42,
7	n_iter=1000
8	)
9	embeddings_2d = tsne.fit_transform(embeddings)
10
11	# Create visualization with rating-based coloring
12	plt.figure(figsize=(14, 10))
13	scatter = plt.scatter(
14	embeddings_2d[:, 0],
15	embeddings_2d[:, 1],
16	c=ratings,
17	cmap='RdYlGn',
18	alpha=0.6,
19	s=50
20	)
21	plt.colorbar(scatter, label='Customer Rating (1-5)')
22	plt.title('E-commerce Review Embeddings')
23	plt.xlabel('t-SNE Dimension 1')
24	plt.ylabel('t-SNE Dimension 2')
25	plt.savefig('review_clusters.png', dpi=300)
26	print("Saved visualization to review_clusters.png")

2D Projection from 1536DRating-Based ColoringPublication-Quality Export

Insights & Impact

Actionable Product InsightsSizing Issue DetectionTransferable MethodologySemantic Understanding at Scale

Gallery

Process visuals

Explore next

Validated RAG Chatbot Ecosystem

Read the next case study