Roles & Stack
Impact
- Semantic clustering of 23,000+ customer reviews
- 2D visualization of high-dimensional embeddings
- Automated sentiment pattern detection
- Product category clustering analysis
Year
2025
Category
AI & Machine Learning
Deliverables
- Text Embedding Pipeline
- t-SNE Visualization System
- ChromaDB Vector Storage
- Interactive Analysis Notebook
The Situation: Millions of Reviews, No Insights
E-commerce platforms accumulate massive volumes of customer reviews. These reviews contain invaluable signals about product quality, fit issues, and customer sentiment—but analyzing thousands of text reviews manually is impossible.
The Women's Clothing E-Commerce Reviews dataset presented a perfect testbed: 23,000+ reviews with ratings, product categories, and recommendation flags. The goal was to uncover patterns that keyword searches and simple aggregations miss.
The Problem: Text is High-Dimensional
Traditional approaches like word frequency analysis or sentiment scoring reduce rich text to simplistic metrics. They miss nuance: a 3-star review saying 'fabric is nice but runs small' contains different information than 'average quality, fine for the price.'
The challenge was to represent reviews in a way that captures semantic meaning—so that similar opinions cluster together naturally, regardless of the exact words used.
The Solution: Embedding + Dimensionality Reduction
I built a pipeline that converts each review into a high-dimensional vector using OpenAI's embedding model. These vectors capture the 'meaning' of the text—similar reviews have similar vectors, even if they use different words.
To visualize these vectors, I applied t-SNE (t-distributed Stochastic Neighbor Embedding) to reduce 1536 dimensions down to 2D, creating a scatter plot where clusters represent groups of semantically similar reviews.
Data Pipeline: Loading & Preprocessing
The embedding pipeline starts with loading and cleaning the dataset. We use pandas for data manipulation and filter out reviews with missing text:
| 1 | import openai |
| 2 | import pandas as pd |
| 3 | import numpy as np |
| 4 | from sklearn.manifold import TSNE |
| 5 | import matplotlib.pyplot as plt |
| 6 | |
| 7 | # Load and clean the reviews dataset |
| 8 | df = pd.read_csv('womens_clothing_e-commerce_reviews.csv') |
| 9 | df = df.dropna(subset=['Review Text']) |
| 10 | print(f"Loaded {len(df):,} reviews") |
| 11 | |
| 12 | # Sample for visualization (full dataset is 23k+ rows) |
| 13 | reviews = df['Review Text'].tolist()[:1000] |
| 14 | ratings = df['Rating'].tolist()[:1000] |
| 15 | print(f"Sampled {len(reviews)} reviews for embedding") |
Embedding Generation: OpenAI API Integration
Each review is converted to a 1536-dimensional vector using OpenAI's text-embedding-3-small model. The function includes type hints and handles the API response structure:
| 1 | def get_embedding(text: str) -> list[float]: |
| 2 | """Generate embedding vector for a text string.""" |
| 3 | response = openai.embeddings.create( |
| 4 | model="text-embedding-3-small", |
| 5 | input=text |
| 6 | ) |
| 7 | return response.data[0].embedding |
| 8 | |
| 9 | # Generate embeddings with progress tracking |
| 10 | embeddings = [] |
| 11 | for i, review in enumerate(reviews): |
| 12 | emb = get_embedding(review) |
| 13 | embeddings.append(emb) |
| 14 | if (i + 1) % 100 == 0: |
| 15 | print(f"Processed {i + 1}/{len(reviews)} reviews") |
| 16 | |
| 17 | embeddings = np.array(embeddings) |
| 18 | print(f"Embedding matrix shape: {embeddings.shape}") |
Visualization: t-SNE Projection
t-SNE reduces the 1536-dimensional embeddings to 2D while preserving local neighborhood structure. Reviews that are semantically similar will cluster together in the visualization:
| 1 | # Apply t-SNE dimensionality reduction |
| 2 | print("Running t-SNE projection...") |
| 3 | tsne = TSNE( |
| 4 | n_components=2, |
| 5 | perplexity=30, |
| 6 | random_state=42, |
| 7 | n_iter=1000 |
| 8 | ) |
| 9 | embeddings_2d = tsne.fit_transform(embeddings) |
| 10 | |
| 11 | # Create visualization with rating-based coloring |
| 12 | plt.figure(figsize=(14, 10)) |
| 13 | scatter = plt.scatter( |
| 14 | embeddings_2d[:, 0], |
| 15 | embeddings_2d[:, 1], |
| 16 | c=ratings, |
| 17 | cmap='RdYlGn', |
| 18 | alpha=0.6, |
| 19 | s=50 |
| 20 | ) |
| 21 | plt.colorbar(scatter, label='Customer Rating (1-5)') |
| 22 | plt.title('E-commerce Review Embeddings') |
| 23 | plt.xlabel('t-SNE Dimension 1') |
| 24 | plt.ylabel('t-SNE Dimension 2') |
| 25 | plt.savefig('review_clusters.png', dpi=300) |
| 26 | print("Saved visualization to review_clusters.png") |
Insights & Impact
The visualization revealed distinct clusters: positive reviews about fit and quality grouped together, while complaints about sizing formed a separate cluster. This insight could drive product improvements—if 'runs small' complaints cluster together, the sizing chart needs updating.
The methodology is applicable beyond e-commerce: support ticket analysis, social media monitoring, competitive intelligence. The embedding approach captures meaning that keyword-based methods miss, enabling deeper understanding of unstructured text data.
Gallery
Process visuals
Explore next
Validated RAG Chatbot Ecosystem
A sophisticated AI engineering implementation of a full Retrieval-Augmented Generation (RAG) pipeline. The system processes large-scale knowledge bases into high-dimensional vector embeddings, performs optimized semantic search for context retrieval, and generates validated, grounded chatbot responses with automated confidence scoring.