Project: Dish Recommendation System with Collaborative Filtering and Clustering
A machine learning project exploring two models—collaborative filtering and ingredient-based clustering—for restaurant dish recommendations.
Overview
This project implements two distinct models for dish recommendation using real-world user rating datasets:
- Model 1: Memory-based collaborative filtering (user-user)
- Model 2: Hard clustering of dishes (ingredient-based, K-medoids)
Both models were evaluated on their ability to predict user ratings and generate ranked recommendations, using real datasets of user–dish ratings and dish ingredient features.
Features
- Handles sparse and noisy real-world data
- Compares collaborative and content-based approaches
- Evaluation on multiple ranking metrics
- Command-line interface for reproducible experiments
Running the Code
Model 1 (Collaborative Filtering)
python3 MODELPART1.py dishes.csv user_ratings_train.json user_ratings_test.json
-
dishes.csv
: Contains dish metadata (dish IDs, ingredients) -
user_ratings_train.json
: User–dish ratings for training -
user_ratings_test.json
: User–dish ratings for testing
Model 2 (K-Medoids Clustering)
python3 MODELPART2.py dishes.csv user_ratings_test.json 5
- The final number is the number of clusters (e.g., 5).
Methodology
Model 1: Collaborative Filtering
- Supervised learning using train/test split
- Computes per-user average ratings
- Predicts ratings for test users based on similar users’ history (cosine similarity on rating vectors)
- If no overlapping ratings, defaults to the user’s average rating from training data
- Clamps predicted ratings to the valid interval [1, 5]
- Metrics computed: MAE, Precision@n, Recall@n
Observations
- Works well with dense data; struggles with sparse user–item overlap.
- Average MAE ≈ 0.6 for typical datasets.
- Performance (especially recall) improves as training data grows.
- Limited by lack of shared history between users.
Model 2: Ingredient-based Clustering
- Content-based: Dishes are clustered by ingredient similarity (binary vectors, custom similarity metric).
- Uses K-Medoids (robust to outliers, binary-friendly) with customizable number of clusters.
- Clusters are evaluated for tightness and separation (WC_SSE, BC_SSE).
- User’s test ratings are predicted as the mean rating within the assigned cluster.
- Recommendations made by selecting highest-rated clusters for each user.
- Metrics computed: MAE, Precision@n, Recall@n
Observations
- MAE values improve substantially as K increases (more clusters = finer grouping).
- Model outperforms a mode-rating baseline for all datasets.
- Recommendations are better aligned with actual user taste trends.
- The mean as a cluster label provides best results.
Results Summary
Metric | Model 1 (500) | Model 1 (2000) | Model 2, K=50 (500) | Model 2, K=50 (2000) |
---|---|---|---|---|
MAE | 0.63 | 0.58 | 0.34 | 0.34 |
Prec@10 | 0.89 | 0.91 | 0.95 | 0.95 |
Recall@10 | 0.43 | 0.44 | 0.46 | 0.46 |
- Model 2 consistently outperforms Model 1 on all metrics, especially with more clusters and data.
Insights
- Collaborative filtering struggles with sparse user–dish overlap, especially in cold-start scenarios.
- Clustering by dish ingredients (content-based) captures meaningful food similarities, producing better predictions and recommendations.
- Higher cluster counts (K) generally improve model performance by capturing finer-grained food preferences.
Please see this class report for additional insights:Report
How to Use
- Plug in your own CSV/JSON files with dish and user-rating data.
- Choose the model appropriate for your data density (collaborative vs. content-based).
- Tune the number of clusters (
K
) for best results in Model 2.