Project: Document Ranking using TF-IDF and Okapi BM25
A comparative analysis of TF-IDF and Okapi BM25 on real-world information retrieval tasks.
Overview
This project implements and analyzes two popular document ranking algorithms in Information Retrieval: TF-IDF and Okapi BM25 (Galago). The main goal is to understand their differences in ranking, precision, recall, scalability, and performance on real datasets.
Key Features
- Custom Python implementation of TF-IDF with stopword removal, stemming, and cosine similarity.
- Integration of Galago’s Okapi BM25 model for large-scale probabilistic ranking.
- Comprehensive evaluation on both small and full text corpora.
- Automatic generation of retrieval outputs and formal evaluation with standard IR metrics (Precision, Recall, F1).
Approach
- TF-IDF Implementation
- Performs keyword matching using cosine similarity.
- Applies stopword removal, stemming (Krovetz), and ignores punctuation.
- Treats queries and documents alike in vector space.
- Okapi BM25 (Galago)
- Probabilistic ranking; considers term frequencies and document length.
- Ranks documents based on the probability of term occurrence in relevant vs. non-relevant documents.
- More efficient for large data due to multi-threading and top-k retrieval.
Results & Evaluation
Precision, Recall, and F1-score Comparison
- On a full corpus with all queries:
- TF-IDF:
Precision = 0.010, Recall = 0.926, F1 = 0.019 - Okapi BM25:
Precision = 0.013, Recall = 0.879, F1 = 0.025
- TF-IDF:
-
For a single random query (q1: “what articles exist which deal with tss time sharing system an operating system for ibm computers”):
Metric TF-IDF Okapi BM25 Precision 0.003 0.005 Recall 1.0 1.0 F1-score 0.005 0.009 Both models retrieved all relevant documents, but Okapi BM25 showed higher precision and F1-score.
Ranking Analysis
- Okapi BM25 consistently ranks relevant documents higher than TF-IDF in almost every case.
- Example: Document 1410 ranked 4th in Okapi but 40th in TF-IDF.
Scalability and Usability
- Okapi BM25 is notably faster and better suited for large corpora.
- TF-IDF can become computationally expensive as vocabulary or document count increases.
- BM25 accommodates larger vocabulary, handles document length, and does not penalize queries with more terms.
Interesting Discoveries
- Krovetz Stemmer (dictionary-based) helps reduce vocabulary and improve matches.
- Stopword removal and stemming can dramatically cut computation time.
- Practical note: Galago BM25 model returns results much faster and is robust to the addition of new query terms.
Sample Commands Used
- TF-IDF output evaluation:
galago eval --judgments=... --baseline=tfidfoutput.txt
- Okapi BM25 output evaluation:
galago batch-search --scorer=bm25 > outputokapi.txt galago eval --judgments=... --baseline=outputokapi.txt
Conclusion
- Okapi BM25 outperforms TF-IDF for both ranking and efficiency on large corpora.
- TF-IDF still offers strong recall and can be effective for smaller or well-defined datasets.
- This analysis provides practical insights for choosing the right document ranking approach in real-world search engines.