Task-Clif: Query Re-Ranking System
An advanced Information Retrieval project that demonstrates how to improve the relevance of search results in the biomedical domain. This system processes natural language queries, performs searches on PubMed, and applies different re-ranking algorithms to optimize the classification of the retrieved documents.
đ§© Features
- Multiple Re-Ranking Strategies: Implements and compares various re-ranking algorithms, from classic models like TF-IDF and BM25 to advanced semantic models like PubMedBERT.
- AI-Powered Keyword Extraction: Uses Googleâs Gemini API to analyze the userâs query and extract the most relevant keywords, improving the precision of the initial search.
- Automated Evaluation Pipeline: Includes scripts to calculate and analyze standard Information Retrieval performance metrics (like P@k, MAP, MRR), allowing for a quantitative evaluation of each model.
- PubMed Integration: Connects directly to the PubMed database via its API to retrieve scientific articles and relevant biomedical literature.
- Configurable System: Allows for easy switching between datasets (train/test), query types, and re-ranking models through a
.envconfiguration file, facilitating experimentation.
đĄ Technologies Used
- Python: As the primary language for the systemâs logic and data analysis.
- Google Gemini API: For natural language processing and keyword extraction.
- PubMed API: For retrieving documents from the PubMed repository.
- Re-Ranking Models:
- TF-IDF: A classic vector model that scores relevance based on word frequency.
- BM25: A state-of-the-art probabilistic ranking algorithm, a standard in many modern search engines for its effectiveness.
- PubMedBERT: A language model based on Transformers (BERT) pre-trained specifically on biomedical texts to understand context and semantics.
- Pandas & Scikit-learn: For data manipulation and the calculation of evaluation metrics.
đŻ Objective
The objective of this project is to address a fundamental challenge in search systems: the initial results returned by a general search engine (like PubMed) are not always the most useful for the user. This is the essence of the re-ranking task in Information Retrieval.
This project demonstrates how to build a âsecond-passâ system that takes an initial list of results and reorders it by applying more sophisticated relevance models. On a theoretical level, two approaches are explored:
- Lexical Relevance (BM25): This model doesnât just count words; it calculates a relevance score that considers term frequency within the document and across the entire collection, avoiding bias towards longer documents. It is an industry standard for its robustness.
- Semantic Relevance (PubMedBERT): Unlike lexical models, PubMedBERT understands the meaning behind the words. Having been trained on millions of biomedical texts, it can determine if a document is relevant to a query even if they donât share the exact same keywords, capturing the userâs intent much more deeply.
The goal is to illustrate how these re-ranking techniques can dramatically improve the quality of search results, delivering more precise and relevant information to the end-user.
đ Repository
You can find the full source code and setup instructions for this demo in the official GitHub repository.