Project Description |
This project focuses on the development of a Natural Language Processing (NLP) based email classifier designed to support job seekers in managing recruitment-related communication. Using Scikit-learn pipelines, the system automatically categorizes incoming emails into key classes such as Invitation, Rejection, Confirmation, or Other. The solution demonstrates the use of machine learning, TF-IDF vectorization, and multiple model architectures (Naive Bayes, Logistic Regression, and a Two-Stage Pipeline). It highlights both the practical application of NLP techniques and the implementation of clean, modular Python code for a real-world use case. |
Problem |
Benefits of Solving |
Email overload during job search |
Candidates often receive dozens of emails (spam, confirmations, rejections, invitations), that can be categorized with the usage of this library. |
Manual categorization is inefficient |
Using NLP will allow user to set up an automatic classifier, saving time. |
User needs an easy way to interact with library. |
An interface class will allow user to interact with the application in an easy way. |
Indicative Schedule |
Project Start Date: 01.10.2025 |
Indicative Project End Date: 01.11.2025 |
Solution Scope |
During implementation the developer will:
- Collect and prepare a dataset of emails labeled into four categories: Invitation, Rejection, Confirmation, Other.
- Preprocess email text and apply NLP vectorization techniques (TF-IDF).
- Develop and compare three classification approaches.
- Implement object-oriented code structure with reusable classes for training and inference.
- Build a simple interface to allow users to select the model and test predictions on custom input.
- Evaluate performance using accuracy, precision, recall, and F1-score with stratified train-test splits.
- Prepare the software documentation of project.
|
Project Assumptions |
- Dataset will consist of a mix of emails to cover all categories.
- The system will be trained and tested on a limited dataset but structured for scalability.
- Users will interact with the classifier through a simple interface.
- Scikit-learn will be the main ML framework, chosen for interpretability and simplicity.
- No integration with live email inboxes will be developed, scope is limited to classification only.
|
Key Risks |
Rating |
Insufficient dataset size |
High |
Class imbalance |
Medium |
Overfitting on synthetic data |
Medium |