Project Scope

Project Description
This project focuses on the development of a Natural Language Processing (NLP) based email classifier designed to support job seekers in managing recruitment-related communication. Using Scikit-learn pipelines, the system automatically categorizes incoming emails into key classes such as Invitation, Rejection, Confirmation, or Other. The solution demonstrates the use of machine learning, TF-IDF vectorization, and multiple model architectures (Naive Bayes, Logistic Regression, and a Two-Stage Pipeline). It highlights both the practical application of NLP techniques and the implementation of clean, modular Python code for a real-world use case.
Problem Benefits of Solving
Email overload during job search Candidates often receive dozens of emails (spam, confirmations, rejections, invitations), that can be categorized with the usage of this library.
Manual categorization is inefficient Using NLP will allow user to set up an automatic classifier, saving time.
User needs an easy way to interact with library. An interface class will allow user to interact with the application in an easy way.
Indicative Schedule
Project Start Date: 01.10.2025 Indicative Project End Date: 01.11.2025
Solution Scope
During implementation the developer will:
  • Collect and prepare a dataset of emails labeled into four categories: Invitation, Rejection, Confirmation, Other.
  • Preprocess email text and apply NLP vectorization techniques (TF-IDF).
  • Develop and compare three classification approaches.
  • Implement object-oriented code structure with reusable classes for training and inference.
  • Build a simple interface to allow users to select the model and test predictions on custom input.
  • Evaluate performance using accuracy, precision, recall, and F1-score with stratified train-test splits.
  • Prepare the software documentation of project.
Project Assumptions
  1. Dataset will consist of a mix of emails to cover all categories.
  2. The system will be trained and tested on a limited dataset but structured for scalability.
  3. Users will interact with the classifier through a simple interface.
  4. Scikit-learn will be the main ML framework, chosen for interpretability and simplicity.
  5. No integration with live email inboxes will be developed, scope is limited to classification only.
Key Risks Rating
Insufficient dataset size High
Class imbalance Medium
Overfitting on synthetic data Medium