Project Scope

Project Description
This project presents a production-oriented Natural Language Processing (NLP) email classification system designed to assist job seekers in organizing and interpreting recruitment-related email communication. Built with Python and Scikit-learn, the application automatically classifies emails into meaningful categories such as Invitation, Rejection, Confirmation, or Non job-hunt related, significantly reducing manual effort during job searches. The system supports two complementary classification architectures: the three-stage binary pipelines, where emails are progressively filtered through specialized models (job-related detection → confirmation detection → invitation vs rejection); and a multiclassifier pipeline, capable of predicting all classes directly using a single model. Both approaches are implemented using modular Scikit-learn pipelines with TF-IDF vectorization and a flexible classifier registry supporting a wide range of algorithms, including Naive Bayes, Logistic Regression, SVMs, ensemble methods, and meta-classifiers such as VotingClassifier and StackingClassifier.
Problem Benefits of Solving
Email overload during job search Candidates often receive dozens of emails (spam, confirmations, rejections, invitations), that can be categorized with the usage of this library.
Manual categorization is inefficient Using NLP will allow user to set up an automatic classifier, saving time.
More than one classifer type is needed for more flexibility. Application architecture will allow support for multiple classifiers.
User needs an easy way to interact with library. An interface class will allow user to interact with the application in an easy way.
Indicative Schedule
Project Start Date: 01.10.2025 Project End Date: 31.01.2026
Solution Scope
During implementation the developer will:
  • Collect and prepare a dataset of emails labeled into four categories: Invitation, Rejection, Confirmation, Other.
  • Preprocess email text and apply NLP vectorization techniques (TF-IDF).
  • Develop and compare three classification approaches.
  • Implement object-oriented code structure with reusable classes for training and inference.
  • Build a simple interface to allow users to select the model and test predictions on custom input.
  • Evaluate performance using accuracy, precision, recall, and F1-score with stratified train-test splits.
  • Prepare the software documentation of project.
Project Assumptions
  1. Dataset will consist of a mix of emails to cover all categories.
  2. The system will be trained and tested on a limited dataset but structured for scalability.
  3. Users will interact with the classifier through a simple interface.
  4. Scikit-learn will be the main ML framework, chosen for interpretability and simplicity.
  5. No integration with live email inboxes will be developed, scope is limited to classification only.
Key Risks Rating
Insufficient dataset size High
Class imbalance Medium
Overfitting on synthetic data Medium