| Project Description |
| This project presents a production-oriented Natural Language Processing (NLP) email classification system designed to assist job seekers in organizing and interpreting recruitment-related email communication. Built with Python and Scikit-learn, the application automatically classifies emails into meaningful categories such as Invitation, Rejection, Confirmation, or Non job-hunt related, significantly reducing manual effort during job searches. The system supports two complementary classification architectures: the three-stage binary pipelines, where emails are progressively filtered through specialized models (job-related detection → confirmation detection → invitation vs rejection); and a multiclassifier pipeline, capable of predicting all classes directly using a single model. Both approaches are implemented using modular Scikit-learn pipelines with TF-IDF vectorization and a flexible classifier registry supporting a wide range of algorithms, including Naive Bayes, Logistic Regression, SVMs, ensemble methods, and meta-classifiers such as VotingClassifier and StackingClassifier. |
| Problem |
Benefits of Solving |
| Email overload during job search |
Candidates often receive dozens of emails (spam, confirmations, rejections, invitations), that can be categorized with the usage of this library. |
| Manual categorization is inefficient |
Using NLP will allow user to set up an automatic classifier, saving time. |
| More than one classifer type is needed for more flexibility. |
Application architecture will allow support for multiple classifiers. |
| User needs an easy way to interact with library. |
An interface class will allow user to interact with the application in an easy way. |
| Indicative Schedule |
| Project Start Date: 01.10.2025 |
Project End Date: 31.01.2026 |
| Solution Scope |
During implementation the developer will:
- Collect and prepare a dataset of emails labeled into four categories: Invitation, Rejection, Confirmation, Other.
- Preprocess email text and apply NLP vectorization techniques (TF-IDF).
- Develop and compare three classification approaches.
- Implement object-oriented code structure with reusable classes for training and inference.
- Build a simple interface to allow users to select the model and test predictions on custom input.
- Evaluate performance using accuracy, precision, recall, and F1-score with stratified train-test splits.
- Prepare the software documentation of project.
|
| Project Assumptions |
- Dataset will consist of a mix of emails to cover all categories.
- The system will be trained and tested on a limited dataset but structured for scalability.
- Users will interact with the classifier through a simple interface.
- Scikit-learn will be the main ML framework, chosen for interpretability and simplicity.
- No integration with live email inboxes will be developed, scope is limited to classification only.
|
| Key Risks |
Rating |
| Insufficient dataset size |
High |
| Class imbalance |
Medium |
| Overfitting on synthetic data |
Medium |