| Project Description |
| Cloud Data Extractor is a modular Python library designed to standardize and simplify the process of retrieving structured data from Google Cloud Firestore and transforming it into analytics-ready formats. The project was developed to address the recurring need for a reusable, reliable, and testable data extraction layer that can serve machine learning pipelines, analytics workflows, and application backends. Instead of tightly coupling applications directly to the Firestore SDK, CDE introduces a clean abstraction layer that handles authentication, connection management, collection retrieval, error handling, and data transformation. |
| Problem |
Benefits of Solving |
| Firestore requires setup that can be complicated for users. |
The library will make the process of connecting GCP Firestore to an app easier and more user-friendly. |
| Lack of a built-in way to convert Firestore documents to pandas DataFrames. |
The library will have methods that allows user to convert fetched data into pandas DataFrames and other useful data structures. |
| No easy mechanism to reuse Firestore data across different applications. |
Using this library will prevent users from having duplicate code for Firestore connection across various projects. |
| User needs an easy way to interact with library. |
An interface class will allow user to interact with the application in an easy way. |
| Indicative Schedule |
| Project Start Date: 29.11.2025 |
Project End Date: 27.04.2026 |
| Solution Scope |
During implementation the developer will:
- Design the overall library structure using a modern Python package layout.
- Implement a credential-loading mechanism that securely reads Google service account keys and validates them.
- Implement high-level data extraction functions.
- Ensure compatibility with multiple Firestore projects, allowing users to specify project IDs on initialization.
- Add error-handling and custom exception classes for clearer debugging and user-friendly messages.
- Implement example scripts that show the library in action.
- Prepare comprehensive documentation of project.
- Create tests of the library methods.
- Ensure the library follows clean code standards, including docstrings and modular design.
|
| Project Assumptions |
- The user possesses valid Google Cloud Platform (GCP) credentials, specifically a service account JSON with Firestore read permissions.
- A Firestore database exists in the target GCP project and contains at least one collection.
- The user knows their GCP project ID and Firestore collection names.
- Internet access is available when using the library to connect to Firestore.
- The library will be run in a Python 3.9+ environment.
- Firestore data structures are reasonably consistent.
- Firestore collections are expected to contain a manageable amount of data that can fit into memory when converted to a DataFrame.
|
| Key Risks |
Rating |
| Invalid or misconfigured GCP credentials provided by users |
High |
| Insufficient Firestore service permissions |
Medium |
| Network issues while using library |
Medium |
| Too large dataset sizes |
Medium |
| Firestore schema inconsistency |
Medium |