A Python-based ETL pipeline developed as part of a Data Engineering scenario, where a multinational company requires up-to-date information about the top 10 largest banks in the world by market capitalization.
The project extracts data from a public web source, transforms it using exchange rate data, and stores the processed results in both CSV format and a relational database for querying by international stakeholders.
🧩 Business Scenario
As a data engineer, the task was to compile a list of the largest global banks ranked by market capitalization (USD billions) and convert those values into GBP, EUR, and INR, based on provided exchange rate data.
The processed data must be accessible locally as a CSV file and stored in a database table to allow managers from different countries to query market capitalization values in their local currency.

🔄 ETL Pipeline Overview
📥 Extract
Bank market capitalization data is extracted by scraping a structured HTML table from a public webpage using requests and BeautifulSoup.
The extracted data includes bank names and market capitalization values in USD, which are parsed and normalized into a pandas DataFrame.


🔧 Transform
The transformation phase enriches the dataset by converting USD market capitalization values into GBP, EUR, and INR, using exchange rate information loaded from an external CSV file.
All calculated values are rounded and standardized to ensure consistency across currencies.


📤 Load
The transformed dataset is:
- Saved locally as a CSV file for downstream usage
- Loaded into a SQLite database as a relational table to support structured queries and analytics.


🗄 Database & Querying
The project creates a local SQLite database and stores the processed data in a dedicated table.
Multiple SQL queries are executed to validate and explore the dataset, including:
- Retrieving all records
- Calculating average market capitalization values
- Sorting banks by market capitalization in different currencies
- Limiting result sets for reporting purposes.

📝 Logging & Monitoring
A logging mechanism records the progress of each ETL stage (extraction, transformation, loading, and querying) with timestamps, providing traceability and easier debugging of the pipeline execution.

