Junior Data Engineer - Data Extraction & Pipelines
Role Title
Junior Data Engineer – Data Extraction & ETL Pipelines
Specialization:
Web Scraping, API Integration, ETL Pipelines
Experience Level:
0–1 years (Open to fresh graduates and early-career professionals)
Location:
Pinnacle Business Park, Andheri East, Mumbai.
Shift: 9 hrs
Role Overview
The Junior Data Engineer will support the development and maintenance of automated data extraction and ingestion pipelines across multiple e-commerce and quick-commerce marketplaces such as Amazon, Flipkart, Myntra, Nykaa, Zepto, and Swiggy Instamart.
This role focuses on hands-on implementation, learning web scraping, API integrations, ETL pipelines, and ensuring data quality before loading into PostgreSQL and Delta Lake systems. The candidate will work closely with senior engineers to understand scalable pipeline design and production-grade data workflows.
Key Responsibilities
Data Extraction & Ingestion
· Assist in building and maintaining automated data extraction pipelines using: REST APIs, Selenium-based browser automation, HTML parsing using XPath / CSS selectors, Request-based scraping (headers, cookies, tokens)
· Extract structured and semi-structured data from multiple marketplaces.
ETL Pipeline Development
· Support reusable ETL pipelines for: Multiple brands, Multiple marketplaces
· Perform data transformation, normalization, and enrichment using Python.
· Load processed data into PostgreSQL and Delta Lake under guidance.
Data Validation & Quality Checks
· Implement basic SQL-based validation checks.
· Assist in data hygiene checks using PostgreSQL and Trino.
Automation & Distributed Workflows
· Monitor pipeline execution and assist in failure analysis.
· Handle retries, logging, and basic error-handling mechanisms.
Debugging & Support
· Troubleshoot issues related to: Scraping failures, API response errors, Data mismatches
· Escalate complex issues to senior engineers with proper logs and analysis.
Required Skills & Technical Knowledge
Must-Have Skills
· Basic to intermediate Python (scripts, functions, error handling)
· Working knowledge of SQL (SELECT, JOIN, GROUP BY, WHERE)
· Familiarity with: Selenium, Requests library, XPath / HTML parsing
· Understanding of API concepts (authentication, pagination, rate limits)
· Exposure to ETL concepts and data pipelines
· Basic understanding of: PostgreSQL, Trino SQL, Delta Lake
· Experience with logging, exception handling, and debugging scripts
· Awareness of data quality and validation concepts