5. Data Structures
- Lists, Tuples, Sets, Dictionaries
- CRUD operations on each data structure
- Iterating through collections
- Common built-in functions (len, sum, sorted, zip, etc.)
6. String and Date Handling
- String Manipulation and Formatting
- split(), join(), slicing, and regex intro (re module)
- Introduction to datetime and time modules (for partition/date-based transformations)
7. Exception Handling
- Try-Except Blocks
- Catching Specific Exceptions
- finally and else in error handling
- Importance in ETL pipeline robustness
8. Intro to OOP (Optional but Useful)
- Classes and Objects
- Constructors (__init__)
- self keyword
- Simple inheritance and method overriding
Data Warehouse
1. Introduction to Data Warehousing
- What is Data Warehousing?
- OLTP vs OLAP
- Data Warehouse Architecture (Single-tier, Two-tier, Three-tier)
- Components of a Data Warehouse
- ETL vs ELT in Data Warehousing
2. Data Modeling Fundamentals
- What is Data Modeling?
- Conceptual, Logical, and Physical Data Models
- Key Data Modeling Concepts: Entities, Attributes, Relationships
- Primary Keys, Foreign Keys, and Constraints
- Normalization & Denormalization
- Choosing the Right Model for Analytical Workloads
3. Dimensional Modeling & Star Schema
- Introduction to Dimensional Modeling
- Fact Tables vs Dimension Tables
- Star Schema: Concepts & Design
- Snowflake Schema: When to Use It?
- Slowly Changing Dimensions (SCD) (Types 0, 1, 2, 3, 4, 6)
- Handling Hierarchies & Aggregations\
4. ETL & Data Integration in Data Warehousing
- Overview of ETL & ELT Processes
- Common ETL Challenges & Solutions
- Data Quality & Data Governance in ETL
- Change Data Capture (CDC) Strategies
5. Modern Data Warehousing
- Traditional Data Warehouses vs Cloud Data Warehouses
- Introduction to Data Lakes & Data Lakehouses
- Overview of Modern DW Platforms: Snowflake, BigQuery, Redshift, Synapse
Pyspark
1. Introduction to PySpark
- What is PySpark?
- PySpark vs Pandas vs Dask
- PySpark Architecture & Execution Model
- Setting up PySpark in Google Colab
- Introduction to SparkSession & DataFrames
2. Data Loading & Basic Transformations in PySpark
- Reading & Writing Data (CSV, JSON, Parquet, Avro)
- Understanding Schema Inference & Defining Schemas
- Basic Transformations: select(), filter(), withColumn(), drop()
- Handling Nulls & Missing Data (fillna(), dropna(), replace())
- Column Operations: cast(), alias(), when(), case()
- Working with Date & Time Functions (current_date(), datediff(), date_add())
3. Advanced PySpark Transformations
- Grouping & Aggregations (groupBy(), agg(), pivot())
- Joins in PySpark (inner, left, right, full)
- Window Functions (Row Number, Ranking, Lead/Lag, Running Totals)
- Exploding & Flattening Nested Data (explode(), array(), struct())
- Working with UDFs (User-Defined Functions)
- Broadcasting & Skew Handling
4. Performance Optimization & Debugging in PySpark
- Understanding Spark Execution Plan (explain(), cache(), persist())
- Catalyst Optimizer & Tungsten Execution
- Partitioning & Bucketing Strategies
- Repartitioning & Coalescing
- Optimizing Shuffle Operations
- Performance Tuning Parameters (spark.conf.set())
PySpark Assignment Problem
- Statements 1 – Hands-On Coding PySpark Assignment Problem
- Statements 2 – Hands-On Coding
Capstone Project 1 – Complex PySpark Transformation – Hands-On Coding
Amazon Web Services ( AWS )
1. AWS Setup & Fundamentals
- Setting up AWS Account and Configuring IAM Roles & Policies
- Creating S3 Buckets, Uploading Data, and Configuring Permissions
- Implementing IAM Best Practices for Secure Data Access
2. AWS Glue – Data Catalog & Crawler
- Setting Up AWS Glue Crawler to Discover Metadata
- Creating and Querying AWS Glue Catalog Tables
- Schema Evolution & Handling Semi-Structured Data (JSON, Parquet)
- Integrating Glue Catalog with Athena & Redshift Spectrum
3. AWS Athena – Querying Data Lake
- Writing SQL Queries on S3 Data Using Athena
- Optimizing Queries with Partitioning & Bucketing
- Using Iceberg Tables in Athena for Time-Travel Queries
- Performance Optimization: Query Federation & Compression Techniques
4. AWS Glue PySpark – Data Transformation
- Setting Up AWS Glue Job with PySpark
- Transforming & Cleaning Raw Data Using PySpark in Glue
- Handling Schema Drift in Glue ETL Pipelines
- Writing Processed Data to S3, Redshift, and RDS
5. Real-Time Data Ingestion Using AWS Glue & REST API
- Configuring AWS Glue Job to Ingest Data from REST API
- Using AWS Lambda to Trigger Glue Jobs on Event Streams
- Handling Real-Time Data Streams in PySpark
- Writing Ingested Data to Iceberg Tables in Athena
6. AWS Redshift – Data Warehousing
- Setting Up an Amazon Redshift Cluster
- Loading Data from S3 to Redshift Using COPY Command
- Performance Tuning with Sort & Distribution Keys
- Running Complex Analytical Queries in Redshift
7. AWS CloudFormation – Infrastructure as Code
- Creating S3, IAM Roles, Glue Jobs, and Redshift Using CloudFormation
- Automating Data Pipeline Deployment Using CloudFormation Templates
- Managing Stack Updates & Rollbacks
Athena Assignment & Problem Statements
- Statements 1 – Hands-On Coding Redshift Assignment Problem
- Statements 2 – Hands-On Coding Glue PySpark Assignment Problem
- Statements 3 – Hands-On Coding
Final Capstone Project 2 End-to-End Data Engineering Pipeline