Microsoft Certified Azure Data Engineer Associate

Skills measured

  • Design and implement data storage (40–45%)
  • Design and develop data processing (25–30%)
  • Design and implement data security (10–15%)
  • Monitor and optimize data storage and data processing (10–15%)

COURSE OUTLINE

Design and implement data storage (40–45%)
Design a data storage structure

• Design an Azure Data Lake solution
• Recommend file types for storage
• Recommend file types for analytical queries
• Design for efficient querying
• Design for data pruning
• Design a folder structure that represents the levels of data transformation
• Design a distribution strategy
• Design a data archiving solution

  • Design a partition strategy for files
  • Design a partition strategy for analytical workloads
  • Design a partition strategy for efficiency/performance
  • Design a partition strategy for Azure Synapse Analytics
  • Identify when partitioning is needed in Azure Data Lake Storage Gen2
  • Design star schemas
  • Design slowly changing dimensions
  • Design a dimensional hierarchy
  • Design a solution for temporal data
  • Design for incremental loading
  • Design analytical stores
  • Design metastores in Azure Synapse Analytics and Azure Databricks
  • Implement compression
  • Implement partitioning
  • Implement sharding
  • Implement different table geometries with Azure Synapse Analytics pools
  • Implement data redundancy
  • Implement distributions
  • Implement data archiving
  • Build a temporal data solution
  • Build a slowly changing dimension
  • Build a logical folder structure
  • Build external tables
  • Implement file and folder structures for efficient querying and data pruning
  • Deliver data in a relational star
  • Deliver data in Parquet files
  • Maintain metadata
  • Implement a dimensional hierarchy
Design and develop data processing (25–30%)
Ingest and transform data
  • Transform data by using Apache Spark
  • Transform data by using Transact-SQL
    • Transform data by using Data Factory
    • Transform data by using Azure Synapse Pipelines
    • Transform data by using Stream Analytics
    • Cleanse data
    • Split data
    • Shred JSON
    • Encode and decode data
    • Configure error handling for the transformation
    • Normalize and denormalize values
    • Transform data by using Scala
    • Perform data exploratory analysis
  • Develop batch processing solutions by using Data Factory, Data Lake, Spark, Azure Synapse
  • Create data pipelines
  • Design and implement incremental data loads
  • Design and develop slowly changing dimensions
  • Handle security and compliance requirements
  • Scale resources
  • Configure the batch size
  • Design and create tests for data pipelines
  • Integrate Jupyter/Python notebooks into a data pipeline
  • Handle duplicate data
  • Handle missing data
  • Handle late-arriving data
  • Upsert data
  • Regress to a previous state
  • Design and configure exception handling
  • Configure batch retention
  • Design a batch processing solution
  • Debug Spark jobs by using the Spark UI
  • Develop a stream processing solution by using Stream Analytics, Azure Databricks, and Azure Event Hubs
  • Process data by using Spark structured streaming
  • Monitor for performance and functional regressions
  • Design and create windowed aggregates
  • Handle schema drift
  • Process time series data
  • Process across partitions
  • Process within one partition
  • Configure checkpoints/watermarking during processing
  • Scale resources
  • Design and create tests for data pipelines
  • Optimize pipelines for analytical or transactional purposes
  • Handle interruptions
  • Design and configure exception handling
  • Upsert data
  • Replay archived stream data
  • Design a stream processing solution
  • Trigger batches
  • Handle failed batch loads
  • Validate batch loads
  • Manage data pipelines in Data Factory/Synapse Pipelines
  • Schedule data pipelines in Data Factory/Synapse Pipelines
  • Implement version control for pipeline artifacts
  • Manage Spark jobs in a pipeline
Design and implement data security (10–15%)
Design security for data policies and standards
  • Design data encryption for data at rest and in transit
  • Design a data auditing strategy
  • Design a data masking strategy
  • Design for data privacy
  •  Design a data retention policy
  • Design to purge data based on business requirements
  • Design Azure role-based access control (Azure RBAC) and POSIX-like Access Control List
    (ACL) for Data Lake Storage Gen2
  • Design row-level and column-level security
  • Implement data masking
  • Encrypt data at rest and in motion
  • Implement row-level and column-level security
  • Implement Azure RBAC
  • Implement POSIX-like ACLs for Data Lake Storage Gen2
  • Implement a data retention policy
  • Implement a data auditing strategy
  • Manage identities, keys, and secrets across different data platform technologies
  • Implement secure endpoints (private and public)
  • Implement resource tokens in Azure Databricks
  • Load a DataFrame with sensitive information
  • Write encrypted data to tables or Parquet files
  • Manage sensitive information
Monitor and optimize data storage and data processing (10–15%)
Monitor data storage and data processing
  • Implement logging used by Azure Monitor
  • Configure monitoring services
  • Measure performance of data movement
  • Monitor and update statistics about data across a system
  • Monitor data pipeline performance
  • Measure query performance
  • Monitor cluster performance
  • Understand custom logging options
  • Schedule and monitor pipeline tests
  • Interpret Azure Monitor metrics and logs
  • Interpret a Spark directed acyclic graph (DAG)
  • Compact small files
  • Rewrite user-defined functions (UDFs)
  • Handle skew in data
  • Handle data spill
  • Tune shuffle partitions
  • Find shuffling in a pipeline
  • Optimize resource management
  • Tune queries by using indexers
  • Tune queries by using cache
  • Optimize pipelines for analytical or transactional purposes
  • Optimize pipeline for descriptive versus analytical workloads
  • Troubleshoot a failed spark job
  • Troubleshoot a failed pipeline run