{bc}
linkedin

Lead Data Engineer - Azure Databricks/Kafka

Virtusa
Dubai, UAE
fulltime
Mid-Senior
Today
ETLData WarehousingSQLPythonSparkCloud Computing (AWS
Free

Job Fit Check

Base Career helps you apply smarter for this job.

?%
Ready to Scan

Key skills for this role

ETLData WarehousingSQL
Smart Apply

Full Job Posting

Overview

Design and develop streaming ingestion pipelines using Apache Spark (Structured Streaming) and Databricks Auto Loader to consume files from cloud storage or messages from Kafka/RabbitMQ/Confluent Cloud and ingest them into Delta Lake, ensuring schema evolution and exactly once semantics. Implement CDC and deduplication logic by capturing change events from source databases using Debezium, built-in CDC features of SQL Server/Oracle, or other connectors, and apply watermarking and drop duplicate strategies based on primary keys and event timestamps. Scale ingestion through configuration by building a config-driven framework such as using Airflow, DBX Jobs, or Delta Live Tables that iterates over metadata tables to deploy/update ingestion pipelines for hundreds of tables/sources without code duplication. Implement monitoring, observability, and security by capturing streaming query metrics and publishing them to monitoring platforms like Prometheus and Grafana, setting up dashboards for lag, files processed, and processing duration, and enforcing role-based access control, encryption, and data masking. Participate in DevOps processes by using CI/CD pipelines, such as Jenkins or GitHub Actions, to automate the deployment of jobs, managing infrastructure with Terraform or similar tools, and following best practices for version control and code reviews. This role requires 5–8 years of experience designing and building data pipelines using Apache Spark, Databricks, or equivalent big data frameworks, along with hands-on expertise with streaming and messaging systems such as Apache Kafka, Confluent Cloud, RabbitMQ, or Azure Event Hub, including creating producers, consumers, and topics and integrating them into downstream processing. Candidates should possess a deep understanding of relational databases and CDC, with proficiency in SQL Server, Oracle, or other RDBMSs and experience capturing change events using Debezium or native CDC tools; proficiency in programming languages such as Python, Scala, or Java; solid knowledge of SQL for data manipulation and transformation; cloud platform expertise, specifically with Azure or AWS services for data storage, compute, and orchestration; and knowledge of data Lakehouse architectures, Delta Lake, partitioning strategies, and performance optimization. Additionally, familiarity with Git, CI/CD pipelines, and infrastructure-as-code is essential,

Apply for this job in 1 click

Skip the repetitive application forms

Install the Base Career Chrome Extension and autofill job applications across major job boards with your profile.

Sarah M.James T.Maya R.

Trusted by over 500,000 job seekers on Base Career

Start Free Today

More from this employer

More jobs at Virtusa