We are seeking an EXPERIENCED Data Lake Implementation Specialist to be responsible for guiding the setup and/or integration of on-premises and cloud data lakes to enable real-time analytics and AI in medium to large digital businesses. Experience in Apache Doris is an added advantage.
Core Skills & Expertise
Data Lake Architecture (Hybrid & Multi-Cloud)
Designing modern data lakehouses with raw + curated layers, unified batch + streaming ingestion
Integration with enterprise systems and support for schema-on-read
Familiarity with lakehouse tools: Delta Lake, Apache Iceberg, Hudi
Real-Time Data Processing
Expertise with streaming architectures: Apache Kafka, Flink, Spark Streaming
Experience with event-driven design, CDC, and real-time ETL tool
Delivered at least one large-scale Doris-based or comparable OLAP system in production
Tools: Debezium, StreamSets, Apache NiFi
Cloud & On-Prem Data Services
Cloud: AWS (S3, Glue, EMR, Kinesis), Azure (ADLS Gen2, Synapse), GCP (BigLake, Dataflow)
On-prem: Hadoop, Cloudera, MapR, private cloud environments
AI/ML Enablement
Data Preparation for AI/ML
Building pipelines for feature extraction and versioning datasets
Integration with feature stores and data quality enforcement
ML Ops Readiness
Integration with ML pipelines (Kubeflow, MLflow, SageMaker)
Model deployment, tuning, and monitoring at scale
Analytics & BI Integration
Support for BI tools (Power BI, Tableau) and fast querying layers (Presto, Trino)
Near real-time dashboard enablement
Governance, Observability, and Security
Enterprise Data Governance
Implementing data ownership, lineage, and access policies
Use of catalogs: Collibra, Apache Atlas, AWS Glue Catalog
Observability & Monitoring
End-to-end pipeline visibility, logs, and metrics
Tools: Prometheus, Grafana, OpenTelemetry, Monte Carlo
Security & Compliance
Encryption, tokenization, and data masking
Adhering to regulations: GDPR, HIPAA, SOC2
Execution Experience
Large-Scale Implementations
Hands-on delivery of hybrid data lake architectures
Experience with syncing on-prem and cloud data systems
Cross-Functional Leadership
Working with data scientists, product teams, and security teams
Leading data platform teams or Centers of Excellence
Agility at Scale
Agile delivery models for data initiatives
Delivering data products and ML capabilities incrementally
Ideal candidate profile summary
A hands-on and strategic data lake architect/engineer with deep knowledge of hybrid and multi-cloud systems, proven experience with streaming data and ML enablement, and the leadership to orchestrate teams around real-time analytics and decision intelligence for digital enterprise scale.
Bonus: Certifications & Tools
Certifications
AWS/GCP/Azure Data Engineer or ML Engineer
Databricks Lakehouse Accreditation
CDMP or DAMA certification
Tools Stack
Airflow, dbt, Spark, Flink, Kafka
Terraform, GitOps, CI/CD
MLflow, Feature Store, SageMaker, Vertex AI
Apache Ranger, Atlas, Lake Formation