Feb 2019 - Present: Senior Software Engineer, Tech Lead (Uber ATG)
Worked on compute infrastructure (Peloton, Mesos, Kubernetes) for self-driving cars. Helped to scale our compute systems for running millions of containers per day: metrics for observability, alerting, and developer tooling for launching jobs (Spark, Zeppelin and Jupyter notebook servers, simulations).
Currently working on the data infrastructure team where I am leading various projects on building systems to improve infrastructure efficiency (cost and utilization) across on-premise and cloud resources:
- Real-time efficiency chargeback: Kubernetes clusters and internal remote inference server (EKS, Kinesis, Flink, Glue Spark ETL)
- Real-time budget tracking and notification system (RESTful budget API, RDS, Flink)
- Automated cost allocation tagging and tag enforcement (AWS Lambda)
- Tools for cost and resource usage reporting (Athena, internal BI).
Besides working on infrastructure efficiency I help optimizing our PB-scale HDFS storage systems and lead several core data infrastructure projects:
- ATG data lake in AWS
- Real-time stream processing and analytics platform on AWS
Nov 2017 - Feb 2019: Staff Software Engineer (Zenreach)
Working on the design and implementation of our core data services and infrastructure. Currently, technical lead for the presence platform. Presence platform enables real-time detection of customer walk-ins, walk-bys, and walkthroughs. Majority of our product features are driven by this data. I am responsible for:
- Design, implementation, and operation of our batch and streaming services.
- Performance, uptime, and QoS guarantees.
- Observability and monitoring.
- Data replication, backfilling, and archival.
Apr 2016 - Nov 2017: Principal Software Engineer (Verizon Labs)
Working on the design and implementation of NorthStar, a serverless computing platform for large-scale event processing and data analysis. Built the platform from ground up. Currently technical lead for all backend services (APIs, storage, data processing, messaging, compute, resource management) and SDKs. Among others I am working on the following components:
- Design and implementation of a custom Apache Mesos framework for container management.
- Design and implementation of a multi-user load-balanced Spark-based batch processing engine (serves all our SQL queries).
- Design and implementation of our distributed stream processing engine.
- Design and implementation of a secure Lua-based sandbox and the associated libraries to execute arbitrary code snippets.
- Design and implementation of a distributed cron service.
- Design and implementation of a service to interface with our database (Apache Cassandra).
- Design and implementation of a service to manage topics and partitions of our messaging system (Apache Kafka).
- Design and implementation of a service to manage our object store (EMC ECS via AWS SDK).
- Tooling for deployment on our clusters (thousands of servers) and public cloud resources (Azure)
This project now powers all device visualizations in our ThingSpace IoT platform and is used by our DevOps team for service monitoring. It is deployed in a fully containerized environment running Docker containers on Mesosphere DC/OS (Apache Mesos) across multiple DCs. Some of its capabilities were recently presented at Microsoft Build 2017 (see attached presentation).
Most of the code is available in open-source at Northstar
Apr 2014 - Apr 2016: Senior Research Engineer (Ericsson Silicon Valley)
Worked on the design and implementations of analytics-driven large-scale distributed systems:
- Designed and implemented a multi-datacenter container orchestrator service along with a per-host passive monitoring service to expose host, container, and service-level metrics. Integrated with Apache Spark for model building and near real-time service anomaly detection. Worked extensively with open-source technologies such as HDFS, Tachyon, ZooKeeper, etcd, Docker, Mesos, Marathon, and Spark (Streaming, MLlib).
- Designed and implemented a Spark Streaming application to detect potholes in a near real-time fashion. This work was presented at the Mobile World Congress 2016. Implemented a scalable movie recommendation system (based on Cloudera Oryx) in a media platform at research.
- Working jointly with business unit cloud and Apcera R&D. Designed and implemented a Mesos compatibility service that enables to transparently run Mesos frameworks across Apcera resources. Contributed to the design and testing of new networking features. Studied various cluster management systems (Kubernetes, YARN, Mesos, Spark Standalone, Borg/Omega).
- Worked as part of Ericsson MediaFirst R&D on the design and implementation of their analytics platform. Contributed to Spark CI pipeline (proposed improvements), Apache Mesos platform, and Spark-based movie recommendation (added new filters, improved movie similarity computation, designed and implemented implicit feedback to scalar score computation job, used MLlib ALS to compute personalized movie recommendations).
- Collaboration with UCB AMPLab and CMU SV. Led projects involving CMU masters and PhD students working on cloud resource management and recommender systems.
- Filed six patents in areas of machine learning and cloud resource management. Published two conference papers (appeared at IEEE CLOUD 2015 and IEEE BigData 2016).
Jan 2013 - Apr 2014: Postdoctoral Researcher (Lawrence Berkeley National Lab)
I was a member of the Data Science and Technology Department where I have worked in the Integrated Data Frameworks Group on the design and implementation of a software ecosystem to facilitate seamless data analysis across desktops, HPC and cloud environments. Specifically, my work was centered around the following projects:
- Intelligent storage and data management in clouds. I have contributed to the design, implementation, and evaluation of the FRIEDA data management system: extended FRIEDA to enable application execution on Amazon EC2, developed a Command Line Interface to easily plugin applications into FRIEDA on EC2 and OpenStack clouds, run experiments using scientific applications on EC2. Finally, I have collaborated closely with a physicist to leverage FRIEDA for processing data from the ATLAS experiment at CERN.
- Designed and implemented a software to automate meta-data extraction of over 100 AmeriFlux tower sites, generate summaries, and inform the sites principal investigators.
- File systems performance analysis on Amazon EC2. Focused on LustreFS performance analysis. This work was done in collaboration with the Intel High Performance Data Division.
- Emulation of next-generation infrastructures to serve data and compute-intensive applications. Run experiments using Linux containers on the FutureGrid experimentation testbed. This work was done in collaboration with HP Labs.
Dec 2019 - Dec 2012: Doctoral Researcher (INRIA)
I have worked on autonomic and energy-efficient virtual machine (VM) management in large-scale virtualized data centers. More precisely, my contributions were two fold:
- Designed and implemented a scalable, autonomic, and energy-efficient VM management system called Snooze. For scalability and autonomy Snooze is based on self-configuring and healing hierarchical architecture. For energy-efficiency Snooze provides a unique holistic energy management solution by integrating VM resource (CPU, memory, network Tx, network Rx) utilization monitoring and estimation, server underload and overload mitigation, VM consolidation, and power management mechanisms. The system has been extensively evaluated at large-scale (thousands of system services) on the Grid’5000 experimentation testbed and shown to be scalable, autonomic, and energy-efficient. It is now available as open-source software under the GPL v2 license at Snooze. The source code (written in Java) is hosted at Github
- The second contribution is a novel VM placement algorithm based on the Ant Colony Optimization (ACO) meta-heuristic. ACO is especially attractive for VM placement due to its polynomial worst-case time complexity, close to optimal solutions and ease of parallelization. This work was evaluated by simulation. IBM ILOG CPLEX solver was used to compute the optimal solutions. To enable scalable VM consolidation, this thesis makes two further contributions: (i) an ACO-based VM consolidation algorithm; (ii) a fully decentralized VM consolidation system based on an unstructured peer-to-peer network. The proposed system was evaluated by emulation on the Grid’5000 testbed. Results show the system to be scalable as well as to achieve a data center utilization close to the one obtained by executing a centralized consolidation algorithm.
Jul 2012 - Sep 2012: Research Intern (Lawrence Berkeley National Lab)
- Analyzed the performance and energy efficiency of Hadoop deployments with collocated and separated data and compute layers sing scientific data-intensive applications on physical and virtual clusters. The experiments were conducted on 33 power-metered servers of the Grid’5000 experimentation testbed. This work was presented and published at the IEEE BigData 2013 conference (main track).
Mar 2019 - Aug 2019: Master Research Intern (INRIA)
Contributed to the XtreemOS grid operating system (developed by around 130 people). Extended the distributed checkpointing service (written in Java) in three ways:
- Designed and implemented independent checkpointing without message logging in XtreemOS.
- Designed and implemented a user-space checkpoint callback library.
- Implemented a JNI translation library to support the latest LinuxSSI/Kerrighed kernel-level checkpointer version.
Jun 2007 - Feb 2009: Student Research Assistent (University of Duesseldorf)
- Design and implementation of the first XtreemOS distributed checkpointing service prototype.
- Designed and implemented incremental checkpointing at kernel-level in the Kerrighed Single System Image operating system for clusters.