Delve into the internals of Apache Spark by understanding the design decisions and implementation details behind the scenes. By looking at various use cases, we will get to understand the problems that Spark solves in the Data Engineering ecosystem, the way in which it solves them, and the tradeoffs that come with the power and flexibility it provides to developers.
Attendees will walk away with knowledge of the stages involved in executing a Spark job, which will help them write more efficient jobs, pinpoint issues quickly while debugging failing jobs, and identify configuration issues in the Spark cluster.
Agenda:
A brief look at the history of the Data Engineering ecosystem
Understand the problems that Spark was created to solve
The fundamental aspects of writing a Spark job
Go behind the scenes to look at how Spark processes data
Understand the various touchpoints and their performance implications
Identifying potential problems and optimizations
About the Speaker:
Dakshin has spent the better part of the last 3 years working with Spark and Data pipelines in production, gaining insights into how these tools operate under the hood, equipping him to write efficient data pipelines and identify performance issues quickly. He is also an open-source contributor to some tools in the Data Engineering space, including Apache Airflow and Marquez.