Apache Spark is a compute engine for parallel and distributed computing. Spark is resilient to machine failures because each computation encodes its dependencies back to a collection on stable storage, so any intermediate result can be reproduced at any time. However, Spark is also fast because it allows these intermediate results to be cached in cluster memory. Spark also presents a productive programming model with a general, powerful abstraction that supports a wide range of analytical and query tasks.
In this talk, I'll provide a general introduction to Spark. We'll discuss the fundamental abstraction of Spark, the resilient distributed dataset, and examine Spark's rich standard libraries for machine learning, structured query, graph computations, and stream processing. We'll close with a case study showing how Spark made it easy for me to make sense of some real-world data.
Survey this Session