Friday, 9 April 2021

What is Apache Spark ?

 
APACHE SPARK
General engine which can combine difference type of computations (SQL queries, text processing & ML)
Main factor : Speed

Spark :
  • Integrates closely with other Bigdata tools (can run in Hadoop clusters & access any Hadoop data source + Cassandra)
  • offers simple APIs in Python, Java, Scala, R & SQL and built-in libs
  • allows querying data via SQL and HQL (Hive QL)
  • supports many data sources : Hive tables, Parquet & formats : JSON, CSV, txt etc.
Spark Architecture
  • Executor : separate JVMs running on working node, runs tasks on worker nodes
  • Driver node : executes the 

Hadoop vs. Spark
  • Spark : streaming (incoming in real time), Hadoop : batch mode