APACHE SPARK
General engine which can combine difference type of computations (SQL queries, text processing & ML)
Main factor : Speed
Spark :
- Integrates closely with other Bigdata tools (can run in Hadoop clusters & access any Hadoop data source + Cassandra)
- offers simple APIs in Python, Java, Scala, R & SQL and built-in libs
- allows querying data via SQL and HQL (Hive QL)
- supports many data sources : Hive tables, Parquet & formats : JSON, CSV, txt etc.
Spark Architecture
- Executor : separate JVMs running on working node, runs tasks on worker nodes
- Driver node : executes the
Hadoop vs. Spark
- Spark : streaming (incoming in real time), Hadoop : batch mode