Final answer:
Apache Spark's key features are in-memory processing for speed, data lineage for fault recovery, and lazy evaluation for optimization. It also supports real-time streaming, batch processing, and allows for manual optimization of jobs.
Step-by-step explanation:
The key features of Apache Spark include:
- In-memory processing: This allows Spark to perform data processing at high speeds because it can keep data in RAM across multiple operations, reducing the need to read from and write to disk.
- Data lineage: Spark tracks the lineage of all the RDDs (Resilient Distributed Datasets) through a directed acyclic graph (DAG), which helps with recovery in case of node failures.
- Lazy evaluation: Operations on RDDs are not executed immediately but are computed lazily. This means Spark will execute a sequence of transformations only when an action is called, leading to optimization of the overall data processing workflow.
Other features that complement these core functionalities include:
- Real-time streaming: Spark supports real-time data processing through Spark Streaming.
- Batch processing: Besides streaming, Spark can also handle batch data processing, working with large datasets in a distributed manner.
- Manual optimization: Advanced users can optimize Spark jobs manually using partitioning and caching strategies.