secfree - Think about Implementing a Spark Streaming Application Systematically

Think about Implementing a Spark Streaming Application Systematically

Below are some of my notes about Spark Streaming. I only add answers to not all known questions.

Input
- Manage offsets by Kafka or self?
- Does “createDirectStream” support group mode?
  - “createDirectStream” in “spark-streaming-kafka-0-8” uses the old low level Kafka API, does not support “group” mode
  - “spark-streaming-kafka-0-10” use the new Kafka API, supports “group” mode
- How to control the consuming speed?
  - “maxRatePerPartition”
  - “backpressure”
Process
- How to determine the executor number and core number?
  - What’s the relationship between executors, partitions, tasks and cores?
  - Does Spark distribute Kafka partitions to executors by executor number or core number?
- How much memory does an executor need?
  - What’s Spark’s memory management mechanism?
  - Can RDD size bigger than executor memory?
  - Why Spark stream state cost so much memory?
    - “MEMORY_ONLY” costs 2x-5x of data size
    - Default cache 20 (2 * checkpoint_duration(10 * batch_duration)) state RDDs
  - What are the possible reasons of “OutOfMemoryError”?
- How to maintain the state?
  - Manage Kafka by Spark directly or outside?
  - What’s the difference between “updateStateByKey”, “mapWithState” and “mapGroupsWithState”?
  - What’s the added cost to checkpoint?
  - Why “mapWithState” changed the partition number?
- What are the possible reasons for Spark Streaming straggler?
  - Data skew
  - Checkpoint
Output
- What’s the mechanism of distributing messages to Kafka?
  - Without key, round-robin
  - With key, hash by key
- Speculation may break up data consistency
Monitor
- Which point should the monitor covered?
  - Input speed, process speed, output speed, process offset lag, event_time delay
Advance
- Think about data recovery systematically. Every points in the data flow may corrupt, we should build a easy recover mechanism for each case
- How to correlate different streamings?
- How to ensure stability?

Think about Implementing a Spark Streaming Application Systematically by secfree was published on 2018-12-09