Why Spark won't store Broadcasted data in off heap memory? Why does it store one copy per executor?

Question

When we broadcast a dataframe (let's say broadcast joins), it broadcasts the same copy of dataframe to all the executors, i.e. each executor (not node, but executor) will hold one copy of the data. So if a node has 3 executors, it unnecessarily holds 3 copies of the same data.

My question is, why can't it use off-heap space? Let's say I give sufficient off-heap memory so why can't it store this data there, just one copy per node and all executors in that node can read from this off heap space.

Spark does not store broadcast data in off-heap memory shared across executors because of its design principles: Executor isolation for fault tolerance and simplicity. Avoidance of synchronization overhead and performance bottlenecks. A distributed architecture that minimizes inter-executor dependencies. To optimize broadcast behavior: Reduce the number of executors per node. Consider alternatives like custom shared memory solutions or distributed file systems. — user84
– user84, Commented Jan 5 at 13:26
This question is similar to: Spark - using off-heap memory. If you believe it’s different, please edit the question, make it clear how it’s different and/or how the answers on that question are not helpful for your problem. — mazaneicha
– mazaneicha, Commented Jan 5 at 21:48

Chris · Accepted Answer · 2025-01-06 09:23:23Z

1

That's a good question, let me first clarify something:

Executors are independent JVM processes that execute tasks in isolation to interfere in each other tasks, when you share data between them, the executors would require inter-process communication (IPC), causing bottlenecks due complexity.
When Spark broadcast data, it guarantees more consistency if one executor broke the data by itself. If it was shared data, this data would be inconsistent for all executors.
Spark is based on fault-tolerance, so if one task fails, Spark will re-execute again this task on a new executor with available resources, keeping a broadcast data copy on each executor simplifies this process of execution again.

Why can't it use off-heap space?

Performance: Accessing off-heap memory could be slower than keeping in a local copy, due IPC or other problems.
Lack of Built-In Mechanisms: Spark's core design revolves around executors as the primary unit of memory management, sharing data directly between executors on the same node would require significant architectural changes to Spark's memory management system.
Combability issues with different Cluster Managers: Introducing node-level shared memory would require modifications and support from each of these environments.

Alternatives:

Distributed systems, like HDFS.
Distributed Caching, like Redis/Memcached.
Custom Shared Memory, like some files to read.

Resources used:

1 - https://blog.devgenius.io/spark-on-heap-and-off-heap-memory-27b625af778b

2 - https://spark.apache.org/docs/latest/tuning.html

edited Jan 6 at 9:23

Chris

2,96516 silver badges15 bronze badges

answered Jan 5 at 17:35

Fred Alisson

1945 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

Praveen Kumar B N Jan 6 at 14:12

Thanks so much for such detailed explanation. One quick question though, it is mentioned that spark uses off heap when we cache any DF. Won't it cause all the problems which you've mentioned above here as well? Why is it that caching is designed to be off heap but broadcasting is not?

Collectives™ on Stack Overflow

Why Spark won't store Broadcasted data in off heap memory? Why does it store one copy per executor?

1 Answer 1

1 Comment

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Linked

Related