concept YARN in category spark

This is an excerpt from Manning's book Spark in Action, Second Edition.
Basically, the cluster manager allocates resources across applications. However, to run on a cluster, the
SparkSession
can connect to several types of cluster managers. This might be dictated from your infrastructure, enterprise architects, or know-it-all guru. You may not have a choice here. Chapter 18 discusses more cluster manager options, including YARN, Mesos, and Kubernetes.
Apache Hadoop YARN (Yet Another Resource Negotiator) is a resource manager that has been fully integrated with Apache Hadoop since Hadoop version 2. YARN is a key component of a Hadoop deployment; it is not independent. If your organization already operates Hadoop clusters, it will most likely run Apache Spark on the same (or adjacent) cluster, through YARN.
Alibaba Cloud Elastic MapReduce (or E-MapReduce), Amazon EMR, Google Cloud Platform’s Dataproc, IBM Analytics Engine, Microsoft Azure HDInsight, and OVH Data Analytics Platform are managed offerings from the big cloud players. They are all based on Hadoop and include YARN as part of their cluster-to-go offering, making deployment easier. Figure 18.3 illustrates a combined Spark and YARN architecture.
In September 2019, Google announced that Dataproc will also use Kubernetes with Spark. Google is the first to go to production with a cloud-based Spark hosting without a strong dependency on YARN. Others will most certainly follow.
YARN offers more features than running Spark in standalone mode, in terms of process isolation and prioritization, which can result in better security and performance.
Figure 18.3 Architecture combining Hadoop YARN and Spark. The YARN resource manager works with the YARN node manager to manage the executors. Every YARN-based architecture shares a similar pattern.
![]()
You can find more on using YARN with Spark at http://mng.bz/BY6g and http:// mng.bz/dxRX .