Fig 1.
The Apache Spark layered architecture.
The colors are used only to distinguish the elements. From bottom to top, the first layer shows some of the most common storage options used by Apache Spark applications to store and retrieve external data: the local file system, the Apache Hadoop HDFS file system, the S3 file system, the Ceph file system, and the GCS file system. The second layer shows the scheduling engines that support the ability to run Apache Spark computations across the nodes of a distributed system: Apache Hadoop YARN, Mesos, Kubernetes, and the cluster manager integrated with Apache Spark. The Kubernetes option has been included despite missing some relevant features, such as resource management and job queues, because it is frequently used in the real world. The third layer shows the core of the Apache Spark framework. The fourth layer shows the standard libraries that are integrated with Apache Spark: SparkSQL, useful for querying very large datasets using a dialect of the SQL language; MLlib, a library of ready-to-use machine learning algorithms and methods; GraphX, a library for representing and processing very large graphs using a distributed approach; and Spark Streaming, a library for distributed processing of streaming data. The top layer lists the programming languages that can be used to write Apache Spark applications.
Fig 2.
The Java source code of an Apache Spark–based distributed alignments counter implemented using the Disq [44] framework.