Skip to content

Apache Spark Scala Interview Questions- Shyam Mallesh May 2026

val rdd = sc.textFile("data.txt") // nothing read yet val words = rdd.flatMap(_.split(" ")) // transformation val counts = words.map(w => (w, 1)).reduceByKey(_ + _) // transformation counts.saveAsTextFile("output") // πŸ”₯ Action triggers job | Operation | Shuffle Behavior | Performance | |----------------|------------------|--------------| | groupByKey | Sends all values for a key across the network β†’ high shuffle I/O | Slower, risks OOM | | reduceByKey | Combines values locally (map-side reduce) before shuffle β†’ reduces data transfer | Faster, memory efficient |

import org.apache.spark.sql.types._ val schema = StructType(Seq( StructField("name", StringType), StructField("age", IntegerType), StructField("address", StructType(Seq( StructField("city", StringType), StructField("zip", LongType) ))) )) Apache Spark Scala Interview Questions- Shyam Mallesh

val df = spark.read.option("inferSchema", "true").json("data.json") val rdd = sc

⚠️ coalesce(1) avoids shuffle but may cause data skew. Only safe if current partitions are small. With schema inference (slow but automatic): These range from beginner to advanced, covering RDDs,

Here’s a curated set of , structured in the style of Shyam Mallesh (known for clear, practical, and depth-driven technical content). These range from beginner to advanced, covering RDDs, DataFrames, Spark SQL, optimizations, and internals. πŸš€ Apache Spark Scala Interview Questions – By Shyam Mallesh βœ… 1. What are the differences between map , flatMap , and mapPartitions in Spark? | Transformation | Description | |----------------|-------------| | map | Applies a function to each element of an RDD/DataFrame and returns a new collection of same size. | | flatMap | Applies a function that returns a sequence (or Option) and flattens the result. Useful for one-to-many transformations. | | mapPartitions | Applies a function to each partition as an iterator. Avoids per-element function call overhead. Good for initialization (e.g., DB connections). |