南通 网站建设,河南搜索引擎推广价格,软件开发外包是什么工作,商务网站的推广Spark核心概念与DAG执行原理笔记
本文档基于手写笔记和学习资料#xff0c;使用Mermaid图表总结Spark的核心概念、DAG执行原理和Stage划分机制#xff0c;便于复习和理解。
1. Spark核心概念总览
mindmaproot((Spark核心概念))RDD弹性分布式数据集五大特性不可变性分区性依…Spark核心概念与DAG执行原理笔记
本文档基于手写笔记和学习资料使用Mermaid图表总结Spark的核心概念、DAG执行原理和Stage划分机制便于复习和理解。
1. Spark核心概念总览
mindmaproot((Spark核心概念))RDD弹性分布式数据集五大特性不可变性分区性依赖关系惰性计算持久化操作类型转换操作Transformations行动操作ActionsDAG有向无环图逻辑执行计划依赖关系窄依赖宽依赖共享变量广播变量Broadcast累加器Accumulator执行流程Driver程序Executor执行器Task任务Stage阶段2. DAG构建与Stage划分流程 #mermaid-svg-auQiBNc8F1tmXeNf {font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;fill:#333;}#mermaid-svg-auQiBNc8F1tmXeNf .error-icon{fill:#552222;}#mermaid-svg-auQiBNc8F1tmXeNf .error-text{fill:#552222;stroke:#552222;}#mermaid-svg-auQiBNc8F1tmXeNf .edge-thickness-normal{stroke-width:2px;}#mermaid-svg-auQiBNc8F1tmXeNf .edge-thickness-thick{stroke-width:3.5px;}#mermaid-svg-auQiBNc8F1tmXeNf .edge-pattern-solid{stroke-dasharray:0;}#mermaid-svg-auQiBNc8F1tmXeNf .edge-pattern-dashed{stroke-dasharray:3;}#mermaid-svg-auQiBNc8F1tmXeNf .edge-pattern-dotted{stroke-dasharray:2;}#mermaid-svg-auQiBNc8F1tmXeNf .marker{fill:#333333;stroke:#333333;}#mermaid-svg-auQiBNc8F1tmXeNf .marker.cross{stroke:#333333;}#mermaid-svg-auQiBNc8F1tmXeNf svg{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;}#mermaid-svg-auQiBNc8F1tmXeNf .label{font-family:"trebuchet ms",verdana,arial,sans-serif;color:#333;}#mermaid-svg-auQiBNc8F1tmXeNf .cluster-label text{fill:#333;}#mermaid-svg-auQiBNc8F1tmXeNf .cluster-label span{color:#333;}#mermaid-svg-auQiBNc8F1tmXeNf .label text,#mermaid-svg-auQiBNc8F1tmXeNf span{fill:#333;color:#333;}#mermaid-svg-auQiBNc8F1tmXeNf .node rect,#mermaid-svg-auQiBNc8F1tmXeNf .node circle,#mermaid-svg-auQiBNc8F1tmXeNf .node ellipse,#mermaid-svg-auQiBNc8F1tmXeNf .node polygon,#mermaid-svg-auQiBNc8F1tmXeNf .node path{fill:#ECECFF;stroke:#9370DB;stroke-width:1px;}#mermaid-svg-auQiBNc8F1tmXeNf .node .label{text-align:center;}#mermaid-svg-auQiBNc8F1tmXeNf .node.clickable{cursor:pointer;}#mermaid-svg-auQiBNc8F1tmXeNf .arrowheadPath{fill:#333333;}#mermaid-svg-auQiBNc8F1tmXeNf .edgePath .path{stroke:#333333;stroke-width:2.0px;}#mermaid-svg-auQiBNc8F1tmXeNf .flowchart-link{stroke:#333333;fill:none;}#mermaid-svg-auQiBNc8F1tmXeNf .edgeLabel{background-color:#e8e8e8;text-align:center;}#mermaid-svg-auQiBNc8F1tmXeNf .edgeLabel rect{opacity:0.5;background-color:#e8e8e8;fill:#e8e8e8;}#mermaid-svg-auQiBNc8F1tmXeNf .cluster rect{fill:#ffffde;stroke:#aaaa33;stroke-width:1px;}#mermaid-svg-auQiBNc8F1tmXeNf .cluster text{fill:#333;}#mermaid-svg-auQiBNc8F1tmXeNf .cluster span{color:#333;}#mermaid-svg-auQiBNc8F1tmXeNf div.mermaidTooltip{position:absolute;text-align:center;max-width:200px;padding:2px;font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:12px;background:hsl(80, 100%, 96.2745098039%);border:1px solid #aaaa33;border-radius:2px;pointer-events:none;z-index:100;}#mermaid-svg-auQiBNc8F1tmXeNf :root{--mermaid-font-family:"trebuchet ms",verdana,arial,sans-serif;} 窄依赖 宽依赖 用户代码 RDD转换操作 构建DAG DAGScheduler分析依赖 依赖类型判断 同一Stage内执行 Stage边界划分 生成Task 新Stage创建 TaskScheduler调度 Executor执行Task 返回结果 3. RDD依赖关系详解 #mermaid-svg-qs5CeNYmpa2gLcaI {font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;fill:#333;}#mermaid-svg-qs5CeNYmpa2gLcaI .error-icon{fill:#552222;}#mermaid-svg-qs5CeNYmpa2gLcaI .error-text{fill:#552222;stroke:#552222;}#mermaid-svg-qs5CeNYmpa2gLcaI .edge-thickness-normal{stroke-width:2px;}#mermaid-svg-qs5CeNYmpa2gLcaI .edge-thickness-thick{stroke-width:3.5px;}#mermaid-svg-qs5CeNYmpa2gLcaI .edge-pattern-solid{stroke-dasharray:0;}#mermaid-svg-qs5CeNYmpa2gLcaI .edge-pattern-dashed{stroke-dasharray:3;}#mermaid-svg-qs5CeNYmpa2gLcaI .edge-pattern-dotted{stroke-dasharray:2;}#mermaid-svg-qs5CeNYmpa2gLcaI .marker{fill:#333333;stroke:#333333;}#mermaid-svg-qs5CeNYmpa2gLcaI .marker.cross{stroke:#333333;}#mermaid-svg-qs5CeNYmpa2gLcaI svg{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;}#mermaid-svg-qs5CeNYmpa2gLcaI .label{font-family:"trebuchet ms",verdana,arial,sans-serif;color:#333;}#mermaid-svg-qs5CeNYmpa2gLcaI .cluster-label text{fill:#333;}#mermaid-svg-qs5CeNYmpa2gLcaI .cluster-label span{color:#333;}#mermaid-svg-qs5CeNYmpa2gLcaI .label text,#mermaid-svg-qs5CeNYmpa2gLcaI span{fill:#333;color:#333;}#mermaid-svg-qs5CeNYmpa2gLcaI .node rect,#mermaid-svg-qs5CeNYmpa2gLcaI .node circle,#mermaid-svg-qs5CeNYmpa2gLcaI .node ellipse,#mermaid-svg-qs5CeNYmpa2gLcaI .node polygon,#mermaid-svg-qs5CeNYmpa2gLcaI .node path{fill:#ECECFF;stroke:#9370DB;stroke-width:1px;}#mermaid-svg-qs5CeNYmpa2gLcaI .node .label{text-align:center;}#mermaid-svg-qs5CeNYmpa2gLcaI .node.clickable{cursor:pointer;}#mermaid-svg-qs5CeNYmpa2gLcaI .arrowheadPath{fill:#333333;}#mermaid-svg-qs5CeNYmpa2gLcaI .edgePath .path{stroke:#333333;stroke-width:2.0px;}#mermaid-svg-qs5CeNYmpa2gLcaI .flowchart-link{stroke:#333333;fill:none;}#mermaid-svg-qs5CeNYmpa2gLcaI .edgeLabel{background-color:#e8e8e8;text-align:center;}#mermaid-svg-qs5CeNYmpa2gLcaI .edgeLabel rect{opacity:0.5;background-color:#e8e8e8;fill:#e8e8e8;}#mermaid-svg-qs5CeNYmpa2gLcaI .cluster rect{fill:#ffffde;stroke:#aaaa33;stroke-width:1px;}#mermaid-svg-qs5CeNYmpa2gLcaI .cluster text{fill:#333;}#mermaid-svg-qs5CeNYmpa2gLcaI .cluster span{color:#333;}#mermaid-svg-qs5CeNYmpa2gLcaI div.mermaidTooltip{position:absolute;text-align:center;max-width:200px;padding:2px;font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:12px;background:hsl(80, 100%, 96.2745098039%);border:1px solid #aaaa33;border-radius:2px;pointer-events:none;z-index:100;}#mermaid-svg-qs5CeNYmpa2gLcaI :root{--mermaid-font-family:"trebuchet ms",verdana,arial,sans-serif;} 宽依赖 (Wide Dependencies) 窄依赖 (Narrow Dependencies) 子RDD分区1 父RDD分区1 子RDD分区2 父RDD分区2 父RDD分区3 操作: groupByKey, reduceByKey 特点: 一对多 需要Shuffle Stage边界 子RDD分区1 父RDD分区1 子RDD分区2 父RDD分区2 子RDD分区3 父RDD分区3 操作: map, filter, union 特点: 一对一或多对一 无需Shuffle 可管道化执行 4. Spark作业执行架构 #mermaid-svg-jGWuvkQKFVj23uX9 {font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;fill:#333;}#mermaid-svg-jGWuvkQKFVj23uX9 .error-icon{fill:#552222;}#mermaid-svg-jGWuvkQKFVj23uX9 .error-text{fill:#552222;stroke:#552222;}#mermaid-svg-jGWuvkQKFVj23uX9 .edge-thickness-normal{stroke-width:2px;}#mermaid-svg-jGWuvkQKFVj23uX9 .edge-thickness-thick{stroke-width:3.5px;}#mermaid-svg-jGWuvkQKFVj23uX9 .edge-pattern-solid{stroke-dasharray:0;}#mermaid-svg-jGWuvkQKFVj23uX9 .edge-pattern-dashed{stroke-dasharray:3;}#mermaid-svg-jGWuvkQKFVj23uX9 .edge-pattern-dotted{stroke-dasharray:2;}#mermaid-svg-jGWuvkQKFVj23uX9 .marker{fill:#333333;stroke:#333333;}#mermaid-svg-jGWuvkQKFVj23uX9 .marker.cross{stroke:#333333;}#mermaid-svg-jGWuvkQKFVj23uX9 svg{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;}#mermaid-svg-jGWuvkQKFVj23uX9 .actor{stroke:hsl(259.6261682243, 59.7765363128%, 87.9019607843%);fill:#ECECFF;}#mermaid-svg-jGWuvkQKFVj23uX9 text.actortspan{fill:black;stroke:none;}#mermaid-svg-jGWuvkQKFVj23uX9 .actor-line{stroke:grey;}#mermaid-svg-jGWuvkQKFVj23uX9 .messageLine0{stroke-width:1.5;stroke-dasharray:none;stroke:#333;}#mermaid-svg-jGWuvkQKFVj23uX9 .messageLine1{stroke-width:1.5;stroke-dasharray:2,2;stroke:#333;}#mermaid-svg-jGWuvkQKFVj23uX9 #arrowhead path{fill:#333;stroke:#333;}#mermaid-svg-jGWuvkQKFVj23uX9 .sequenceNumber{fill:white;}#mermaid-svg-jGWuvkQKFVj23uX9 #sequencenumber{fill:#333;}#mermaid-svg-jGWuvkQKFVj23uX9 #crosshead path{fill:#333;stroke:#333;}#mermaid-svg-jGWuvkQKFVj23uX9 .messageText{fill:#333;stroke:#333;}#mermaid-svg-jGWuvkQKFVj23uX9 .labelBox{stroke:hsl(259.6261682243, 59.7765363128%, 87.9019607843%);fill:#ECECFF;}#mermaid-svg-jGWuvkQKFVj23uX9 .labelText,#mermaid-svg-jGWuvkQKFVj23uX9 .labelTexttspan{fill:black;stroke:none;}#mermaid-svg-jGWuvkQKFVj23uX9 .loopText,#mermaid-svg-jGWuvkQKFVj23uX9 .loopTexttspan{fill:black;stroke:none;}#mermaid-svg-jGWuvkQKFVj23uX9 .loopLine{stroke-width:2px;stroke-dasharray:2,2;stroke:hsl(259.6261682243, 59.7765363128%, 87.9019607843%);fill:hsl(259.6261682243, 59.7765363128%, 87.9019607843%);}#mermaid-svg-jGWuvkQKFVj23uX9 .note{stroke:#aaaa33;fill:#fff5ad;}#mermaid-svg-jGWuvkQKFVj23uX9 .noteText,#mermaid-svg-jGWuvkQKFVj23uX9 .noteTexttspan{fill:black;stroke:none;}#mermaid-svg-jGWuvkQKFVj23uX9 .activation0{fill:#f4f4f4;stroke:#666;}#mermaid-svg-jGWuvkQKFVj23uX9 .activation1{fill:#f4f4f4;stroke:#666;}#mermaid-svg-jGWuvkQKFVj23uX9 .activation2{fill:#f4f4f4;stroke:#666;}#mermaid-svg-jGWuvkQKFVj23uX9 .actorPopupMenu{position:absolute;}#mermaid-svg-jGWuvkQKFVj23uX9 .actorPopupMenuPanel{position:absolute;fill:#ECECFF;box-shadow:0px 8px 16px 0px rgba(0,0,0,0.2);filter:drop-shadow(3px 5px 2px rgb(0 0 0 / 0.4));}#mermaid-svg-jGWuvkQKFVj23uX9 .actor-man line{stroke:hsl(259.6261682243, 59.7765363128%, 87.9019607843%);fill:#ECECFF;}#mermaid-svg-jGWuvkQKFVj23uX9 .actor-man circle,#mermaid-svg-jGWuvkQKFVj23uX9 line{stroke:hsl(259.6261682243, 59.7765363128%, 87.9019607843%);fill:#ECECFF;stroke-width:2px;}#mermaid-svg-jGWuvkQKFVj23uX9 :root{--mermaid-font-family:"trebuchet ms",verdana,arial,sans-serif;} Driver Program DAGScheduler TaskScheduler Cluster Manager Executor 1. 提交Job 2. 构建DAG 3. Stage划分 4. 提交TaskSet 5. 申请资源 6. 启动Executor 7. 分发Task 8. 执行Task 9. 返回结果 10. Stage完成通知 11. Job完成 Driver Program DAGScheduler TaskScheduler Cluster Manager Executor 5. Stage划分原理图 #mermaid-svg-TJR01IIRiW2JxhUP {font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;fill:#333;}#mermaid-svg-TJR01IIRiW2JxhUP .error-icon{fill:#552222;}#mermaid-svg-TJR01IIRiW2JxhUP .error-text{fill:#552222;stroke:#552222;}#mermaid-svg-TJR01IIRiW2JxhUP .edge-thickness-normal{stroke-width:2px;}#mermaid-svg-TJR01IIRiW2JxhUP .edge-thickness-thick{stroke-width:3.5px;}#mermaid-svg-TJR01IIRiW2JxhUP .edge-pattern-solid{stroke-dasharray:0;}#mermaid-svg-TJR01IIRiW2JxhUP .edge-pattern-dashed{stroke-dasharray:3;}#mermaid-svg-TJR01IIRiW2JxhUP .edge-pattern-dotted{stroke-dasharray:2;}#mermaid-svg-TJR01IIRiW2JxhUP .marker{fill:#333333;stroke:#333333;}#mermaid-svg-TJR01IIRiW2JxhUP .marker.cross{stroke:#333333;}#mermaid-svg-TJR01IIRiW2JxhUP svg{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;}#mermaid-svg-TJR01IIRiW2JxhUP .label{font-family:"trebuchet ms",verdana,arial,sans-serif;color:#333;}#mermaid-svg-TJR01IIRiW2JxhUP .cluster-label text{fill:#333;}#mermaid-svg-TJR01IIRiW2JxhUP .cluster-label span{color:#333;}#mermaid-svg-TJR01IIRiW2JxhUP .label text,#mermaid-svg-TJR01IIRiW2JxhUP span{fill:#333;color:#333;}#mermaid-svg-TJR01IIRiW2JxhUP .node rect,#mermaid-svg-TJR01IIRiW2JxhUP .node circle,#mermaid-svg-TJR01IIRiW2JxhUP .node ellipse,#mermaid-svg-TJR01IIRiW2JxhUP .node polygon,#mermaid-svg-TJR01IIRiW2JxhUP .node path{fill:#ECECFF;stroke:#9370DB;stroke-width:1px;}#mermaid-svg-TJR01IIRiW2JxhUP .node .label{text-align:center;}#mermaid-svg-TJR01IIRiW2JxhUP .node.clickable{cursor:pointer;}#mermaid-svg-TJR01IIRiW2JxhUP .arrowheadPath{fill:#333333;}#mermaid-svg-TJR01IIRiW2JxhUP .edgePath .path{stroke:#333333;stroke-width:2.0px;}#mermaid-svg-TJR01IIRiW2JxhUP .flowchart-link{stroke:#333333;fill:none;}#mermaid-svg-TJR01IIRiW2JxhUP .edgeLabel{background-color:#e8e8e8;text-align:center;}#mermaid-svg-TJR01IIRiW2JxhUP .edgeLabel rect{opacity:0.5;background-color:#e8e8e8;fill:#e8e8e8;}#mermaid-svg-TJR01IIRiW2JxhUP .cluster rect{fill:#ffffde;stroke:#aaaa33;stroke-width:1px;}#mermaid-svg-TJR01IIRiW2JxhUP .cluster text{fill:#333;}#mermaid-svg-TJR01IIRiW2JxhUP .cluster span{color:#333;}#mermaid-svg-TJR01IIRiW2JxhUP div.mermaidTooltip{position:absolute;text-align:center;max-width:200px;padding:2px;font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:12px;background:hsl(80, 100%, 96.2745098039%);border:1px solid #aaaa33;border-radius:2px;pointer-events:none;z-index:100;}#mermaid-svg-TJR01IIRiW2JxhUP :root{--mermaid-font-family:"trebuchet ms",verdana,arial,sans-serif;} Stage 2 Stage 1 Stage 0 Shuffle Write Shuffle Write collect sortByKey reduceByKey flatMap textFile filter mapToPair 窄依赖操作可在同一Stage执行 宽依赖操作产生Stage边界 Action操作触发Job执行 6. Task数量与分区关系 #mermaid-svg-OAXWqpL19C9pMsF3 {font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;fill:#333;}#mermaid-svg-OAXWqpL19C9pMsF3 .error-icon{fill:#552222;}#mermaid-svg-OAXWqpL19C9pMsF3 .error-text{fill:#552222;stroke:#552222;}#mermaid-svg-OAXWqpL19C9pMsF3 .edge-thickness-normal{stroke-width:2px;}#mermaid-svg-OAXWqpL19C9pMsF3 .edge-thickness-thick{stroke-width:3.5px;}#mermaid-svg-OAXWqpL19C9pMsF3 .edge-pattern-solid{stroke-dasharray:0;}#mermaid-svg-OAXWqpL19C9pMsF3 .edge-pattern-dashed{stroke-dasharray:3;}#mermaid-svg-OAXWqpL19C9pMsF3 .edge-pattern-dotted{stroke-dasharray:2;}#mermaid-svg-OAXWqpL19C9pMsF3 .marker{fill:#333333;stroke:#333333;}#mermaid-svg-OAXWqpL19C9pMsF3 .marker.cross{stroke:#333333;}#mermaid-svg-OAXWqpL19C9pMsF3 svg{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;}#mermaid-svg-OAXWqpL19C9pMsF3 .label{font-family:"trebuchet ms",verdana,arial,sans-serif;color:#333;}#mermaid-svg-OAXWqpL19C9pMsF3 .cluster-label text{fill:#333;}#mermaid-svg-OAXWqpL19C9pMsF3 .cluster-label span{color:#333;}#mermaid-svg-OAXWqpL19C9pMsF3 .label text,#mermaid-svg-OAXWqpL19C9pMsF3 span{fill:#333;color:#333;}#mermaid-svg-OAXWqpL19C9pMsF3 .node rect,#mermaid-svg-OAXWqpL19C9pMsF3 .node circle,#mermaid-svg-OAXWqpL19C9pMsF3 .node ellipse,#mermaid-svg-OAXWqpL19C9pMsF3 .node polygon,#mermaid-svg-OAXWqpL19C9pMsF3 .node path{fill:#ECECFF;stroke:#9370DB;stroke-width:1px;}#mermaid-svg-OAXWqpL19C9pMsF3 .node .label{text-align:center;}#mermaid-svg-OAXWqpL19C9pMsF3 .node.clickable{cursor:pointer;}#mermaid-svg-OAXWqpL19C9pMsF3 .arrowheadPath{fill:#333333;}#mermaid-svg-OAXWqpL19C9pMsF3 .edgePath .path{stroke:#333333;stroke-width:2.0px;}#mermaid-svg-OAXWqpL19C9pMsF3 .flowchart-link{stroke:#333333;fill:none;}#mermaid-svg-OAXWqpL19C9pMsF3 .edgeLabel{background-color:#e8e8e8;text-align:center;}#mermaid-svg-OAXWqpL19C9pMsF3 .edgeLabel rect{opacity:0.5;background-color:#e8e8e8;fill:#e8e8e8;}#mermaid-svg-OAXWqpL19C9pMsF3 .cluster rect{fill:#ffffde;stroke:#aaaa33;stroke-width:1px;}#mermaid-svg-OAXWqpL19C9pMsF3 .cluster text{fill:#333;}#mermaid-svg-OAXWqpL19C9pMsF3 .cluster span{color:#333;}#mermaid-svg-OAXWqpL19C9pMsF3 div.mermaidTooltip{position:absolute;text-align:center;max-width:200px;padding:2px;font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:12px;background:hsl(80, 100%, 96.2745098039%);border:1px solid #aaaa33;border-radius:2px;pointer-events:none;z-index:100;}#mermaid-svg-OAXWqpL19C9pMsF3 :root{--mermaid-font-family:"trebuchet ms",verdana,arial,sans-serif;} RDD分区数 Task数量 影响因素 数据源分区 Shuffle分区配置 手动设置分区 HDFS Block数量 文件数量 spark.sql.shuffle.partitions 默认200个分区 repartition() coalesce() 每个分区对应一个Task 并行度 分区数 7. 共享变量使用场景 #mermaid-svg-h5OtSHnWIxiAO1lF {font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;fill:#333;}#mermaid-svg-h5OtSHnWIxiAO1lF .error-icon{fill:#552222;}#mermaid-svg-h5OtSHnWIxiAO1lF .error-text{fill:#552222;stroke:#552222;}#mermaid-svg-h5OtSHnWIxiAO1lF .edge-thickness-normal{stroke-width:2px;}#mermaid-svg-h5OtSHnWIxiAO1lF .edge-thickness-thick{stroke-width:3.5px;}#mermaid-svg-h5OtSHnWIxiAO1lF .edge-pattern-solid{stroke-dasharray:0;}#mermaid-svg-h5OtSHnWIxiAO1lF .edge-pattern-dashed{stroke-dasharray:3;}#mermaid-svg-h5OtSHnWIxiAO1lF .edge-pattern-dotted{stroke-dasharray:2;}#mermaid-svg-h5OtSHnWIxiAO1lF .marker{fill:#333333;stroke:#333333;}#mermaid-svg-h5OtSHnWIxiAO1lF .marker.cross{stroke:#333333;}#mermaid-svg-h5OtSHnWIxiAO1lF svg{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;}#mermaid-svg-h5OtSHnWIxiAO1lF .label{font-family:"trebuchet ms",verdana,arial,sans-serif;color:#333;}#mermaid-svg-h5OtSHnWIxiAO1lF .cluster-label text{fill:#333;}#mermaid-svg-h5OtSHnWIxiAO1lF .cluster-label span{color:#333;}#mermaid-svg-h5OtSHnWIxiAO1lF .label text,#mermaid-svg-h5OtSHnWIxiAO1lF span{fill:#333;color:#333;}#mermaid-svg-h5OtSHnWIxiAO1lF .node rect,#mermaid-svg-h5OtSHnWIxiAO1lF .node circle,#mermaid-svg-h5OtSHnWIxiAO1lF .node ellipse,#mermaid-svg-h5OtSHnWIxiAO1lF .node polygon,#mermaid-svg-h5OtSHnWIxiAO1lF .node path{fill:#ECECFF;stroke:#9370DB;stroke-width:1px;}#mermaid-svg-h5OtSHnWIxiAO1lF .node .label{text-align:center;}#mermaid-svg-h5OtSHnWIxiAO1lF .node.clickable{cursor:pointer;}#mermaid-svg-h5OtSHnWIxiAO1lF .arrowheadPath{fill:#333333;}#mermaid-svg-h5OtSHnWIxiAO1lF .edgePath .path{stroke:#333333;stroke-width:2.0px;}#mermaid-svg-h5OtSHnWIxiAO1lF .flowchart-link{stroke:#333333;fill:none;}#mermaid-svg-h5OtSHnWIxiAO1lF .edgeLabel{background-color:#e8e8e8;text-align:center;}#mermaid-svg-h5OtSHnWIxiAO1lF .edgeLabel rect{opacity:0.5;background-color:#e8e8e8;fill:#e8e8e8;}#mermaid-svg-h5OtSHnWIxiAO1lF .cluster rect{fill:#ffffde;stroke:#aaaa33;stroke-width:1px;}#mermaid-svg-h5OtSHnWIxiAO1lF .cluster text{fill:#333;}#mermaid-svg-h5OtSHnWIxiAO1lF .cluster span{color:#333;}#mermaid-svg-h5OtSHnWIxiAO1lF div.mermaidTooltip{position:absolute;text-align:center;max-width:200px;padding:2px;font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:12px;background:hsl(80, 100%, 96.2745098039%);border:1px solid #aaaa33;border-radius:2px;pointer-events:none;z-index:100;}#mermaid-svg-h5OtSHnWIxiAO1lF :root{--mermaid-font-family:"trebuchet ms",verdana,arial,sans-serif;} 累加器 Accumulators 计数器 sc.longAccumulator() 求和操作 错误统计 调试监控 accumulator.add(value) accumulator.value() 广播变量 Broadcast Variables 大型只读数据 sc.broadcast(data) 查找表/字典 配置信息 避免数据重复传输 broadcastVar.value() 8. Spark 4.0.0 新特性概览
mindmaproot((Spark 4.0.0))核心升级JDK 17默认Scala 2.13默认丢弃JDK 8/11支持Spark Connect轻量级Python客户端ML on Spark ConnectSwift客户端支持Spark SQLVARIANT数据类型SQL UDFs会话变量管道语法字符串排序规则PySpark增强绘图APIPython数据源APIPython UDTFs统一性能分析Structured Streaming任意状态API v2状态数据源改进的容错机制9. 学习要点总结 #mermaid-svg-7Lyvtijh0M2RZtDi {font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;fill:#333;}#mermaid-svg-7Lyvtijh0M2RZtDi .error-icon{fill:#552222;}#mermaid-svg-7Lyvtijh0M2RZtDi .error-text{fill:#552222;stroke:#552222;}#mermaid-svg-7Lyvtijh0M2RZtDi .edge-thickness-normal{stroke-width:2px;}#mermaid-svg-7Lyvtijh0M2RZtDi .edge-thickness-thick{stroke-width:3.5px;}#mermaid-svg-7Lyvtijh0M2RZtDi .edge-pattern-solid{stroke-dasharray:0;}#mermaid-svg-7Lyvtijh0M2RZtDi .edge-pattern-dashed{stroke-dasharray:3;}#mermaid-svg-7Lyvtijh0M2RZtDi .edge-pattern-dotted{stroke-dasharray:2;}#mermaid-svg-7Lyvtijh0M2RZtDi .marker{fill:#333333;stroke:#333333;}#mermaid-svg-7Lyvtijh0M2RZtDi .marker.cross{stroke:#333333;}#mermaid-svg-7Lyvtijh0M2RZtDi svg{font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:16px;}#mermaid-svg-7Lyvtijh0M2RZtDi .label{font-family:"trebuchet ms",verdana,arial,sans-serif;color:#333;}#mermaid-svg-7Lyvtijh0M2RZtDi .cluster-label text{fill:#333;}#mermaid-svg-7Lyvtijh0M2RZtDi .cluster-label span{color:#333;}#mermaid-svg-7Lyvtijh0M2RZtDi .label text,#mermaid-svg-7Lyvtijh0M2RZtDi span{fill:#333;color:#333;}#mermaid-svg-7Lyvtijh0M2RZtDi .node rect,#mermaid-svg-7Lyvtijh0M2RZtDi .node circle,#mermaid-svg-7Lyvtijh0M2RZtDi .node ellipse,#mermaid-svg-7Lyvtijh0M2RZtDi .node polygon,#mermaid-svg-7Lyvtijh0M2RZtDi .node path{fill:#ECECFF;stroke:#9370DB;stroke-width:1px;}#mermaid-svg-7Lyvtijh0M2RZtDi .node .label{text-align:center;}#mermaid-svg-7Lyvtijh0M2RZtDi .node.clickable{cursor:pointer;}#mermaid-svg-7Lyvtijh0M2RZtDi .arrowheadPath{fill:#333333;}#mermaid-svg-7Lyvtijh0M2RZtDi .edgePath .path{stroke:#333333;stroke-width:2.0px;}#mermaid-svg-7Lyvtijh0M2RZtDi .flowchart-link{stroke:#333333;fill:none;}#mermaid-svg-7Lyvtijh0M2RZtDi .edgeLabel{background-color:#e8e8e8;text-align:center;}#mermaid-svg-7Lyvtijh0M2RZtDi .edgeLabel rect{opacity:0.5;background-color:#e8e8e8;fill:#e8e8e8;}#mermaid-svg-7Lyvtijh0M2RZtDi .cluster rect{fill:#ffffde;stroke:#aaaa33;stroke-width:1px;}#mermaid-svg-7Lyvtijh0M2RZtDi .cluster text{fill:#333;}#mermaid-svg-7Lyvtijh0M2RZtDi .cluster span{color:#333;}#mermaid-svg-7Lyvtijh0M2RZtDi div.mermaidTooltip{position:absolute;text-align:center;max-width:200px;padding:2px;font-family:"trebuchet ms",verdana,arial,sans-serif;font-size:12px;background:hsl(80, 100%, 96.2745098039%);border:1px solid #aaaa33;border-radius:2px;pointer-events:none;z-index:100;}#mermaid-svg-7Lyvtijh0M2RZtDi :root{--mermaid-font-family:"trebuchet ms",verdana,arial,sans-serif;} Spark学习重点 理解RDD本质 掌握DAG原理 熟悉Stage划分 优化性能调优 不可变分布式数据集 血缘关系与容错 惰性计算机制 依赖关系分析 执行计划优化 任务调度理解 窄依赖vs宽依赖 Shuffle操作识别 并行度控制 分区策略优化 缓存策略选择 资源配置调优 10. 实践建议
10.1 代码优化建议
优先使用DataFrame/Dataset API而非RDD合理使用缓存机制cache/persist避免不必要的Shuffle操作选择合适的分区策略
10.2 性能调优要点
调整并行度分区数优化内存配置选择合适的序列化方式监控和分析Spark UI
10.3 故障排查思路
查看Spark UI中的DAG可视化分析Stage执行时间和数据倾斜检查Task失败原因和重试情况监控资源使用情况CPU、内存、网络 注意: 本笔记结合了手写笔记中的DAG、Stage划分、Task调度等核心概念以及Spark 4.0.0的新特性形成了完整的知识体系图谱便于系统性复习和理解Spark的工作原理。