使用SBT编译Spark子项目 2018-04-07 21:44:10 # 前言 最近为了解决Spark2.1的Bug,对Spark的源码做了不少修改,需要对修改的代码做编译测试,如果编译整个Spark项目快的话,也得半小时左右,所以基本上是改了哪个子项目就单独对那个项目编译打包。 Spark官方已经给出了如何使用mvn单独编译子项目的方法:<http://spark.apache.org/docs/latest/building-spark.html#building-submodules-individually> 使用mvn单独编译子项目是节约了不少时间。但是频繁的改动项目,每次用mvn编译还是挺耗时间的。 之前看官方文档提到,对于开发者,为了提高效率,推荐使用sbt编译。于是,又查了下文档资料:<http://spark.apache.org/developer-tools.html> 咦,看到:Running Build Targets For Individual Projects,内容如下: ``` $ # sbt $ build/sbt package $ # Maven $ build/mvn package -DskipTests -pl assembly ``` 这不是坑么,虽然没怎么用sbt编译过Spark,但是sbt俺还是用过的。`build/sbt package`明明是编译整个项目的好吧,这哪是编译子项目啊。 翻遍官方所有跟编译有关的资料,无果。 最后,研究了下Spark的sbt定义,也就是下`project/SparkBuild.scala`文件,找到了使用sbt编译子项目的方法。 # 使用sbt编译子项目 下面是对spark-core重新编译打包的方法,我们需要使用REPL模式,大致的流程如下: ``` ➜ spark git:(branch-2.1.0) ✗ ./build/sbt -Pyarn -Phadoop-2.6 -Phive ... [info] Set current project to spark-parent (in build file:/Users/stan/Projects/spark/) > project core [info] Set current project to spark-core (in build file:/Users/stan/Projects/spark/) > package [info] Updating {file:/Users/stan/Projects/spark/}tags... [info] Resolving jline#jline;2.12.1 ... ... [info] Packaging /Users/stan/Projects/spark/core/target/scala-2.11/spark-core_2.11-2.1.0.jar ... [info] Done packaging. [success] Total time: 213 s, completed 2017-2-15 16:58:15 ``` 最后将`spark-core_2.11-2.1.0.jar`替换到`jars`或者`assembly/target/scala-2.11/jars`目录下就可以了。 选择的子项目的关键是`project`命令,如何知道有哪些定义好的子项目呢?这个还得参考`project/SparkBuild.scala`中BuildCommons的定义: ``` object BuildCommons { private val buildLocation = file(".").getAbsoluteFile.getParentFile val sqlProjects@Seq(catalyst, sql, hive, hiveThriftServer, sqlKafka010) = Seq( "catalyst", "sql", "hive", "hive-thriftserver", "sql-kafka-0-10" ).map(ProjectRef(buildLocation, _)) val streamingProjects@Seq( streaming, streamingFlumeSink, streamingFlume, streamingKafka, streamingKafka010 ) = Seq( "streaming", "streaming-flume-sink", "streaming-flume", "streaming-kafka-0-8", "streaming-kafka-0-10" ).map(ProjectRef(buildLocation, _)) val allProjects@Seq( core, graphx, mllib, mllibLocal, repl, networkCommon, networkShuffle, launcher, unsafe, tags, sketch, _* ) = Seq( "core", "graphx", "mllib", "mllib-local", "repl", "network-common", "network-shuffle", "launcher", "unsafe", "tags", "sketch" ).map(ProjectRef(buildLocation, _)) ++ sqlProjects ++ streamingProjects val optionallyEnabledProjects@Seq(mesos, yarn, java8Tests, sparkGangliaLgpl, streamingKinesisAsl, dockerIntegrationTests) = Seq("mesos", "yarn", "java8-tests", "ganglia-lgpl", "streaming-kinesis-asl", "docker-integration-tests").map(ProjectRef(buildLocation, _)) val assemblyProjects@Seq(networkYarn, streamingFlumeAssembly, streamingKafkaAssembly, streamingKafka010Assembly, streamingKinesisAslAssembly) = Seq("network-yarn", "streaming-flume-assembly", "streaming-kafka-0-8-assembly", "streaming-kafka-0-10-assembly", "streaming-kinesis-asl-assembly") .map(ProjectRef(buildLocation, _)) val copyJarsProjects@Seq(assembly, examples) = Seq("assembly", "examples") .map(ProjectRef(buildLocation, _)) val tools = ProjectRef(buildLocation, "tools") // Root project. val spark = ProjectRef(buildLocation, "spark") val sparkHome = buildLocation val testTempDir = s"$sparkHome/target/tmp" val javacJVMVersion = settingKey[String]("source and target JVM version for javac") val scalacJVMVersion = settingKey[String]("source and target JVM version for scalac") } ``` 我们看下这个例子: ``` val sqlProjects@Seq(catalyst, sql, hive, hiveThriftServer, sqlKafka010) = Seq( "catalyst", "sql", "hive", "hive-thriftserver", "sql-kafka-0-10" ).map(ProjectRef(buildLocation, _)) ``` 这是对sql项目定义的子项目,有:`catalyst, sql, hive, hiveThriftServer, sqlKafka010`。 我们如果需要编译catalyst这个项目,只需要进入sbt:`project catalyst`选择catalyst项目就可以了,后面使用的compile、package等命令都是针对这个项目的。 ## 结语 这下可以爽爽的编译Spark了。 还有一些有用的编译技巧,去参考<http://spark.apache.org/developer-tools.html>就可以了。 非特殊说明,均为原创,原创文章,未经允许谢绝转载。 原始链接:使用SBT编译Spark子项目 赏 Prev 把Q5盒子打造成怀旧游戏模拟器 Next Spark 1.6升级2.x防踩坑指南