先来个普通的数组:
scala> var arr=Array(1.0,2,3,4)arr: Array[Double] = Array(1.0, 2.0, 3.0, 4.0)
可以将它转换成一个Vector:
scala> import org.apache.spark.mllib.linalg._scala> var vec=Vectors.dense(arr)vec: org.apache.spark.mllib.linalg.Vector = [1.0,2.0,3.0,4.0]
再做一个RDD[Vector]:
scala> val rdd=sc.makeRDD(Seq(Vectors.dense(arr),Vectors.dense(arr.map(_*10)),Vectors.dense(arr.map(_*100))))rdd: org.apache.spark.rdd.RDD[org.apache.spark.mllib.linalg.Vector] = ParallelCollectionRDD[6] at makeRDD at:26
可以根据这个RDD做一个分布式的矩阵:
scala> import org.apache.spark.mllib.linalg.distributed._scala> val mat: RowMatrix = new RowMatrix(rdd)mat: org.apache.spark.mllib.linalg.distributed.RowMatrix = org.apache.spark.mllib.linalg.distributed.RowMatrix@3133b850scala> val m = mat.numRows()m: Long = 3scala> val n = mat.numCols()n: Long = 4
试试统计工具,算算平均值:
scala> var sum=Statistics.colStats(rdd)scala> sum.meanres7: org.apache.spark.mllib.linalg.Vector = [37.0,74.0,111.0,148.0]