I have a clojure function that uses the flambo v0.60 functions api to do some analysis on a sample data set. I noticed that when I use a (get rdd 2)
instead of getting the second element in the rdd collection, its getting the second character of the first element of the rdd collection. My assumption is clojure is treating each row of the rdd collection as a whole string and not a vector for me to be able to get the second element in the collection. I'm thinking of using the map-values function to convert the mapped values into a vector for which I can get the second element, I tried this:
(defn split-on-tab-transformation [xctx input]
(assoc xctx :rdd (-> (:rdd xctx)
(spark/map (spark/fn [row] (s/split row #"\t")))
(spark/map-values vec))))
Unfortunately I got an error:
java.lang.IllegalArgumentException: No matching method found: mapValues for class org.apache.spark.api.java.JavaRDD...
This is code returns the first collection in the rdd:
(assuming I removed the (spark/map-values vec)
in the above function
(defn get-distinct-column-val
"input = {:col val}"
[ xctx input ]
(let [rdds (-> (:rdd xctx)
(f/map (f/fn [row] row))
f/first)]
(clojure.pprint/pprint rdds)))
Output:
[2.00000 770127 200939.000000 \t6094\tBENTONVILLE, AR DPS\t22.500000\t5.000000\t2.500000\t5.000000\t0.000000\t0.000000\t0.000000\t0.000000\t0.000000\t1\tStore Tab\t0.000000\t4.50\t3.83\t5.00\t0.000000\t0.000000\t0.000000\t0.000000\t19.150000]
if I try to get the second element 770127
(defn get-distinct-column-val
"input = {:col val}"
[ xctx input ]
(let [rdds (-> (:rdd xctx)
(f/map (f/fn [row] row))
f/first)]
(clojure.pprint/pprint (get rdds 1)))
I get :
[\.]
Flambo documentation for map-values
I'm new to clojure and I'd appreciate any help. Thanks
First of all
map-values
(ormapValues
in Spark API) is a valid transformation only on a PairRDD (for example something like this[:foo [1 2 3]]
. RDDs with values like this can be interpreted as some some sort of maps where the first element is a key and the second is a value.If you have RDD like this
mapValues
transforms the values without changing the key. In this case you should use a second map, although it seem obsolete sinceclojure.string/split
already returns a vector.A simple example of using
map-values
:From your description it looks like you're using an input RDD instead of the one returned from
split-on-tab-transformation
. If I had to guess you're trying to use originalxctx
, not the one returned fromsplit-on-tab-transformation
. Since Clojuremaps
are immutableassoc
doesn't change a passed argument andget-distinct-column-val
receivesRDD[String]
notRDD[Array[String]]
Based on a naming convention I assume you want to get distinct values for a single position in a array. I removed unused parts of your code for clarity. First lets create dummy data:
add rewritten versions of your functions
and result