I have a function that is deduplicating with preference, I thought of implementing the solution in clojure using flambo function thus:
From the data set, using the
group-by
, to group duplicates (i.e based on a specified:key
)Given a
:val
as input, using afilter
to check if the some of values for each row are equal to this:val
Use a map to
untuple
the duplicates to return single vectors (Not very sure if that is the right way though, I tried using aflat-map
without any luck)
For a sample data-set
(def rdd
(f/parallelize sc [ ["Coke" "16" ""] ["Pepsi" "" "5"] ["Coke" "2" "3"] ["Coke" "" "36"] ["Pepsi" "" "34"] ["Pepsi" "25" "34"]]))
I tried this:
(defn dedup-rows
[rows input]
(let [{:keys [key-col col val]} input
result (-> rows
(f/group-by (f/fn [row]
(get row key-col)))
(f/values)
(f/map (f/fn [rows]
(if (= (count rows) 1)
rows
(filter (fn [row]
(let [col-val (get row col)
equal? (= col-val val)]
(if (not equal?)
true
false))) rows)))))]
result))
if I run this function thus:
(dedup-rows rdd {:key-col 0 :col 1 :val ""})
it produces
;=> [(["Pepsi" 25 34]), (["Coke" 16 ] ["Coke" 2 3])]]
I don't know what else to do to handle the result to produce a result of
;=> [["Pepsi" 25 34],["Coke" 16 ],["Coke" 2 3]]
I tried f/map f/untuple
as the last form in the ->
macro with no luck.
Any suggestions? I will really appreciate if there's another way to go about this. Thanks.
PS: when grouped
;=> [[["Pepsi" "" 5], ["Pepsi" "" 34], ["Pepsi" 25 34]], [["Coke" 16 ""], ["Coke" 2 3], ["Coke" "" 36]]]
For each group, rows that have""
are considered duplicates and hence removed from the group.
Looking at the flambo readme, there is a
flat-map
function. This is slightly unfortunate naming because the Clojure equivalent is calledmapcat
. These functions take each map result - which must be a sequence - and concatenates them together. Another way to think about it is that it flattens the final sequence by one level.I can't test this but I think you should replace your
f/map
withf/flat-map
.