Untuple a Clojure sequence

135 Views Asked by At

I have a function that is deduplicating with preference, I thought of implementing the solution in clojure using flambo function thus:

  1. From the data set, using the group-by, to group duplicates (i.e based on a specified :key)

  2. Given a :val as input, using a filter to check if the some of values for each row are equal to this :val

  3. Use a map to untuple the duplicates to return single vectors (Not very sure if that is the right way though, I tried using a flat-map without any luck)

For a sample data-set

(def rdd
   (f/parallelize sc [ ["Coke" "16" ""] ["Pepsi" "" "5"] ["Coke" "2" "3"] ["Coke" "" "36"] ["Pepsi" "" "34"] ["Pepsi" "25" "34"]]))

I tried this:

(defn dedup-rows
 [rows input]
 (let [{:keys [key-col col val]} input  
      result (-> rows
               (f/group-by (f/fn [row]
                            (get row key-col)))
              (f/values)
              (f/map (f/fn [rows]
                (if (= (count rows) 1)
                  rows
                  (filter (fn [row]
                            (let [col-val (get row col)
                                  equal? (= col-val val)]
                              (if (not equal?)
                               true
                               false))) rows)))))]
    result))

if I run this function thus:

(dedup-rows rdd {:key-col 0 :col 1 :val ""})

it produces

;=> [(["Pepsi" 25 34]), (["Coke" 16 ] ["Coke" 2 3])]]

I don't know what else to do to handle the result to produce a result of

;=> [["Pepsi" 25 34],["Coke" 16 ],["Coke" 2 3]]

I tried f/map f/untuple as the last form in the -> macro with no luck.

Any suggestions? I will really appreciate if there's another way to go about this. Thanks.

PS: when grouped

;=> [[["Pepsi" "" 5], ["Pepsi" "" 34], ["Pepsi" 25 34]], [["Coke" 16 ""], ["Coke" 2 3], ["Coke" "" 36]]]

For each group, rows that have"" are considered duplicates and hence removed from the group.

2

There are 2 best solutions below

0
On

Looking at the flambo readme, there is a flat-map function. This is slightly unfortunate naming because the Clojure equivalent is called mapcat. These functions take each map result - which must be a sequence - and concatenates them together. Another way to think about it is that it flattens the final sequence by one level.

I can't test this but I think you should replace your f/map with f/flat-map.

0
On

Going by @TheQuickBrownFox suggestion, I tried the following

(defn dedup-rows
[rows input]
(let [{:keys [key-col col val]} input  
  result (-> rows
           (f/group-by (f/fn [row]
                        (get row key-col)))
          (f/values)
          (f/map (f/fn [rows]
            (if (= (count rows) 1)
              rows
              (filter (fn [row]
                        (let [col-val (get row col)
                              equal? (= col-val val)]
                          (if (not equal?)
                           true
                           false))) rows)))
           (f/flat-map (f/fn [row]
                           (mapcat vector row)))))]
  result))

and seems to work