How can I modify a column in an Incanter dataset?

2.1k Views Asked by At

I'd like to be able to transform an individual column in an incanter data set, and save the resulting data set to a new (csv) file. What is the simplest way to do that?

Essentially, I'd like to be able to map a function over a column in the data set, and replace the original column with this result.

4

There are 4 best solutions below

0
On

You can define something like:

(defn map-data [dataset column fn]
  (conj-cols (sel dataset :except-cols column)
             ($map fn column dataset)))

and use as

(def data (get-dataset :cars))
(map-data data :speed #(* % 2))

there is only one problem with changing of column names - I'll try to fix it, when I'll have free time...

2
On

Again: maybe you can use the internal structure of the dataset.

user=> (defn update-column
         [dataset column f & args]
         (->> (map #(apply update-in % [column] f args) (:rows dataset))
           vec
           (assoc dataset :rows)))
#'user/update-column
user=> d
[:col-0 :col-1]
[1 2]
[3 4]
[5 6]

user=> (update-column d :col-1 str "d")
[:col-0 :col-1]
[1 "2d"]
[3 "4d"]
[5 "6d"]

Again it should be checked in how far this is public API.

0
On

NOTE: this solution requires Incanter 1.5.3 or greater

For those who can use recent versions of Incanter...

add-column & add-derived-column were added to Incanter in 1.5.3 (pull request)

From the docs:

add-column

"Adds a column, with given values, to a dataset."

(add-column column-name values)

or

(add-column column-name values data)

Or you can use:

add-derived-column

"This function adds a column to a dataset that is a function of existing columns. If no dataset is provided, $data (bound by the with-data macro) will be used. f should be a function of the from-columns, with arguments in that order."

(add-derived-column column-name from-columns f)

or

(add-derived-column column-name from-columns f data)

a more complete example

(use '(incanter core datasets))
  (def cars (get-dataset :cars))

(add-derived-column :dist-over-speed [:dist :speed] (fn [d s] (/ d s)) cars)

(with-data (get-dataset :cars)
  (view (add-derived-column :speed**-1 [:speed] #(/ 1.0 %))))
0
On

Here are two similar functions, both column name and order preserving.

(defn transform-column [col-name f data] 
  (let [new-col-names (sort-by #(= % col-name) (col-names data))
        new-dataset (conj-cols
                      (sel data :except-cols col-name)
                      (f ($ col-name data)))]

    ($ (col-names data) (col-names new-dataset new-col-names) )))

(defn transform-rows [col-name f data] 
  (let [new-col-names (sort-by #(= % col-name) (col-names data))
        new-dataset (conj-cols
                      (sel data :except-cols col-name)
                      ($map f col-name data))]

And here is an example illustrating the difference:

=> (def test-data (to-dataset [{:a 1 :b 2} {:a 3 :b 4}])) 
=> (transform-column :a (fn [x] (map #(* % 2) x)) test-data)
[:a :b]
[2 2]
[6 4]

=> (transform-rows   :a #(* % 2) test-data)
[:a :b]
[2 2]
[6 4]

transform-rows is best for simple transformations, where as transform-column is for when the transformation for one row is dependent on other rows (such as when normalizing a column).

Saving and loading CSV can be done with the standard Incanter functions, so a full example looks like:

(use '(incanter core io)))

(def data (col-names (read-dataset 'data.csv') [:a :b])

(save (transform-rows :a #(* % 2) data) 'transformed-data.csv')