Cascalog process multi-line json?

101 Views Asked by At

I have a directory of Json files that I want to process using cascalog. The solution I have right now requires me to remove all newline characters from my json files using a bash script. I am looking a better solution because I sync these files using rsync.

My question is can I read the contents of a file in Cascalog and return the contents of the file as one tuple. At present the function 'lfs-textline' returns a sequence of tuples for each line in the file, hence why I have to remove the newline characters. Preferably I want to return a sequence of tuples for each file.

(defn textline-parsed [dir]
    (let [source (lfs-textline dir)]
        (<- [?line]
            (source ?line))))
1

There are 1 best solutions below

0
On BEST ANSWER

Use hfs-wholefile from cascalog.more-taps to do this.

(:require [cascalog.more-taps :as taps])

(defn- byte-writable-to-str [bw]
  "convert byte writable to stirng"
  [(apply str (map char (. bw (getBytes))))])

And, use

(??<- [?str] 
    ((taps/hfs-wholefile path) ?filename ?file-content) 
    (byte-writable-to-str ?file-content :> ?str)