How to convert below text to sequence file which again, will be converted to vector for mahout kmeans?

764 Views Asked by At

Good afternoon to you all,

My data is in below format:

ID : VALUE(tags assigned by users)

0001: "PC, THINKPAD, T500"

0002: "PHONE, CELLPHONE, IPHONE, APPLE, IPHONE5"

.......and so on.

How can I write a code to:

1) first, convert these into sequence file in key:value format.

2) then, convert sequence file above to vectors that will be used for kmeans clustering?

I am checking out the SequenceFileFromdDirectory, and SparseVectorFromSequenceFiles, but these seems a little complicated and a little hard to read right now.

So, I wonder if anyone here could give me a simple sample code about how to do above two conversions?

Thank you very much!

1

There are 1 best solutions below

2
On

Those 2 processes do exactly what you want to do, now it's just a matter of making the output human readable, instead of Sequence Files, for which you would use the seqdumper functionality.

If you need a clearer picture, have a look here, very nice intro.