I am trying to made something similar to map reduce, but without hadoop.
I plane to use several php processes, each doing $cf->get_range($begin, $end) and to iterate every row.
But because of random partitioner, the data does not come sorted. This means I can not really select good $begin, $end variables, and will be difficult to start 30-40 processes in parallel.
Cassandra support get_range by token, but it is not exposed in phpcassa.
I have several possibilities, but do not like them because they do not seems unprofessional:
- put all keys on single row and use CoulumnSlice() + multiget() after that.
- put all keys on single row but with their MD5 values. Then by MD5 value to get key, and to do get_range()
- doing similar stuff with secondary index
- import all keys in Redis.