Pyspark: using filter for feature selection

1k Views Asked by At

I have an array of dimensions 500 x 26. Using the filter operation in pyspark, I'd like to pick out the columns which are listed in another array at row i. Ex: if
a[i]= [1 2 3]

Then pick out columns 1, 2 and 3 and all rows. Can this be done with filter command? If yes, can someone show an example or the syntax?

2

There are 2 best solutions below

0
On BEST ANSWER

Sounds like you need to filter columns, but not records. Fo doing this you need to use Spark's map function - to transform every row of your array represented as an RDD. See in my example:

# generate 13 x 10 array and creates rdd with 13 records, each record contains a list with 10 elements
rdd = sc.parallelize([range(10) for i in range(13)])

def make_selector(cols):
    """use closure to configure select_col function
    :param cols: list - contains columns' indexes to select from every record
    """
    def select_cols(record):
            return [record[c] for c in cols]
    return select_cols

s = make_selector([1,2])
s([0,1,2])
>>> [1, 2]

rdd.map(make_selector([0, 3, 9])).take(5)

results in

[[0, 3, 9], [0, 3, 9], [0, 3, 9], [0, 3, 9], [0, 3, 9]]
0
On

This is essentially the same answer as @vvladymyrov's, but without closures:

rdd = sc.parallelize([range(10) for i in range(13)])
columns = [0,3,9]
rdd.map(lambda record: [record[c] for c in columns]).take(5)

results in

[[0, 3, 9], [0, 3, 9], [0, 3, 9], [0, 3, 9], [0, 3, 9]]