I have a dataset(CSV file) of sequence of links with their order placed status for each sequence. I have got the subsequences with their count with the help of prefixSpan algorithm(as described here).
But I also want to find probability of each subsequences in leading to order placed =1. Suppose links are a
,b
,c
,d
and their sequences and order status are as follows in data frame:
Link sequences Order status
a,b,c,a,c,c 0
a,c,b,c 1
a,b,d,c,b,c 1
a,c,b,c 0
Subsequences I get if I put minimum Support =4 with help of prefixSpan algorithm
Subsequences Support
[a] 4
[a,b] 4
[a,b,c] 4
[a,c] 4
[a,c,c] 4
[b] 4
[b,c] 4
[c] 4
[c,c] 4
What changes should I make in prefixSpan algorithm code as mentioned in above link to get probability also as following :
Subsequence Support Prob
[a] 4 0.5
[a,b] 4 0.5
[a,b,c] 4 0.5
[a,c] 4 0.5
[a,c,c] 4 0.5
[b] 4 0.5
[b,c] 4 0.5
[c] 4 0.5
[c,c] 4 0.5
The procedure used to calculate probability of the subsequence is:
Add order placed status of all sequences where the subsequence is present and divide it by count of sequences where it is present eg :
P(subsequence [a,c,c]) =( 0+1+1+0)/4 = 0.5