I want to calculate the mutual information between two numpy vectors:
>>>from sklearn.metrics.cluster import mutual_info_score
>>>import numpy as np
>>>a, b = np.random.rand(10), np.random.rand(10)
>>>mutual_info_score(a, b)
1.6094379124341005
>>>a, b = np.random.rand(10), np.random.rand(10)
>>>mutual_info_score(a, b)
1.6094379124341005
As you can see, although I updated a and b, it returned the same value. Then I tried another example:
>>>a = np.array([167.52523295, 73.2904335 , 98.61953303, 152.17297007,
211.01341451, 327.72296346, 356.60500081, 43.9371432 ,
119.09474284, 125.20180842])
>>>b = np.array([280.9287028 , 131.76304983, 176.0277832 , 188.56630096,
229.09811401, 228.47200012, 617.67000122, 52.7211511 ,
125.95361582, 148.55247447])
>>>mutual_info_score(a, b)
2.302585092994046
>>>a = np.array([ 6.71381009, 1.43607653, 3.78729242, -4.75706796, -3.81281173,
3.23440092, 10.84495625, -0.19646145, 4.09724507, -0.13858104])
>>>b = np.array([ 4.25330873, 3.02197642, -3.2833848 , 0.41855662, -3.74693531,
0.7674982 , 11.36459148, 0.64636462, 0.51817262, 1.65318943])
>>>mutual_info_score(a, b)
2.302585092994046
Why? Look at the difference between those numbers. Why it returns the same value? More importantly, how do I calculate the MI between two vectors?
In that case, you will obtain different numbers each time you run the cell. Here, you're utilizing a method that is suitable for measuring the quality of clustering results!
Let's quickly jump into the principal material. For observing the mutual information (MI) between two vectors (or even several vectors), you can use the
mutual_info_regressionfunction (as described here):In the above, I calculated the MI between each feature of the
awith thetarget! E.g., the MI between the first feature and thetargetis ~0.184. There are various ways to calculate MI between variables, e.g.:estimate mutual information (MI) with histograms. E.g., code:
The challenge is finding a suitable value for the number of
binshere. [1]based on entropy estimation from k-nearest neighbors' distances (
mutual_info_regressionis based on this approach)etc.
P.S. Reading this document is worthwhile.