Does this middle variable have any information gain?

102 Views Asked by At

Let's say that I have 2 variables: A (as the input) and C (as the output)
So it's A -> C
There's also another variable B, and
corr(A, B) > corr(A, C)
corr(C, B) > corr(A, C)

Would A -> B -> C get better performance with the existing model?
In other words, does this B have any information gain?

1

There are 1 best solutions below

0
On

The information gained about C, given A is: log(1/P(A,C)). The information gained about C, given both A and B is: log(1/P(A,B,C)). So as long as P(A,C) > P(A,B,C), there will be more information gained by including B.

Now, whether or not that's the case depends on what you're using for the corr metric. But if A/C are dependent on B, there will be at least some values of B which are giving information gain. In general, I'd always include dependent variables in a model, unless the dependence is too strong, in which case some models may not work as well (e.g. neural networks).