I am implementing GRU in iOS device using cblas library. And I used formula of GRU from Wiki, and also the same formula like in Wikipedia I learned on Coursera. And I found that results with the same weights in my implementation and tf.Keras are different. After debugging I found that GRU in Keras and Torch use different formula for calculating h_t:
In wiki formula next:
h_t = (1 - z) * h_t_previous + z * h_tilda.
When in Keras and Torch:
h_t = (1 - z) * h_tilda + z * h_t_previous.
Can someone explain why they are different?? Also it is logically that update gate multiplies new value (What I would like to update from new value), not? Fun fact, that, MPSGRUDescriptor has flipOutputGates variable for handling this crutch with this two formulas.