I have a regression dataset where approximately 95% of the target variables are zeros (the other 5% are between 1 and 30) and I am trying to design a Tensorflow model to model that data. I am thinking of implementing a model that combines a classifier and a regressor (check the output of the classifier submodel, if it's less than a threshold then pass it to the regression submodel). I have the intuition that this should be built using the functional API But I couldn't find helpful resources on that. Any ideas?
Here is the code that generates the data that I am using to replicate the problem:
n = 10000
zero_percentage = 0.95
zeros = np.zeros(round(n * zero_percentage))
non_zeros = np.random.randint(1,30,size=round(n * (1- zero_percentage)))
y = np.concatenate((zeros,non_zeros))
np.random.shuffle(y)
a = 50
b = 10
x = np.array([np.random.randint(31,60) if element == 0 else (element - b) / a for element in y])
y_classification = np.array([0 if element == 0 else 1 for element in y])
Note: I experimented with probabilistic models (Poisson regression and regression with a discretized logistic mixture distribution), and they provided good results but the training was unstable (loss diverges very often).
Instead of trying to find some heuristic to balance the training between the zero values and the others, you might want to try some input preprocessing method that can handle imbalanced training sets better (usually by mapping to another space before running the model, then doing the inverse with the results); for example, an embedding layer. Alternatively, normalize the values to a small range (like [-1, 1]) and apply an activation function before evaluating the model on the data.