I am trying to perform a fit of an ODE model using cmdstanr. Such model consists of four variables, of which two are observables. I have time course data of these two variables, however, they differ in number of timepoints. Namely, I have 6 timepoints for the first observable and 43 for the second. Also, the time span is different (72h for the first and 67.5h for the second).
What I have done until now is to perform polynomial interpolation so that I can match the timepoints of the two observables. This solution has worked pretty well for the current model, however, I would like to understand whether and how it is possible to work with missing data on cmdstanr, so that I don't have to perform polynomial interpolation and I can just use the original data.
Here you have a simplified version of the model that I run using the interpolated data:
functions {
real[] ode(real t, real[] y, real[] theta, real[] x_r, int[] x_i) {
// Define the state variables
real y1 = y[1];
real y2 = y[2];
real y3 = y[3]; //first observable
real y4 = y[4]; //second observable
//I omit the parameter definition and I won't specify the equations
real dy1 = eq1;
real dy2 = eq2;
real dy3 = eq3;
real dy4 = eq4;//
return {dy1, dy2, dy3, dy4};
}
}
data {
int<lower=1> N; // Number of time points for all except NEC
real t0;
real ts[N];
real y[N,2];
real<lower=0> sigma[N,2];
}
transformed data {
int x_i[0];
real x_r[0];
}
parameters {
//here I omit the parameters and initial conditions to be estimated
}
transformed parameters {
real z_hat[N, 4];
{real theta[8] = {//list of parameters};
real y0[4] = {//list of initial conditions};
z_hat = integrate_ode_rk45(ode, y0, t0, ts, theta, x_r, x_i);
}
}
model {
// Priors for the parameters (omitted)
// Likelihood
for (n in 1:N) {
y[n,1] ~ normal(z_hat[n,3], sigma[n,1]);
y[n,2] ~ normal(z_hat[n,4], sigma[n,2]);
}
}
I tried to extend the dataset, so that the two observables would take values in the union of the two different sets of timepoints. All the missing values (i.e. mean and sd of the observables at the respective extra time points) were set to NaN. I tried to modify the code accordigly, setting N to be the total number of time points, but apparently cmdstanr cannot accept NaN values. It was something that I still wanted to try, considering that other calibration methods can deal with NaN values (they are just ignored). I then tried to start from the same dataframe (with Nan values included) and add another boolean column which would take value 1 in case of missing data and 0 otherwise. I then modified the model block in the stan code in the following way:
for (n in 1:N) {
if (is_missing[n] == 0) {
y[n,1] ~ normal(z_hat[n,3], sigma[n,1]);
y[n,2] ~ normal(z_hat[n,4], sigma[n,2]);
} else {
y[n, 1] ~ normal(mu_y1, sigma_y1); // Latent variable for missing mean
y[n, 2] ~ normal(mu_y2, sigma_y2); // Latent variable for missing mean
}
}
}
where mu_yi and sigma_yi are arbitrarily chosen, but I still get the NaN error since y[n, i] is not defined in the dataset. How would you tackle the problem?