I am new to oozie and trying to understand dataset.xml. I have following dataset and trying to understand what exactly oozie is trying to validate here. what is the meaning of initial instance and what uri-template is doing here(not clear on oozie document)
<dataset name="sample" frequency="${coord:hours(1)}" initial-instance="2022-01-10T00:00Z" timezone="UTC">
<uri-template>${hdfsdir}/filepath/${YEAR}${MONTH}${DAY}${HOUR}</uri-template>
<done-flag>_SUCCESS</done-flag>
</dataset>
Similarly, in coordinator I have following for input and output dataset. Here what is the significance of current(-5) and start parameter?
<coordinator-app name="test" frequency="${freq}" start="2022-01-10T00:00Z" end="2023-04-11T00:00Z" timezone="UTC" xmlns="uri:oozie:coordinator:0.4" xmlns:sla="uri:oozie:sla:0.2">
<data-in name="raw" dataset="raw_data">
<instance>${coord:current(-5)}</instance>
</data-in>
<data-out name="processed" dataset="raw_out">
<instance>${coord:current(-5)}</instance>
</data-out>
Can someone explain what oozie is expecting on the datasets?
Thanks, bab
Without looking at the documentation, here's what I can guess.
initial-instance- When is thedatasetfirst available? If you try to provide a timestamp before this in a workflow or coordinator, you can expect an error.frequencywill "count up" from that timestampuri-templateuses built-in Oozie variables to determine what pattern those files exist in the filesystem.coord:current(-5)will multiply 5 by the datasetfrequency, and return the 5th previous instance... Giving you a dataset 5 hours before the time that the coordinator was started.So, for your example, you have
dataset name="sample"defined, but yourdata-inanddata-outtags do not reference this, so I don't think anything will run...Here's the docs for
coord:current(might say something different from my answer) https://oozie.apache.org/docs/5.2.1/CoordinatorFunctionalSpec.html#a6.6.1._coord:currentint_n_EL_Function_for_Synchronous_DatasetsSection 5.1 seems to mostly answer your question