How do you manage static data in Palantir Foundry?

44 Views Asked by At

Static data meaning config files with rarely updated data, that downstream transforms depend on.

When updating data, it should be possible to create the change on a branch, and then ideally merge that change to the master branch when ready.

What I have tried:

  • Uploading a raw file as a dataset. It's not possible to upload a file on a new branch, and no way to merge/delete branches from a dataset.
  • Fusion sheets. These don't support branching, but you can copy paste the static data and create a new sync to a different branch. Fusion also auto-syncs changes which is not ideal.
  • Pipeline builder. There is some support for a manually edited dataset, but this prevents you from uploading a file. It doesn't seem to be possible to edit data on a branch - the GUI falls back to master branch data. This means that, on a branch, uploading a new file will actually overwrite the master branch.
1

There are 1 best solutions below

0
ZettaP On

2 potential solutions, maybe not exactly meeting your need.

You could upload the files to a dataset on master and use a code repository to read from that (new) dataset into your branch of interest. Essentially piping the source dataset to the branch you need. On principle that's the same behavior as the branch fallback mechanism, but maybe that would be useful in some way to have more control there if you need ?

# Assuming this code lives on a branch

@transform(
    processed=Output("ri.foundry.main.dataset.abc")
    my_input=Input("ri.foundry.main.dataset.xyz", branch="master")
)
def create_data(ctx, processed):
   ...

https://www.palantir.com/docs/foundry/transforms-python/transforms-python-api-classes/#input

As an alternative, it might be an anti-pattern if you have big files, but you could try to upload the file directly to your code repository and read it and save it via a transform, to an output dataset. I don't have a code example at hand for the code to read a file of a code repository however.

def get_empty_df(ctx):
    schema = T.StructType([T.StructField("key", T.StringType(), True)])
    df = ctx.spark_session.createDataFrame([("dummy_key",)], schema)
    ## TODO: some logic to read from the file in the code repository
    df = df.withColumn('when', F.current_timestamp())
    return df


# This transform always save a dataframe of N rows
@transform(
    processed=Output("ri.foundry.main.dataset.abc")
)
def create_data(ctx, processed):
    # We generate a dataframe 
    df = get_empty_df(ctx)

    # We save this dataframe to our output
    processed.write_dataframe(df)