Writing Parquet files using Parquet.NET works with local file, but results in empty file in blob storage

8.5k Views Asked by At

We are using parquet.net to write parquet files. I've set up a simple schema containing 3 columns, and 2 rows:

        // Set up the file structure
        var UserKey = new Parquet.Data.DataColumn(
            new DataField<Int32>("UserKey"),
            new Int32[] { 1234, 12345}
        );

        var AADID = new Parquet.Data.DataColumn(
            new DataField<string>("AADID"),
            new string[] { Guid.NewGuid().ToString(), Guid.NewGuid().ToString() }
        );

        var UserLocale = new Parquet.Data.DataColumn(
            new DataField<string>("UserLocale"),
            new string[] { "en-US", "en-US" }
        );

        var schema = new Schema(UserKey.Field, AADID.Field, UserLocale.Field
        );

When using a FileStream to write to a local file, a file is created, and when the code finishes, I can see two rows in the file (which is 1 kb after):

            using (Stream fileStream = System.IO.File.OpenWrite("C:\\Temp\\Users.parquet")) {
                using (var parquetWriter = new ParquetWriter(schema, fileStream)) {
                    // Creare a new row group in the file
                    using (ParquetRowGroupWriter groupWriter = parquetWriter.CreateRowGroup()) {
                        groupWriter.WriteColumn(UserKey);
                        groupWriter.WriteColumn(AADID);
                        groupWriter.WriteColumn(UserLocale);
                    }
                }
            }

Yet, when I attempt to use the same to write to our blob storage, that only generates an empty file, and the data is missing:

// Open reference to Blob Container
CloudAppendBlob blob = OpenBlobFile(blobEndPoint, fileName);

using (MemoryStream stream = new MemoryStream()) {

    blob.CreateOrReplaceAsync();

    using (var parquetWriter = new ParquetWriter(schema, stream)) {
        // Creare a new row group in the file
        using (ParquetRowGroupWriter groupWriter = parquetWriter.CreateRowGroup()) {
            groupWriter.WriteColumn(UserKey);
            groupWriter.WriteColumn(AADID);
            groupWriter.WriteColumn(UserLocale);
        }
    
    // Set stream position to 0
    stream.Position = 0;
    blob.AppendBlockAsync(stream);
    return true;
}

...

public static CloudAppendBlob OpenBlobFile (string blobEndPoint, string fileName) {
    CloudBlobContainer container = new CloudBlobContainer(new System.Uri(blobEndPoint));
    CloudAppendBlob blob = container.GetAppendBlobReference(fileName);

    return blob;
}

Reading the documentation, I would think my implementation of the blob.AppendBlocAsync should do the trick, but yet I end up with an empty file. Would anyone have suggestions as to why this is and how I can resolve it so I actually end up with data in the file?

Thanks in advance.

1

There are 1 best solutions below

1
SchmitzIT On BEST ANSWER

The explanation for the file ending up empty is the line:

blob.AppendBlockAsync(stream);

Note how the function called has the Async suffix. This means it expects whatever is calling it to wait. I turned the function the code was in into an Async one, and had Visual Studio suggest the following change to the line:

_ = await blob.AppendBlockAsync(stream);

I'm not entirely certain what _ represents, and hovering my mouse over it doesn't reveal much more, other than it being a long data type, but the code now works as intended.