Azure Data Lake Analytics IOutputter get output file name

326 Views Asked by At

I'm using a custom IOutputter to write the results of my U-SQL script to a a local database:

OUTPUT @dataset
TO "/path/somefilename_{*}.file"
USING new CustomOutputter()

public class CustomOutputter: IOutputter
{          
        public CustomOutputter()
        {
            myCustomDatabase.Open("databasefile.database");
        }    

        public override void Output(IRow input, IUnstructuredWriter output)
        {

        }
}

Is there any possibility to replace "databasefile.database" with the specified output file path "/path/somefilename_{*}.file" ?

Since I'm not able to pass output.BaseStream to the database I can't find a way to properly write to the correct file name.

UPDATE How I copy the local DB file to the ADLA provided outputstream:

        public override void Close()
        {
            using (var fs = File.Open("databasefile.database", FileMode.Open))
            {
                byte[] buffer = new byte[65536];
                int read;
                while ((read = fs.Read(buffer, 0, buffer.Length)) > 0)
                {
                    this.output.BaseStream.Write(buffer, 0, read);
                    this.output.BaseStream.Flush();
                }
            }
        }
1

There are 1 best solutions below

7
On BEST ANSWER

I am not sure what you try to achieve.

  1. Outputters (and UDOs in general) cannot leave their containers (VMs) when executed in ADLA (local execution has no such limit at this point). So connecting to a database outside the container is going to be blocked and I am not sure what it helps to write data into a database in a transient VM/container.

  2. The UDO model has a well-defined model to write to files that live in either ADLS or WASB by writing the data in the input row(set) into the output's stream. You can write into local files, but again, these files will cease to exist after the vertex finishes execution.

Given this information, could you please rephrase?

Update based on clarifying comment

you have two options to generate a database from a rowset:

  1. you use ADF to do the data movement. This is the most commonly used approach and probably the easiest.
  2. If you use a custom outputter you could try the following:
    1. write the output rowset into the database which is local to your vertex (you have to deploy the database as a resource, so you probably need a small footprint version to fit into the resource size limit) using the database interface,
    2. then read the database file from the vertex local directory into the output stream so you copy the file into ADLS.
    3. Note that you need atomic file processing on the outputter to avoid writing many database files that then get stitched together.