Dataset Filter working in an unexpected way

96 Views Asked by At

Scenario: I have read two XML files via specifying a schema on load.

In the schema, One of the tags is mandatory. One XML is missing that mandatory tag.

Now, when I do the following, I am expecting the XML with the missing mandatory tag to be filtered out.

dataset = dataset.filter(functions.col("mandatoryColumnNameInSchema").isNotNull());

In the code when I try to count the Rows of the dataset, I am getting the count as 2 (2 Input XMLS) but when I try to print the dataset via show() method, I am getting a NPE.

When I debugged the above line and do the following, I get 0 as the count.

dataset.filter(functions.col("mandatoryColumnNameInSchema").isNotNull()).count();

Question:

Can anyone please answer the questions/affirm my understanding below

  1. Why is spark Dataset not filtering the Row which does not have a mandatory column?
  2. Why there are no exception in the count but in show method?

For 2, I believe the count is just counting no of Rows without looking into the contents. For show, the iterator actually goes through the Struct Fields to print their values and when it does not find the mandatory column, it errors out.

P.S. If I make the mandatory column optional, all is working fine.

Edit:

Providing Reading options as requested

For loading the data I am executing the following

Dataset<Row> dataset = sparkSession.read().format("com.databricks.spark.xml")
                .option("header", "true")
                .option("inferSchema", "false")
                .option("rowTag", rowTag)//rowTag is "body" tag in the XML
                .option("failFast", "true")
                .option("mode", "FAILFAST")
                .schema(schema)
                .load(XMLfilePath);

Providing samples as requested

Schema:

root
 |-- old: struct (nullable = true)
 |    |-- _beyond: string (nullable = true)
 |    |-- lot: struct (nullable = true)
 |    |    |-- _VALUE: string (nullable = true)
 |    |    |-- _chose: string (nullable = true)
 |    |-- real: struct (nullable = true)
 |    |    |-- _eat: string (nullable = true)
 |    |    |-- kill: struct (nullable = true)
 |    |    |    |-- _VALUE: double (nullable = true)
 |    |    |    |-- _top: string (nullable = true)
 |    |    |-- tool: struct (nullable = true)
 |    |    |    |-- _VALUE: string (nullable = true)
 |    |    |    |-- _affect: string (nullable = true)
 |-- porch: struct (nullable = true)
 |    |-- _account: string (nullable = true)
 |    |-- cast: string (nullable = true)
 |    |-- vegetable: struct (nullable = true)
 |    |    |-- leg: struct (nullable = true)
 |    |    |    |-- _VALUE: double (nullable = true)
 |    |    |    |-- _nose: string (nullable = true)
 |    |    |-- now: struct (nullable = true)
 |    |    |    |-- _gravity: string (nullable = true)
 |    |    |    |-- chief: struct (nullable = true)
 |    |    |    |    |-- _VALUE: long (nullable = true)
 |    |    |    |    |-- _further: string (nullable = true)
 |    |    |    |-- field: string (nullable = true)

Sample XML:

<?xml version="1.0" encoding="UTF-8" ?>
<root>
    <body>
    <porch account="something">
        <vegetable>
            <now gravity="wide">
                <field>box</field>
                <chief further="satisfied">-1889614487</chief>
            </now>
            <leg nose="angle">912658017.229279</leg>
        </vegetable>
        <cast>clear</cast>
    </porch>
    <old beyond="continent">
        <real eat="term">
            <kill top="plates">-1623084908.8669372</kill>
            <tool affect="pond">today</tool>
        </real>
        <lot chose="swung">promised</lot>
    </old>
    </body>
</root>

Schema in JSON format:

{"type":"struct","fields":[{"name":"old","type":{"type":"struct","fields":[{"name":"_beyond","type":"string","nullable":true,"metadata":{}},{"name":"lot","type":{"type":"struct","fields":[{"name":"_VALUE","type":"string","nullable":true,"metadata":{}},{"name":"_chose","type":"string","nullable":true,"metadata":{}}]},"nullable":true,"metadata":{}},{"name":"real","type":{"type":"struct","fields":[{"name":"_eat","type":"string","nullable":true,"metadata":{}},{"name":"kill","type":{"type":"struct","fields":[{"name":"_VALUE","type":"double","nullable":true,"metadata":{}},{"name":"_top","type":"string","nullable":true,"metadata":{}}]},"nullable":true,"metadata":{}},{"name":"tool","type":{"type":"struct","fields":[{"name":"_VALUE","type":"string","nullable":true,"metadata":{}},{"name":"_affect","type":"string","nullable":true,"metadata":{}}]},"nullable":true,"metadata":{}}]},"nullable":true,"metadata":{}}]},"nullable":true,"metadata":{}},{"name":"porch","type":{"type":"struct","fields":[{"name":"_account","type":"string","nullable":true,"metadata":{}},{"name":"cast","type":"string","nullable":true,"metadata":{}},{"name":"vegetable","type":{"type":"struct","fields":[{"name":"leg","type":{"type":"struct","fields":[{"name":"_VALUE","type":"double","nullable":true,"metadata":{}},{"name":"_nose","type":"string","nullable":true,"metadata":{}}]},"nullable":true,"metadata":{}},{"name":"now","type":{"type":"struct","fields":[{"name":"_gravity","type":"string","nullable":true,"metadata":{}},{"name":"chief","type":{"type":"struct","fields":[{"name":"_VALUE","type":"long","nullable":true,"metadata":{}},{"name":"_further","type":"string","nullable":true,"metadata":{}}]},"nullable":true,"metadata":{}},{"name":"field","type":"string","nullable":true,"metadata":{}}]},"nullable":true,"metadata":{}}]},"nullable":true,"metadata":{}}]},"nullable":true,"metadata":{}}]}

My scenario can be reproduced by making the element "old" as nullable = false, and removing the tag from the XML

0

There are 0 best solutions below