Decompressing varbinary column having xml data and insert it into hive table

298 Views Asked by At

We have SQL Server 2016 with a varbinary column that contains compressed XML. Now we want to load data into cdp hive (Hive 3.1.3000) table BY DECOMPRESSING it.

Initially we were using java utility for decompressing and inflating data, but now we are looking for some alternate approach like pyspark.

We were using below java code to inflate data:

if( colType == java.sql.Types.VARBINARY ) {
msg = "Processing VARBINARY " + colLabel;
// logger.info("Checking VARBINARY column: " + colLabel);
if( inflateColumnList.contains(colLabel) ) {
ByteArrayInputStream bais = new ByteArrayInputStream( rs.getBytes( colIndex ));
Inflater inflater = new Inflater(true);
InflaterInputStream iis = new InflaterInputStream(bais, inflater); 
ByteArrayOutputStream buffer = new ByteArrayOutputStream();
while(iis.available() != 0){
buffer.write(iis.read());
}
iis.close();
result = new String(buffer.toByteArray(), "UTF-8" );
}
else {
logger.info(" VARBINARY column: " + colLabel + " is NOT in the unzip list");
result = Base64.getEncoder().encodeToString(rs.getBytes(colIndex) );
}

I am at point where I can fetch bytearray from dataframe as below:

enter image description here

bytearrayobj = df.select(F.collect_list('itemdetailsdata')).first()[0][0]
print(zlib.decompress(bytes.decode(bytearrayobj,'utf-8')))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: descriptor 'decode' requires a 'str' object but received a 'bytearray'

Please guide me what I need to proceed with in order to generate decompressed XML from this bytearray.

0

There are 0 best solutions below