Completeness (and other data profiling) check(s) in snowpark python worksheet

270 Views Asked by At

I am trying to a do some data profiling/quality checks on data in Snowflake. I've already tried to implement some using SQL but saw there is also an option for a Python worksheet.

My current code:

import snowflake.snowpark as snowpark
from snowflake.snowpark.functions import col, count, countDistinct, min, max, length, lit, avg, sum

def main(session: snowpark.Session): 
    # Your code goes here, inside the "main" handler.
    df_campaign = session.table('liveperson_campaign')

    # Only check on the top 1000 rows 
    df_campaign_top = df_campaign.limit(1000)

    # Initialize empty list to store results 
    data = []

    # Iterate over columns to compute completeness 
    for column_name in df_campaign_top.columns:
        completeness = df_campaign_top.select((count(column_name)/df_campaign_top.count()).alias('completeness')).collect()[0]['completeness']
        data.append((column_name, completeness))

    # Create new dataframe for completeness 
    df_completeness = session.createDataFrame(data, schema=['column', 'completeness'])
        
    # Print a sample of the dataframe to standard output.
    df_completeness.show()

    # Return value will appear in the Results tab.
    return df_completeness

I currently get the following error:

Traceback (most recent call last):
  Worksheet, line 19, in main
  File "snowflake/snowpark/row.py", line 151, in __getitem__
    return self._named_values[item]
KeyError: 'completeness'

Where line 19 is

completeness = df_campaign_top.select((count(column_name)/df_campaign_top.count()).alias('completeness')).collect()[0]['completeness']

I would like to achieve the following: check the completeness for every column in the table, put this into a new dataframe and show the results. Eventually, I would also add checks like uniqueness, min/max values etc. to the same results dataframe. For now, I would just be happy if this works.

Is there a way to achieve this, I don't really understand the error I'm getting. In addition, I'm wondering if snowpark is the right way to tackle this. Would it be better to just keep using the SQL worksheets? I've used pandas before and that worked great, is there a way to embed Pandas into snowpark in some way?

Thanks for your help in advance!

0

There are 0 best solutions below