data profiling on bigquery table covering min,max,unique, null count statistics

2.8k Views Asked by At

I am looking for solution to perform data profiling on bigquery table covering below statistics for each column in table. Some of the columns are ARRAY and STRUCT as given below.

I tried multiple ways to generate dynamic query to cover below scenarios but no luck.. I will greatly appreciate your help/inputs.

Metrics I want to calculate part of this solution are:

  • MIN VALUE
  • MAX VALUE
  • MIN LENGTH OF THE FIELD
  • MAX LENGTH OF THE FIELD
  • NO OF UNIQUE RECORDS FOR EACH FIELD
  • NO OF NULLS IN FIELD
  • NO OF NON NULL VALUES IN FIELD.
  • MIN DATE IN CASE OF DATE OR DATETIME FIELD
  • MAX DATE IN CASE OF DATE OR DATETIME FIELD

Sample Table Data:

Blockquote

Desired output enter image description here

3

There are 3 best solutions below

1
On

I dont understand what you mean with Min Length and Max Length, but considering the provided data, you could do something like below.

This query have basically two steps:

  1. Create a temporary table with flat data using the WITH clause
  2. Calculate the metrics by running one query for each column and use UNION ALL to compose everything in a single table.

Query:

WITH
  t AS(
  SELECT
    first_name,
    dob,
    last_name,
    a.zip addresses_zip,
    a.state addresses_state,
    a.city addresses_city,
    a.numberOfYears addresses_numberOfYears,
    a.status addresses_status,
    a.phone.primarynumber addresses_phone_primarynumber,
    a.phone.secondary addresses_phone_secondary
  FROM
    <your-table> t,
    t.addresses a 
)

SELECT
  "first_name" AS column,
  COUNT(first_name) total_count,
  COUNT(DISTINCT first_name) total_distinct,
  SUM(
  IF
    (first_name IS NULL,
      1,
      0)) total_null,
  CAST(MIN(first_name) AS string) min_value,
  CAST(MAX(first_name) AS string) max_value
FROM
  t

UNION ALL

SELECT
  "dob" AS column,
  COUNT(dob) total_count,
  COUNT(DISTINCT dob) total_distinct,
  SUM(
  IF
    (dob IS NULL,
      1,
      0)) total_null,
  CAST(MIN(dob) AS string) min_value,
  CAST(MAX(dob) AS string) max_value
FROM
  t

UNION ALL

SELECT
  "last_name" AS column,
  COUNT(last_name) total_count,
  COUNT(DISTINCT last_name) total_distinct,
  SUM(
  IF
    (last_name IS NULL,
      1,
      0)) total_null,
  CAST(MIN(last_name) AS string) min_value,
  CAST(MAX(last_name) AS string) max_value
FROM
  t

UNION ALL

SELECT
  "addresses.zip" AS column,
  COUNT(addresses_zip) total_count,
  COUNT(DISTINCT addresses_zip) total_distinct,
  SUM(
  IF
    (addresses_zip IS NULL,
      1,
      0)) total_null,
  CAST(MIN(addresses_zip) AS string) min_value,
  CAST(MAX(addresses_zip) AS string) max_value
FROM
  t

UNION ALL

SELECT
  "addresses.state" AS column,
  COUNT(addresses_state) total_count,
  COUNT(DISTINCT addresses_state) total_distinct,
  SUM(
  IF
    (addresses_state IS NULL,
      1,
      0)) total_null,
  CAST(MIN(addresses_state) AS string) min_value,
  CAST(MAX(addresses_state) AS string) max_value
FROM
  t

UNION ALL

SELECT
  "addresses.city" AS column,
  COUNT(addresses_city) total_count,
  COUNT(DISTINCT addresses_city) total_distinct,
  SUM(
  IF
    (addresses_city IS NULL,
      1,
      0)) total_null,
  CAST(MIN(addresses_city) AS string) min_value,
  CAST(MAX(addresses_city) AS string) max_value
FROM
  t

UNION ALL

SELECT
  "addresses.numberOfYears" AS column,
  COUNT(addresses_numberOfYears) total_count,
  COUNT(DISTINCT addresses_numberOfYears) total_distinct,
  SUM(
  IF
    (addresses_numberOfYears IS NULL,
      1,
      0)) total_null,
  CAST(MIN(addresses_numberOfYears) AS string) min_value,
  CAST(MAX(addresses_numberOfYears) AS string) max_value
FROM
  t

UNION ALL

SELECT
  "addresses.status" AS column,
  COUNT(addresses_status) total_count,
  COUNT(DISTINCT addresses_status) total_distinct,
  SUM(
  IF
    (addresses_status IS NULL,
      1,
      0)) total_null,
  CAST(MIN(addresses_status) AS string) min_value,
  CAST(MAX(addresses_status) AS string) max_value
FROM
  t

UNION ALL

SELECT
  "addresses.phone.primarynumber" AS column,
  COUNT(addresses_phone_primarynumber) total_count,
  COUNT(DISTINCT addresses_phone_primarynumber) total_distinct,
  SUM(
  IF
    (addresses_phone_primarynumber IS NULL,
      1,
      0)) total_null,
  CAST(MIN(addresses_phone_primarynumber) AS string) min_value,
  CAST(MAX(addresses_phone_primarynumber) AS string) max_value
FROM
  t 

UNION ALL

SELECT
  "addresses.phone.secondary" AS column,
  COUNT(addresses_phone_secondary) total_count,
  COUNT(DISTINCT addresses_phone_secondary) total_distinct,
  SUM(
  IF
    (addresses_phone_secondary IS NULL,
      1,
      0)) total_null,
  CAST(MIN(addresses_phone_secondary) AS string) min_value,
  CAST(MAX(addresses_phone_secondary) AS string) max_value
FROM
  t
6
On

This query returns all the columns from a table in a dataset. I excluded STRUCTS, since you only need value columns.

SELECT CONCAT('`', table_catalog, '.', table_schema, '.', table_name, '`') as table_name, field_path, data_type
FROM project.dataset.INFORMATION_SCHEMA.COLUMN_FIELD_PATHS
WHERE table_name = 'table_name'
  AND data_type NOT LIKE 'STRUCT%'

Using the columns table, we'll generate a SQL query to get all these columns. Here, I only added MIN, MAX and COUNT DISTINCT columns. However, you can add more of them by adding new lines to SELECT part.

SELECT 
  STRING_AGG(
    CONCAT(
      'SELECT "', field_path, '" as field_path, ',
        'CAST(MIN(', field_path, ') as string) as max, ',
        'CAST(MAX(', field_path, ') as string) as min ',
        'COUNT(DISTINCT ', field_path, ') as count_distinct ',
      'FROM ', table_name) ,
    ' UNION ALL \n'
  ) as query
FROM columns

At the end, we'll run this query using EXECUTE IMMEDIATE, since it's a string:

EXECUTE IMMEDIATE (
  query
)

To bring all these queries together, it looks like that:

EXECUTE IMMEDIATE (
  SELECT 
    STRING_AGG(
      CONCAT(
        'SELECT "', field_path, '" as field_path, ',
          'CAST(MIN(', field_path, ') as string) as max, ',
          'CAST(MAX(', field_path, ') as string) as min ',
          'COUNT(DISTINCT ', field_path, ') as count_distinct ',
        'FROM ', table_name) ,
      ' UNION ALL \n'
    ) as query
  FROM (
    SELECT CONCAT('`', table_catalog, '.', table_schema, '.', table_name, '`') as table_name, field_path, data_type
    FROM project.dataset.INFORMATION_SCHEMA.COLUMN_FIELD_PATHS
    WHERE table_name = 'table_name'
      AND data_type NOT LIKE 'STRUCT%'
  )
)

PS: It only solves structs for now. Can you show me an example of your ARRAY columns?

0
On

I highly recommend using a low-code no-code tool like Telmai for profiling entire tables in BigQuery within minutes.

All of the metrics you have mentioned are OOTB, so dont need to write any code and you and there is free tier so almost no cost to you.

https://telm-ai.webflow.io/data-profiling-and-observability