BigQuery join and UDF

835 Views Asked by At

How can I join two tables in a select statement in which I also use a UDF? I stored the SQL query and UDF function in two files that I call via the bq command line. However, when I run it, I get the following error:

BigQuery error in query operation: Error processing job '[projectID]:bqjob_[error_number]': Table name cannot be resolved: dataset name is missing.

Note that I'm logged in the correct project via the gcloud auth method. My SQL statement:

SELECT
  substr(date,1,6) as date,
  device,
  channelGroup,
  COUNT(DISTINCT CONCAT(fullVisitorId,cast(visitId as string))) AS sessions,
  COUNT(DISTINCT fullVisitorId) AS users,
FROM
  defaultChannelGroup(
    SELECT
      a.date,
      a.device.deviceCategory AS device,
      b.hits.page.pagePath AS page,
      a.fullVisitorId,
      a.visitId,
      a.trafficSource.source AS trafficSourceSource,
      a.trafficSource.medium AS trafficSourceMedium,
      a.trafficSource.campaign AS trafficSourceCampaign
    FROM FLATTEN(
      SELECT date,device.deviceCategory,trafficSource.source,trafficSource.medium,trafficSource.campaign,fullVisitorId,visitID
      FROM
        TABLE_DATE_RANGE([datasetname.ga_sessions_],TIMESTAMP('2016-10-01'),TIMESTAMP('2016-10-31'))
    ,hits) as a
    LEFT JOIN FLATTEN(
      SELECT hits.page.pagePath,hits.time,visitID,fullVisitorId
      FROM
        TABLE_DATE_RANGE([datasetname.ga_sessions_],TIMESTAMP('2016-10-01'),TIMESTAMP('2016-10-31'))
      WHERE
        hits.time = 0
        and trafficSource.medium = 'organic'
    ,hits) as b
    ON a.fullVisitorId = b.fullVisitorId AND a.visitID = b.visitID
  )
GROUP BY
  date,
  device,
  channelGroup
ORDER BY sessions DESC

where I replaced my datasetname with the correct name of course; and some of the UDF (which works with another query):

function defaultChannelGroup(row, emit)
{
  function output(channelGroup) {
    emit({channelGroup:channelGroup,
      fullVisitorId: row.fullVisitorId, 
      visitId: row.visitId,
      device: row.device,
      date: row.date
      });
  }
  computeDefaultChannelGroup(row, output);
}

bigquery.defineFunction(
  'defaultChannelGroup',
  ['date', 'device', 'page', 'trafficSourceMedium', 'trafficSourceSource', 'trafficSourceCampaign', 'fullVisitorId', 'visitId'],
  //['device', 'page', 'trafficSourceMedium', 'trafficSourceSource', 'trafficSourceCampaign', 'fullVisitorId', 'visitId'],
  [{'name': 'channelGroup', 'type': 'string'},
  {'name': 'fullVisitorId', 'type': 'string'},
  {'name': 'visitId', 'type': 'integer'},
  {'name': 'device', 'type': 'string'},
  {'name': 'date', 'type': 'string'}
],
  defaultChannelGroup
);
1

There are 1 best solutions below

0
On

The select statements within the FLATTEN function needed to be in brackets.

Ran the bq command in the shell: bq query --udf_resource=udf.js "$(cat query.sql)"

query.sql contains the following scripts:

SELECT
  substr(date,1,6) as date,
  device,
  channelGroup,
  COUNT(DISTINCT CONCAT(fullVisitorId,cast(visitId as string))) AS sessions,
  COUNT(DISTINCT fullVisitorId) AS users,
  COUNT(DISTINCT transactionId) as orders,
  CAST(SUM(transactionRevenue)/1000000 AS INTEGER) as sales
FROM
  defaultChannelGroup(
    SELECT
      a.date as date,
      a.device.deviceCategory AS device,
      b.hits.page.pagePath AS page,
      a.fullVisitorId as fullVisitorId,
      a.visitId as visitId,
      a.trafficSource.source AS trafficSourceSource,
      a.trafficSource.medium AS trafficSourceMedium,
      a.trafficSource.campaign AS trafficSourceCampaign,
      a.hits.transaction.transactionRevenue as transactionRevenue,
      a.hits.transaction.transactionID as transactionId
    FROM FLATTEN((
      SELECT  date,device.deviceCategory,trafficSource.source,trafficSource.medium,trafficSource.campaign,fullVisitorId,visitID,
              hits.transaction.transactionID, hits.transaction.transactionRevenue
      FROM
        TABLE_DATE_RANGE([datasetname.ga_sessions_],TIMESTAMP('2016-10-01'),TIMESTAMP('2016-10-31'))
    ),hits) as a
    LEFT JOIN FLATTEN((
      SELECT hits.page.pagePath,hits.time,trafficSource.medium,visitID,fullVisitorId
      FROM
        TABLE_DATE_RANGE([datasetname.ga_sessions_],TIMESTAMP('2016-10-01'),TIMESTAMP('2016-10-31'))
      WHERE
        hits.time = 0
        and trafficSource.medium = 'organic'
    ),hits) as b
    ON a.fullVisitorId = b.fullVisitorId AND a.visitID = b.visitID
  )
GROUP BY
  date,
  device,
  channelGroup
ORDER BY sessions DESC

and udf.js contains the following function (the 'computeDefaultChannelGroup' function is not included):

function defaultChannelGroup(row, emit)
{
  function output(channelGroup) {
    emit({channelGroup:channelGroup,
      date: row.date,
      fullVisitorId: row.fullVisitorId, 
      visitId: row.visitId,
      device: row.device,
      transactionId: row.transactionId,
      transactionRevenue: row.transactionRevenue,
      });
  }
  computeDefaultChannelGroup(row, output);
}

bigquery.defineFunction(
  'defaultChannelGroup',
  ['date', 'device', 'page', 'trafficSourceMedium', 'trafficSourceSource', 'trafficSourceCampaign', 'fullVisitorId', 'visitId', 'transactionId', 'transactionRevenue'],
  [{'name': 'channelGroup', 'type': 'string'},
  {'name': 'date', 'type': 'string'},
  {'name': 'fullVisitorId', 'type': 'string'},
  {'name': 'visitId', 'type': 'integer'},
  {'name': 'device', 'type': 'string'},
  {'name': 'transactionId', 'type': 'string'},
  {'name': 'transactionRevenue', 'type': 'integer'}
],
  defaultChannelGroup
);

Ran without error and matched the data in Google Analytics.