Google Public Patent Data SQL (BigQuery)

634 Views Asked by At

I am trying to retrieve specific cpc codes AND assignees via SQL in the Google public patent data. I am trying to search for the term "VOLKSWAGEN" and cpc.code "H01M8".

But I got the error:

No matching signature for operator = for argument types: ARRAY <STRUCT<name STRING, country_code STRING>>, STRING. Supported signature: ANY = ANY at [15:3]

code:

SELECT
  publication_number application_number,
  family_id,
  publication_date,
  filing_date,
  priority_date,
  priority_claim,
  ipc,
  cpc.code,
  inventor,
  assignee_harmonized,
FROM
  `patents-public-data.patents.publications`
WHERE
  assignee_harmonized = "VOLKSWAGEN" AND cpc.code = "H01M8"
LIMIT
  1000

I'm also interested in searching multiple assignees such as:

in ("VOLKSWAGEN", "PORSCHE", "AUDI", "SCANIA", "SKODA", "MAZDA", "TOYOTA", "HONDA", "BOSCH", "KYOCERA", "PANASONIC", "TOTO", "NISSAN", "LG FUEL CELL SYSTEMS", "SONY", "HYUNDAI", "SUZUKI", "PLUG POWER", "SFC ENERGY", "BALLARD", "KIA MOTORS", "SIEMENS", "KAWASAKI", "BAYERISCHE MOTORENWERKE", "HYDROGENICS", "POWERCELL SWEDEN", "ELRINGKLINGER", "PROTON MOTOR")

I have recently started to work with SQL and do not see the mistake :/

Many thanks for your help!

2

There are 2 best solutions below

0
On

Many thanks, now I created this code to screen multiple companies. Is it possible to get the query of requests out of "cpc__u.code" in one row cell each? with a ", " to seperate the codes between the output string?. Same I like to consider for the assignee_harmonized__u.name here as well !

Do you think the companies will be screened with this precedure and the "IN" operator?

SELECT
  publication_number application_number,
  family_id,
  publication_date,
  filing_date,
  priority_date,
  priority_claim,
  cpc__u.code,
  inventor,
  assignee_harmonized,
  assignee
FROM
  `patents-public-data.patents.publications`,
  UNNEST(assignee_harmonized) AS assignee_harmonized__u,
  UNNEST(cpc) AS cpc__u
WHERE
  assignee_harmonized__u.name in ("VOLKSWAGEN", "PORSCHE", "AUDI", "SCANIA", "SKODA", "MAZDA", "TOYOTA", "HONDA", "BOSCH", "KYOCERA", "PANASONIC", "TOTO", "NISSAN", "LG FUEL CELL SYSTEMS", "SONY", "HYUNDAI", "SUZUKI", "PLUG POWER", "SFC ENERGY", "BALLARD", "KIA MOTORS", "SIEMENS", "KAWASAKI", "BAYERISCHE MOTORENWERKE", "HYDROGENICS", "POWERCELL SWEDEN", "ELRINGKLINGER", "PROTON MOTOR")
  AND cpc__u.code LIKE "H01M8%"
LIMIT
  100000
3
On

In Google BigQuery UNNEST is needed to access ARRAY elements. This is described here:

https://cloud.google.com/bigquery/docs/reference/standard-sql/arrays

The following query works for me.

SELECT
  publication_number application_number,
  family_id,
  publication_date,
  filing_date,
  priority_date,
  priority_claim,
  ipc,
  cpc__u.code,
  inventor,
  assignee_harmonized,
FROM
  `patents-public-data.patents.publications`,
  UNNEST(assignee_harmonized) AS assignee_harmonized__u,
  UNNEST(cpc) AS cpc__u
WHERE
  assignee_harmonized__u.name = "VOLKSWAGEN AG"
  AND cpc__u.code LIKE "H01M8%"
LIMIT
  1000

The following are changes I made to generate results:

  1. UNNEST(assignee_harmonized) as assignee_harmonized__u to access assignee_harmonized__u.name.
  2. UNNEST(cpc) as cpc__u to access cpc__u.code.
  3. assignee_harmonized__u.name = "VOLKSWAGEN AG" as "VOLKSWAGEN" returns no results.
  4. cpc__u.code LIKE "H01M8%" as "H01M8" returns no results. An example value is H01M8/10.

This returns the following:

Query complete (2.3 sec elapsed, 29.2 GB processed)

If you want to screen multiple assignee names, IN will work like the following, however, you need to have an exact match like VOLKSWAGEN AG or AUDI AG.

assignee_harmonized__u.name IN ("VOLKSWAGEN", "PORSCHE", "AUDI", "SCANIA", "SKODA", "MAZDA", "TOYOTA", "HONDA", "BOSCH", "KYOCERA", "PANASONIC", "TOTO", "NISSAN", "LG FUEL CELL SYSTEMS", "SONY", "HYUNDAI", "SUZUKI", "PLUG POWER", "SFC ENERGY", "BALLARD", "KIA MOTORS", "SIEMENS", "KAWASAKI", "BAYERISCHE MOTORENWERKE", "HYDROGENICS", "POWERCELL SWEDEN", "ELRINGKLINGER", "PROTON MOTOR")

If you want to do a LIKE style match with multiple strings, you can try REGEXP_CONTAINS:

https://cloud.google.com/bigquery/docs/reference/standard-sql/string_functions#regexp_contains