Separating tags with BigQuery public dataset for stackoverflow

887 Views Asked by At

Google makes available a public dataset for the content of Stackoverflow. We can read about this here. When I login to the GCP Cloud Console and visit the BigQuery page and submit the following query:

select id, tags from `bigquery-public-data.stackoverflow.posts_questions` limit 10

The resulting table that is shown to me shows the tags field as concatenated items.

enter image description here

If I look at the JSON, I seem to see the same:

enter image description here

My supposition was that the tags would be delimited by the '|' character but the data seems to show otherwise. I'm hoping to understand this better. My end goal is to perform queries to find all questions that contain a given tag.

2

There are 2 best solutions below

0
On BEST ANSWER

It was discovered that an error was at play in how the source data from Stackoverflows was being transformed into BigQuery tables. Google created an issue to resolve and eventually posted that it had been fixed. As such, this story/posting was transient and likely will not be replicable nor of value in the future.

1
On

To find all the Stackoverflow questions that contain a given tag name you can employ the SQL operator LIKE. This operator is very useful while looking for a particular pattern within a database field. It is specified after the WHERE statement with the following syntax:

SELECT field1, field2, ...
FROM table
WHERE fieldN LIKE pattern;

The patterns are usually written with the following wildcards that make possible to generate regular-like expressions:

  • The percent symbol "%" looks for any string of zero or more characters.

    Example: SELECT * WHERE fieldN LIKE "%foo%" will return all the entries that contain the string "foo" in the fieldN column.

  • The underscore symbol "_" looks for any single character.

    Example: SELECT * WHERE fieldN LIKE "A_" will return all the entries that start with the character "A".

In order to find all the Stackoverflow questions that contain a given tag TAGNAME, you can employ the pattern "%TAGNAME%". In this example, the query will return all the entries where the tag field contains the word "javascript":

SELECT 
  *
FROM 
  `bigquery-public-data.stackoverflow.posts_questions` 
WHERE 
  tags LIKE "%javascript%"