create new record in db with sql

65 Views Asked by At

I have a bunch of device_id (roughly 300million), I want to create a full video resolution list ['360p', '480p', '540p', '720p', '1080p']for each device_id like

device_id
   1
   2
device_id    video_resolution
  1              360p
  1              480p
  1              540p 
  1              720p
  1              1080p
  2              360p
  2              480p
  2              540p 
  2              720p
  2              1080p

I wonder if it's possible to create these record in sql?  
I've tried to use spark, but I do not have that many resources and often run into OOM issue. 
3

There are 3 best solutions below

0
tinazmu On

Try this: the cross join would give you all the combinations. I am not familiar with the internals of PostgreSQL but generally the CREATE TABLE statements are 'logged' less, thefore it may not run out of memory (you may run out of target disk space, of course).

create table MyTbl (DeviceId integer);
insert into MyTbl values (1),(2),(3);
create table MyBiggerTable as 
  select mt.*, vr.res
  from myTbl mt cross join (values
    ('360p'),
    ('480p'),
    ('540p'),
    ('720p'),
    ('1080p')) as vr(res);

You don't need the first two statement, of course, and you need to adjust the table names.

0
CRAFTY DBA On

In Apache Spark, there is the idea of a hive table, but it really is schema on read. Meaning, that the data is actually stored in files. Before the delta format, these formats were read only. You had to rebuild the table to pull in new files.

In 2018 Databricks introduced the delta format. Yes, it supports ACID features at the cost of additional parquet files and log files. This is important since you can execute INSERT, UPDATE and DELETE; However, when dealing with the volume you have above, I would caution you on the time to do the 300 M inserts.

It is important to understand the difference between managed (external) and unmanaged (internal) hive tables. In addition, storage can be accessed via the URL or mounted. The examples below show managed vs unmanaged tables.

External tables in Azure are stored in Azure Data Lake Storage. The capacity of this storage is large. On the other hand, internal tables are stored with in the managed storage for the service. This space is limited.

First, we need to create a database (schema) in spark.

%sql    
--
--  Create a new data
--
 
-- Drop existing
DROP DATABASE IF EXISTS sparktips CASCADE;

-- Create new
CREATE DATABASE sparktips;

The next section of code creates an external table using mounted ADLS storage.

%sql    
--
--  Unmanaged csv hive table
--
 
-- Drop existing
DROP TABLE IF EXISTS sparktips.ext_devices;

-- Create new
CREATE EXTERNAL TABLE sparktips.ext_devices
(
  device_id int,
  video_resolution string
)
USING CSV
LOCATION '/lake2022/bronze/devices';
 

The next section of code creates an internal table using storage from the service.

%sql    
--
--  Managed csv hive table
--
 
-- Drop existing
DROP TABLE IF EXISTS sparktips.int_devices;

-- Create new
CREATE TABLE sparktips.int_devices
(
  device_id int,
  video_resolution string
);
 

However, you now have to insert the 300 M rows into the managed table versus having 300 M rows in a several CSV files that are represented by a unmanaged table.

In short, I never had issues with 300 M rows. Just size your cluster accordingly.

This post was tagged PostgreSQL. Maybe that was incorrect? Yes, like all RDBMS you can create table and load data with your favorite ETL tool.

Please see my blog on more spark articles.

0
Ariel Zhao On

Turns out I found a function in SQL lateral view EXPLODE() function helped me generate all video_resolution with device_id!