BigQuery Streaming Insertion takes time in minutes for write-heavy application

1.3k Views Asked by At

I have an write-heavy Springboot application integrating with Bigquery for heavy load , facing 10 minutes to insert some of the entries.Here are my configurations

Number of Entries Stored: 1 Million/min
Number of pods : 100
Insertion Type : Streaming Data(Using JsonStreamWrite)
Deployed Cloud : Azure
Average time taken for insertion : 650 ms
Max time taken : 22 mins (for a single insert)
Number of Threads Per Pod : 15 threads

Each pod has a BigQuery Connection and tries to insert in BigQuery. Now as 10% of the inserts are taking time in minutes , we are facing a lot of timeout and performance issues. Is there an efficient way to write data in BigQuery with such large loads.

We use the following Google client libraries

    <dependency>
        <groupId>com.google.cloud</groupId>
        <artifactId>google-cloud-storage</artifactId>
    </dependency>
    <dependency>
        <groupId>com.google.cloud</groupId>
        <artifactId>google-cloud-bigquerystorage</artifactId>
    </dependency>
    <dependency>
        <groupId>com.google.cloud</groupId>
        <artifactId>google-cloud-bigquery</artifactId>
        <exclusions>
            <exclusion>
                <groupId>org.slf4j</groupId>
                <artifactId>slf4j-api</artifactId>
            </exclusion>
        </exclusions>
    </dependency>
    <dependencies>
        <dependency>
            <groupId>com.google.cloud</groupId>
            <artifactId>libraries-bom</artifactId>
            <version>25.4.0</version>
            <type>pom</type>
            <scope>import</scope>
        </dependency>
    </dependencies>



private void updateRequestMetadataOperations(JSONArray requestMetaDataArr){

    JSONArray firstObjArr = new JSONArray();

        JSONObject firstTableJsonObj = new JSONObject();
        firstTableJsonObj.put("firstColumn",firstColumnVal);
        firstTableJsonObj.put("secondColumn",secondColumnVal);
        firstTableJsonObj.put("thirdColumn",thirdColumnVal);
        firstTableJsonObj.put("fourthColumn",fourthColumnVal);
        firstTableJsonObj.put("fifthColumn",fifthColumnVal);
        firstTableJsonObj.put("sixthColumn",sixthColumnVal);
        .
        .
        .
        firstTableJsonObj.put("twentyColumn",twentyColumnVal);

        
        firstObjArr.put(firstTableJsonObj);
    }

 public void insertIntoBigQuery(String tableName, JSONArray jsonArr) throws Exception{
    if(jsonArr.length()==0){
           return;
       }
       JsonStreamWriter jsonStreamWriter = JsonStreamWriterUtil.getWriteStreamMap(tableName);
       if(jsonStreamWriter!=null) {
           jsonStreamWriter.append(jsonArr);
       }
   }



public JsonStreamWriter createWriteStream(String table) throws IOException, Descriptors.DescriptorValidationException, InterruptedException {
      BigQueryWriteClient bqClient = BigQueryWriteClient.create();
      WriteStream stream = WriteStream.newBuilder().setType(WriteStream.Type.COMMITTED).build();
      TableName tableName = TableName.of("ProjectId", "DataSet", table);
      CreateWriteStreamRequest createWriteStreamRequest =
              CreateWriteStreamRequest.newBuilder()
                      .setParent(tableName.toString())
                      .setWriteStream(stream)
                      .build();
      WriteStream writeStream = bqClient.createWriteStream(createWriteStreamRequest);

      JsonStreamWriter jsonStreamWriter = JsonStreamWriter
              .newBuilder(writeStream.getName(), writeStream.getTableSchema())
              .build();
      return jsonStreamWriter;
  }
1

There are 1 best solutions below

0
On

In general, BigQuery streaming insert is meant for small real-time data update and it is light and fast.Batch or load, on the other hand, accepts file uploads and it is meant for larger and heavier updates. The BigQuery Storage Write API is a unified data-ingestion API for BigQuery. It combines streaming ingestion and batch loading into a single high-performance API.

Is there an efficient way to write data in BigQuery with such large loads?

The insert process is optimized for bulk operations resulting in much higher levels of performance and lower loading times. Also note that a large payload can also lead to a slow insert, especially if it is coming from an outside network with additional latencies. If you need to gain speed you can always opt for an asynchronous approach. You can always consider using a message bus like Pub/Sub with Dataflow to write into BigQuery.The Storage Write API is a gRPC API that uses bidirectional connections. The AppendRows method creates a connection to a stream. Generally, a single connection supports at least 1MB/s of throughput. The upper bound depends on several factors, such as network bandwidth, the schema of the data, and server load, but can exceed 10MB/s. If you require more throughput, create more connections. You can also refer to this document for quote and limit.

Your createWriteStream method seems fine as per the GCP code example.