Unable to insert into database using org.postgresql.copy.CopyManager when the data has japanese characters

1.8k Views Asked by At

I have been breaking my head over the past few hours at trying to figure out what's wrong with my code. This piece of code was all working fine until i received a file which had japanese characters in it. Notepad++ and even some online utility tools say that the encoding of the file is UTF-8. Notepad says its UTF-8-BOM. I have read my data from the file and I have processed it and finally want to write it out to the database.

I get the error org.postgresql.util.PSQLException: ERROR: invalid byte sequence for encoding "UTF8": 0xee My database encoding is UTF8 only..

package citynet.dta.pump;

import java.io.ByteArrayInputStream;
import java.io.ByteArrayOutputStream;
import java.io.DataOutputStream;
import java.sql.Connection;
import java.sql.DriverManager;
import java.sql.SQLException;

import org.postgresql.copy.CopyManager;
import org.postgresql.core.BaseConnection;

import citynet.common.ServerException;

public class TestEncoding {

public static void main(String[] args) {
    byte[] bytes = null;
    try {
        //use the below sql to create table 'testtable'
        // create table testtable (text1 character varying, text2 character varying,text3 character varying)
        try (ByteArrayOutputStream baos = new ByteArrayOutputStream()) {
            DataOutputStream out = new DataOutputStream(baos);

            out.writeBytes("INR,字仮名交じり文,3255104BTK1");

            bytes = baos.toByteArray();
        }
        Class.forName("org.postgresql.Driver");
        Connection c = DriverManager.getConnection("jdbc:postgresql://server:5432/dbname", "username", "password");
        if (bytes != null) {
            try (ByteArrayInputStream input = new ByteArrayInputStream(bytes)) {
                String sql = "COPY testtable  FROM stdin delimiter ','  NULL AS 'null' ENCODING 'UTF8' ";
                BaseConnection pgcon = (BaseConnection) c;
                CopyManager mgr = new CopyManager(pgcon);
                try {
                    mgr.copyIn(sql, input);

                } catch (SQLException ex) {
                    throw new ServerException("Error while copying data in Postgres DB:" + ex);

                }
            }
        }
    } catch (Exception e) {
        System.out.println("Error:" + e);
    }
  }
}
1

There are 1 best solutions below

5
Vladimir Sitnikov On BEST ANSWER

The issue is DataOutputStream#writeBytes("INR,字仮名交じり文,3255104BTK1") is not doing what you expect.

  1. You should refrain from using BaseConnection as it is an internal class. Application code should use PGConnection

  2. Here's how you get CopyManager:

    Connection con = ...;
    PGConnection pgcon = con.unwrap(org.postgresql.PGConnection.class);
    CopyManager mgr = pgcon.getCopyAPI();
    
  3. The source of your data might be different so there are multiple ways to execute copyAPI.

    If you want to convert String to UTF-8 bytes via your own code, then you need getBytes.

    String sql = "COPY testtable  FROM stdin delimiter ','  NULL AS 'null' ENCODING 'UTF8' ";
    byte[] bytes = "INR,字仮名交じり文,3255104BTK1".getBytes(StandardCharsets.UTF_8);
    mgr.copyIn(sql, new ByteArrayInputStream(bytes));
    

    Note: there's no need to close ByteArrayInputStream (see its Javadoc).

  4. If you need to stream a CSV file to the database, you might use FileInputStream:

    try (InputStream fis = new FileInputStream("file.csv")) {
      mgr.copyIn(sql, fis);
    }
    
  5. If you want to build the contents incrementally, then you might use ByteArrayOutputStream + OutputStreamWriter

    Note: all the rows would need to fit in the memory otherwise you get OutOfMemoryError.

    ByteArrayOutputStream baos = new ByteArrayOutputStream();
    try (OutputStreamWriter wr = new OutputStreamWriter(baos, StandardCharsets.UTF_8)) {
      // Write 10 rows
      for (int i = 0; i < 10; i++) {
        wr.write("INR,字仮名交じり文,3255104BTK1\n");
      }
    }
    
    String sql = "COPY testtable  FROM stdin delimiter ','  NULL AS 'null' ENCODING 'UTF8'";
    mgr.copyIn(sql, new ByteArrayInputStream(baos.toByteArray()));
    
  6. An alternative option is to use Reader

    Note: encoding is not specified, and it is using connection-default encoding (which is utf-8 in 99.42% of the cases since the driver defaults to utf-8 connection encoding).

    String sql = "COPY testtable  FROM stdin delimiter ','  NULL AS 'null'";
    mgr.copyIn(sql, new StringReader("INR,字仮名交じり文,3255104BTK1"));
    
  7. Yet another alternative is to use copyIn(String sql, ByteStreamWriter from) API which might be more efficient for certain use-cases (e.g. all the data is in-memory, and you know the number of bytes you are going to write)