Using univocity to parse two different csv files and write into new csv file

1.2k Views Asked by At

I always use univocity parser in my java program to compare csv files. It works excellent and is much faster.

But the problem is, this time am trying to parse two different large volume csv files with complex values and print the difference in new csv file,

Looking into one of the authors examples I tried to use processFile after reading file1 into list then converting to map, still I get error while parsing.

Below are my sample input and expected output file.

INPUT - file1

"h1","h2","h3","h4","h5"
"00000","US","9503.00.0089","USA","9503.0089"
"","EU","9503.00.7000","EUROPEAN UNION","9503.00.7000"
"#1200","US","5601.22.0010","USA","5601.22.0010"
"0180691","US","9503.00.0073","USA","9503.00.0073"
“DRTY01”,”CA”,”9603.01.0088”,”CAN”,”9603.01.0088”

INPUT - file2

"h1","h2","h3","h6","h7","h8","h9","h10",h11 
"018890","US","","2015","101","1","1","All",””
"00000","US","9503.00.0090","1986","101","1","1","All","9503.00.0090"
"0180691","US","9503.00.0073","2019","101","1","1","All","9503.00.0073”
“DRTY01”,”CA”,”9603.01.0087”,”2002”,”102”,”1”,”2”,”CA”, “9603.01.0087”

Selecting h1, h2 common values in file1 and file2 then comparing h3 of file1 to h3 of file2 , if both files h3 are not equal then I want to print “h1”,”h4”,” h10”,”h5”, ”h11”,”h6”,”h7”,”h8”,”h9” to file3

OUTPUT - file3

“h1”,”h4”,” h10”,”h5”, ”h11”,”h6”,”h7”,”h8”,”h9”
"00000","USA”,”All”,”9503.00.0089”,”9503.00.0090”, "1986","101","1","1"   
"DRTY01”,“CAN”,”CA”,”9603.01.0088”,“9603.01.0087”,”2002”,”102”,”1”,”2”
1

There are 1 best solutions below

4
On BEST ANSWER

I have a solution for your problem but please do regress testing. So what I'm assuming is that h1 and h2 combined would be a unique value. I'm creating a HashMap with a map of as key and the entire row of the csv file as value. We will override the hashcode and equals method of the created class like:

  • The hashcode will only use h1 and h2 to generate the code(because they are surely unique)
  • The equals we will use h3 also as a comparison condition which will return false when the two h3 are same.

The Logic in equals will be - if both h1 and h2 are same in map1 and map2 while h3 is different give me the row from map1 and map2. This logic uses additional space in maps but the overall computation logic is reduced to O(N). The below code will give you the rows you want from the maps.I have not performed IO and exception handling properly, please take care of them accordingly.

The Test class

public class UnivocityTest
{

    public static void main(String[] args) throws FileNotFoundException
    {
        // Get data from csv file1
        List<String[]> f1 = getData("example.csv");
        // Get data from csv file2


       List<String[]> f2 = getData("example1.csv");

        // Convert data to a Map with HeaderList class and entire row.
        Map<HeaderList, String[]> map1 = convertAndReturn(f1);
        Map<HeaderList, String[]> map2 = convertAndReturn(f2);

        //Currently prints the required rows.
        compareData(map1, map2);
    }

    // Convert csv to List<String[]>
    private static List<String[]> getData(String file) throws FileNotFoundException
    {
        CsvParserSettings parserSettings = new CsvParserSettings();
        parserSettings.setLineSeparatorDetectionEnabled(true);
        RowListProcessor rowProcessor = new RowListProcessor();
        parserSettings.setProcessor(rowProcessor);
        parserSettings.setHeaderExtractionEnabled(true);

        CsvParser parser = new CsvParser(parserSettings);
        parser.parse(getReader(file));
        // String[] headers = rowProcessor.getHeaders();
        List<String[]> rows = rowProcessor.getRows();

        return rows;
    }

    // get reader object
    private static Reader getReader(String string) throws FileNotFoundException
    {
        // TODO Add proper file handling and exception handling
        return new FileReader(new File(string));
    }

    // Return HashMap
    private static Map<HeaderList, String[]> convertAndReturn(List<String[]> f1)
    {
        Map<HeaderList, String[]> map = new java.util.HashMap<>();

        for (String[] each : f1)
        {
            // For each row in csv create a corresponding HeaderList object with h1,h2 and h3 as key
            // and row as value.
            HeaderList header = new HeaderList(each[0], each[1], each[2]);
            map.put(header, each);
        }

        return map;
    }

    private static void compareData(Map<HeaderList, String[]> map1, Map<HeaderList, String[]> map2)
    {
        // Iterates over the map1 keys one by one. For each key we check if there is a matching key
        // in map2. The matching condition will be h1 and h2 should be same while h3 should be
        // different. Once a key like that is found currently I'm printing both the rows, here you
        // can get the rows you want from the map and return them.

        for (HeaderList each : map1.keySet())
        {
            if (map2.containsKey(each))
            {
//TODO Assume you want columns h3,h4 from file1 and h6  h7 from file2.
                //We know map1 represents file1 with columns h3 and h4 at positions 2 and 3 inside the String[]
                //We know map2 represents file1 with columns h6 and h7 at positions 3 and 4 inside the String[]
                String h3FromFile1 = map1.get(each)[2];
                String h4FromFile1 = map1.get(each)[3];
                String h6FromFile2 = map2.get(each)[3];
                String h7FromFile2 = map2.get(each)[4];
                System.out.println("Required Columns: ");
                System.out.println("h3 file1: "+ h3FromFile1);
                System.out.println("h4 file1: "+ h4FromFile1);
                System.out.println("h6 file2: "+ h6FromFile2);
                System.out.println("h7 file2: " + h7FromFile2);
                System.out.println(Arrays.toString(map1.get(each)));
                System.out.println(Arrays.toString(map2.get(each)));
                System.out.println("-------------------------------");
            }
        }
    }

}

The bean class which will have the three column h1,h2,h3:

class HeaderList
        {

            private String h1;

            private String h2;

            private String h3;

            public HeaderList(String h1, String h2, String h3)
            {
                super();
                this.h1 = h1;
                this.h2 = h2;
                this.h3 = h3;
            }

            /**
             * The hash code method which generate same hashkey for h1 and h2.
             * 
             * @inheritDoc
             */
            @Override
            public int hashCode()
            {
                final int prime = 31;
                int result = 1;
                result = prime * result + ((h1 == null) ? 0 : h1.hashCode());
                result = prime * result + ((h2 == null) ? 0 : h2.hashCode());
                return result;
            }

            /**
             * The equals method assumes each csv file row will be uniquely identified my h1 and h2
             * combined. Please see if h1 and h2 cannot be uniquely identified then it may lead to data
             * loss. For h3 we return true only for same values.
             * 
             * @inheritDoc
             */
            @Override
            public boolean equals(Object obj)
            {
                if (this == obj)
                    return true;
                if (obj == null)
                    return false;
                if (getClass() != obj.getClass())
                    return false;
                HeaderList other = (HeaderList) obj;
                if (h1 == null)
                {
                    if (other.h1 != null)
                        return false;
                }
                else if (!h1.equals(other.h1))
                    return false;
                if (h2 == null)
                {
                    if (other.h2 != null)
                        return false;
                }
                else if (!h2.equals(other.h2))
                    return false;
                if (h3 == null)
                {
                    if (other.h3 == null)
                        return false;
                }
                else if (h3.equals(other.h3))
                    return false;
                return true;
            }

            /**
             * @inheritDoc
             */
            @Override
            public String toString()
            {
                return "HeaderList [h1=" + h1 + ", h2=" + h2 + ", h3=" + h3 + "]";
            }

        }

Output for the given input csv files:

Required Columns: 
h3 file1: 9603.01.0088
h4 file1: CAN
h6 file2: 2002
h7 file2: 102
[DRTY01, CA, 9603.01.0088, CAN, 9603.01.0088]
[DRTY01, CA, 9603.01.0087, 2002, 102, 1, 2, CA, 9603.01.0087]
-------------------------------
Required Columns: 
h3 file1: 9503.00.0089
h4 file1: USA
h6 file2: 1986
h7 file2: 101
[00000, US, 9503.00.0089, USA, 9503.0089]
[00000, US, 9503.00.0090, 1986, 101, 1, 1, All, 9503.00.0090]
-------------------------------