How do I load a big CSV file into WSO2 ML

312 Views Asked by At

I was trying to upload a 10GB CSV file into WSO2 ML, but I could not do it, it gave me errors, I followed this link to change the size limit of my dataset in WSO2 ML(https://docs.wso2.com/display/ML100/FAQ#FAQ-Isthereafilesizelimittomydataset?Isthereafilesizelimittomydataset?)

I am running wso2 ML in a PC with the following characteristics: - 50GB RAM - 8 Cores

Thanks

2

There are 2 best solutions below

3
On

When it comes to uploading datasets into WSO2 Machine Learner, we have given three options.

  1. Uploading files from your local file system. As you have mentioned, maximum uploading limit is kept to 100MB and you can increase the limit by setting -Dog.apache.cxf.io.CachedOutputStream.Threshold option your wso2server.dat file. We have tested this feature with a 1GB file. However, for large files, we don't recommend this option. The main use case of this functionality is to allow users to quickly try out some machine learning algorithm with small datasets.

Since you are working with a large dataset we would like to recommend following two approaches for uploading your dataset into WSO2 ML server.

  1. Upload data using Hadoop file system (HDFS). We have given a detailed description on how to use HDFS files in WSO2 ML in our documentation [1].

  2. If you have up and running WSO2 DAS instance, by integrating WSO2 ML with WSO2 DAS you can easily point out a DAS table as your source type in the WSO2 ML's "Create Dataset" wizard. For more details on integrating WSO2 ML with WSO2 DAS please refer [2].

If you need more help regarding this issue please let me know.

[1]. https://docs.wso2.com/display/ML100/HDFS+Support

[2]. https://docs.wso2.com/display/ML110/Integration+with+WSO2+Data+Analytics+Server

0
On

For those who want to use HDP (Hortonworks) as part of your HDFS solution to load a large sized dataset for WSO2 ML using the NameNode port of 8020 via IPC, i.e. hdfs://hostname:8020/samples/data/wdbcSample.csv, you may also need to ingest such a data file onto HDFS in the first place using the following Java client:

public static void main(String[] args) throws Exception {

    Configuration configuration = new Configuration();

    FileSystem hdfs = FileSystem.get(new URI("hdfs://hostname:8020"), configuration);
    Path dstPath = new Path("hdfs://hostname:8020/samples/data/wdbcSample.csv");

    if (hdfs.exists(dstPath)) {
        hdfs.delete(dstPath, true);
    } else {
        System.out.println("No such destination ...");
    }
    Path srcPath = new Path("wdbcSample.csv"); // a local file path on the client side

    try {
        hdfs.copyFromLocalFile(srcPath, dstPath);
        System.out.println("Done successfully ...");
    } catch (Exception ex) {
        ex.printStackTrace();
    } finally {
        hdfs.close();
    }
}