I am a data scientist with the task of anomaly detection on system logs and I want to experiment on the DARPA TC Engagement 5 dataset. I have downloaded the included scripts to import the data and now the data is being parsed and stored in a single huge ndjson file.

What's my end goal?

Use PySpark for distributed computation and storage of the huge ndjson file on the cluster (so I can delete the ndjson and then only use the distributed data).

What's my issue?

Due to my lack of experience with newly deployed data, I dont know how to deploy the data on the cluster, I was thinking to just read the huge file using PySpark, repartition it and save as parquet but that seems very basic and inefficient.

What else do I need? (nice to have) (TC Engagement 5 dataset)

While optional, It would be nice to know of a better way to decompress and parse the DARPA TC Engagement 5 dataset (Theai in this case), the included JAR parse the data and stream it into an elasticsearch service, since I dont have one I have to modify the Java class responsible to stream it to a local file instead (on the cluster), There must be a better and faster way to do that.

Thanks in advance for the help, much appreciated.

0

There are 0 best solutions below