I am a data scientist with the task of anomaly detection on system logs and I want to experiment on the DARPA TC Engagement 5 dataset.
I have downloaded the included scripts to import the data and now the data is being parsed and stored in a single huge ndjson
file.
What's my end goal?
Use PySpark
for distributed computation and storage of the huge ndjson
file on the cluster (so I can delete the ndjson
and then only use the distributed data).
What's my issue?
Due to my lack of experience with newly deployed data, I dont know how to deploy the data on the cluster, I was thinking to just read the huge file using PySpark
, repartition it and save as parquet but that seems very basic and inefficient.
What else do I need? (nice to have) (TC Engagement 5 dataset)
While optional, It would be nice to know of a better way to decompress and parse the DARPA TC Engagement 5 dataset (Theai in this case), the included JAR
parse the data and stream it into an elasticsearch
service, since I dont have one I have to modify the Java class
responsible to stream it to a local file instead (on the cluster), There must be a better and faster way to do that.
Thanks in advance for the help, much appreciated.