I have a standalone Spark running on a virtual machine on my computer. Spark Streaming gets data from Kafka, saves it onto an HBase table, then processes it and saves the result to another table.
Spark Batch queries the table of processed results for the latest entry and uses data from there to determine which data to query from the unprocessed data table. The batch job has an infinite while loop making the batch restart once it finishes. Both it and the streaming job have the scheduler set to fair.
I have a client app that runs all these things in proper order by streaming generated information into Kafka first and then launching a separate thread for the streaming layer, after that for the batch after a certain delay.
My issue is that streaming runs and doesn't complain, using 2 of the 3 provided cores, but when the batch job starts, the stream says it's running, but the HBase tables show clearly that while the batch jobs are writing to their table, the streaming jobs don't write anything. Also, the streaming logs pause while this all happens.
The way I set up the threads to be run is like this:
Runnable batch = new Runnable() {
@Override
public void run() {
try {
Lambda.startBatch(lowBoundary, highBoundary);
} catch (Exception e) {
e.printStackTrace();
}
}
};
Thread batchThread = new Thread(batch);
batchThread.start();
The starting of batch and streaming are done through ProcessBuilder like this:
public static void startBatch(String low, String high) throws Exception {
// Specify executable path
String sparkSubmit = "/home/lambda/Spark/bin/spark-submit";
// Describe the process to be run
ProcessBuilder batch = new ProcessBuilder(sparkSubmit,
"--class", "batch.Batch", "--master",
"spark://dissertation:7077",
"/home/lambda/Downloads/Lambda/target/lambda-1.0-jar-with-dependencies.jar",
low, high);
// Start the batch layer
batch.start();
}
Does anyone have an idea on why that's happening? I'm suspecting it's just Spark not managing the tasks like I want them to, but have no idea what to do about it.