I am totally new to Amazon Elastic MapReduce. I have a need that I want to use my custom scheduler, which is implemented based on Hadoop capacity scheduler, to schedule my jobs in Amazon Elastic MapReduce.
According to my current understanding, to achieve this, I can define only one stage in the job flow, and submit my custom jar file via SSH connection to the master node. However, I cannot find how can I edit the xml configuration files, like capacity-scheduler.xml in the master node. Anyone knows how to do that?
Moreover, if I want to add the dynamic sizing property onto it, can I dynamically tune the number of task nodes in the cluster, when the job is currently running? Or in per stage, the size of a cluster should remain the same? Thank you so much.
You should use a bootstrap action to change Hadoop configuration.
The following AWS doc can be referenced for Hadoop configuratio bootstrap action.
http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/emr-plan-bootstrap.html#PredefinedbootstrapActions_ConfigureHadoop
This blog article that I bookmarked also has some info. http://sujee.net/tech/articles/hadoop/amazon-emr-beyond-basics/
For changing the cluster size dynamically, one option is to use the AWS SDK.
http://docs.aws.amazon.com/ElasticMapReduce/latest/DeveloperGuide/calling-emr-with-java-sdk.html
Using the following interface you can modify the instance count of the instance group. http://docs.aws.amazon.com/AWSJavaSDK/latest/javadoc/com/amazonaws/services/elasticmapreduce/AmazonElasticMapReduce.html