I am relatively new to distributed computing, so forgive me if I misunderstand some of the basic concepts here. I am looking for a (preferably) Python-based alternative to Hadoop for processing large data sets via MapReduce on a cluster using an SGE-based grid engine (eg. OpenGrid or Sun of Grid Engine). I have had good luck running basic distributed jobs with PythonGrid, but I'd really like a more feature-rich framework for running my jobs. I have read up on tools like Disco and MinceMeatPy, both of which seem to offer true Map-Sort-Reduce job processing, but their does not seem to be any obvious support for SGE. This makes me wonder if it is possible to achieve true MapReduce functionality using a grid scheduler, or if people just don't support it out-of-the-box because they are not frequently used. Can you perform Map-Sort-Reduce tasks on a grid engine? Are their Python tools that support this? How difficult would it be to rig existing MapReduce tools to use SGE job schedulers?
Python MapReduce on Sun Grid Engine
1.1k Views Asked by woemler At
1
There are 1 best solutions below
Related Questions in PYTHON
- new thread blocks main thread
- Extracting viewCount & SubscriberCount from YouTube API V3 for a given channel, where channelID does not equal userID
- Display images on Django Template Site
- Difference between list() and dict() with generators
- How can I serialize a numpy array while preserving matrix dimensions?
- Protractor did not run properly when using browser.wait, msg: "Wait timed out after XXXms"
- Why is my program adding int as string (4+7 = 47)?
- store numpy array in mysql
- how to omit the less frequent words from a dictionary in python?
- Update a text file with ( new words+ \n ) after the words is appended into a list
- python how to write list of lists to file
- Removing URL features from tokens in NLTK
- Optimizing for Social Leaderboards
- Python : Get size of string in bytes
- What is the code of the sorted function?
Related Questions in MAPREDUCE
- pcap to Avro on Hadoop
- CouchDB sum by date range and type
- How to output multiple values with the same key in reducer?
- mapreduce job not setting compression codec correctly
- Split S3 files into multiple output files
- groupByKey not properly working in spark
- MapReduce job fails with ExitCodeException exitCode=255
- What is better way to send associative array through map/reduce at MongoDB?
- How to efficiently join two files using Hadoop?
- null pointer exception in getstrings method hadoop
- can you explain word count mapreduce program step by step
- How to efficiently find top-k elements?
- how to ignore key-value pair in Map-Reduce if values are blank?
- akka: pattern for combining messages from multiple children
- Map a table of a cassandra database using spark and RDD
Related Questions in DISTRIBUTED-COMPUTING
- Is curator's persistent ephemeral nodes just regular ephemeral with retries?
- IPython MPI with a Machinefile
- Prevent RabbitMQ erl_crash.dump files?
- Hazelcast 3.3 - EntryProcessor is accessing "non-local" keys
- Java RMI Compute Engine
- Data division on Addition of node to distributed System
- Shuffled vs non-shuffled coalesce in Apache Spark
- Accessing data on distributed database on OrientDB
- Leverage Round Robin DNS for image transfer
- MPI Allreduce error on MPICH 3.1.5 on ARMv7
- Why can't CP systems also be CAP?
- In a distributed Java web application, how to share a value between all servlets on all machines?
- How is service discovery not a subset of centralized configuration?
- Warning that "unknown addresses are found in partition table"
- How to compute the average(or sum) of node values in a network?
Related Questions in SUNGRIDENGINE
- Changing priority of job in SGE using python drmaa wrapper
- SGE error can't open output file
- qsub: get last job id submitted
- How can python wait for a batch SGE script finish execution?
- openMPI/mpich2 doesn't run on multiple nodes
- SGE Cluster qsub email notifications not working
- What's the relationship between Sun Grid Engine (SGE) process number and OpenMPI process number?
- Submit job to multiple hosts in grid-engine
- Simultaneous starting -hold_jid jobs on Sun Grid Engine
- SGE - QSUB fails to submit jobs in -sync mode
- How to estimate memory requirement for submitting a job to a cluster running SGE?
- python sge api submit to specific node
- Python - glob.glob doesn't find *.txt in specified filepath within Unix OS
- SGE h_vmem vs java -Xmx -Xms
- How to save job info (qstat) in SGE when submitting to qsub?
Trending Questions
- UIImageView Frame Doesn't Reflect Constraints
- Is it possible to use adb commands to click on a view by finding its ID?
- How to create a new web character symbol recognizable by html/javascript?
- Why isn't my CSS3 animation smooth in Google Chrome (but very smooth on other browsers)?
- Heap Gives Page Fault
- Connect ffmpeg to Visual Studio 2008
- Both Object- and ValueAnimator jumps when Duration is set above API LvL 24
- How to avoid default initialization of objects in std::vector?
- second argument of the command line arguments in a format other than char** argv or char* argv[]
- How to improve efficiency of algorithm which generates next lexicographic permutation?
- Navigating to the another actvity app getting crash in android
- How to read the particular message format in android and store in sqlite database?
- Resetting inventory status after order is cancelled
- Efficiently compute powers of X in SSE/AVX
- Insert into an external database using ajax and php : POST 500 (Internal Server Error)
Popular Questions
- How do I undo the most recent local commits in Git?
- How can I remove a specific item from an array in JavaScript?
- How do I delete a Git branch locally and remotely?
- Find all files containing a specific text (string) on Linux?
- How do I revert a Git repository to a previous commit?
- How do I create an HTML button that acts like a link?
- How do I check out a remote Git branch?
- How do I force "git pull" to overwrite local files?
- How do I list all files of a directory?
- How to check whether a string contains a substring in JavaScript?
- How do I redirect to another webpage?
- How can I iterate over rows in a Pandas DataFrame?
- How do I convert a String to an int in Java?
- Does Python have a string 'contains' substring method?
- How do I check if a string contains a specific word?
I've heard that Jug works. It's using the filesystem for coordination amongst the parallel tasks. In that kind of framework, you'd write your code and run "jug status primes.py" on the machine you're on then start a grid array job with as many workers as you like, all running "jug execute primes.py".
mincemeat.py should be able to function in the same way but looks to use the network for coordination. So that may depend on whether your nodes can talk to a server running the overall script.
There are several release notes about running actual Hadoop MapReduce and HDFS on SGE, but I haven't been able to find good documentation.
If you're used to Hadoop streaming with Python, it's not too bad to replicate on SGE. I've had some success with this at work: I run an array job that does map + shuffle for each input file. Then another array job that does sort + reduce for each reducer number. The shuffle part just writes files to a network dir like mapper00000_reducer00000, mapper00000_reducer00001, and so on (all pairs of mapper and reducer numbers). Then reducer 00001 sorts all files labeled reducer00001 together and pipes to reducer code.
Unfortunately, Hadoop streaming isn't very full-featured.