I am relatively new to distributed computing, so forgive me if I misunderstand some of the basic concepts here. I am looking for a (preferably) Python-based alternative to Hadoop for processing large data sets via MapReduce on a cluster using an SGE-based grid engine (eg. OpenGrid or Sun of Grid Engine). I have had good luck running basic distributed jobs with PythonGrid, but I'd really like a more feature-rich framework for running my jobs. I have read up on tools like Disco and MinceMeatPy, both of which seem to offer true Map-Sort-Reduce job processing, but their does not seem to be any obvious support for SGE. This makes me wonder if it is possible to achieve true MapReduce functionality using a grid scheduler, or if people just don't support it out-of-the-box because they are not frequently used. Can you perform Map-Sort-Reduce tasks on a grid engine? Are their Python tools that support this? How difficult would it be to rig existing MapReduce tools to use SGE job schedulers?
Python MapReduce on Sun Grid Engine
1.1k Views Asked by woemler At
1
There are 1 best solutions below
Related Questions in PYTHON
- How to store a date/time in sqlite (or something similar to a date)
- Instagrapi recently showing HTTPError and UnknownError
- How to Retrieve Data from an MySQL Database and Display it in a GUI?
- How to create a regular expression to partition a string that terminates in either ": 45" or ",", without the ": "
- Python Geopandas unable to convert latitude longitude to points
- Influence of Unused FFN on Model Accuracy in PyTorch
- Seeking Python Libraries for Removing Extraneous Characters and Spaces in Text
- Writes to child subprocess.Popen.stdin don't work from within process group?
- Conda has two different python binarys (python and python3) with the same version for a single environment. Why?
- Problem with add new attribute in table with BOTO3 on python
- Can't install packages in python conda environment
- Setting diagonal of a matrix to zero
- List of numbers converted to list of strings to iterate over it. But receiving TypeError messages
- Basic Python Question: Shortening If Statements
- Python and regex, can't understand why some words are left out of the match
Related Questions in MAPREDUCE
- Hadoop No appenders could be found for logger (org.apache.hadoop.mapreduce.v2.app.MRAppMaster)
- Top-N using Python, MapReduce
- Spark Driver vs MapReduce Driver on YARN
- Hadoop MapReduce WordPairsCount produces inconsistent results
- Hadoop MiniCluster Web UI
- Java lang runtime exception or jar file does not exist error
- basic python but wierd problem in hadoop-stream text value changes in MapReduce
- Hadoop is writing to file using context.write() but output file turns out empty
- Error while executing load_summarize_chain with custom prompts
- Apache Crunch Job On AWS EMR using Oozie
- Hadoop MapReducee WordCountLength - Type mismatch in key from map: expected org.apache.hadoop.io.Text, received org.apache.hadoop.io.IntWritable
- Error: java.io.IOException: wrong value class: class org.apache.hadoop.io.Text is not class org.apache.hadoop.io.FloatWritable
- I'm having trouble with a map reduce script
- No Output for MapReduce Program even after successful job completion on Cloudera VM
- Context.write method returns wrong result in Mapreduce java
Related Questions in DISTRIBUTED-COMPUTING
- Micrometer & Prometheus with Java subprocesses that can't expose HTTP
- Least Connection Load balancing using Grpc
- How to debug ValueError: `FlatParameter` requires uniform dtype but got torch.float32 and torch.bfloat16?
- Load pre-training parameters trained on a single GPU on multi GPUS on a single machine
- How to access spark context or pandas inside a worker node to create a dataframe?
- Not Able To Connect Storj Node with Quic connection
- Is it better to store CUDA or CPU tensors that are loaded by torch DataLoader?
- FSDP with size_based_auto_wrap_policy freezes training
- Scalable Architecture for an Uptime Bot Tool in Node.js Handling Thousands of Cron Jobs Per Minute
- Contiguos graph partitioning
- How can we redirect system calls between OSes?
- spark sql - Have disabled Broadcast Hash Join ,but "NOT IN" query still do the mechanism
- How does model.to(rank) work if rank is an integer? (DistributedDataParallel)
- scanf function with MPI
- Accessing multiple GPUs on different hosts using LSF
Related Questions in SUNGRIDENGINE
- Sun GridEngine finished jobs working directory
- Interactive jobs via qsh not working in sge
- OGS/SGE np_load_avg Not Decaying
- First call to interpolator constructed by LinearNDInterpolator (with precomputed triangulation) slow regardless of grid size or values
- List all parallel environments of queues of Sun Grid Engine
- Specify job working directory in Sun Grid Engine
- Submitting a job to SGE using a string instead of a bash script
- Grid Engine - How to detect cluster node status?
- Is there a Sun Grid Engine / queuing / batch job management command that pauses job submissions at a particular time point then restarts them later?
- Output multiple files SGE
- is it possible to qsub a command instead of a script?
- R script works perfectly with Rscript but fails once sent with SGE qsub - problem loading packages
- Is the sharp+something format common in bash?
- SGE MPI jobs running on particular set of hosts only even though we have lot of nodes in pool
- Why does qconf -mq does not recognize an entry like -sq?
Trending Questions
- UIImageView Frame Doesn't Reflect Constraints
- Is it possible to use adb commands to click on a view by finding its ID?
- How to create a new web character symbol recognizable by html/javascript?
- Why isn't my CSS3 animation smooth in Google Chrome (but very smooth on other browsers)?
- Heap Gives Page Fault
- Connect ffmpeg to Visual Studio 2008
- Both Object- and ValueAnimator jumps when Duration is set above API LvL 24
- How to avoid default initialization of objects in std::vector?
- second argument of the command line arguments in a format other than char** argv or char* argv[]
- How to improve efficiency of algorithm which generates next lexicographic permutation?
- Navigating to the another actvity app getting crash in android
- How to read the particular message format in android and store in sqlite database?
- Resetting inventory status after order is cancelled
- Efficiently compute powers of X in SSE/AVX
- Insert into an external database using ajax and php : POST 500 (Internal Server Error)
Popular # Hahtags
Popular Questions
- How do I undo the most recent local commits in Git?
- How can I remove a specific item from an array in JavaScript?
- How do I delete a Git branch locally and remotely?
- Find all files containing a specific text (string) on Linux?
- How do I revert a Git repository to a previous commit?
- How do I create an HTML button that acts like a link?
- How do I check out a remote Git branch?
- How do I force "git pull" to overwrite local files?
- How do I list all files of a directory?
- How to check whether a string contains a substring in JavaScript?
- How do I redirect to another webpage?
- How can I iterate over rows in a Pandas DataFrame?
- How do I convert a String to an int in Java?
- Does Python have a string 'contains' substring method?
- How do I check if a string contains a specific word?
I've heard that Jug works. It's using the filesystem for coordination amongst the parallel tasks. In that kind of framework, you'd write your code and run "jug status primes.py" on the machine you're on then start a grid array job with as many workers as you like, all running "jug execute primes.py".
mincemeat.py should be able to function in the same way but looks to use the network for coordination. So that may depend on whether your nodes can talk to a server running the overall script.
There are several release notes about running actual Hadoop MapReduce and HDFS on SGE, but I haven't been able to find good documentation.
If you're used to Hadoop streaming with Python, it's not too bad to replicate on SGE. I've had some success with this at work: I run an array job that does map + shuffle for each input file. Then another array job that does sort + reduce for each reducer number. The shuffle part just writes files to a network dir like mapper00000_reducer00000, mapper00000_reducer00001, and so on (all pairs of mapper and reducer numbers). Then reducer 00001 sorts all files labeled reducer00001 together and pipes to reducer code.
Unfortunately, Hadoop streaming isn't very full-featured.