I have a big sequence file storing the tfidf values for documents. Each line represents line and the columns are the value of tfidfs for each term (the row is a sparse vector). I'd like to pick the top-k words for each document using Hadoop. The naive solution is to loop through all the columns for each row in the mapper and pick the top-k but as the file becomes bigger and bigger I don't think this is a good solution. Is there a better way to do that in Hadoop?
How to efficiently find top-k elements?
829 Views Asked by HHH At
1
There are 1 best solutions below
Related Questions in HADOOP
- pcap to Avro on Hadoop
- schedule and automate sqoop import/export tasks
- How to diagnose Kafka topics failing globally to be found
- Only 32 bit available in Oracle VM - Hadoop Installation
- Using HDFS with Apache Spark on Amazon EC2
- How to get raw hadoop metrics
- How to output multiple values with the same key in reducer?
- Loading chararray from embedded JSON using Pig
- Oozie Pig action stuck in PREP state and job is in RUNNING state
- InstanceProfile is required for creating cluster - create python function to install module
- mapreduce job not setting compression codec correctly
- What does namespace and block pool mean in MapReduce 2.0 YARN?
- Hadoop distributed mode
- Building apache hadoop 2.6.0 throwing maven error
- I am using Hbase 1.0.0 and Apache phoenix 4.3.0 on CDH5.4. When I restart Hbase regionserver is down
Related Questions in MAPREDUCE
- pcap to Avro on Hadoop
- CouchDB sum by date range and type
- How to output multiple values with the same key in reducer?
- mapreduce job not setting compression codec correctly
- Split S3 files into multiple output files
- groupByKey not properly working in spark
- MapReduce job fails with ExitCodeException exitCode=255
- What is better way to send associative array through map/reduce at MongoDB?
- How to efficiently join two files using Hadoop?
- null pointer exception in getstrings method hadoop
- can you explain word count mapreduce program step by step
- How to efficiently find top-k elements?
- how to ignore key-value pair in Map-Reduce if values are blank?
- akka: pattern for combining messages from multiple children
- Map a table of a cassandra database using spark and RDD
Related Questions in TF-IDF
- How to efficiently find top-k elements?
- Do I need to transform unseen documents before projecting them onto model topics?
- How do we ignore the order of letters in calculating Levenshtein distance?
- LDA with tm package in R using bigrams
- Incorporating new articles in tfidf vector for online clustering
- Find the tf-idf score of specific words in documents using sklearn
- Can I check the frequencies of predetermined words or phrases in document clustering using R?
- How can I group words based on how often they are used in the same sentence?
- Algorithm to group parts of documents that belong together
- Why the following tfidf vectorization is failing?
- tf:idf text analysis in r
- how to get the most representative features in the following tfidf model?
- Calculate SVD on a TF-IDF matrix
- IndexError: only integers, slices (`:`), ellipsis (`...`), numpy.newaxis (`None`) and integer or boolean arrays are valid indices using skfeature
- Alternatives to TF-IDF and Cosine Similarity (comparing documents with different formats)
Trending Questions
- UIImageView Frame Doesn't Reflect Constraints
- Is it possible to use adb commands to click on a view by finding its ID?
- How to create a new web character symbol recognizable by html/javascript?
- Why isn't my CSS3 animation smooth in Google Chrome (but very smooth on other browsers)?
- Heap Gives Page Fault
- Connect ffmpeg to Visual Studio 2008
- Both Object- and ValueAnimator jumps when Duration is set above API LvL 24
- How to avoid default initialization of objects in std::vector?
- second argument of the command line arguments in a format other than char** argv or char* argv[]
- How to improve efficiency of algorithm which generates next lexicographic permutation?
- Navigating to the another actvity app getting crash in android
- How to read the particular message format in android and store in sqlite database?
- Resetting inventory status after order is cancelled
- Efficiently compute powers of X in SSE/AVX
- Insert into an external database using ajax and php : POST 500 (Internal Server Error)
Popular Questions
- How do I undo the most recent local commits in Git?
- How can I remove a specific item from an array in JavaScript?
- How do I delete a Git branch locally and remotely?
- Find all files containing a specific text (string) on Linux?
- How do I revert a Git repository to a previous commit?
- How do I create an HTML button that acts like a link?
- How do I check out a remote Git branch?
- How do I force "git pull" to overwrite local files?
- How do I list all files of a directory?
- How to check whether a string contains a substring in JavaScript?
- How do I redirect to another webpage?
- How can I iterate over rows in a Pandas DataFrame?
- How do I convert a String to an int in Java?
- Does Python have a string 'contains' substring method?
- How do I check if a string contains a specific word?
Think of the problem as