I have a file in the following fixed format:
<unique record identifier> <white_space> <numeric value>
e.g.
1426828011 9
1426828028 350
1426828037 25
1426828056 231
1426828058 109
1426828066 111
.
.
.
I want to write a program that reads from 'stdin' the contents of a file, and optionally accepts the absolute path of a file from the command line. The file/stdin stream is expected to be in the above format. The output should be a list of the unique ids associated with the X-largest values in the rightmost column, where X is specified by an input parameter.
For example, given the input data above and X=3, the following would be valid output:
1426828028
1426828066
1426828056
Note that the output does not need to be in any particular order. Multiple instances of the same numeric value count as distinct records of the total X. So if we have 4 records with values: 200, 200, 115, 110 and X=2 then the result must consist of the two IDs that point to 200 and 200 and no more.
Notice: take into account extremely large files.
My idea and brief implementation: Sorting by k-largest values
1st way: I want to read file content into multimap then iterate k elements to output
2nd way: Read file data into a vector<pair<int, int>> then use heap sort (priority queue).
I'm wondering which way has better time complexity & higher performance? Time complexity of 2nd way should be nlog(n). Is time complexity of 1st way log(n)? Please tell me both time & space complexity of the above methods and suggest any other better methods.
Besides, the input file is huge, so I think of using external sort. But I haven't done it before. I'd appreciate if someone can instruct me or write sample code of it for my better understanding.
Anyways, it's not required to sort output. We only need to print X-largest values in any order. So I'm wondering whether I need to do any sorting algorithm. The requirement to print the X-largest values in any order is weird, because we must sort it in descending order before printing. So I even don't know why it says "in any order" as if it makes the problem easier.
My brief code:
#include <iostream>
#include <fstream>
#include <string>
#include <algorithm>
//#include "stdafx.h"
using namespace std;
std::multimap<int, int> mp;
typedef std::pair<std::string, int> mypair;
struct IntCmp {
bool operator()(const mypair &lhs, const mypair &rhs) {
return lhs.second < rhs.second;
}
};
void printK(const std::map<std::string,int> &mymap, int k) {
std::vector<mypair> myvec(mymap.begin(), mymap.end());
assert(myvec.size() >= k);
std::partial_sort(myvec.begin(), myvec.begin() + k, myvec.end(), IntCmp());
for (int i = 0; i < k; ++i) {
std::cout << i << ": " << myvec[i].first
<< "-> " << myvec[i].second << "\n";
}
}
void readinfo(std::istream& in)
{
std::string s, ID, Value;
//while(getline(in, s))
while(in >> ID >> Value)
std::cout << s << '\n';
}
int main (int argc, char **argv) {
if (argc > 1) { /* read from file if given as argument */
std::ifstream fin (argv[1]);
if (fin.is_open())
readinfo(fin);
else
{
std::cerr << "error: Unable to open file " << argv[1] << '\n';
return 1;
}
}
else
// No input file has been passed in the command line.
// Read the data from stdin (std::cin).
{
readinfo(std::cin);
}
return 0;
}
But I don't know how to split the huge file to sort and combine back together. Please tell me how to fix my code for this problem.
I think we can come up with a better approach that has a lower space and time complexity.
One requirement is to get the x largest values. Then we do only need to store x values. Not more. The others are of no interest. All values will be read and, if not larger than the already collected values, then we discard them. With that, we save tons of memory.
Next, how to store?
If we have an already sorted container, then the smallest element is always at the beginning. So, if we read a new value, then we just need to compare this new value with the first element in the container. Because, if the new value would be smaller than the smallest existing value, we can discard it. But if it is bigger, then we need to add it to our container, and eliminate the so far smallest element.
If we use a function like
std::lower_bound
then it will give us the exact position on where we need to insert the new element. Without the need for any resorting. It can be inserted at the exact correct position. Then we have a new smallest value.To select the type of container, we think of the operations that we need to do.
This leads us to a
std::list
, which will fulfil our criteria in an optimal way.So, how will we implement the solution?
>>
and insertion<<
operators for easier IOstd::list::sort
andstd::lower_bound
std::istream
either to a given source file orstd::cin
std::istream
contains still data, then continue to read valuesAfter the initial sorting, all operations will be done in constant time. The number of input values does not add additional complexity. Anyway. Most time will be burned with reading data from the disk, if the file is huge. For small files with 100'000 values or so, any other algorithm would be also fine.
Please see one of many potential solutions below.