Find X-largest values in a large file with optional input file command line parsing method in C++

551 Views Asked by At

I have a file in the following fixed format:

<unique record identifier> <white_space> <numeric value>

e.g.

1426828011 9
1426828028 350
1426828037 25
1426828056 231
1426828058 109
1426828066 111
.
.
.

I want to write a program that reads from 'stdin' the contents of a file, and optionally accepts the absolute path of a file from the command line. The file/stdin stream is expected to be in the above format. The output should be a list of the unique ids associated with the X-largest values in the rightmost column, where X is specified by an input parameter.

For example, given the input data above and X=3, the following would be valid output:

1426828028
1426828066
1426828056

Note that the output does not need to be in any particular order. Multiple instances of the same numeric value count as distinct records of the total X. So if we have 4 records with values: 200, 200, 115, 110 and X=2 then the result must consist of the two IDs that point to 200 and 200 and no more.

Notice: take into account extremely large files.

My idea and brief implementation: Sorting by k-largest values

1st way: I want to read file content into multimap then iterate k elements to output
2nd way: Read file data into a vector<pair<int, int>> then use heap sort (priority queue).

I'm wondering which way has better time complexity & higher performance? Time complexity of 2nd way should be nlog(n). Is time complexity of 1st way log(n)? Please tell me both time & space complexity of the above methods and suggest any other better methods.

Besides, the input file is huge, so I think of using external sort. But I haven't done it before. I'd appreciate if someone can instruct me or write sample code of it for my better understanding.

Anyways, it's not required to sort output. We only need to print X-largest values in any order. So I'm wondering whether I need to do any sorting algorithm. The requirement to print the X-largest values in any order is weird, because we must sort it in descending order before printing. So I even don't know why it says "in any order" as if it makes the problem easier.

My brief code:

#include <iostream>
#include <fstream>
#include <string>
#include <algorithm>
//#include "stdafx.h"

using namespace std;

std::multimap<int, int> mp;

typedef std::pair<std::string, int> mypair;

struct IntCmp {
    bool operator()(const mypair &lhs, const mypair &rhs) {
        return lhs.second < rhs.second;
    }
};

void printK(const std::map<std::string,int> &mymap, int k) {
    std::vector<mypair> myvec(mymap.begin(), mymap.end());
    assert(myvec.size() >= k);
    std::partial_sort(myvec.begin(), myvec.begin() + k, myvec.end(), IntCmp());

    for (int i = 0; i < k; ++i) {
        std::cout << i << ": " << myvec[i].first 
            << "-> " << myvec[i].second << "\n";
    }
}

void readinfo(std::istream& in)
{
    std::string s, ID, Value;
    //while(getline(in, s))
        while(in >> ID >> Value)
        std::cout << s << '\n';
}

int main (int argc, char **argv) {

    if (argc > 1) {     /* read from file if given as argument */
        std::ifstream fin (argv[1]);
        if (fin.is_open())
            readinfo(fin);
        else 
        {
            std::cerr << "error: Unable to open file " << argv[1] << '\n';
            return 1;
        }
    }
    else 
        // No input file has been passed in the command line.
        // Read the data from stdin (std::cin).
    {
        readinfo(std::cin);
    }

    return 0;
}

But I don't know how to split the huge file to sort and combine back together. Please tell me how to fix my code for this problem.

3

There are 3 best solutions below

4
On

I think we can come up with a better approach that has a lower space and time complexity.

One requirement is to get the x largest values. Then we do only need to store x values. Not more. The others are of no interest. All values will be read and, if not larger than the already collected values, then we discard them. With that, we save tons of memory.

Next, how to store?

If we have an already sorted container, then the smallest element is always at the beginning. So, if we read a new value, then we just need to compare this new value with the first element in the container. Because, if the new value would be smaller than the smallest existing value, we can discard it. But if it is bigger, then we need to add it to our container, and eliminate the so far smallest element.

If we use a function like std::lower_bound then it will give us the exact position on where we need to insert the new element. Without the need for any resorting. It can be inserted at the exact correct position. Then we have a new smallest value.

To select the type of container, we think of the operations that we need to do.

  • We want to eliminate the first element (Without shifting all other data). - We want to add an element in a given position, without the need to shift all following values to the right.

This leads us to a std::list, which will fulfil our criteria in an optimal way.

So, how will we implement the solution?

  • Define a struct to hold the data, the unique id and the associated value
  • Add extraction >> and insertion << operators for easier IO
  • Add 2 sort operator overloads for std::list::sort and std::lower_bound
  • In main, get an std::istreameither to a given source file or std::cin
  • Read the first X values and store them in the list as is. If there should be only these X values or less, then we have the solution already
  • Sort the values in the list. The smalles value is now at the front of the list
  • If the std::istream contains still data, then continue to read values
  • If the new value id greater than than the smalled value in the list, then add the value to the list with insert sort
  • Delete the smallest value at the front.

After the initial sorting, all operations will be done in constant time. The number of input values does not add additional complexity. Anyway. Most time will be burned with reading data from the disk, if the file is huge. For small files with 100'000 values or so, any other algorithm would be also fine.

Please see one of many potential solutions below.

#include <iostream>
#include <fstream>
#include <random>
#include <string>
#include <list>
#include <limits>
#include <algorithm>

const std::string fileName{ "r:\\big.txt" };

// ----------------------------------------------------------------------------
// Create a big file for Test purposes
void createBigFile() {

    if (std::ofstream ofs(fileName); ofs) {

        constexpr size_t uniqueIDStart = 1'426'828'028;
        constexpr size_t numberOfRecords = 10'000'000;
        constexpr size_t endRecords = uniqueIDStart + numberOfRecords;

        std::random_device randomDevice;
        std::mt19937 randomEngine(randomDevice());
        std::uniform_int_distribution<int> uniformDistribution(1, 10'000'000);

        for (size_t k{ uniqueIDStart }; k < endRecords; ++k) {
            ofs << k << ' ' << uniformDistribution(randomEngine) << '\n';
        }
    }
}
// ----------------------------------------------------------------------------


// Here we will store our data
struct Data {
    unsigned int ID{};
    int value{ std::numeric_limits<int>::max() };

    // Sorting operators
    bool operator < (const int& i) const { return value < i; }                  // For lower_bound
    bool operator < (const Data& other) const { return value < other.value; }   // For sort

    // Simple extractor and inserter
    friend std::istream& operator >> (std::istream& is, Data& d) { return is >> d.ID >> d.value; }
    friend std::ostream& operator << (std::ostream& os, const Data& d) { return os << d.ID << ' ' << d.value; }
};


// Whatever number of X you need for the X-largest values
constexpr size_t Rank = 50;
// We will use a list to store the C-largest data
using DList = std::list<Data>;   
using ListIter = DList::iterator;

// For faster reading we will increase the input buffer size
constexpr size_t ifStreamBufferSize = 500'000u;
static char buffer[ifStreamBufferSize];

int main(int argc, char* argv[]) {

    // If you want to create test data, then uncomment the following line
    //createBigFile();

    //We will either read from std::cin or from a file
    std::shared_ptr<std::istream> input{};
    if (argc == 2) {
        // Try to open the source file, given by command line arguments
        input.reset(new std::ifstream(argv[1]));
        input->rdbuf()->pubsetbuf(buffer, ifStreamBufferSize);
    }
    else {
        // Use std::cin for input. Handover a NoOp custom deleter. We do not want to close std::cin
        input.reset(&std::cin, [](...) {});
    }

    // If file could be opened or if std::cin is OK
    if (input) {

        // Here we will store all data
        DList dList;

        // Read the first X values as is
        size_t numberOfElementsInArray{};
        Data data;
        
        while (*input >> data) {
            if (numberOfElementsInArray < Rank) {
                dList.push_front(std::move(data));
                ++numberOfElementsInArray;
            }
            if (numberOfElementsInArray >= Rank) break;
        }
        // And do a first time (and the only sort)
        dList.sort();

        // For comparison
        int smallest{ dList.front().value };
        
        // Read all data from file
        while (*input >> data) {
           
            // If the latest read value is bigger than the smalles in the list, the we need to add a new value now
            if (data.value > smallest) {

                // FInd the position, where to insert the new element
                ListIter dIter = std::lower_bound(dList.begin(), dList.end(), data.value);
                if (dIter != dList.end()) {
                    // Insert new value where is should be. Keeps sorting
                    dList.insert(dIter,std::move(data));

                    // We have now more values then needed. Get rid of the smalles one and get new smallest value.
                    dList.pop_front();
                    smallest = dList.front().value;
                }
            }
        }
        std::copy(dList.rbegin(), dList.rend(), std::ostream_iterator<Data>(std::cout, "\n"));
    }
    else std::cerr << "*** Error with input file (" << fileName << ") or with std::cin\n\n";
}
6
On

Maybe you could use a min-heap via std::priority_queue:

#include <cstdlib>
#include <fstream>
#include <iostream>
#include <queue>
#include <vector>

struct IdAndValue {
  std::string id;
  int value;
};

struct ValueCmp {
  bool operator()(const IdAndValue &lhs, const IdAndValue &rhs) {
    return lhs.value > rhs.value;
  }
};

void PrintTopK(std::istream &in, long k) {
  std::priority_queue<IdAndValue, std::vector<IdAndValue>, ValueCmp> largest_k;
  std::string id;
  int value;
  while (in >> id >> value) {
    if (largest_k.size() < k) {
      largest_k.push({.id = id, .value = value});
    } else {
      if (value > largest_k.top().value) {
        largest_k.pop();
        largest_k.push({.id = id, .value = value});
      }
    }
  }
  std::cout << "Top " << k << " IDs with largest values:\n";
  while (!largest_k.empty()) {
    IdAndValue id_and_value = largest_k.top();
    largest_k.pop();
    std::cout << id_and_value.id << '\n';
  }
}

int main(int argc, char **argv) {
  if (argc > 2) {  // Read from file if given as argument.
    std::ifstream fin(argv[1]);
    if (fin.is_open())
      PrintTopK(fin, std::strtol(argv[2], nullptr, 10));
    else {
      std::cerr << "Error: Unable to open file " << argv[1] << '\n';
      return 1;
    }
  } else {  // Read the data from stdin (std::cin).
    PrintTopK(std::cin, std::strtol(argv[1], nullptr, 10));
  }
  return 0;
}

Usage from stdin (Ctrl + D to send EOF on unix):

./PrintTopK 3
1426828011 9
1426828028 350
1426828037 25
1426828056 231
1426828058 109
1426828066 111
Top 3 IDs with largest values:
1426828066
1426828056
1426828028

Usage when passed in a file:

$ ./PrintTopK /Users/shash/CLionProjects/PrintTopK/data.txt 3
Top 3 IDs with largest values:
1426828066
1426828056
1426828028

With data.txt:

1426828011 9
1426828028 350
1426828037 25
1426828056 231
1426828058 109
1426828066 111
1
On

----Solution.java----

import java.lang.Math;
import java.lang.Long;
import java.lang.Integer;
import java.lang.InterruptedException;
import java.text.SimpleDateFormat;
import java.util.Date;
import java.util.Collections;
import java.util.ArrayList;
import java.util.List;
import java.util.Map;
import java.util.HashMap;
import java.util.Arrays;
import java.util.concurrent.Callable;
import java.util.concurrent.ExecutorService;
import java.util.concurrent.CompletionService;
import java.util.concurrent.ExecutorCompletionService;
import java.util.concurrent.Executors;
import java.util.concurrent.Future;
import java.util.concurrent.TimeUnit;
import java.util.concurrent.ExecutionException;
import java.io.IOException;
import java.io.FileNotFoundException;
import java.io.BufferedReader;
import java.io.InputStreamReader;
import java.io.FileInputStream;
import java.io.File;
import java.io.FileWriter;
import java.io.RandomAccessFile;


/**
 * Function Description: This program reads the contents of a file from 'stdin', and can also accept the absolute path of a file from the command line.
 * The file or stdin stream should contain records in the "<unique record identifier><white_space><numeric value>" format.
 * The output should be a list of unique ids associated with the K-largest values in the rightmost column. The K is specified by an input parameter.
 * Process Description: To begin with, the data should be read from external files or stdins. If the amount of data is large, it is recommended to 
 * divide it into chunks with an average size of 1M. Then, a thread pool should be used to call a core algorithm (based on quick sorting theory) to 
 * calculate the largest k for each chunk. Finally, the results obtained from the previous step should be calculated again and then output.
 * JDK version used: JDK 1.8
 */
class Solution {

    /**
     * FunName: main
     * Description: Program main entrance.
     * @param args: args[0] = k, the k means the largest k elements. If args.length == 2, args[1] = the absolute path of a file.
     * if args.length == 1, the program reads from 'stdin' the contents of a file.
     * @throws Exception: If input is emptry,throw a exception.
     */
    public static void main(String[] args) throws Exception {
        int argsLengh = args.length;
        if (argsLengh < 1 || argsLengh > 2) {
            System.out.println("Parameters error, please re-enter.");
            return;
        }

        // The largest k, k > 0.
        int k = Integer.parseInt(args[0]);
        if (k <= 0) {
            System.out.println("The first parameter k should > 0, please re-enter.");
            return;
        }

        String testFileName = getTestFileNameFromArgOrStdin(args);

        // Split the large file into smaller logic chunks. Do not split if size < 1M.
        Map<Long, Long> chunkMap = splitFile(testFileName);

        // Distributed processing of multiple file logic chunks.
        List<Long> resIdentifiers = parallelHandleMutipleChunks(chunkMap, testFileName, k);

        createFileForOutput(resIdentifiers, "output_" + new SimpleDateFormat("yyyyMMddHHmmss").format(new Date()));
    }

    /**
     * FunName: getTestFileNameFromArgOrStdin
     * Description: Optionally accepts the absolute path of a file from the command line, or generates a new file from 'stdin'.
     * @param args: args of the main function, if argsLengh == 2, args[1] = the absolute path of a file.
     * @return: testFileName.
     */
    private static String getTestFileNameFromArgOrStdin(String[] args) {
        String testFileName = "";
        int argsLengh = args.length;

        if (argsLengh == 2) {
            // The absolute path of a file;
            testFileName = args[1];

            if (testFileName == "") {
                System.out.println("The file name is illegal, please re-enter.");
            }

            // If you want to automatically create a custom-sized test file, uncomment the following line.
            // testFileName = createFileForTest(1000000, "myTestInput");
            
        } else if (argsLengh == 1) {
            InputStreamReader inputStreamReader = null;
            BufferedReader in = null;

            try {
                inputStreamReader = new InputStreamReader(System.in);
                in = new BufferedReader(inputStreamReader);

                String lineStr;
                List<long[]> input = new ArrayList<>();
                while ((lineStr = in.readLine()).length() != 0) {
                    String[] strArray = lineStr.split(" ");
                    long identifier = Long.parseLong(strArray[0].trim());
                    long value = Long.parseLong(strArray[1].trim());
                    input.add(new long[]{identifier, value});
                }

                testFileName = createFileAndWriteData(input, "inputTest_" + new SimpleDateFormat("yyyyMMddHHmmss").format(new Date()));
                in.close();
            } catch (IOException e) {
                System.out.println("There are errors when reading the input" + e.getMessage());
            } finally {
                try {
                    in.close();
                } catch (IOException e) {
                    System.out.println("There are errors when closing the reader" + e.getMessage());
                }
            }
        }

        return testFileName;
    }

    // ----------------------------------------------------------------------------
    // Here is the core algorithm.

    /**
     * FunName: largestKValue
     * Description: Retrieve the largest k elements without the need for the results to be in a specific order.
     * @param input: The list to be processed.
     * @param k: The largest k, k > 0.
     * @return: The largest k elements.
     */
    private static List<long[]> largestKValue(List<long[]> input, int k) {
        int inputSize = input.size();
        if (k >= inputSize) {
            return input;
        }

        threeWayQuickSelect(input, k, 0, inputSize - 1);

        // Retrieve the k largest elements from the list, starting from index 0 to k-1.
        return input.subList(0, k);
    }

    /**
     * FunName: threeWayQuickSelect
     * Description: Quick sorting can select the largest x without any specific output order.
     * Using a three-way quick selecting method is more reasonable, considering the large amount of duplicate data that files may contain.
     * The average time complexity is O(N): this algorithm only requires one branch to be recursive at a time.
     * The average space complexity is O(log N): the average recursive depth of the function is O(log N).
     * @param input: The list to be processed.
     * @param k: The largest k, k > 0.
     * @param leftBound: Left boundary of each recursion.
     * @param rightBound: Right boundary of each recursion.
     */
    private static void threeWayQuickSelect(List<long[]> input, int k, int leftBound, int rightBound) {
        if (leftBound == rightBound) {
            return;
        }

        // In order to improve the robustness of the algorithm, the function adopt a random selection method to select the benchmark element.
        int randomIndex = leftBound + (int)(Math.random() * (rightBound - leftBound + 1));
        Collections.swap(input, leftBound, randomIndex);

        long pivotValue = input.get(leftBound)[1];

        int x = leftBound + 1;
        int j = rightBound;
        int i = leftBound;

        while (x <= j) {
            if (input.get(x)[1] > pivotValue) {
                i++;
                Collections.swap(input, x, i);
                x++;
            } else if (input.get(x)[1] == pivotValue) {
                x++;
            } else {
                Collections.swap(input, x, j);
                j--;
            }
        }

        Collections.swap(input, leftBound, i);

        if (i >= k) {
            // The largest k element is on the left side of i.
            threeWayQuickSelect(input, k, leftBound, i - 1);
        } else if (j + 1 < k) {
            // The largest k element is on the right side of j.
            threeWayQuickSelect(input, k, j + 1, rightBound);
        } else {
            // The largest k element is between [i, j].
            return;
        }
    }

    // ----------------------------------------------------------------------------
    // Here is the distributed processing of multiple file logic chunks.

    /**
     * FunName: parallelHandleMutipleChunks
     * Description: The function processes chunks in parallel using multiple threads. Each thread finds the largest k elements in its chunk.
     * Once a thread is done, its results can be merged with those of other threads to determine the final largest k elements.
     * @param chunkMap: Store the start position and size of the chunk in the file, key = start position, value = chunk size(bytes).
     * @param fileName: The name of the input file.
     * @param k: The largest k, k > 0.
     * @return: A list of the unique ids associatedwith the X-largest values.
     */
    private static List<Long> parallelHandleMutipleChunks(Map<Long, Long> chunkMap, String fileName, int k) {
        long startTime = System.currentTimeMillis();

        File file = new File(fileName);

        // create the optimal number of threads based on the available number of cores.
        int threadNum = Math.min(chunkMap.size(), Runtime.getRuntime().availableProcessors());
        ExecutorService executor = Executors.newFixedThreadPool(threadNum);
        CompletionService<List<long[]>> cs = new ExecutorCompletionService<>(executor);

        for (Map.Entry<Long, Long> entry : chunkMap.entrySet()) {
            cs.submit(new Callable<List<long[]>>() {
                @Override
                public List<long[]> call() {
                    return handleFileChunk(entry.getKey(), entry.getValue(), file, k);
                }
            });
        }

        // Do not have to wait for all the other threads to finish their work before combining their results.
        List<long[]> res = new ArrayList<>();
        try {
            for (int i = 0; i < chunkMap.size(); i++) {
                res.addAll(cs.take().get());
                if (res.size() > k) {
                    res = largestKValue(res, k);
                }
            }
        } catch (InterruptedException | ExecutionException e) {
            e.printStackTrace();
        }
        
        try {
            // close thread poll
            executor.shutdown();
            executor.awaitTermination(1, TimeUnit.HOURS);
        } catch (InterruptedException e) {
            System.out.println("shut down error " + e.getMessage());
        }

        long endTime = System.currentTimeMillis();
        long elapsedTime = endTime - startTime;
        System.out.println("paralle execute time:" + elapsedTime + "ms");  

        List<Long> resIdentifiers = new ArrayList<>();
        for (long[] ele : res) {
            long id = ele[0];
            System.out.println(id);
            resIdentifiers.add(id);
        }

        return resIdentifiers;
    }

    private static List<long[]> handleFileChunk(long start, long chunkSize, File file, int k) {
        List<long[]> chunkInput = readFileChunkWithBuffer(start, chunkSize, file);
        return largestKValue(chunkInput, k);
    }

    /**
     * FunName: readFileChunkWithBuffer
     * Description: Improve reading efficiency by using the buffer method to read a chunk at a specified location in the file.
     * @param start: The start position of the chunk.
     * @param chunkSize: The size of the chunk.
     * @param file: The name of the input file.
     * @return: The list of records, array[0] = identifier, array[1] = value.
     */
    private static List<long[]> readFileChunkWithBuffer(long start, long chunkSize, File file) {
        BufferedReader reader = null;
        FileInputStream fileInputStream = null;
        InputStreamReader inputStreamReader = null;

        try {
            fileInputStream = new FileInputStream(file);
            fileInputStream.skip(start);            
            inputStreamReader = new InputStreamReader(fileInputStream);
            reader = new BufferedReader(inputStreamReader);

            List<long[]> lines = new ArrayList<>();
            long currentCount = 0;
            String lineStr = null;
            while(currentCount < chunkSize && (lineStr = reader.readLine()) != null) {
                if (lineStr.equals("")) {
                    continue;
                }
                String[] strArray = lineStr.split(" ");
                long identifier = Long.parseLong(strArray[0].trim());
                long value = Long.parseLong(strArray[1].trim());
                lines.add(new long[]{identifier, value});
                currentCount += lineStr.getBytes().length + System.lineSeparator().getBytes().length;
            }

            reader.close();
            return lines;
        } catch (FileNotFoundException e) {
            System.out.println("does not find the file(" + file.getName() + "): " + e.getMessage());
        } catch (IOException e) {
            System.out.println("There are errors when reading the file(" + file.getName() + "): " + e.getMessage());
        } finally {
            try {
                reader.close();
            } catch (IOException e) {
                System.out.println("There are errors when closing the reader" + e.getMessage());
            }
        }

        return null;
    }

    // ----------------------------------------------------------------------------
    // Here is the splitting large files into multiple small logic chunks.

    /**
     * FunName: splitFile
     * Description: Split the input file into multiple 1M chunks; no splitting if file size < 1M.
     * @param fileName: The name of the input file.
     * @return: Using map to store splitting results, key = chunk start position, value = chunk size.
     * @throws Exception: If input is emptry,throw a exception.
     */
    private static Map<Long, Long> splitFile(String fileName) throws Exception {
        long startTime = System.currentTimeMillis();

        File file = new File(fileName);
        long total = file.length();
        if (total == 0) {
            throw new Exception("Input is empty, please retry.");
        }

        long targetChunkSize = total < 1 * 1024 * 1024 ? total : 1 * 1024 * 1024;
        int chunkNum = Integer.valueOf(String.valueOf(total/targetChunkSize));
        long chunkSize = total/chunkNum;
        long start = 0;

        // key = start, value = chunkSize
        Map<Long, Long> chunkMap = new HashMap<>();

        for (int i = 0; i < chunkNum; i++) {
            chunkSize = adjustChunkSize(start, chunkSize, file);
            chunkMap.put(start, chunkSize);
            start += chunkSize;
        }

        long endTime = System.currentTimeMillis();
        long elapsedTime = endTime - startTime;
        System.out.println("split file " + fileName + " time:" + elapsedTime + "ms");

        return chunkMap;
    }

    /**
     * FunName: adjustChunkSize
     * Description: Adjust size of chunk to ensure that each chunk ends at the end of a line and return its size.
     * @param start: The start position of the chunk.
     * @param chunkSize: Chunk size for rough splitting.
     * @param file: The name of the input file.
     */
    private static long adjustChunkSize(long start, long chunkSize, File file) {
        try (RandomAccessFile randomAccessFile = new RandomAccessFile(file, "r")) {
            randomAccessFile.seek(start + chunkSize);
            boolean needAdjust = true;
            
            // Check if position needs adjustment. No data or end of line = no adjustment.
            int read = randomAccessFile.read();
            switch (read) {
                case -1:
                case '\n':
                    needAdjust = false;
                    break;
                case '\r':
                    needAdjust = false;
                    break;
                default:
                    break;
            }

            // If it meets the adjustment criteria, adjust the data by reading a row, calculating the remaining length,
            // adding it to the previously read byte, and then adding the total to the chunk size.
            if (needAdjust){
                chunkSize += 1;
                String readLine = randomAccessFile.readLine();
                if (readLine != null) {
                    chunkSize += readLine.getBytes().length;
                }                
            }
        } catch (Exception e) {
            System.out.println("adjustChunkSize error " + e.getMessage());
        }
        return chunkSize;
    }

    // ----------------------------------------------------------------------------
    // Here is the Creation of test files and result output files

    /**
     * FunName: createFileForTest
     * Description: Create a file containing a specified number of records to test program correctness and performance.
     * @param recordsNum: The number of records in the file.
     * @param fileName: The name of the file for test.
     * @return: fileName
     */
    private static String createFileForTest(long recordsNum, String fileName) {
        long startTime = System.currentTimeMillis();

        File file = createFile(fileName);

        long startId = 1426828011;
        long endId = startId + recordsNum - 1;

        try (FileWriter writer = new FileWriter(file)) {
            for (long i = startId; i <= endId; i++) {
                long value = (long)(Math.random() * recordsNum);
                //long value = (long)100;
                writer.write(String.valueOf(i) + " " + String.valueOf(value));

                // The last record should not be followed by a line break.
                if (i < endId) {
                    writer.write("\n");
                }
            }
            writer.flush();
            System.out.println(fileName + " writes success");
        } catch (IOException e) {
            System.out.println("There are errors when writing the file(" + fileName + "): " + e.getMessage());
        }

        long endTime = System.currentTimeMillis();
        long elapsedTime = endTime - startTime;
        System.out.println("create file " + fileName + " time:" + elapsedTime + "ms");
        System.out.println(fileName + " file size: " + file.length() + " bytes");
        return fileName;
    }

    /**
     * FunName: createFileAndWriteData
     * Description: Create a file and write data to the file.
     * @param data: The list to write.
     * @param fileName: The name of the output file.
     * @return: The name of the output file.
     */
    private static String createFileAndWriteData(List<long[]> data, String fileName) {
        File file = createFile(fileName);

        try (FileWriter writer = new FileWriter(file)) {
            for (int i = 0; i < data.size(); i++) {
                long[] ele = data.get(i);
                writer.write(String.valueOf(ele[0]) + " " + String.valueOf(ele[1]));

                // The last record should not be followed by a line break.
                if (i < data.size() - 1) {
                    writer.write("\n");
                }
            }
            writer.flush();
        } catch (IOException e) {
            System.out.println("There are errors when writing the file(" + fileName + "): " + e.getMessage());
        }

        return fileName;
    }

    /**
     * FunName: createFileForOutput
     * Description: Write the largest k results to a file.
     * @param output: The list which contains ids of the largest k records.
     * @param fileName: The name of the output file.
     * @return: The name of the output file.
     */
    private static String createFileForOutput(List<Long> data, String fileName) {
        File file = createFile(fileName);

        try (FileWriter writer = new FileWriter(file)) {
            for (int i = 0; i < data.size(); i++) {
                writer.write(String.valueOf(data.get(i)));

                // The last record should not be followed by a line break.
                if (i < data.size() - 1) {
                    writer.write("\n");
                }
            }
            writer.flush();
        } catch (IOException e) {
            System.out.println("There are errors when writing the file(" + fileName + "): " + e.getMessage());
        }

        System.out.println("The result has been writen to the file: " + fileName);
        return fileName;
    }

    /**
     * FunName: createFile
     * Description: Create an empty file based on the file name. If the name already exists, delete the file first.
     * @param fileName: The name of the file.
     * @return: The file object.
     */ 
    private static File createFile(String fileName) {
        String currentPath = System.getProperty("user.dir");

        File file = new File(currentPath, fileName);

        if (file.exists()) {
            System.out.println("The file " + fileName + "is exists, start to delete it...");
            if (file.delete()) {
                System.out.println(fileName + " deletes success");
            } else {
                System.out.println(fileName + " deletes failed");
            }
        }

        try {
            if (file.createNewFile()) {
                System.out.println(fileName + " creates success");
            }
        } catch (IOException e) {
            System.out.println("There are errors when ceating the file(" + fileName + "): " + e.getMessage());
        }

        return file;
    }
}

---------Read Me-------------

Function Description This program reads the contents of a file from 'stdin', and can also accept the absolute path of a file from the command line. The file or stdin stream should contain records in the "<white_space>" format. The output should be a list of unique ids associated with the K-largest values in the rightmost column. The K is specified by an input parameter.

Technical Description:

  1. Core algorithm. The task requires returning the largest K elements without any specific order. To achieve this, an algorithm based on fast sorting theory is used. Furthermore, since there may be a large number of equal-value data, three-way fast sorting is considered to handle this extreme situation.
  2. Splitting. Divide the large file into smaller logic chunks and use multiple threads to read and process each part in parallel.This can significantly reduce the time required for processing the file.
  3. Using buffers. Using buffers can enhance the efficiency of reading large files. When reading files, the content can be stored in a buffer, and while processing data in the buffer, the next block of data can be read in the background. This allows for the full utilization of the benefits of multithreading.
  4. Using thread pools. Using a thread pool to manage multiple threads. The thread pool can provide certain thread reuse and management capabilities to avoid frequent creation and destruction of threads, thereby reducing resource consumption and improving efficiency.
  5. Merge while executing. When running tasks in a thread pool, it's common for the number of tasks to exceed the number of available threads.This can result in varying return times for each task. The program merges the results that return earlier rather than waiting for all results to return before merging.

Challenges and Unfulfilled Features

  1. Input validation. The data input by the user and the external file provided may be illegal, such as incorrect format, duplicate IDs, etc. Due to time constraints, this program did not perform excessive legitimacy verification on the input.
  2. Appropriate splitting. When splitting files, the chunk size should be dynamically adjusted based on the file size and available memory, but this hasn't been implemented. When multiple machines process large files together, they need to be allocated using a hash mode.
  3. Appropriate buffer size. When using buffers to read files, there is still room for tuning the size of the buffer. However, from the perspective of the time required for each stage of processing large files, the optimization set in this section is not a performance bottleneck.

My Assumptions

  1. In cases where the file is extremely large, k is much smaller than the number of records.

Getting Started

Prerequisites Hardware Requirements: CPU 1Ghz and above, memory 500M and above, storage 1GB and above. Operating System Requirements: mainstream operating systems such as MacOS, Linux, Windows. JAVA Version Requirements: JDK 1.8. Example for installing JDK8 in Linux:

  1. Go to http://java.com and click on the Download button.
  2. Cd directory_path_name, such as "cd /usr/java/".
  3. Move the .tar.gz archive binary to the current directory.
  4. Unpack the tarball and install Java, such as "tar zxvf jre-8u73-linux-x64.tar.gz".
  5. Delete the .tar.gz file if you want to save disk space.
  6. After installing the JDK, you may need to set JAVA_HOME to your profile.
  7. Verify the version of the JDK, "java -version".

Usage Example

  1. Compile .java files, "javac Solution.java".
  2. When acceptting the absolute path of a file, param1 = k, param2 = the absolute path, "java Solution 5 /Users/darain/Documents/myTestInput".
  3. when reading data from stdin, param1 = k, "java Solution 5".