Median in 500GB file java

232 Views Asked by At

Find median of all numbers in the given 500GB file at the command prompt.

File format eg:

12 
4
98
3

with one number in each line(numbers can be repeated).Can anyone please help on how to approach on this in JAVA? if we have to split the file and then how can median be calculated? I have come across several posts on median but couldn't find best approach on such huge file .

2

There are 2 best solutions below

0
On

This doesn't cover the calculation itself, but here is how you read the file in small parts, so that you don't run out of memory.

try (
    InputStream fis = Files.newInputStream(Paths.get(fileName), StandardOpenOption.READ);
    BufferedReader book = new BufferedReader(new InputStreamReader(fis, StandardCharsets.UTF_8));
) {
    String line = null;
    long cnt = 0;
    while ((line = book.readLine()) != null) {
        cnt++;
        BigInteger data = new BigInteger(line);
        ... handle the data
        if (cnt % 500 == 0) System.gc(); // invoke garbage collector
    }
}

I recently needed to import a 50mb file that gave me out-of-memory errors with a 2GB memory limit, just because of all the extra metadata that it keeps for each object, and this method helped me get through it.

0
On

500GB file with [not necessarily unique numbers represented as strings of decimal digits,] one number in each line
- that's 250_000_000_000L numbers, at most, each with no more than twice that many digits, occurrence of signs not specified.

Assuming you can allocate 1 GB of long counters, you can count the number of numbers with any given length below 25 million digits, and the total number of numbers in a first pass.
Determine the (sign and) length of the digit string to represent your median.
In subsequent passes, narrow down the range for your median, starting with number representations of same (sign and) length.