What is the fastest way to find newline characters in Rust?

158 Views Asked by At

I'm new to Rust and I'm trying to write a version of the Linux wc utility in Rust. Specifically, I want it to be able to get the number of lines in an extremely large files—GiBs or larger—rather quickly—I'm comparing it to wc.

Here are the functions in question:

pub fn par_get_number_of_lines(path: &PathBuf) -> Result<(), io::Error>  {
    let input: File = File::open(&path)?;
    let file_size: usize = get_file_size(&path) as usize;
    let cores: u64 = available_parallelism().unwrap().get() as u64;
    let buffered: BufReader<File> = BufReader::new(input);
    let buff_capacity: usize = buffered.capacity();
    let buffer_segments: u64 = (file_size).div_ceil(buff_capacity) as u64;
    let segments_per_core: u64 = buffer_segments / cores;
    let mut remaining_bytes: u64 = buffer_segments % cores;
    let mut assignments: Vec<( u64, u64, u64)> = vec![];
    let mut cursor_location = 0;
    for core in 0..cores {
        let mut end = cursor_location + (segments_per_core  * buff_capacity as u64);
        if remaining_bytes > 0{
            end = cursor_location + ((segments_per_core + 1) * buff_capacity as u64);
            remaining_bytes -= 1;
        } else if core == cores {
            end = get_file_size(&path) as u64;
        }
        let assignment = (cursor_location, end, 0);
        assignments.push( assignment);
        cursor_location = end + 1;
    }

    let mut total_lines: Arc<Mutex<u64>> = Arc::new(Mutex::new(0));
    let mut path = path;
    let mut handle_vec: Vec<thread::JoinHandle<()>> = vec![]; // JoinHandles will go in here
    for assign in assignments{
        let total_lines_clone = Arc::clone(&total_lines);
        let path_clone = path.clone();
        let handle = std::thread::spawn(move || {
            *total_lines_clone.lock().unwrap() += read_part_of_file(path_clone, assign.0, assign.1);
        });
        handle_vec.push(handle);
    }
    handle_vec.into_iter().for_each(|handle| handle.join().unwrap());
    println!("Total lines {:?}", total_lines);
    Ok(())
}
fn read_part_of_file(path: PathBuf, start: u64, end: u64) -> u64{
    let input: File = File::open(&path).unwrap();
    let mut buffered: BufReader<File> = BufReader::new( input);
    buffered.seek(SeekFrom::Start(start as u64)).unwrap();
    let buf_size: u64 = (end - start) as u64;
    let reader: io::Take<BufReader<File>> = buffered.take(buf_size);
    let lines = reader.bytes().filter(
            |b| 
            Some(b.as_ref().unwrap())== Some(&('\n' as u8))
        )
        .count();    
    lines as u64
}

I was able to get pretty close to wc in terms of performance by using Rayon, but ran into an issue. My approach was to create an evenly distributed list of start and end bytes per core, and use BufReader::SeekFrom and ::Take for each thread to read from those parts of the file. The problem was that when I used lines() it counted an EOF as a newline character. So, when a segment from one thread ended mid sentence it counted that as a newline and then the actual newline starting in the next segment was counted by the next thread. The result was a fast, but inaccurate program.

To fix the issue, I try to search for '\n'. The result was accurate but MUCH SLOWER. It seems like there has to be a solution that is as fast, or nearly as fast as wc. I have spent hours looking for a solution. The closest thing was an article about writing wc in Rust, but the approach is significantly different and a bit over my head as a newbie. Any suggestions that can improve this program?

1

There are 1 best solutions below

0
nik88 On

Per @ShadowMaster: the fix was simply adding the --release flag to the build.