Now that the Read::chars iterator has been officially deprecated, what is the the proper way to obtain an iterator over the chars coming from a Reader like stdin without reading the entire stream into memory?
How can I create an efficient iterator of chars from stdin with Rust?
3.9k Views Asked by maxcountryman AtThere are 2 best solutions below
On
The corresponding issue for deprecation nicely sums up the problems with Read::chars and offers suggestions:
Code that does not care about processing data incrementally can use
Read::read_to_stringinstead. Code that does care presumably also wants to control its buffering strategy and work with&[u8]and&strslices that are as large as possible, rather than onecharat a time. It should be based on thestr::from_utf8function as well as thevalid_up_toanderror_lenmethods of theUtf8Errortype. One tricky aspect is dealing with cases where a singlecharis represented in UTF-8 by multiple bytes where those bytes happen to be split across separatereadcalls / buffer chunks. (Utf8Error::error_lenreturningNoneindicates that this may be the case.) Theutf-8crate solves this, but in order to be flexible provides an API that probably has too much surface to be included in the standard library.Of course the above is for data that is always UTF-8. If other character encoding need to be supported, consider using the
encoding_rsorencodingcrate.
Your own iterator
The most efficient solution in terms of number of I/O calls is to read everything into a giant buffer String and iterate over that:
use std::io::{self, Read};
fn main() {
let stdin = io::stdin();
let mut s = String::new();
stdin.lock().read_to_string(&mut s).expect("Couldn't read");
for c in s.chars() {
println!(">{}<", c);
}
}
You can combine this with an answer from Is there an owned version of String::chars?:
use std::io::{self, Read};
fn reader_chars<R: Read>(mut rdr: R) -> io::Result<impl Iterator<Item = char>> {
let mut s = String::new();
rdr.read_to_string(&mut s)?;
Ok(s.into_chars()) // from https://stackoverflow.com/q/47193584/155423
}
fn main() -> io::Result<()> {
let stdin = io::stdin();
for c in reader_chars(stdin.lock())? {
println!(">{}<", c);
}
Ok(())
}
We now have a function that returns an iterator of chars for any type that implements Read.
Once you have this pattern, it's just a matter of deciding where to make the tradeoff of memory allocation vs I/O requests. Here's a similar idea that uses line-sized buffers:
use std::io::{BufRead, BufReader, Read};
fn reader_chars<R: Read>(rdr: R) -> impl Iterator<Item = char> {
// We use 6 bytes here to force emoji to be segmented for demo purposes
// Pick more appropriate size for your case
let reader = BufReader::with_capacity(6, rdr);
reader
.lines()
.flat_map(|l| l) // Ignoring any errors
.flat_map(|s| s.into_chars()) // from https://stackoverflow.com/q/47193584/155423
}
fn main() {
// emoji are 4 bytes each
let data = "";
let data = data.as_bytes();
for c in reader_chars(data) {
println!(">{}<", c);
}
}
The far extreme would be to perform one I/O request for every character. This wouldn't take much memory, but would have a lot of I/O overhead.
A pragmatic answer
Copy and paste the implementation of Read::chars into your own code. It will work as well as it used to.
See also:
As a couple others have mentioned, it is possible to copy the deprecated implementation of
Read::charsfor use in your own code. Whether this is truly ideal or not will depend on your use-case--for me, this proved to be good enough for now although it is likely that my application will outgrow this approach in the near-future.To illustrate how this can be done, let's look at a concrete example:
The above code also requires
read_one_byteandutf8_char_widthto be implemented. Those should look something like:Now we can use the
MyReaderimplementation to produce an iterator ofchars over some reader, likeio::stdin::Stdin:The limitations of this approach are discussed at length in the original issue thread. One particular concern worth pointing out however is that this iterator will not handle non-UTF-8 encoded streams correctly.