How can I read characters until a specific one in Java?

2.9k Views Asked by At

I want to read a few words from a file. I didn't found any method to do this, so I decided to read char by char, but I need to stop at the spaces to store the read word in my array and go to the next one.

I'm making an external sorting aplication, that's why I have a memory limitation, and, in that case, I can't just use readLine() and then split(), I need to have a control of what I read.

The read() method returns an int and I have no idea of what can I do to read() method return a char and stop reading after a space.

This is my code this far:

protected static String [] readWords(String arqName, int amountOfWords) throws IOException {
    FileReader arq = new FileReader(arqName);
    BufferedReader lerArq = new BufferedReader(arq);

    String[] words = new String[amountOfWords];

    for (int i = 0; i < amountOfWords; i++){
        //words[i] = lerArq.read();
    }

    return words;
}

Edit 1: I used a Scanner and the next() method, it worked. Scanner's initialization is at Main.

static String [] readWords(int amountOfWords, Scanner leitor) throws IOException {
    String[] words= new String[amountOfWords];

    for (int i = 0; i < amountOfWords; i++){
        words[i] = leitor.next();
    }

    return words;
}
2

There are 2 best solutions below

0
On BEST ANSWER

If you want to read it char by char (so you have more control over what you want to store and what you don't), you could try something like this:

import java.io.BufferedReader;
import java.io.IOException;

[...]

public static String readNextWord(BufferedReader reader) throws IOException {
    StringBuilder builder = new StringBuilder();

    int currentData;

    do {
        currentData = reader.read();

        if(currentData < 0) {
            if(builder.length() == 0) {
                return null;
            }
            else {
                return builder.toString();
            }
        }
        else if(currentData != ' ') {
            /* Since you're talking about words, here you can apply
             * a filter to ignore chars like ',', '.', '\n', etc. */

            builder.append((char) currentData);
        }

    } while (currentData != ' ' || builder.length() == 0);

    return builder.toString();
}

And then call it like this:

String[] words = new String[amountOfWordsToRead];

for (int i = 0; i < amountOfWordsToRead; i++){
    words [i] = readNextWord(yourBufferedReader);
}
4
On

Maybe this will be helpful.

It's not a problem to use read(). Just cast the result to a character:

...
for (int i = 0; i < memTam; i++) {
      // this should work. you will get the actual character
      int current = lerArq.read();
      if (current != -1) {
          char c = (char) current;
          // then you can do what you need with this character
      }
}
...

The method returns character read, as an integer in the range 0 to 65535 or -1 if the end of the stream has been reached.

I won't add a lot of theory about encodings, how it's done in Java, etc. because I am not aware of some very low-level details. I have a basic high-level understanding of how it works.

Every single key on your keyboard has a number associated with it. Every single character that you type can be translated into a decimal number. For example, A becomes the number 65. This is a standard and it is globally recognized.

At this point, I hope you can agree it's not that weird that read() method returns a number and not the actual character :)

There is something called the ASCII table which represents all those codes(numbers) for all the keys on your keyboard.

Here it is just to show how ot looks:

Dec  Char                           Dec  Char     Dec  Char     Dec  Char
---------                           ---------     ---------     ----------
  0  NUL (null)                      32  SPACE     64  @         96  `
  1  SOH (start of heading)          33  !         65  A         97  a
  2  STX (start of text)             34  "         66  B         98  b
  3  ETX (end of text)               35  #         67  C         99  c
  4  EOT (end of transmission)       36  $         68  D        100  d
  5  ENQ (enquiry)                   37  %         69  E        101  e
  6  ACK (acknowledge)               38  &         70  F        102  f
  7  BEL (bell)                      39  '         71  G        103  g
  8  BS  (backspace)                 40  (         72  H        104  h
  9  TAB (horizontal tab)            41  )         73  I        105  i
 10  LF  (NL line feed, new line)    42  *         74  J        106  j
 11  VT  (vertical tab)              43  +         75  K        107  k
 12  FF  (NP form feed, new page)    44  ,         76  L        108  l
 13  CR  (carriage return)           45  -         77  M        109  m
 14  SO  (shift out)                 46  .         78  N        110  n
 15  SI  (shift in)                  47  /         79  O        111  o
 16  DLE (data link escape)          48  0         80  P        112  p
 17  DC1 (device control 1)          49  1         81  Q        113  q
 18  DC2 (device control 2)          50  2         82  R        114  r
 19  DC3 (device control 3)          51  3         83  S        115  s
 20  DC4 (device control 4)          52  4         84  T        116  t
 21  NAK (negative acknowledge)      53  5         85  U        117  u
 22  SYN (synchronous idle)          54  6         86  V        118  v
 23  ETB (end of trans. block)       55  7         87  W        119  w
 24  CAN (cancel)                    56  8         88  X        120  x
 25  EM  (end of medium)             57  9         89  Y        121  y
 26  SUB (substitute)                58  :         90  Z        122  z
 27  ESC (escape)                    59  ;         91  [        123  {
 28  FS  (file separator)            60  <         92  \        124  |
 29  GS  (group separator)           61  =         93  ]        125  }
 30  RS  (record separator)          62  >         94  ^        126  ~
 31  US  (unit separator)            63  ?         95  _        127  DEL

So, imagine you have a .txt file with some text - all the letters have corresponding numbers.

The problem with ASCII is that ASCII defines 128 characters, which map to the numbers 0–127 (all of the upper-case letters, lower-case letters, 0-9 digits and a few more symbols).

But there are many more different characters/symbols in the world (different alphabets, emoji, etc.), so there has to be another encoding system to represent them all.

It is called Unicode. Unicode is exactly the same thing for characters whose codes are 0-127. But in general, Unicode can represent a much much wider range of symbols.

In Java, the char data type (and therefore the value that a Character object encapsulates) are based on the original Unicode specification, which defined characters as fixed-width 16-bit entities. You can check more details in this javadoc. In other words, all Strings in Java are represented in UTF-16.

Hope, after this long story, it makes some sense why you get numbers when read, but you can cast them to type char. And again, it's just a kind of high-level overview. Happy Coding :)