How to Read a File in Char for C

Task

Read a file graphic symbol past graphic symbol/UTF8
You are encouraged to solve this job according to the job description, using whatsoever language you may know.

Chore

Read a file ane character at a fourth dimension, every bit opposed to reading the unabridged file at once.

The solution may exist implemented as a procedure, which returns the adjacent character in the file on each consecutive call (returning EOF when the end of the file is reached).

The process should support the reading of files containing UTF8 encoded broad characters, returning whole characters for each consecutive read.

Related task

Read a file line by line

1 Action!
two AutoHotkey
3 BASIC256
iv C
v C++
vi C#
7 Mutual Lisp
8 Crystal
9 Delphi
ten Déjà Vu
11 Gene
12 FreeBASIC
xiii FunL
xiv Go
15 Haskell
xvi J
17 Coffee
xviii jq
19 Julia
20 Kotlin
21 Lua
22 M2000 Interpreter
23 Mathematica/Wolfram Language
24 NetRexx
25 Nim
26 Pascal
27 Perl
28 Phix
29 PicoLisp
30 Python
31 Racket
32 Raku
33 REXX
- 33.one version 1
- 33.two version 2
34 Band
35 Ruby-red
36 Run Bones
37 Rust
38 Seed7
39 Sidef
twoscore Smalltalk
41 Tcl
42 Wren
43 zkl

Activeness! [edit]

byte X Proc Main()
            Open (i,"D:FILENAME.TXT",4,0)
            Do
            10=GetD(1)
            Put(Ten)
            Until EOF(one)
            Od
            Close(1)
           Return

AutoHotkey [edit]

File          :=          FileOpen(          "input.txt"          ,          "r"          )          
          while !File.AtEOF
          MsgBox          , % File.Read(          1          )

BASIC256 [edit]

f = freefile
filename$ = "file.txt" open f, filename$
           while non eof(f)
            print chr(readbyte(f));
end while
close f
end

C [edit]

          #include <stdio.h>          
          #include <wchar.h>          
          #include <stdlib.h>          
          #include <locale.h>          int            main(            void            )            
            {            
            /* If your native locale doesn't employ UTF-8 encoding              
              * you demand to replace the empty string with a
              * locale like "en_US.utf8"
              */            
            char            *locale            =            setlocale            (LC_ALL,            ""            )            ;            
            FILE            *in            =            fopen            (            "input.txt"            ,            "r"            )            ;          
               wint_t c;            
            while            (            (c            =            fgetwc            (in)            )            !=            WEOF)            
            putwchar            (c)            ;            
            fclose            (in)            ;          
                       return            EXIT_SUCCESS;            
            }

C++ [edit]


          #include <fstream>          
          #include <iostream>          
          #include <locale>          using            namespace            std;          
          int            main(            void            )            
            {            
            /* If your native locale doesn't use UTF-8 encoding              
              * you demand to replace the empty cord with a
              * locale like "en_US.utf8"
              */            
            std::            locale            ::            global            (std::            locale            (            ""            )            )            ;            // for C++            
            std::            cout.imbue            (std::            locale            (            )            )            ;            
            ifstream in(            "input.txt"            )            ;          
                       wchar_t            c;            
            while            (            (c            =            in.become            (            )            )            !            =            in.eof            (            )            )            
            wcout<<c;            
            in.close            (            )            ;          
                       return            EXIT_SUCCESS            ;            
            }

C# [edit]

          using          System          ;          
          using          System.IO          ;          
          using          System.Text          ;          namespace            RosettaFileByChar
            {            
            class            Program
            {            
            static            char            GetNextCharacter(StreamReader streamReader)            =>            (            char            )streamReader.            Read            (            )            ;          
                       static            void            Main(            string            [            ]            args)            
            {            
            Console.            OutputEncoding            =            Encoding.            UTF8            ;            
            char            c;            
            using            (FileStream fs            =            File.            OpenRead            (            "input.txt"            )            )            
            {            
            using            (StreamReader streamReader            =            new            StreamReader(fs, Encoding.            UTF8            )            )            
            {            
            while            (            !streamReader.            EndOfStream            )            
            {            
            c            =            GetNextCharacter(streamReader)            ;            
            Panel.            Write            (c)            ;            
            }            
            }            
            }            
            }            
            }            
            }

Common Lisp [edit]

          ;; CLISP puts the external formats into a separate packet          
#+clisp          (import 'charset:          utf-          viii          'keyword)          (with-open up-file            (south            "input.txt"            :            external-format            :            utf-            viii            )            
            (loop for c            =            (read-char due south            nil            )            
            while c
            practice            (format t            "~a"            c)            )            )

Crystal [edit]

Translation of: Carmine

The encoding is UTF-8 by default, but it can be explicitly specified.

          File.open          (          "input.txt"          )          do          |file|          
          file.each_char          {          |c|          p          c          }          
          end

          File.open          (          "input.txt"          )          do          |file|          
          while          c = file.read_char          
          p          c
          finish          
          end

Delphi [edit]

Translation of: C#


          program          Read_a_file_character_by_character_UTF8;          {$APPTYPE CONSOLE}          
          uses            
            Arrangement.            SysUtils            ,            
            System.            Classes            ;          
          function            GetNextCharacter(StreamReader:            TStreamReader)            :            char            ;            
            begin            
            Result            :            =            chr            (StreamReader.            Read            )            ;            
            end            ;          
          const            
            FileName:            TFileName            =            'input.txt'            ;          
          begin            
            if            not            FileExists            (FileName)            then            
            raise            Exception.            Create            (            'Error: File not exist.'            )            ;          
                       var            F            :            =            TStreamReader.            Create            (FileName,            TEncoding.            UTF8            )            ;          
                       while            not            F.            EndOfStream            do            
            begin            
            var            c            :            =            GetNextCharacter(F)            ;            
            write            (c)            ;            
            terminate            ;            
            readln;            
            end            .

Déjà Vu [edit]

#helper function that deals with non-ASCII code points
local (read-utf8-char) file tmp:
!read-byte file
if = :eof dup:
drop
enhance :unicode-mistake
resize-blob tmp ++ dup len tmp
set-to tmp
attempt:
return !decode!utf-8 tmp
take hold of unicode-error:
if < 3 len tmp:
raise :unicode-error
(read-utf8-char) file tmp

#reader role
read-utf8-char file:
!read-byte file
if = :eof dup:
return
local :tmp make-blob ane
fix-to tmp 0
effort:
return !decode!utf-8 tmp
grab unicode-fault:
(read-utf8-char) file tmp

#if the module is used as a script, read from the file "input.txt",
#showing each code bespeak separately
if = (name) :(primary):
local :file !open :read "input.txt"

while truthful:
read-utf8-char file
if = :eof dup:
drop
!close file
return
!.

Factor [edit]

USING: kernel io io.encodings.utf8 io.files strings ;
IN: rosetta-code.read-i "input.txt" utf8 [
            [ read1 dup ] [ 1string write ] while drop
] with-file-reader

FreeBASIC [edit]

Dim As Long f
f = Freefile

Dim Equally String filename = "file.txt"
Dim As String*1 txt

Open filename For Binary Equally #f
While Not Eof (f)
txt = String ( Lof (f), 0 )
Become #f, , txt
Print txt;
Wend
Close #f
Slumber

FunL [edit]

import io.{InputStreamReader, FileInputStream} r = InputStreamReader( FileInputStream('input.txt'), 'UTF-8' )
           while (ch = r.read()) != -1
            print( chr(ch) )
           r.close()

Go [edit]

          package          mainimport            (            
            "bufio"            
            "fmt"            
            "io"            
            "bone"            
            )          
          func            Runer(r io.RuneReader)            func            ()            (rune,            mistake)            {            
            return            func            ()            (r rune,            err mistake)            {            
            r,            _,            err            =            r.ReadRune()            
            render            
            }            
            }          
          func            master()            {            
            runes            :=            Runer(bufio.NewReader(os.Stdin))            
            for            r,            err            :=            runes();            err            !=            null            ;            r,err            =            runes()            {            
            fmt.Printf(            "%c"            ,            r)            
            }            
            }

Haskell [edit]

Works with: GHC version vii.8.iii

#!/usr/bin/env runhaskell{- The procedure to read a UTF-8 graphic symbol is only:
    hGetChar :: Handle -> IO Char
                  bold that the encoding for the handle has been set to utf8.
-}
                      
          import            Arrangement.Surround            (getArgs)            
            import            System.            IO            (            
            Handle,            IOMode            (            ..            )            ,            
            hGetChar,            hIsEOF,            hSetEncoding,            stdin,            utf8,            withFile
            )            
            import            Control.            Monad            (grade_,            unless)            
            import            Text.Printf            (printf)            
            import            Information.            Char            (ord)          
           processCharacters            ::            Handle            ->            IO            (            )            
processCharacters h            =            exercise            
            done            <-            hIsEOF h
            unless done            $            do            
            c            <-            hGetChar h
            putStrLn            $            printf            "U+%04X"            (ord c)            
            processCharacters h
           processOneFile            ::            Handle            ->            IO            (            )            
processOneFile h            =            exercise            
            hSetEncoding h utf8
            processCharacters h
          {- Y'all can specify 1 or more files on the command line, or if no
              files are specified, it reads from standard input.
-}            
main            ::            IO            (            )            
primary            =            practise            
            args            <-            getArgs
            case            args            of            
            [            ]            ->            processOneFile stdin
            xs            ->            class_            xs            $            \name            ->            do            
            putStrLn            proper noun
            withFile name ReadMode processOneFile

fustigate$ repeat €50 | ./read-char-utf8.hs  U+20AC U+0035 U+0030 U+000A

J [edit]

Reading a file a character at a time is antithetical non only to the compages of J, only to the architecture and design of most computers and most file systems. Nevertheless, this can be a useful concept if you're building your own hardware. So permit's model it...

Beginning, we know that the kickoff viii-bit value in a utf-8 sequence tells united states of america the length of the sequence needed to represent that character. Specifically: we can convert that value to binary, and count the number of leading 1s to observe the length of the character (except the length is always at to the lowest degree ane graphic symbol long).

u8len=:          1          >.          0          i.~          (          8#ii          )#:a.&i.

So at present, we tin can apply indexed file read to read a utf-eight character starting at a specific file alphabetize. What we do is read the first octet and then read every bit many additional characters as we need based on any we started with. If that'southward not possible, nosotros will return EOF:

indexedread1u8=:4 :0          
          try.          
          octet0=.          1!:11          y;x,1          
          octet0,1!:11          y;(          x+1          ),<:u8len octet0
          take hold of.          
          'EOF'          
          end.          
          )

The length of the upshot tells us what to add to the file alphabetize to detect the adjacent available file index for reading.

Of course, this is massively inefficient. Then if someone ever asks you to do this, make sure you enquire them "Why?" Because the answer to that question is going to be important (and might advise a completely different implementation).

Note also that it would make more than sense to return an empty string, instead of the string 'EOF', when we reach the stop of the file. But that is out of scope for this chore.

Java [edit]

          import          coffee.io.FileReader          ;          
          import          java.io.IOException          ;          
          import          java.nio.charset.StandardCharsets          ;          public            form            Main            {          
                       public            static            void            primary(            String            [            ]            args)            throws            IOException            {            
            var reader            =            new            FileReader            (            "input.txt", StandardCharsets.UTF_8            )            ;            
            while            (            true            )            {            
            int            c            =            reader.read            (            )            ;            
            if            (c            ==            -            1            )            intermission            ;            
            System.out.print            (            Character.toChars            (c)            )            ;            
            }            
            }            
            }

jq [edit]

jq being stream-oriented, it makes sense to define `readc` then that information technology emits a stream of the UTF-8 characters in the input:

def readc:
          inputs + "\n" | explode[] | [.] | implode;

Example:


          repeat '过活' | jq -Rn 'include "readc"; readc'
          "过"
          "活"
          "\n"

Julia [edit]

The built-in read(stream, Char) role reads a unmarried UTF8-encoded character from a given stream.

open("myfilename") practice f
          while !eof(f)
          c = read(f, Char)
          println(c)
          end
cease

Kotlin [edit]

          // version one.1.2          import            java.io.File          
           const            val            EOF            =            -1          
           fun main(args:            Array<Cord>            )            {            
            val            reader            =            File(            "input.txt"            ).reader            (            )            // uses UTF-8 by default            
            reader.utilise            {            
            while            (            true            )            {            
            val            c            =            reader.read            (            )            
            if            (c            ==            EOF)            intermission
            print(c.toChar            (            )            )            // echo to console            
            }            
            }            
            }

Lua [edit]

Works with: Lua version 5.iii


          -- Return whether the given string is a single ASCII character.          
          function          is_ascii          (str)          
          render          string          .match(str,          "[\0-\x7F]"          )          
          end          -- Return whether the given string is an initial byte in a multibyte sequence.            
            part            is_init            (str)            
            render            string            .lucifer(str,            "[\xC2-\xF4]"            )            
            terminate          
          -- Return whether the given string is a continuation byte in a multibyte sequence.            
            function            is_cont            (str)            
            return            string            .lucifer(str,            "[\x80-\xBF]"            )            
            end          
          -- Accept a filestream.            
            -- Return the side by side UTF8 graphic symbol in the file.            
            function            read_char            (file)            
            local            multibyte            -- build a valid multibyte Unicode character          
                       for            c            in            file:lines(            ane            )            practice            
            if            is_ascii(c)            then            
            if            multibyte            and then            
            -- Nosotros've finished reading a Unicode character; unread the adjacent byte,            
            -- and return the Unicode grapheme.            
            file:            seek            (            "cur"            ,            -            1            )            
            return            multibyte
            else            
            return            c
            end            
            elseif            is_init(c)            then            
            if            multibyte            then            
            file:            seek            (            "cur"            ,            -            1            )            
            return            multibyte
            else            
            multibyte            =            c
            terminate            
            elseif            is_cont(c)            then            
            multibyte            =            multibyte            ..            c
            else            
            assert            (            imitation            )            
            end            
            finish            
            end          
          -- Test.            
            office            read_all            (            )            
            testfile            =            io.open up            (            "tmp.txt"            ,            "west"            )            
            testfile:            write            (            "𝄞AöЖ€𝄞Ελληνικάyä®€成长汉\northward"            )            
            testfile:shut(            )            
            testfile            =            io.open            (            "tmp.txt"            ,            "r"            )          
                       while            truthful            do            
            local            c            =            read_char(testfile)            
            if            non            c            then            return            else            io.write            (            " "            ,            c)            cease            
            end            
            end

𝄞 A ö Ж € 𝄞 Ε λ λ η ν ι κ ά y ä ® € 成 长 汉

M2000 Interpreter [edit]

from revision 27, version nine.iii, of M2000 Surroundings, Chinese 长 letter displayed in console (as displayed in editor)


Module checkit {
          \\ prepare a file
          \\ Save.Doc and Suspend.Medico  to file, Load.Doc and Merge.Doc from file
          document a$
          a$={Starting time Line
          Second line
          Third Line
          Ελληνικά Greek Letters
          yä®€
          成长汉
          }
          Save.Doc a$, "checkthis.txt", 2  ' 2 for UTF-8
          b$="*"
          final$=""
          buffer Articulate bytes every bit byte*xvi
          Buffer I as byte
          Buffer Ii as byte*two
          Buffer 3 equally byte*3
          Locale 1033
          open "checkthis.txt" for input as #f
          seek#f, four ' skip BOM
          While b$<>"" {
          GetOneUtf8Char(&b$)
          last$+=b$
          }
          close #f
          Report final$
          Sub GetOneUtf8Char(&ch$)
          ch$=""
          if Eof(#f) then Exit Sub
          Get #f, I
          Return Bytes, 0:=Eval(one, 0)
          local mrk=Eval(1, 0)
          Effort ok {
          If Binary.And(mrk, 0xE0)=0xC0 so {
          Get #f,one
          Return Bytes, 1:=Eval$(one, 0,1)          
          ch$=Eval$(Bytes, 0, two)
          } Else.if Binary.And(mrk, 0xF0)=0xE0 and then {
          Get #f,2
          Return Bytes, 1:=Eval$(two,0,2)
          ch$=Eval$(Bytes, 0, iii)
          } Else.if Binary.And(mrk, 0xF8)=0xF0 and so {
          Get #f,3
          Return Bytes, 1:=Eval$(three, 0, 3)
          ch$=Eval$(Bytes, 0, 4)
          } Else ch$=Eval$(Bytes, 0, 1)
          }
          if Error or not ok then ch$="" : exit sub
          ch$=left$(string$(ch$ as Utf8dec),1)
          Finish Sub
}
checkit

using certificate as final$


Module checkit {
          \\ gear up a file
          \\ Save.Doc and Append.Doc  to file, Load.Doc and Merge.Dr. from file
          document a$
          a$={First Line
          Second line
          Third Line
          Ελληνικά Greek Messages
          yä®€
          成长汉
          }
          Save.Doctor a$, "checkthis.txt", two  ' 2 for UTF-8
          b$="*"
          certificate concluding$
          buffer Clear bytes every bit byte*xvi
          Buffer One as byte
          Buffer 2 as byte*2
          Buffer Iii every bit byte*3
          Locale 1033
          open "checkthis.txt" for input as #f
          seek#f, 4 ' skip BOM
          oldb$=""
          While b$<>"" {
          GetOneUtf8Char(&b$)
          \\ if final$ is document and so 10 and 13 if comes alone are new line
          \\ so we demand to throw x after the 13, so we have to use oldb$
          if b$=chr$(10)  so if oldb$=chr$(13)  then  oldb$="": go along
          oldb$=b$
          last$=b$  ' nosotros apply = for append to document
          }
          close #f
          Report last$
          Sub GetOneUtf8Char(&ch$)
          ch$=""
          if Eof(#f) then Exit Sub
          Get #f, Ane
          Return Bytes, 0:=Eval(one, 0)
          local mrk=Eval(one, 0)
          Effort ok {
          If Binary.And(mrk, 0xE0)=0xC0 then {
          Get #f,one
          Return Bytes, 1:=Eval$(one, 0,ane)          
          ch$=Eval$(Bytes, 0, 2)
          } Else.if Binary.And(mrk, 0xF0)=0xE0 then {
          Get #f,two
          Return Bytes, 1:=Eval$(2,0,2)
          ch$=Eval$(Bytes, 0, iii)
          } Else.if Binary.And(mrk, 0xF8)=0xF0 then {
          Go #f,three
          Return Bytes, 1:=Eval$(three, 0, 3)
          ch$=Eval$(Bytes, 0, 4)
          } Else ch$=Eval$(Bytes, 0, 1)
          }
          if Error or not ok so ch$="" : go out sub
          ch$=left$(string$(ch$ as Utf8dec),i)
          Cease Sub
}
checkit

Mathematica/Wolfram Linguistic communication [edit]

str = OpenRead["file.txt"];
ToString[Read[str, "Grapheme"], CharacterEncoding -> "UTF-eight"]

NetRexx [edit]

This instance is incorrect. Delight set the code and remove this message.

Details: Perhaps overengineered?

Works with: Java version 1.vii

Java and by extension NetRexx provides I/O functions that read UTF-8 encoded character data direct from an attached input stream. The Reader.read() method reads a single character equally an integer value in the range 0 – 65535 [0x00 – 0xffff], reading from a file encoded in UTF-8 will read each codepoint into an int. In the sample below the readCharacters method reads the file character past grapheme into a Cord and returns the result to the caller. The rest of this sample examines the result and formats the details.

The file information/utf8-001.txt information technology a UTF-8 encoded text file containing the following: yä®€𝄞𝄢12.

          /* NetRexx */          
options supersede format comments coffee crossref symbols nobinary
          numeric          digits          20           runSample(arg)            
            render          
          -- ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~            
method readCharacters(fName)            public            static            binary            returns            String
            slurped = String(            ''            )            
            slrp = StringBuilder(            )            
            fr = Reader null
            fFile = File(fName)            
            EOF = int -ane            -- End Of File indicator            
            practise            
            fr = BufferedReader(FileReader(fFile)            )            
            ic = int
            cc = char
            -- read the contents of the file one character at a time            
            loop            label            rdr forever
            -- Reader.read reads a single grapheme every bit an integer value in the range 0 - 65535 [0x00 - 0xffff]            
            -- or -ane on cease of stream i.east. End Of File            
            ic = fr.read            (            )            
            if            ic == EOF            and then            go out            rdr
            cc = Rexx(ic).d2c            
            slrp.append            (cc)            
            end            rdr
            -- load the results of the read into a variable            
            slurped = slrp.toString            (            )            
            grab            fex = FileNotFoundException
            fex.printStackTrace            (            )            
            take hold of            iex = IOException
            iex.printStackTrace            (            )            
            finally            
            if            fr            \= naught            then            do            
            fr.close            (            )            
            catch            iex = IOException
            iex.printStackTrace            (            )            
            finish            
            end            
            render            slurped
          -- ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~            
method encodingDetails(str = String)            public            static            
            stlen = str.length            (            )            
            cplen = Character.codePointCount            (str,            0, stlen)            
            say            'Unicode: length="'stlen'" code_point_count="'cplen'" cord="'str'"'            
            loop            ix =            0            to stlen -            1            
            cp = Rexx(Character.codePointAt            (str, nine)            )            
            cc = Rexx(Character.charCount            (cp)            )            
            say            '  'formatCodePoint(ix, cc, cp)            
            if            cc            >            1            then            practice            
            surrogates =            [Rexx(Character.highSurrogate            (cp)            ).c2d            (            ), Rexx(Grapheme.lowSurrogate            (cp)            ).c2d            (            )            ]            
            loop            sx =            0            to cc -            i            
            ix = ix + sx
            cp = surrogates[sx]            
            say            '  'formatCodePoint(ix,            1, cp)            
            end            sx
            cease            
            cease            nine
            say            
            return          
          -- ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~            
            -- @see http://docs.oracle.com/javase/6/docs/technotes/guides/intl/encoding.doc.html            
            -- @since Java 1.7            
method formatCodePoint(ix, cc, cp)            private            static            
            scp = Rexx(Character.toChars            (cp)            )            
            icp = cp.d2x            (            8            ).x2d            (            9            )            -- signed to unsigned conversion            
            ocp = Rexx(Integer.toOctalString            (icp)            )            
            x_utf16 =            ''            
            x_utf8  =            ''            
            practise            
            b_utf16 = Cord(scp).getBytes            (            'UTF-16BE'            )            
            b_utf8  = String(scp).getBytes            (            'UTF-8'            )            
            loop            bv =            0            to b_utf16.length            -            1            by            2            
            x_utf16 = x_utf16 Rexx(b_utf16[bv]            ).d2x            (            two            )            ||            Rexx(b_utf16[bv +            1            ]            ).d2x            (            ii            )            
            end            bv
            loop            bv =            0            to b_utf8.length            -            1            
            x_utf8 = x_utf8 Rexx(b_utf8[bv]            ).d2x            (            2            )            
            end            bv
            x_utf16 = x_utf16.infinite            (            1,            ','            )            
            x_utf8  = x_utf8.space            (            1,            ','            )            
            catch            ex = UnsupportedEncodingException
            ex.printStackTrace            (            )            
            cease            
            cpName = Character.getName            (cp)            
            fmt =                        -
            'CodePoint:'            -
            'index="'9.correct            (            3,            0            )            '"'            -
            'character_count="'cc'"'            -
            'id="U+'cp.d2x            (            v            )            '"'            -
            'hex="0x'cp.d2x            (            half-dozen            )            '"'            -
            'dec="'icp.correct            (            vii,            0            )            '"'            -
            'oct="'ocp.right            (            7,            0            )            '"'            -
            'char="'scp'"'            -
            'utf-sixteen="'x_utf16'"'            -
            'utf-viii="'x_utf8'"'            -
            'name="'cpName'"'            
            render            fmt
          -- ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~            
method runSample(arg)            public            static            
            parse            arg fileNames
            if            fileNames =            ''            and then            fileNames =            'data/utf8-001.txt'            
            loop            while fileNames            \=            ''            
            parse            fileNames fileName fileNames
            slurped = readCharacters(fileName)            
            say            "Input:"            slurped
            encodingDetails(slurped)            
            end            
            say            
            return

Input: yä®€𝄞𝄢12 Unicode: length="10" code_point_count="8" string="yä®€𝄞𝄢12"   CodePoint: index="000" character_count="1" id="U+00079" hex="0x000079" dec="0000121" oct="0000171" char="y" utf-16="0079" utf-viii="79" name="LATIN SMALL LETTER Y"   CodePoint: index="001" character_count="1" id="U+000E4" hex="0x0000E4" dec="0000228" oct="0000344" char="ä" utf-sixteen="00E4" utf-eight="C3,A4" name="LATIN SMALL Letter A WITH DIAERESIS"   CodePoint: alphabetize="002" character_count="one" id="U+000AE" hex="0x0000AE" dec="0000174" oct="0000256" char="®" utf-xvi="00AE" utf-8="C2,AE" name="REGISTERED SIGN"   CodePoint: alphabetize="003" character_count="1" id="U+020AC" hex="0x0020AC" dec="0008364" oct="0020254" char="€" utf-16="20AC" utf-eight="E2,82,Air conditioning" proper name="EURO SIGN"   CodePoint: index="004" character_count="2" id="U+1D11E" hex="0x01D11E" dec="0119070" oct="0350436" char="𝄞" utf-sixteen="D834,DD1E" utf-8="F0,9D,84,9E" proper noun="MUSICAL SYMBOL G CLEF"   CodePoint: index="004" character_count="i" id="U+0D834" hex="0x00D834" dec="0055348" oct="0154064" char="?" utf-16="FFFD" utf-8="3F" name="HIGH SURROGATES D834"   CodePoint: alphabetize="005" character_count="1" id="U+0DD1E" hex="0x00DD1E" dec="0056606" october="0156436" char="?" utf-sixteen="FFFD" utf-8="3F" name="LOW SURROGATES DD1E"   CodePoint: index="006" character_count="ii" id="U+1D122" hex="0x01D122" dec="0119074" october="0350442" char="𝄢" utf-16="D834,DD22" utf-eight="F0,9D,84,A2" name="MUSICAL SYMBOL F CLEF"   CodePoint: alphabetize="006" character_count="one" id="U+0D834" hex="0x00D834" december="0055348" october="0154064" char="?" utf-xvi="FFFD" utf-eight="3F" proper noun="HIGH SURROGATES D834"   CodePoint: index="007" character_count="1" id="U+0DD22" hex="0x00DD22" dec="0056610" oct="0156442" char="?" utf-sixteen="FFFD" utf-8="3F" proper name="LOW SURROGATES DD22"   CodePoint: index="008" character_count="1" id="U+00031" hex="0x000031" dec="0000049" oct="0000061" char="one" utf-sixteen="0031" utf-eight="31" proper noun="DIGIT ONE"   CodePoint: index="009" character_count="1" id="U+00032" hex="0x000032" december="0000050" oct="0000062" char="2" utf-xvi="0032" utf-eight="32" name="DIGIT TWO"

Nim [edit]

Equally virtually organization languages, Nim reads bytes and provides functions to decode bytes into Unicode runes. The normal mode to read a stream of UTF-8 characters would exist to read the file line by line and decode each line using the "utf-8" iterator which yields UTF-viii characters every bit strings (ane by 1) or using the "runes" iterator which yields the UTF-eight characters as Runes (one by one).

Equally in fact the file would be read line by line, even if the characters are actually yielded one by i, it may be considered as cheating. So, we provide a office and an iterator which read bytes one past ane.

import unicode proc readUtf8(f: File): cord =
            ## Return next UTF-eight grapheme every bit a string.
            while truthful:
            issue.add f.readChar()
            if result.validateUtf8() == -1: pause
           iterator readUtf8(f: File): cord =
            ## Yield successive UTF-8 characters from file "f".
            var res: string
            while not f.endOfFile:
            res.setLen(0)
            while true:
            res.add together f.readChar()
            if res.validateUtf8() == -i: break
            yield res

Pascal [edit]

          (* Read a file char past char *)          
          program          ReadFileByChar;          
          var          
          InputFile,OutputFile:          file          of          char          ;          
          InputChar:          char          ;          
          brainstorm          
          Assign(InputFile,          'testin.txt'          )          ;          
          Reset(InputFile)          ;          
          Assign(OutputFile,          'testout.txt'          )          ;          
          Rewrite(OutputFile)          ;          
          while          not          Eof          (InputFile)          do          
          begin          
          Read          (InputFile,          InputChar)          ;          
          Write          (OutputFile,          InputChar)          
          end          ;          
          Close(InputFile)          ;          
          Shut(OutputFile)          
          end          .

Perl [edit]

binmode STDOUT , ':utf8' ; # so we can print broad chars without warning

open my $fh , "<:encoding(UTF-8)" , "input.txt" or die "$!\n" ;

while ( read $fh , my $char , 1 ) {
printf "got grapheme $char [U+%04x]\n" , ord $char ;
}

shut $fh ;

If the contents of the input.txt file are aă€⼥ and so the output would be:

got graphic symbol a [U+0061] got character ă [U+0103] got character € [U+20ac] got graphic symbol ⼥ [U+2f25]

Phix [edit]

By and large I utilise utf8_to_utf32() on whole lines when I want to do character-counting.

Y'all can observe that routine in builtins/utfconv.eastward, and here is a modified copy that reads precisely one unicode graphic symbol from a file. If there is a genuine demand for information technology, I could hands add this to that file permanently, and document/autoinclude it properly.

          constant          INVALID_UTF8          =          #FFFD          function          get_one_utf8_char          (          integer          fn          )          -- returns INVALID_UTF8 on error, else a string of 1..4 bytes representing 1 character          object          res          integer          headb          ,          bytes          ,          c          -- headb = showtime byte of utf-8 character:          headb          =          getc          (          fn          )          if          headb          =-          1          then          return          -          i          terminate          if          res          =          ""          &          headb          -- calculate length of utf-8 character in bytes (one..four):          if          headb          <          0          then          bytes          =          0          -- (utf-8 starts at #0)          elsif          headb          <=          0b01111111          then          bytes          =          ane          -- 0b_0xxx_xxxx          elsif          headb          <=          0b10111111          then          bytes          =          0          -- (it's a tail byte)          elsif          headb          <=          0b11011111          so          bytes          =          2          -- 0b_110x_xxxx          elsif          headb          <=          0b11101111          and then          bytes          =          3          -- 0b_1110_xxxx          elsif          headb          <=          0b11110100          so          bytes          =          4          -- 0b_1111_0xzz          else          bytes          =          0          -- (utf-viii ends at #10FFFF)          end          if          -- 2..4 bytes encoding (tail range: 0b_1000_0000..0b_1011_1111);          for          j          =          1          to          bytes          -          1          do          -- tail bytes are valid?          c          =          getc          (          fn          )          if          c          <          #eighty          or          c          >          #BF          then          bytes          =          0          -- invalid tail byte or eof          exit          stop          if          res          &=          c          end          for          -- 1 byte encoding (head range: 0b_0000_0000..0b_0111_1111):          if          bytes          =          i          then          c          =          headb          -- UTF-viii = ASCII       -- ii bytes encoding (caput range: 0b_1100_0000..0b_1101_1111):          elsif          bytes          =          ii          so          c          =          and_bits          (          headb          ,          #1F          )*          #40          +          -- 0b110[7..11] headb          and_bits          (          res          [          2          ],          #3F          )          -- 0b10[1..6] tail          if          c          >          #7FF          then          ?          nine          /          0          end          if          -- sanity bank check          if          c          <          #eighty          then          -- long form?          res          =          INVALID_UTF8          finish          if          -- 3 bytes encoding (head range: 0b_1110_0000..0b_1110_1111):          elsif          bytes          =          three          then          c          =          and_bits          (          headb          ,          #0F          )*          #1000          +          -- 0b1110[13..xvi] head          and_bits          (          res          [          2          ],          #3F          )*          #40          +          -- 0b10[seven..12] tail          and_bits          (          res          [          3          ],          #3F          )          -- 0b10[1..6] tail          if          c          >          #FFFF          and so          ?          9          /          0          end          if          -- sanity check          if          c          <          #800          -- long form?          or          (          c          >=          #D800          and          c          <=          #DFFF          )          and then          -- utf-16 incompatible          res          =          INVALID_UTF8          end          if          -- 4 bytes encoding (caput range: 0b_1111_0000..0b_1111_0111):          elsif          bytes          =          4          then          c          =          and_bits          (          headb          ,          #07          )*          #040000          +          -- 0b11110[19..21] caput          and_bits          (          res          [          2          ],          #3F          )*          #k          +          -- 0b10[thirteen..18] tail          and_bits          (          res          [          3          ],          #3F          )*          #0040          +          -- 0b10[vii..12] tail          and_bits          (          res          [          4          ],          #3F          )          -- 0b10[1..6] tail          if          c          <          #10000          -- long form?          or          c          >          #10FFFF          so          res          =          INVALID_UTF8          -- utf-8 ends at #10FFFF          end          if          -- bytes = 0; electric current byte is not encoded correctly:          else          res          =          INVALID_UTF8          end          if          return          res          end          role

Test lawmaking:

          --string utf8 = "aă€⼥"  -- (aforementioned results equally next)          string          utf8          =          utf32_to_utf8          ({          #0061          ,          #0103          ,          #20ac          ,          #2f25          })          printf          (          1          ,          "length of utf8 is %d bytes\northward"          ,          length          (          utf8          ))          integer          fn          =          open up          (          "test.txt"          ,          "wb"          )          puts          (          fn          ,          utf8          )          close          (          fn          )          fn          =          open          (          "test.txt"          ,          "r"          )          for          i          =          1          to          5          do          object          res          =          get_one_utf8_char          (          fn          )          if          string          (          res          )          then          if          platform          ()=          LINUX          then          printf          (          one          ,          "char %d (%s) is %d bytes\due north"          ,{          i          ,          res          ,          length          (          res          )})          else          -- unicode and consoles tricky on windows, so I'm             -- just fugitive that effect altogther (t)here.          printf          (          1          ,          "char %d is %d bytes\n"          ,{          i          ,          length          (          res          )})          end          if          elsif          res          =-          1          then          printf          (          1          ,          "char %d - EOF\due north"          ,          i          )          exit          else          printf          (          1          ,          "char %d - INVALID_UTF8\due north"          ,          i          )          exit          terminate          if          end          for          close          (          fn          )

length of utf8 is 9 bytes char 1 is i bytes char 2 is 2 bytes char iii is iii bytes char iv is 3 bytes char 5 - EOF

PicoLisp [edit]

Pico Lisp uses UTF-eight until told otherwise.


(in "wordlist"          
          (while (char)
          (process @))

Python [edit]

Works with: Python version two.7

def get_next_character(f):
# note: assumes valid utf-eight
c = f.read ( 1 )
while c:
while True:
try:
yield c.decode ( 'utf-8' )
except UnicodeDecodeError:
# we've encountered a multibyte grapheme
# read another byte and effort once more
c += f.read ( one )
else:
# c was a valid char, and was yielded, go on
c = f.read ( i )
intermission

# Usage:
with open ( "input.txt" , "rb" ) as f:
for c in get_next_character(f):
print (c)

Works with: Python version 3

Python 3 simplifies the handling of text files since you can specify an encoding.

          def          get_next_character(f):
          """Reads one character from the given textfile"""          
          c          =          f.read          (          1          )          
          while          c:          
          yield          c
          c          =          f.read          (          1          )          # Usage:                        
            with            open            (            "input.txt"            ,            encoding=            "utf-8"            )            as            f:
            for            c            in            get_next_character(f):
            print            (c,            sep=            ""            ,            cease=            ""            )

Racket [edit]

Don't nosotros all love self reference?


#lang noise
; This file contains utf-8 charachters: λ, α, γ ...
(for ([c (in-port read-char (open-input-file "read-file.rkt"))])
          (brandish c))

Output:


#lang racket
; This file contains utf-viii charachters: λ, α, γ ...
(for ([c (in-port read-char (open-input-file "read-file.rkt"))])
          (brandish c))

Raku [edit]

(formerly Perl 6)

Raku has a built in method .getc to get a single character from an open up file handle. File handles default to UTF-viii, so they will handle multi-byte characters correctly.

To read a single character at a time from the Standard Input terminal; $*IN in Raku:

          .          say          while          defined          $_          =          $*IN          .          getc          ;

Or, from a file:

my $filename = 'whatever' ;

my $in = open( $filename , :r ) orelse . dice ;

print $_ while divers $_ = $in . getc ;

REXX [edit]

version ane [edit]

REXX doesn't support UTF8 encoded wide characters, just bytes.
The task'south requirement stated that EOF was to be returned upon reaching the terminate-of-file, so this programming case was written every bit a subroutine (procedure).
Notation that displaying of characters that may modify screen beliefs such as tab usage, backspaces, line feeds, carriage returns, "bells" and others are suppressed, but their hexadecimal equivalents are displayed.

          /*REXX program  reads and displays  a file char by char, returning   'EOF'   when washed. */          
          parse          arg          iFID          .          /*iFID:     is the fileID to be read.  */          
          /* [↓]  show the file'southward contents.      */          
          if          iFID\==''          and so          practise          j=one          until          ten=='EOF'          /*J  count's the file'due south characters.    */          
          x=getchar(iFID);    y=          /*go a character  or  an 'EOF'.       */          
          if          x>>          ' '          then          y=ten          /*brandish   X   if presentable.        */          
          say          right          (j,          12          )          'character,  (hex,char)'          c2x          (x)          y
          end          /*j*/          /* [↑]  only brandish  X  if not low hex*/          
          exit          /*stick a fork in it,  we're all done. */          
          /*──────────────────────────────────────────────────────────────────────────────────────*/          
getchar:          procedure;          parse          arg          z;          if          chars          (z)==0          so          return          'EOF';          render          charin          (z)

input file: ABC
and was created by the DOS command (nether Windows/XP): repeat 123 [¬ a prime]> ABC

123 [¬ a prime]

output (for the above [ABC] input file:

          1 graphic symbol,  (hex,char) 31 1            2 character,  (hex,char) 32 ii            iii character,  (hex,char) 33 3            4 character,  (hex,char) twenty            five graphic symbol,  (hex,char) 5B [            half-dozen character,  (hex,char) AA ¬            7 character,  (hex,char) 20            8 grapheme,  (hex,char) 61 a            9 character,  (hex,char) twenty           10 character,  (hex,char) lxx p           11 character,  (hex,char) 72 r           12 character,  (hex,char) 69 i           13 grapheme,  (hex,char) 6D grand           14 grapheme,  (hex,char) 65 e           xv character,  (hex,char) 5D ]           sixteen character,  (hex,char) 0D           17 character,  (hex,char) 0A           18 graphic symbol,  (hex,char) 454F46 EOF End-Of-File.

version 2 [edit]

/* REXX ---------------------------------------------------------------
* 29.12.2013 Walter Pachl
* read one utf8 grapheme at a time
* run across http://de.wikipedia.org/wiki/UTF-8#Kodierung
*--------------------------------------------------------------------*/
oid='utf8.txt';'erase' oid /* outset create file containing utf8 chars*/
Phone call charout oid,'79'ten
Call charout oid,'C3A4'x
Phone call charout oid,'C2AE'10
Call charout oid,'E282AC'10
Call charout oid,'F09D849E'x
Phone call lineout oid
fid='utf8.txt' /* so read information technology and evidence the contents */
Exercise Until c8='EOF'
c8=get_utf8char(fid)
Say left (c8,4 ) c2x (c8)
End
Exit

get_utf8char: Procedure
Parse Arg f
If chars (f)=0 And then
Return 'EOF'
c=charin (f)
b=c2b(c)
If left (b,1 )=0 Then
Nop
Else Exercise
p=pos ( '0',b)
Exercise i=1 To p-2
If chars (f)=0 Then Practice
Say 'illegal contents in file' f
Leave
Terminate
c=c|| charin (f)
Terminate
Terminate
Render c

c2b: Render x2b ( c2x ( arg ( 1 ) ) )

output:

y    79 Ã¤   C3A4 Â®   C2AE â‚¬  E282AC ð�„ž F09D849E EOF  454F46

Ring [edit]


fp = fopen("C:\Ring\ReadMe.txt","r")
r = fgetc(fp)
while isstring(r)
          r = fgetc(fp)
          encounter r
end
fclose(fp)

Output:

================================================== The Ring Programming Language http://ring-lang.net/ Version 1.0 Release Date : Jan 25, 2016 Update Engagement : March 27, 2016 =================================================== Binary release for Microsoft Windows ===================================================  Run Starting time.bat to open Ring Notepad then showtime learning from the documentation  Bring together Ring Group for questions https://groups.google.com/forum/#!forum/band-lang  Greetings, Mahmoud Fayed [email protected] http://www.facebook.com/mahmoudfayed1986

Reddish [edit]

Works with: Ruby version 1.nine

          File.open up          (          'input.txt',          'r:utf-8'          )          practice          |f|          
          f.each_char          {          |c|          p          c}          
          end

          File.open up          (          'input.txt',          'r:utf-eight'          )          do          |f|          
          while          c = f.getc          
          p          c
          end          
          end

Run Basic [edit]

open file.txt" for binary equally #f
numChars = 1              ' specify number of characters to read
a$ = input$(#f,numChars)  ' read number of characters specified
b$ = input$(#f,one)         ' read i character
shut #f

Rust [edit]

Rust standard library provides hardly any straight-forwards manner to read single UTF-8 characters from a file. Following code implements an iterator that consumes a byte stream, taking only equally many bytes as necessary to decode the next UTF-8 character. It provides quite a complete fault report, and then that the client code tin leverage it to deal with corrupted input.

The decoding code is based on utf8-decode crate originally.

utilise std::{
          catechumen::TryFrom,
          fmt::{Debug, Display, Formatter},
          io::Read,
}; pub struct ReadUtf8<I: Iterator> {
            source: std::iter::Peekable<I>,
}
           impl<R: Read> From<R> for ReadUtf8<std::io::Bytes<R>> {
            fn from(source: R) -> Cocky {
            ReadUtf8 {
            source: source.bytes().peekable(),
            }
            }
}
           impl<I, East> Iterator for ReadUtf8<I>
where
            I: Iterator<Item = Result<u8, Eastward>>,
{
            type Detail = Outcome<char, Fault<E>>;
               fn side by side(&mut self) -> Selection<Self::Item> {
            self.source.next().map(|next| match next {
            Ok(lead) => cocky.complete_char(atomic number 82),
            Err(e) => Err(Error::SourceError(e)),
            })
            }
}
           impl<I, East> ReadUtf8<I>
where
            I: Iterator<Particular = Result<u8, E>>,
{
            fn continuation(&mut self) -> Result<u32, Mistake<E>> {
            if allow Some(Ok(byte)) = self.source.peek() {
            let byte = *byte;
                       render if byte & 0b1100_0000 == 0b1000_0000 {
            cocky.source.adjacent();
            Ok((byte & 0b0011_1111) every bit u32)
            } else {
            Err(Error::InvalidByte(byte))
            };
            }
                   match self.source.next() {
            None => Err(Fault::InputTruncated),
            Some(Err(due east)) => Err(Error::SourceError(e)),
            Some(Ok(_)) => unreachable!(),
            }
            }
               fn complete_char(&mut self, pb: u8) -> Effect<char, Fault<E>> {
            let a = lead as u32; // Let'due south name the bytes in the sequence
                   let result = if a & 0b1000_0000 == 0 {
            Ok(a)
            } else if lead & 0b1110_0000 == 0b1100_0000 {
            let b = self.continuation()?;
            Ok((a & 0b0001_1111) << 6 | b)
            } else if a & 0b1111_0000 == 0b1110_0000 {
            let b = cocky.continuation()?;
            permit c = cocky.continuation()?;
            Ok((a & 0b0000_1111) << 12 | b << 6 | c)
            } else if a & 0b1111_1000 == 0b1111_0000 {
            permit b = self.continuation()?;
            let c = cocky.continuation()?;
            permit d = self.continuation()?;
            Ok((a & 0b0000_0111) << 18 | b << 12 | c << 6 | d)
            } else {
            Err(Error::InvalidByte(lead))
            };
                   Ok(char::try_from(result?).unwrap())
            }
}
           #[derive(Debug, Clone)]
pub enum Mistake<E> {
            InvalidByte(u8),
            InputTruncated,
            SourceError(E),
}
           impl<E: Display> Display for Fault<Due east> {
            fn fmt(&cocky, f: &mut Formatter<'_>) -> std::fmt::Outcome {
            lucifer self {
            Cocky::InvalidByte(b) => write!(f, "invalid byte 0x{:ten}", b),
            Self::InputTruncated => write!(f, "character truncated"),
            Self::SourceError(east) => e.fmt(f),
            }
            }
}
           fn main() -> std::io::Result<()> {
            for (index, value) in ReadUtf8::from(std::fs::File::open("test.txt")?).enumerate() {
            match value {
            Ok(c) => impress!("{}", c),
                       Err(eastward) => {
            print!("\u{fffd}");
            eprintln!("offset {}: {}", index, e);
            }
            }
            }
               Ok(())
}

Seed7 [edit]

The library utf8.s7i provides the functions openUtf8 and getc. When a file has been opened with openUtf8 fhe function getc reads UTF-8 characters from the file. To allow writing Unicode characters to standard output the file STD_UTF8_OUT is used.

$ include "seed7_05.s7i";
          include "utf8.s7i"; const proc: chief is func
            local
            var file: inFile is STD_NULL;
            var char: ch is ' ';
            begin
            OUT := STD_UTF8_OUT;
            inFile := openUtf8("readAFileCharacterByCharacterUtf8.in", "r");
            if inFile <> STD_NULL and then
            while hasNext(inFile) practice
            ch := getc(inFile);
            writeln("got graphic symbol " <& ch <& " [U+" <& ord(ch) radix 16 <& "]");
            end while;
            shut(inFile);
            finish if;
            end func;

When the input file readAFileCharacterByCharacterUtf8.in contains the characters aă€⼥ the output is:

got character a [U+61] got graphic symbol ă [U+103] got graphic symbol € [U+20ac] got character ⼥ [U+2f25]

Sidef [edit]

var file =          File          (          'input.txt'          )          # the input file contains: "aă€⼥"          
var fh = file.open_r          # equivalent with: file.open('<:utf8')          
fh.each_char          {          |char|          
          printf          (          "got character #{char} [U+%04x]\n", char.ord          )          
          }

got character a [U+0061] got character ă [U+0103] got graphic symbol € [U+20ac] got character ⼥ [U+2f25]

Smalltalk [edit]

|utfStream|
          utfStream          :=          'input'          asFilename readStream asUTF8EncodedStream.
          [          utfStream          atEnd]          whileFalse:[          
          Transcript          showCR:'got char ',utfStream          next.
          ].
          utfStream          close.

Tcl [edit]

To read a single grapheme from a file, use:

          set up          ch          [          read          $channel          1          ]

This will read multiple bytes sufficient to obtain a Unicode grapheme if a suitable encoding has been configured on the channel. For binary channels, this will always eat exactly one byte. However, the depression-level channel buffering logic may eat more than ane byte (which but really matters where the channel is being handed on to some other procedure and the channel is over a file descriptor that doesn't support the lseek Os phone call); the extent of buffering can be controlled via:

          fconfigure          $channel          -buffersize          $byteCount

When the channel is only being accessed from Tcl (or via Tcl'southward C API) it is not normally necessary to adjust this selection.

Wren [edit]

          import          "io"          for          File File.            open up            (            "input.txt"            )            {            |file|            
            var            offset            =            0            
            var            char            =            ""            // stores each byte read till we have a complete UTF encoded character            
            while            (            truthful            )            {            
            var            b            =            file.            readBytes            (            1            ,            offset)            
            if            (b            ==            ""            )            return            // stop of stream            
            char            =            char            +            b
            if            (            char            .            codePoints            [            0            ]            >=            0            )            {            // a UTF encoded character is consummate            
            System.            write            (            char            )            // print information technology            
            char            =            ""            // reset shop            
            }            
            showtime            =            start            +            i            
            }            
            }

zkl [edit]

zkl doesn't know much about UTF-viii or Unicode simply is able to examination whether a string or number is valid UTF-8 or not. This code uses that to build a state machine to decode a byte stream into UTF-8 characters.

fcn readUTF8c(chr,s=""){ // transform UTF-8 character stream
          s+=chr;
          try{ s.len(8); return(s) }
          take hold of{ if(s.len()>half-dozen) throw(__exception) } // 6 bytes max for UTF-8
          return(Void.Again,south);  // phone call me again with southward & another character
}

Used to modify a zkl iterator, it can consume whatever stream-able (files, strings, lists, etc) and provides support for foreach, map, look ahead, button dorsum, etc.

fcn utf8Walker(obj){
          obj.walker(3)  // read characters
          .tweak(readUTF8c)
}

s:="-->\u20AC123";  // --> e2,82,ac,31,32,33 == -->€123
utf8Walker(s).walk().println(); w:=utf8Walker(Data(Void,s,"\n")); // Data is a byte bucket
foreach c in (utf8Walker(Data(Void,s,"\n"))){ print(c) }
           utf8Walker(Data(Void,0xe2,0x82,"123456")).walk().println(); // € is short one byte

L("-","-",">","€","1","2","three") -->€123 VM#1 defenseless this unhandled exception:    ValueError : Invalid UTF-8 string

If you lot wish to push button a UTF-8 stream through one or more functions, y'all can apply the same country machine:

stream:=Data(Void,s,"\n").howza(three); // character stream
stream.pump(Listing,readUTF8c,"print")

-->€123

and returns a list of the viii UTF-8 characters (with newline). Or, if file "foo.txt" contains the characters:

File("foo.txt","rb").howza(three).pump(List,readUTF8c,"print");

produces the same result.

hoadleyanouncy.blogspot.com

Source: http://www.rosettacode.org/wiki/Read_a_file_character_by_character/UTF8