Golang UTF8 Package – Text Encoding

golang utf8 package text encoding

In this blog, we will learn about the Golang UTF8 Package and Character Encoding in Programming Languages.

The Golang Unicode/utf8 package provides several useful functions for querying and manipulating strings and []bytes which hold UTF8 bytes.

First of all lets understand the difference between UTF8 and ASCII Encoding.

ASCII vs UTF8 Encoding

In earlier days of the invention of programming language, the computer scientists felt the need of only 128 characters and thus they encoded the 128 characters in 1 byte (primarily 7 bits, as the starting 1 bit is only for signal).

2^7 = 128

This Encoding was called ASCII (American Standard Code for Information Interchange).

ASCII Example:

A (Capital) has an ASCII Value of 65 – Binary representation of A is: 1000001

01000001

while nowadays, we use a lot of characters and many countries type code in their own native language other than English then how it’s possible.

Unicode is a standard that encodes almost all the characters used in the world for convenience purposes. The UTF8 (8-bit Unicode Transformation Format) defined by Unicode Standards, is a character encoding that encodes a total of 1,112,064 characters.

The UTF8 is developed by Ken Thompson and Rob Pike (also developers of The GO Programming language). This is also the reason why Golang is typed in UTF8 Encoding.

UTF8 is a variable width character encoding, and uses one – four bytes to encode a character. UTF8 Supports ASCII as it is backward compatible. As ASCII Characters take only 7 bit or 1 byte to encode a character, it is given the first place in the UTF8 Encoding.

Other Characters take two – four bytes in order to encode.

The Characters which take two or more bytes for UTF8 encoding, there is a similarity, the first bit is preceded by as many ones as the characters encoding size and a Zero. Example.

Byte1 = 110xxxxx

After that all the bytes get preceded by 10s.

Byte2 = 10xxxxxx

Visit Wikipedia page to know more about UTF8 Encoding.

Devanagari (Hindi) UTF8 Code

In this blog, I will be using examples in Hindi as well as English Unicodes, Take a reference for Devanagri UTF8 Code.

Golang UTF8 Package DecodeRune

func DecodeRune(p []byte) (r rune, size int)

Decode Rune takes the first UTF8 encoding from the passed string and returns rune of the encoded character and the size it takes in UTF8 encoding.

If the passed string is empty, the DecodeRune function returns rune error and 0 as the size of the encoded character. If the encoded string is invalid then the function returns rune error and 1.

package main

import (
	"fmt"
	"unicode/utf8"
)

func main() {
	str := []byte("नमस्ते दुनिया") // Hello World in Hindi

	for len(b) > 0 {
		r, size := utf8.DecodeRune(str)
		fmt.Printf("%c %v bytes\n", r, size)

		str = str[size:]
	}
}

Output:

न 3 bytes
म 3 bytes
स 3 bytes
् 3 bytes
त 3 bytes
े 3 bytes
1 bytes
द 3 bytes
ु 3 bytes
न 3 bytes
ि 3 bytes
य 3 bytes
ा 3 bytes

DecodeRuneInString

func DecodeRuneInString(s string) (r rune, size int)

Golang UTF8 Package DecodeRuneInString function is like DecodeRune but its input is a string.

Golang UTF8 Package DecodeLastRune

func DecodeLastRune(p []byte) (r rune, size int)

Golang UTF8 Package DecodeLastRune function takes the last UTF8 encoding from the passed string, and returns rune of the encoded character and the size it takes in UTF8 encoding.

If the string is empty DecodeLastRune function returns (RuneError, 0).

Otherwise, if the encoding is invalid, the function returns (RuneError, 1).

package main

import (
	"fmt"
	"unicode/utf8"
)

func main() {
	str := []byte("नमस्ते दुनिया") // Hello World in Hindi

	for len(b) > 0 {
		r, size := utf8.DecodeLastRune(str)
		fmt.Printf("UTF8 Code: %v %c %v bytes\n", r, r, size)

		str = str[:len(str)-size]
	}
}
UTF8 Code: 2366 ा 3 bytes
UTF8 Code: 2351 य 3 bytes
UTF8 Code: 2367 ि 3 bytes
UTF8 Code: 2344 न 3 bytes
UTF8 Code: 2369 ु 3 bytes
UTF8 Code: 2342 द 3 bytes
UTF8 Code: 32   1 bytes
UTF8 Code: 2375 े 3 bytes
UTF8 Code: 2340 त 3 bytes
UTF8 Code: 2381 ् 3 bytes
UTF8 Code: 2360 स 3 bytes
UTF8 Code: 2350 म 3 bytes
UTF8 Code: 2344 न 3 bytes

DecodeLastRuneInString

func DecodeLastRuneInString(s string) (r rune, size int)

Golang UTF8 Package DecodeLastRuneInString function is like DecodeLastRune but its input is a string.

Example:

	str := "नमस्ते दुनिया" // Hello World in Hindi

	for len(str) > 0 {
		r, size := utf8.DecodeLastRuneInString(str)
		fmt.Printf("UTF8 Code: %v %c %v bytes\n", r, r, size)

		str = str[:len(str)-size]
	}
UTF8 Code: 2366 ा 3 bytes
UTF8 Code: 2351 य 3 bytes
UTF8 Code: 2367 ि 3 bytes
UTF8 Code: 2344 न 3 bytes
UTF8 Code: 2369 ु 3 bytes
UTF8 Code: 2342 द 3 bytes
UTF8 Code: 32   1 bytes
UTF8 Code: 2375 े 3 bytes
UTF8 Code: 2340 त 3 bytes
UTF8 Code: 2381 ् 3 bytes
UTF8 Code: 2360 स 3 bytes
UTF8 Code: 2350 म 3 bytes
UTF8 Code: 2344 न 3 bytes

Golang UTF8 Package EncodeRune

func EncodeRune(b []byte, r rune) int

The EncodeRune Function takes a byte array and a rune, and encodes it to UTF8 Encoding.

Example:

package main

import (
	"fmt"
	"unicode/utf8"
)

func main() {
	r := 'क' // English K
	b := make([]byte, 3)

	n := utf8.EncodeRune(b, r)

	fmt.Print("Byte Array :",b)
	fmt.Print("Number of Bytes Written:",n)
}

Output:

Byte Array : [224 164 149] Number of Bytes Written: 3

Explanation:

The Hindi letter (DEVANAGARII LETTER KA) UTF8 encoding takes 3 bytes. The output returns the bytes of the letter. Let’s dive deeper into the binary and know how the letter is encoded in UTF8.

Binary of 224:

11100000

Binary of 164:

10100100

Binary of 149:

10010101

According to the UTF8 Encoding Rule, the first byte (except any ASCII Characters) is preceded by the number of ones equal to the size of that character and next 0, and rest of the bytes will precede with 10s, this is fixed rule for UTF8 Encoding.

Golang UTF8 RuneCount

func RuneCount(b []byte) int

Golang UTF8 RuneCount function returns the number of runes in array of bytes.

Example:

func main() {
	b := []byte("Hello, दुनिया") // World in Hindi
	fmt.Println(b)
	fmt.Println("bytes =", len(b))
	fmt.Println("runes =", utf8.RuneCount(b))
}

Output:

[72 101 108 108 111 44 32 224 164 166 224 165 129 224 164 168 224 164 191 224 164 175 224 164 190]
bytes = 25
runes = 13

In the output, the bytes array contains 25 elements but there are only few when we look at it.

The byte array, from 72 to 32 it contains “Hello,” String and after that contains the byte for UTF8 Encoded string.

The reason why the string splits into a long of a byte array as each character takes 3 bytes to encode.

Golang UTF8 RuneCountInString

func RuneCountInString(s string) (n int)

Golang UTF8 RuneCountInString function is like RuneCount but its input is a string.

Golang UTF8 Valid

func Valid(b []byte) bool

Golang UTF8 Valid function returns a boolean value true if the byte array consists entirely of valid UTF-8-encoded runes, else false.

func main() {
	valid := []byte("Hello, दुनिया") // World in Hindi
	invalid := []byte{0xff, 0xfe, 0xfd}

	fmt.Println(utf8.Valid(valid))
	fmt.Println(utf8.Valid(invalid))
}

Output:

true
false

ValidRune and ValidString work the same as Valid function. The only difference is in the input section, the ValidRune take rune as input and ValidString takes a string value as input.

Hope you like it!

Also, read Why Golang is called the future of Server-side language?

Learn more about Golang UTF8 Package from the official Documentation.

Learn Golang Basics:

Learn Golang Advanced Topics:

Learn Golang Deeper:

Tags: ,

Leave a Reply

Your email address will not be published. Required fields are marked *