Worth noting that the addition of the interlinear annotation characters was quite controversial, with many commenting that this simply is not plain text and as such does not belong in Unicode. I'm not clear on how it made it in anyway, but it sure seems like the Unicode Consortium now somewhat agrees, as while they haven't formally deprecated the characters, they have kind of discouraged their use.
Rust
line.bytes()
Swift
line.utf8
Go
line // slice of bytes
// assuming line is valid utf8, which is not enforced
Get Unicode codepoints of string
Most characters and emojis consist of a single codepoint. Some are made up of multiple codepoints.
If it isn't guaranteed that only this limited set of characters is used, this is not a safe way to iterate over what users would consider characters.
Codepoints are 4 bytes, usually stored internally as u32 or i32 but with different API for the programmer.
Rust
line.chars()
// https://doc.rust-lang.org/std/primitive.char.html
Swift
line.unicodeScalars
// https://developer.apple.com/documentation/swift/unicode/scalar
Go
[]rune(line)
// or iterate with range
for index, runeValue := range line {
fmt.Printf("%#U starts at byte position %d\n", runeValue, index)
}
// https://go.dev/blog/strings
Get extended grapheme clusters of string
What a reader would actually consider to be a character. E.g, this character consists of two codepoints but is one grapheme cluster: a̐
Rust
use unicode_segmentation::UnicodeSegmentation;
line.graphemes(true)
Swift
for ch in line {
print(ch)
}
// This is the default view - just iterate over string (or map, filter etc.)
// In Swift, a `Character` is a grapheme cluster.
// https://developer.apple.com/documentation/swift/string#Accessing-String-Elements
Go
// https://pkg.go.dev/github.com/rivo/uniseg
Normalize strings
A character like é can be represented in different forms: either as one codepoint (U+00e9) or as a combination of e + ◌́ (U+0065, U+0301).
Some characters are defined multiple times with different names: Ω can be found as "greek capital letter omega" (U+03a9) and as "ohm sign" (U+2126).
Normalization converts a string to use only one of those forms and is required to consistently compare strings.
Rust
use unicode_normalization::UnicodeNormalization;
line.nfc()
line.nfd()
Swift
line.precomposedStringWithCanonicalMapping
line.decomposedStringWithCanonicalMapping
Go
// https://pkg.go.dev/golang.org/x/text/unicode/norm
Remove diacritics
This can be considered a destructive form of normalization, which can be useful in some cases.
Rust
use diacritics::remove_diacritics;
remove_diacritics(line)
Swift
line.applyingTransform(.stripDiacritics, reverse: false)
// and others to transform between alphabets etc.
// https://developer.apple.com/documentation/Foundation/StringTransform
You probably want ICU4X if you're working with Unicode in Rust. It's fast, has a tolerable overhead, and its lead developers have experience doing i18n work at Mozilla and Google and are involved with the Unicode Consortium.
Worth noting that the addition of the interlinear annotation characters was quite controversial, with many commenting that this simply is not plain text and as such does not belong in Unicode. I'm not clear on how it made it in anyway, but it sure seems like the Unicode Consortium now somewhat agrees, as while they haven't formally deprecated the characters, they have kind of discouraged their use.
Previous discussion: https://news.ycombinator.com/item?id=13149705
And don't miss [this comment](https://news.ycombinator.com/item?id=13149912). The future is now!
Superscript:
Lowercase: ᵃᵇᶜᵈᵉᶠᵍʰⁱʲᵏˡᵐⁿᵒᵖʳˢᵗᵘᵛʷˣʸᶻ
Uppercase: ᴬᴮᴰᴱᴳᴴᴵᴶᴷᴸᴹᴺᴼᴾᴿᵀᵁⱽᵂ
no lower q, and no upper C,F,Q,S,X,Y or Z. And depending on the font, it might be worse.
Recently I compared Unicode handling in Rust, Swift and Go for my own curiosity. Sharing it here, in the hope someone finds it useful:
Get bytes representing utf8-encoding of string
Only ASCII characters map 1:1 to their utf8-encoding. Everything else expands to multiple bytes.
https://en.wikipedia.org/wiki/UTF-8#Description
Get Unicode codepoints of stringMost characters and emojis consist of a single codepoint. Some are made up of multiple codepoints.
If it isn't guaranteed that only this limited set of characters is used, this is not a safe way to iterate over what users would consider characters.
Codepoints are 4 bytes, usually stored internally as u32 or i32 but with different API for the programmer.
Get extended grapheme clusters of stringWhat a reader would actually consider to be a character. E.g, this character consists of two codepoints but is one grapheme cluster: a̐
Normalize stringsA character like é can be represented in different forms: either as one codepoint (U+00e9) or as a combination of e + ◌́ (U+0065, U+0301).
Some characters are defined multiple times with different names: Ω can be found as "greek capital letter omega" (U+03a9) and as "ohm sign" (U+2126).
Normalization converts a string to use only one of those forms and is required to consistently compare strings.
Remove diacriticsThis can be considered a destructive form of normalization, which can be useful in some cases.
You probably want ICU4X if you're working with Unicode in Rust. It's fast, has a tolerable overhead, and its lead developers have experience doing i18n work at Mozilla and Google and are involved with the Unicode Consortium.
heyyoo