Strings

Strings

  • A String in Swift is a collection of Character values,
  • where a Character is what a human reader of a text would perceive as a single character, regardless of how many Unicode scalars it’s composed of.
    • 就是人认为是一个字符的就是一个字符, 不看技术上占用的空间

代价:

  • Why can’t I write str[999] to access a string’s 1000 character?
  • Why doesn’t str[idx+1] get the next character?
  • Why can’t I loop over a range of Character values such as "a"..."z"?
  • String does not support random access,
    • i.e. jumping to an arbitrary character is not an O(1) operation.
    • It can’t be — when characters have variable width, the string doesn’t know where the nth character is stored without looking at all the characters that come before it.

Unicode

  • ASCII strings were a sequence of integers between 0 and 127.
    • If you stored them in an 8-bit byte, you even had a bit to spare!
    • every character was of a fixed size, ASCII strings could be random access.
  • ISO 8859 takes the extra bit and defines 16 different encodings above the ASCII range, such as:
    • Part 1 (ISO 8859-1, aka Latin-1), covering several Western European languages; (越南语不够, 有大量变音符)
    • Part 5, covering languages that use the Cyrillic alphabet.
    • Part 6 (Latin/Arabic) (不够)
    • Part 7 (Latin/Greek)
    • Part 9 (Turkish)

When you run out of room with a fixed-width encoding, you have a choice:

  • either increase the size,
  • or switch to variable-width encoding.

some terms:

  • The basic building block of Unicode is the code point: an integer value in the Unicode code space,
    • which ranges from 0 to 0x10FFFF (in decimal notation: 1,114,111).
  • In Unicode 12.1 (published in May 2019), only about 138,000 of the 1.1 million available code points are currently in use,
    • so there’s a lot of room for more emoji.
    • Code points are commonly written in hex notation with a “U+” prefix.
      • For example, the euro sign is at code point U+20AC (or 8364 in decimal).
  • Unicode scalars are almost, but not quite, the same as code points.
    • They’re all the code points except the 2,048 surrogate code points in the range 0xD800 to 0xDFFF
      • (which are used by the UTF-16 encoding to represent code points greater than 65,535).
    • Scalars are represented in Swift string literals as "\u{xxxx}", where xxxx represents hex digits.
    • So the euro sign can be written in Swift as either "€" or "\u{20AC}".
      • The corresponding Swift type is Unicode.Scalar, which is a wrapper around a UInt32 value.
  • The same Unicode data (i.e. a sequence of scalars) can be encoded with different encodings, UTF-8 and UTF-16 being the most common ones.
    • The smallest entity in an encoding is called a Code Uit.
    • The UTF-8 encoding has 8-bit-wide code units and UTF-16 has 16-bit-wide code units.
    • UTF-8具有良好的后向兼容, 所以基本上取代了ASCII
    • Code units are different from code points or scalars because a single scalar is often encoded with multiple code units.
    • UTF-8 takes one to four code units (one to four bytes) to encode a single scalar, whereas UTF-16 takes either one or two code units (two or four bytes).
    • Swift represents UTF-8 and UTF-16 code units as UInt8 and UInt16 values, respectively (aliased as Unicode.UTF8.CodeUnit and Unicode.UTF16.CodeUnit).
  • To represent each scalar by a single code unit, you’d need a 21-bit encoding scheme, which usually gets rounded up to 32-bit and is called UTF-32.
    • This is what Unicode.Scalar does in Swift.
    • Unicode is still a variable-width format when it comes to “characters.” (屏幕上的人眼看到的"一个字符", 可能需要多个scalar)
    • 这种用户感觉到的字符在Unicode里的说法是(扩展的)字符集群( (extended) grapheme cluster)。

Grapheme Clusters and Canonical Equivalence

Combining Marks

A quick way to see how String handles Unicode data is to look at the two different ways to write é.

  • Unicode defines U+00E9, Latin small letter e with acute, as a single value.
  • But you can also write it as the plain letter e, followed by U+0301, combining acute accent.
  • 如上生成的字符串: “résumé”, 对用户/人眼来说, 是没有区别的
    • 长度也是 6
    • They would be what the Unicode specification describes as canonically equivalent.
let single = "Pok\u{00E9}mon" // Pokémon
let double = "Poke\u{0301}mon" // Pokémon

single.count // 7
double.count // 7

single == double // true

// 在scalar的层面能看出区别
single.unicodeScalars.count // 7
double.unicodeScalars.count // 8

// OC (NSString in Foundation) 没有用这个特性, 所以也是不同的:
let nssingle = single as NSString
nssingle.length // 7
let nsdouble = double as NSString
nsdouble.length // 8
nssingle == nsdouble // false


  • In the case of NSString, this will do a literal comparison on the level of UTF-16 code units rather than one accounting for equivalent but differently composed characters.
  • Most string APIs in other languages work this way too.
  • If you really want to perform a canonical comparison of two NSStrings, you must use NSString.compare(_:) (而不是==).

为什么Unicode要支持同一个字符的多种表示方式呢?有了预组合字符的存在,Unicode的编码点范围才能和已经包含像é和ñ这样的字符的Latin-1编码兼容。虽然处理起来可能有点麻烦,但这使得两种编码之间的转换变得又快又简单。

// CR+LF is a single Character.
let crlf = "\r\n"
crlf.count // 1

Emoji

  • Many emoji are assigned Unicode scalars that don’t fit in a single UTF-16 code unit.
  • Languages that represent strings as collections of UTF-16 code units, such as Java or C#, would say that the string “🚀” is two “characters” long.
  • For example, the emoji “🚀” is assigned the code point U+1F680, which is composed of two UTF-16 code units: U+D83D and U+DE80.
  • how the string is exposed to the program, not how it’s stored in memory.
  • Swift uses UTF-8 as its internal encoding for non-ASCII strings, but that’s an implementation detail.
  • Other emoji are composed of multiple scalars. An emoji flag is a combination of two regional indicator symbols that correspond to an ISO country code. Swift treats the flag correctly as one Character.

To inspect the Unicode scalars a string is composed of, use the unicodeScalars view.

// let flags = "🇫🇷🇭🇳🇫🇷"
flags.unicodeScalars.map {
    "U+\(String($0.value, radix: 16, uppercase: true))"
}
// ["U+1F1E7", "U+1F1F7", "U+1F1F3", "U+1F1FF", "U+1F1F3", "U+1F1FF"] (值是乱写的)
  • Unicode solves this by specifying that these emoji are actually sequences of multiple emoji, combined with the invisible zero-width joiner (ZWJ) character (U+200D).

可以理解为 ZWJ 就是一个不显示的连字符

let family1 = "👧👩👧👩" // 假定为四个人头的结合体, 我没打出来, 也没复制出来
let family2 = "👧\u{200D}👩\u{200D}👧\u{200D}👩" 

family1.count // 1
family2.count // 1
family1.unicodeScalars.count // 4
family2.unicodeScalars.count // 4
// 🧑‍🚒 = 👩 +ZWJ + 🚒
// 👧🏽 = 👧 +ZWJ + 🟫 // skin tone

family1.count // 1
family1.utf16.count // 11
family1.utf8.count // 25

将这些序列渲染为单个字形,是操作系统的任务。在 2017 年的 Apple 平台上,操作系统所包括的字形是 Unicode 标准所列出的⼀般交换所推荐⽀持的的序列 (RGI) 的⼦集。

  • 一旦系统中没有了这个序列, 就会拆开显示
  • 但是显示是显示, count仍然是1
  • 你不能保证⽤⼾所看到的和你在开发中所看到东西⼀定是完全⼀致的。
  • 如果语言不能很好地处理 grapheme clusters (扩展的字符集群/字位簇), 就会出现各种问题(比如反转字符串)。

Strings and Collections

由于字位簇的存在, Swift4以前避免了让字符串实现Collection协议, 因为这样会出现很多问题(因此引入了characters)。比如两个单独的字符, 刚好是一个字位簇, 就会被认为是一个字符。但是在Swift4以后, 认为为这种边界情况而牺牲太多特性不值, 因此字符串实现了Collection协议, 所以可以使用Collection的各种方法。

Bidirectional, Not Random Access

  • String is not a random-access collection.
  • String conforms only to BidirectionalCollection.
    • 你可以从字符串的头或者尾开始,向后或者向前移动,代码会察看毗邻字符的组合,跳过正确的字节数。

遍历一个字符串的所有前缀, 可以用循环来实现, 这样的坏处是, prefix每次会遍历整个字符串, 所以时间复杂度是O(n2)O(n^2)

(0...count).map(prefix)

但是indexs能把每个字符的索引取到, 时间复杂度降低到O(n)O(n)

[""] + indices.map { index in self[...index] }

Range-Replaceable, Not Mutable

  • String also conforms to RangeReplaceableCollection.
  • This means that you can replace a range of characters with another string(Even empty string, 那就相当于removeSubrange了).
var greeting = "Hello, world!"
if let comma = greeting.firstIndex(of: ",") {
    greeting[..<comma] // Hello
    greeting.replaceSubrange(comma..., with: " again.")
}
greeting // Hello again.
  • MutableCollection 是⼀个集合的经典特性,然⽽字符串并没有实现这个协议。
  • 你⽆法做到通过下标操作对⼀个字符进⾏替换。究其原因,⼜回到可变⻓度的字符上。

由于字符串中的字符可能是可变⻓度,改变其中⼀个元素的宽度将意味着要把后⾯元素在内存中的位置上下移动。不⽌如此,在被插⼊的索引位置之后的所有索引值也会由于内存未知的改动⽽失效,这同样并不直观。由于这些原因,就算你想要更改的元素只有⼀个,你也必须使⽤replaceSubrange

String Indices

前一节提前用到了String indices, 记住, Swif中不可能通过string[index]来获取字符, 获取第nwh字符也不可能通过O(1)的常量时间来实现, 而是遍历整个字符串, 直到找到第n个字符O(n)

  • String.Index, the index type used by String and its views, is an opaque value that essentially stores a byte offset into the string’s in-memory representation (usually UTF-8).
  • It’s still an O(n) operation
  • 但是⼀旦你拥有了有效的索引,就可以通过索引下标以 O(1) 的时间对字符串进⾏访问了。
let s = "abcdef"
let second = s.index(after: s.startIndex)
s[second] // b

let sixth = s.index(second, offsetBy: 4)
s[sixth] // f

// 越界保护
let safeIdx = s.index(s.startIndex, offsetBy: 400, limitedBy: s.endIndex)
safeIdx // nil

// prefix 
s[..<s.index(s.startIndex, offsetBy: 4)] // abcd
s.prefix(4) // abcd
// middle
s[s.index(s.endIndex, offsetBy: -2)..<s.endIndex] // ef
s.suffix(2) // ef