|
deep inside webpages
 


Portable UTF-8: Demo

In this post I will demonstrate the working of utf8_* family of functions that belong to Portable UTF-8 library.

Background

UTF-8 encoded characters are multi-byte characters that take variable number of bytes - between 1 and 4. This affects all basic to advance string operations because the native string handling functions of PHP operate on single byte characters and do not support multi-byte, resulting in broken characters and unreadable text. Portable UTF-8 solves this issue gently and performs all string operations at character level, rather than at byte level.

More help

After reading this, if you feel you need more help, you can have a look at Portable UTF-8's original page that explains the library with a complete list of available functions. Also the source code is fully documented. More demonstrations may be listed on the original page. For more help, you can ask below.

For demo purpose, I am copying a muti-byte string from BBC Urdu's news page.
  1. $string = 'Story Title: مریخ پر پانی کی موجودگی کا حتمی ثبوت مل گیا';
PHP Code

Including the Library

Its this simple. No configuration.
  1. include( '/path/to/portable-utf8-v1-1.php' );
PHP Code

Removing Invalid Characters

The function utf8_clean removes all such byte sequences that do not constitute a valid UTF-8 character. Using this function is highly recommended on all inputs from HTML forms and any other sources out of your control. This will prevent invalid encodings and help you fight with XSS:
  1. $string = utf8_clean( $string );
PHP Code

String Length

Comparison of utf8_strlen with
strlen
:
  1. echo utf8_strlen( $string );
  2. echo strlen( $string );
  3.  
  4. //Output:
  5. //utf8_strlen    = 56    Being number of characters in $string
  6. //strlen         = 90    Being number of bytes in $string
PHP Code

Creating SEO friendly URL Slug

Set the third parameter of utf8_url_slug as
true
to enable transliteration (not recommended). If you set the maximum character limit of the slug, it will cut words (as shown in last 2 output examples below). Future versions may add support for word boundaries detection to avoid word cut.
  1. echo utf8_url_slug( $string );
  2. echo utf8_url_slug( $string , 40 );
  3. echo utf8_url_slug( $string , 50 );
  4.  
  5. //Output
  6. //story-title-مریخ-پر-پانی-کی-موجودگی-کا-حتمی-ثبوت-مل-گیا          Complete string
  7. //story-title- مریخ-پر-پانی-کی-موجودگی-کا-ح                        Just 40 character long slug
  8. //story-title-مریخ-پر-پانی-کی-موجودگی-کا-حتمی-ثبوت-م               Just 50 characters long slug
PHP Code

Unicode Code Point of a Character

Returns code point of first legal UTF-8 character in the string.
  1. echo utf8_ord( $string );
  2. echo utf8_ord( 'گ' );
  3.  
  4. //Output:
  5. //       83      Code Point of S
  6. //     1711      Code Point of گ
PHP Code

Make a Character from Code Point

  1. echo utf8_chr( 83 );
  2. echo utf8_chr( 1711 );
  3.  
  4. //Output:
  5. //      S
  6. //      گ
PHP Code

Getting part of a String

  1. echo utf8_substr( $string , 6 );
  2. echo utf8_substr( $string , 6 , 4 );
  3. echo utf8_substr( $string , 6 , -4 );
  4. echo utf8_substr( $string , -4 );
  5.  
  6. //Output:
  7. //Title: مریخ پر پانی کی موجودگی کا حتمی ثبوت مل گیا
  8. //Titl
  9. //Title: مریخ پر پانی کی موجودگی کا حتمی ثبوت مل
  10. // گیا
PHP Code

Getting Code Points of all Characters

  1. print_r( utf8_codepoints( $string ) );        //Integer Code Points
  2. print_r( utf8_codepoints( $string , true ) );  //Unicode Style Code Points
  3.  
  4. //Output: Integer Code Points
  5. //Array
  6. //(
  7. //    [0] => 83
  8. //    [1] => 116
  9. //    [2] => 111
  10. //    [3] => 114
  11. //    [4] => 121
  12. //    [5] => 32
  13. //    [6] => 84
  14. //    [7] => 105
  15. //    [8] => 116
  16. //    [9] => 108
  17. //    [10] => 101
  18. //    [11] => 58
  19. //    [12] => 32
  20. //    [13] => 1605
  21. //    [14] => 1585
  22. //    [15] => 1740
  23. //    [16] => 1582
  24. //    [17] => 32
  25. //    [18] => 1662
  26. //    [19] => 1585
  27. //    [20] => 32
  28. //    [21] => 1662
  29. //    [22] => 1575
  30. //    [23] => 1606
  31. //    [24] => 1740
  32. //    [25] => 32
  33. //    [26] => 1705
  34. //    [27] => 1740
  35. //    [28] => 32
  36. //    [29] => 1605
  37. //    [30] => 1608
  38. //    [31] => 1580
  39. //    [32] => 1608
  40. //    [33] => 1583
  41. //    [34] => 1711
  42. //    [35] => 1740
  43. //    [36] => 32
  44. //    [37] => 1705
  45. //    [38] => 1575
  46. //    [39] => 32
  47. //    [40] => 1581
  48. //    [41] => 1578
  49. //    [42] => 1605
  50. //    [43] => 1740
  51. //    [44] => 32
  52. //    [45] => 1579
  53. //    [46] => 1576
  54. //    [47] => 1608
  55. //    [48] => 1578
  56. //    [49] => 32
  57. //    [50] => 1605
  58. //    [51] => 1604
  59. //    [52] => 32
  60. //    [53] => 1711
  61. //    [54] => 1740
  62. //    [55] => 1575
  63. //)
  64. //Output: Hexadecimal U+xxxx style Code Points
  65. //Array
  66. //(
  67. //    [0] => U+0053
  68. //    [1] => U+0074
  69. //    [2] => U+006f
  70. //    [3] => U+0072
  71. //    [4] => U+0079
  72. //    [5] => U+0020
  73. //    [6] => U+0054
  74. //    [7] => U+0069
  75. //    [8] => U+0074
  76. //    [9] => U+006c
  77. //    [10] => U+0065
  78. //    [11] => U+003a
  79. //    [12] => U+0020
  80. //    [13] => U+0645
  81. //    [14] => U+0631
  82. //    [15] => U+06cc
  83. //    [16] => U+062e
  84. //    [17] => U+0020
  85. //    [18] => U+067e
  86. //    [19] => U+0631
  87. //    [20] => U+0020
  88. //    [21] => U+067e
  89. //    [22] => U+0627
  90. //    [23] => U+0646
  91. //    [24] => U+06cc
  92. //    [25] => U+0020
  93. //    [26] => U+06a9
  94. //    [27] => U+06cc
  95. //    [28] => U+0020
  96. //    [29] => U+0645
  97. //    [30] => U+0648
  98. //    [31] => U+062c
  99. //    [32] => U+0648
  100. //    [33] => U+062f
  101. //    [34] => U+06af
  102. //    [35] => U+06cc
  103. //    [36] => U+0020
  104. //    [37] => U+06a9
  105. //    [38] => U+0627
  106. //    [39] => U+0020
  107. //    [40] => U+062d
  108. //    [41] => U+062a
  109. //    [42] => U+0645
  110. //    [43] => U+06cc
  111. //    [44] => U+0020
  112. //    [45] => U+062b
  113. //    [46] => U+0628
  114. //    [47] => U+0648
  115. //    [48] => U+062a
  116. //    [49] => U+0020
  117. //    [50] => U+0645
  118. //    [51] => U+0644
  119. //    [52] => U+0020
  120. //    [53] => U+06af
  121. //    [54] => U+06cc
  122. //    [55] => U+0627
  123. //)
PHP Code

String Split

Comparison of utf8_split with
str_split
:
  1. print_r( utf8_split( $string ) );
  2. print_r( str_split( $string ) );
  3.  
  4. //Output: utf8_split - Correct character handling
  5. //Array
  6. //(
  7. //    [0] => S
  8. //    [1] => t
  9. //    [2] => o
  10. //    [3] => r
  11. //    [4] => y
  12. //    [5] =>  
  13. //    [6] => T
  14. //    [7] => i
  15. //    [8] => t
  16. //    [9] => l
  17. //    [10] => e
  18. //    [11] => :
  19. //    [12] =>  
  20. //    [13] => م
  21. //    [14] => ر
  22. //    [15] => ی
  23. //    [16] => خ
  24. //    [17] =>  
  25. //    [18] => پ
  26. //    [19] => ر
  27. //    [20] =>  
  28. //    [21] => پ
  29. //    [22] => ا
  30. //    [23] => ن
  31. //    [24] => ی
  32. //    [25] =>  
  33. //    [26] => ک
  34. //    [27] => ی
  35. //    [28] =>  
  36. //    [29] => م
  37. //    [30] => و
  38. //    [31] => ج
  39. //    [32] => و
  40. //    [33] => د
  41. //    [34] => گ
  42. //    [35] => ی
  43. //    [36] =>  
  44. //    [37] => ک
  45. //    [38] => ا
  46. //    [39] =>  
  47. //    [40] => ح
  48. //    [41] => ت
  49. //    [42] => م
  50. //    [43] => ی
  51. //    [44] =>  
  52. //    [45] => ث
  53. //    [46] => ب
  54. //    [47] => و
  55. //    [48] => ت
  56. //    [49] =>  
  57. //    [50] => م
  58. //    [51] => ل
  59. //    [52] =>  
  60. //    [53] => گ
  61. //    [54] => ی
  62. //    [55] => ا
  63. //)
  64.  
  65. //Output: str_split - Broken Characters - Incorrect Character Handling
  66. //Array
  67. //(
  68. //    [0] => S
  69. //    [1] => t
  70. //    [2] => o
  71. //    [3] => r
  72. //    [4] => y
  73. //    [5] =>  
  74. //    [6] => T
  75. //    [7] => i
  76. //    [8] => t
  77. //    [9] => l
  78. //    [10] => e
  79. //    [11] => :
  80. //    [12] =>  
  81. //    [13] => �
  82. //    [14] => �
  83. //    [15] => �
  84. //    [16] => �
  85. //    [17] => �
  86. //    [18] => �
  87. //    [19] => �
  88. //    [20] => �
  89. //    [21] =>  
  90. //    [22] => �
  91. //    [23] => �
  92. //    [24] => �
  93. //    [25] => �
  94. //    [26] =>  
  95. //    [27] => �
  96. //    [28] => �
  97. //    [29] => �
  98. //    [30] => �
  99. //    [31] => �
  100. //    [32] => �
  101. //    [33] => �
  102. //    [34] => �
  103. //    [35] =>  
  104. //    [36] => �
  105. //    [37] => �
  106. //    [38] => �
  107. //    [39] => �
  108. //    [40] =>  
  109. //    [41] => �
  110. //    [42] => �
  111. //    [43] => �
  112. //    [44] => �
  113. //    [45] => �
  114. //    [46] => �
  115. //    [47] => �
  116. //    [48] => �
  117. //    [49] => �
  118. //    [50] => �
  119. //    [51] => �
  120. //    [52] => �
  121. //    [53] => �
  122. //    [54] => �
  123. //    [55] =>  
  124. //    [56] => �
  125. //    [57] => �
  126. //    [58] => �
  127. //    [59] => �
  128. //    [60] =>  
  129. //    [61] => �
  130. //    [62] => �
  131. //    [63] => �
  132. //    [64] => �
  133. //    [65] => �
  134. //    [66] => �
  135. //    [67] => �
  136. //    [68] => �
  137. //    [69] =>  
  138. //    [70] => �
  139. //    [71] => �
  140. //    [72] => �
  141. //    [73] => �
  142. //    [74] => �
  143. //    [75] => �
  144. //    [76] => �
  145. //    [77] => �
  146. //    [78] =>  
  147. //    [79] => �
  148. //    [80] => �
  149. //    [81] => �
  150. //    [82] => �
  151. //    [83] =>  
  152. //    [84] => �
  153. //    [85] => �
  154. //    [86] => �
  155. //    [87] => �
  156. //    [88] => �
  157. //    [89] => �
  158. //)
PHP Code



  1. Portable UTF-8 v 1.2 released
  2. Portable UTF-8 - A Lightweight Library for Unicode Handling in PHP
  3. Portable UTF-8 v 1.3 released