Portable UTF-8: Demo

In this post I will demonstrate the working of utf8_* family of functions that belong to Portable UTF-8 Library.

Background

UTF-8 encoded characters are multi-byte characters that take variable number of bytes – between 1 and 4. This affects all basic to advance string operations because the native string handling functions of PHP operate on single byte characters and do not support multi-byte, and thus resulting in broken characters and unreadable text. Portable UTF-8 solves this issue gently and performs all string operations at character level, rather than at byte level.

More help

After reading this, if you feel you need more help, you can have a look at Portable UTF-8’s original post that explains the library with a complete list of available functions. Also the source code is fully documented. More demonstrations may be listed on the original page. For more help, you can ask below.

For demo purpose, I am copying a muti-byte string from BBC Urdu’s news page.

$string = 'Story Title: مریخ پر پانی کی موجودگی کا حتمی ثبوت مل گیا';

Including the Library

Its this simple. No configuration.

include( '/path/to/portable-utf8.php' );

Removing Invalid Characters

The function utf8_clean removes all such byte sequences that do not form a valid UTF-8 character. Using this function is highly recommended on all inputs from HTML forms and any other sources out of your control. This will prevent invalid encodings and help you fight with XSS:

$string = utf8_clean( $string );

String Length

Comparison of utf8_strlen with strlen:

echo utf8_strlen( $string );
echo strlen( $string );
 
//Output:
//utf8_strlen    = 56    Being number of characters in $string
//strlen         = 90    Being number of bytes in $string

Creating SEO friendly URL Slug

Set the third parameter of utf8_url_slug as true to enable transliteration (not recommended). If you set the maximum character limit of the slug, it will cut words (as shown in last 2 output examples below). Future versions may add support for word boundaries detection to avoid word cut.

echo utf8_url_slug( $string );
echo utf8_url_slug( $string , 40 );
echo utf8_url_slug( $string , 50 );
 
//Output
//story-title-مریخ-پر-پانی-کی-موجودگی-کا-حتمی-ثبوت-مل-گیا          Complete string
//story-title- مریخ-پر-پانی-کی-موجودگی-کا-ح                        Just 40 character long slug
//story-title-مریخ-پر-پانی-کی-موجودگی-کا-حتمی-ثبوت-م               Just 50 characters long slug

Unicode Code Point of a Character

Returns code point of first legal UTF-8 character in the string.

echo utf8_ord( $string );
echo utf8_ord( 'گ' );
 
//Output:
//       83      Code Point of S
//     1711      Code Point of گ

Make a Character from Code Point

echo utf8_chr( 83 );
echo utf8_chr( 1711 );
 
//Output:
//      S
//      گ

Getting part of a String

echo utf8_substr( $string , 6 );
echo utf8_substr( $string , 6 , 4 );
echo utf8_substr( $string , 6 , -4 );
echo utf8_substr( $string , -4 );
 
//Output:
//Title: مریخ پر پانی کی موجودگی کا حتمی ثبوت مل گیا
//Titl
//Title: مریخ پر پانی کی موجودگی کا حتمی ثبوت مل
// گیا

Getting Code Points of all Characters

    print_r( utf8_codepoints( $string ) );        //Integer Code Points
    print_r( utf8_codepoints( $string , true ) );  //Unicode Style Code Points
     
    //Output: Integer Code Points
    //Array
    //(
    //    [0] => 83
    //    [1] => 116
    //    [2] => 111
    //    [3] => 114
    //    [4] => 121
    //    [5] => 32
    //    [6] => 84
    //    [7] => 105
    //    [8] => 116
    //    [9] => 108
    //    [10] => 101
    //    [11] => 58
    //    [12] => 32
    //    [13] => 1605
    //    [14] => 1585
    //    [15] => 1740
    //    [16] => 1582
    //    [17] => 32
    //    [18] => 1662
    //    [19] => 1585
    //    [20] => 32
    //    [21] => 1662
    //    [22] => 1575
    //    [23] => 1606
    //    [24] => 1740
    //    [25] => 32
    //    [26] => 1705
    //    [27] => 1740
    //    [28] => 32
    //    [29] => 1605
    //    [30] => 1608
    //    [31] => 1580
    //    [32] => 1608
    //    [33] => 1583
    //    [34] => 1711
    //    [35] => 1740
    //    [36] => 32
    //    [37] => 1705
    //    [38] => 1575
    //    [39] => 32
    //    [40] => 1581
    //    [41] => 1578
    //    [42] => 1605
    //    [43] => 1740
    //    [44] => 32
    //    [45] => 1579
    //    [46] => 1576
    //    [47] => 1608
    //    [48] => 1578
    //    [49] => 32
    //    [50] => 1605
    //    [51] => 1604
    //    [52] => 32
    //    [53] => 1711
    //    [54] => 1740
    //    [55] => 1575
    //)
    //Output: Hexadecimal U+xxxx style Code Points
    //Array
    //(
    //    [0] => U+0053
    //    [1] => U+0074
    //    [2] => U+006f
    //    [3] => U+0072
    //    [4] => U+0079
    //    [5] => U+0020
    //    [6] => U+0054
    //    [7] => U+0069
    //    [8] => U+0074
    //    [9] => U+006c
    //    [10] => U+0065
    //    [11] => U+003a
    //    [12] => U+0020
    //    [13] => U+0645
    //    [14] => U+0631
    //    [15] => U+06cc
    //    [16] => U+062e
    //    [17] => U+0020
    //    [18] => U+067e
    //    [19] => U+0631
    //    [20] => U+0020
    //    [21] => U+067e
    //    [22] => U+0627
    //    [23] => U+0646
    //    [24] => U+06cc
    //    [25] => U+0020
    //    [26] => U+06a9
    //    [27] => U+06cc
    //    [28] => U+0020
    //    [29] => U+0645
    //    [30] => U+0648
    //    [31] => U+062c
    //    [32] => U+0648
    //    [33] => U+062f
    //    [34] => U+06af
    //    [35] => U+06cc
    //    [36] => U+0020
    //    [37] => U+06a9
    //    [38] => U+0627
    //    [39] => U+0020
    //    [40] => U+062d
    //    [41] => U+062a
    //    [42] => U+0645
    //    [43] => U+06cc
    //    [44] => U+0020
    //    [45] => U+062b
    //    [46] => U+0628
    //    [47] => U+0648
    //    [48] => U+062a
    //    [49] => U+0020
    //    [50] => U+0645
    //    [51] => U+0644
    //    [52] => U+0020
    //    [53] => U+06af
    //    [54] => U+06cc
    //    [55] => U+0627
    //)

String Split

Comparison of utf8_split with str_split:

    print_r( utf8_split( $string ) );
    print_r( str_split( $string ) );
     
    //Output: utf8_split - Correct character handling
    //Array
    //(
    //    [0] => S
    //    [1] => t
    //    [2] => o
    //    [3] => r
    //    [4] => y
    //    [5] =>  
    //    [6] => T
    //    [7] => i
    //    [8] => t
    //    [9] => l
    //    [10] => e
    //    [11] => :
    //    [12] =>  
    //    [13] => م
    //    [14] => ر
    //    [15] => ی
    //    [16] => خ
    //    [17] =>  
    //    [18] => پ
    //    [19] => ر
    //    [20] =>  
    //    [21] => پ
    //    [22] => ا
    //    [23] => ن
    //    [24] => ی
    //    [25] =>  
    //    [26] => ک
    //    [27] => ی
    //    [28] =>  
    //    [29] => م
    //    [30] => و
    //    [31] => ج
    //    [32] => و
    //    [33] => د
    //    [34] => گ
    //    [35] => ی
    //    [36] =>  
    //    [37] => ک
    //    [38] => ا
    //    [39] =>  
    //    [40] => ح
    //    [41] => ت
    //    [42] => م
    //    [43] => ی
    //    [44] =>  
    //    [45] => ث
    //    [46] => ب
    //    [47] => و
    //    [48] => ت
    //    [49] =>  
    //    [50] => م
    //    [51] => ل
    //    [52] =>  
    //    [53] => گ
    //    [54] => ی
    //    [55] => ا
    //)
     
    //Output: str_split - Broken Characters - Incorrect Character Handling
    //Array
    //(
    //    [0] => S
    //    [1] => t
    //    [2] => o
    //    [3] => r
    //    [4] => y
    //    [5] =>  
    //    [6] => T
    //    [7] => i
    //    [8] => t
    //    [9] => l
    //    [10] => e
    //    [11] => :
    //    [12] =>  
    //    [13] => �
    //    [14] => �
    //    [15] => �
    //    [16] => �
    //    [17] => �
    //    [18] => �
    //    [19] => �
    //    [20] => �
    //    [21] =>  
    //    [22] => �
    //    [23] => �
    //    [24] => �
    //    [25] => �
    //    [26] =>  
    //    [27] => �
    //    [28] => �
    //    [29] => �
    //    [30] => �
    //    [31] => �
    //    [32] => �
    //    [33] => �
    //    [34] => �
    //    [35] =>  
    //    [36] => �
    //    [37] => �
    //    [38] => �
    //    [39] => �
    //    [40] =>  
    //    [41] => �
    //    [42] => �
    //    [43] => �
    //    [44] => �
    //    [45] => �
    //    [46] => �
    //    [47] => �
    //    [48] => �
    //    [49] => �
    //    [50] => �
    //    [51] => �
    //    [52] => �
    //    [53] => �
    //    [54] => �
    //    [55] =>  
    //    [56] => �
    //    [57] => �
    //    [58] => �
    //    [59] => �
    //    [60] =>  
    //    [61] => �
    //    [62] => �
    //    [63] => �
    //    [64] => �
    //    [65] => �
    //    [66] => �
    //    [67] => �
    //    [68] => �
    //    [69] =>  
    //    [70] => �
    //    [71] => �
    //    [72] => �
    //    [73] => �
    //    [74] => �
    //    [75] => �
    //    [76] => �
    //    [77] => �
    //    [78] =>  
    //    [79] => �
    //    [80] => �
    //    [81] => �
    //    [82] => �
    //    [83] =>  
    //    [84] => �
    //    [85] => �
    //    [86] => �
    //    [87] => �
    //    [88] => �
    //    [89] => �
    //)