Portable UTF-8: Demo

In this post I will demonstrate the working of utf8_* family of functions that belong to Portable UTF-8 Library.

Background

UTF-8 encoded characters are multi-byte characters that take variable number of bytes – between 1 and 4. This affects all basic to advance string operations because the native string handling functions of PHP operate on single byte characters and do not support multi-byte, and thus resulting in broken characters and unreadable text. Portable UTF-8 solves this issue gently and performs all string operations at character level, rather than at byte level.

More help

After reading this, if you feel you need more help, you can have a look at Portable UTF-8’s original post that explains the library with a complete list of available functions. Also the source code is fully documented. More demonstrations may be listed on the original page. For more help, you can ask below.

For demo purpose, I am copying a muti-byte string from BBC Urdu’s news page.

Including the Library

Its this simple. No configuration.

Removing Invalid Characters

The function utf8_clean removes all such byte sequences that do not form a valid UTF-8 character. Using this function is highly recommended on all inputs from HTML forms and any other sources out of your control. This will prevent invalid encodings and help you fight with XSS:

String Length

Comparison of utf8_strlen with strlen:

Creating SEO friendly URL Slug

Set the third parameter of utf8_url_slug as true to enable transliteration (not recommended). If you set the maximum character limit of the slug, it will cut words (as shown in last 2 output examples below). Future versions may add support for word boundaries detection to avoid word cut.

Unicode Code Point of a Character

Returns code point of first legal UTF-8 character in the string.

Make a Character from Code Point

Getting part of a String

Getting Code Points of all Characters

String Split

Comparison of utf8_split with str_split: