Portable UTF-8 – A Lightweight Library for Unicode Handling in PHP

Portable UTF-8 library is a Unicode aware alternative to PHP’s native string handling API. It is written in PHP and can work without mbstring, iconv, UTF-8 support in PCRE, or any other library. The benefit of Portable UTF-8 is that it is very light-weight, fast, easy to use, easy to bundle, and it always works (no dependencies).

Download: Portable UTF-8 v1.3
release notesLicense

Portable UTF-8 on the web

Why Portable UTF-8?

PHP 7 and earlier versions have no native Unicode support. To bridge the gap, there exist several extensions like mbstring, iconv and intl.

The problem with mbstring and others is that most of the time you cannot ensure presence of a specific one on a server. If you rely on one of these, your application is no more portable. This problem gets even severe for open source applications that have to run on different servers with different configurations. Considering these, I decided to write a library:

  • that anyone can use with a simple include, and;
  • that facilitates working with UTF-8 independent of 3rd party extensions;
  • that can be bundled freely, and can be used for all purposes (commercial or non-commercial);
  • that, works everywhere, on all platforms, on all environments, without any external support.

Requirements & Recommendations

  • No extensions are required to run this library. Portable UTF-8 only needs PCRE library that is available by default since PHP 4.2.0 and cannot be disabled since PHP 5.3.0. \u modifier support in PCRE for UTF-8 handling is not a must.
  • PHP 4.2 is the minimum requirement, and all later versions are fine with Portable UTF-8.
  • To speed up string handling, it is recommended that you have mbstring or iconv available on your server, as well as the latest version of PCRE library.
  • Although Portable UTF-8 is easy to use; moving from native API to Portable UTF-8 may not be straight-forward for everyone. Most of the time, some native function may be all what you need.

List of Functions

Here are the functions that the library currently implements: (You may help grow this library by contributing a function through email or through an online code sharing tool.

Since v 1.3

  • utf8_str_replace – UTF-8 aware replace all occurrences of a string with another string.
  • utf8_str_repeat – Repeat a UTF-8 encoded string.
  • utf8_str_pad – Pad a UTF-8 string to given length with another string.
  • utf8_strrpos – Find position of last occurrence of a char in a UTF-8 string.
  • utf8_remove_duplicates – Removes duplicate occurrences of a string in another string.
  • utf8_ws – Returns an array of Unicode White Space characters.
  • utf8_trim_util (for internal use) – Prepares a string and given chars for trim operations.
  • utf8_trim – Strip white space or other characters from beginning and end of a UTF-8 string.
  • utf8_ltrim – Strip whitespace or other characters from beginning of a UTF-8 string.
  • utf8_rtrim – Strip whitespace or other characters from end of a UTF-8 string.
  • utf8_strtolower – Make a UTF-8 string Lower Case.
  • utf8_strtoupper – Make a UTF-8 string Upper Case.
  • utf8_case_table – Returns an array of all lower and upper case UTF-8 encoded characters.
  • utf8_ucfirst – Makes string’s first char Uppercase.
  • utf8_lcfirst – Makes string’s first char Lowercase.
  • utf8_ucwords – Uppercase the first character of each word in a string.
  • utf8_stripos – Find position of first occurrence of a case-insensitive string.
  • utf8_strripos – Find position of last occurrence of a case-insensitive string.
  • mbstring_loaded – Checks whether mbstring is available on the server.
  • iconv_loaded – Checks whether iconv is available on the server.

Since v 1.2

  • utf8_string – Makes a string from UTF-8 code points.
  • utf8_substr_count – Count the number of sub string occurrences.
  • is_ascii – Checks if a string is 7 bit ASCII.
  • utf8_range – Returns an array of characters between two codepoints (int or hex) or UTF8 characters.
  • utf8_hash – Generates a hash/string of random UTF-8 characters.
  • utf8_chr_map – Applies callback to all UTF-8 characters.
  • utf8_access (alternative of $string[$i]) – Provides a way to access individual UTF-8 characters.
  • utf8_str_sort – Sort ascending/descending with respect to codepoints of all characters.
  • utf8_strip_tags – Removes HTML tags from string.

Since v 1.1

  • utf8_ord – Returns Unicode Code Point of UTF-8 encoded character.
  • utf8_chr – Opposite of utf8_ord. Accepts a Unicode Code Point and returns the corresponding UTF-8 encoded character.
  • utf8_strlen – Returns number of UTF-8 characters in the string.
  • utf8_split – Breaks a string into an array of UTF-8 character(s).
  • utf8_chunk_split – Splits a UTF-8 encoded string into smaller chunks of specified length.
  • utf8_substr – Accepts a UTF-8 encoded string and returns a part of it.
  • utf8_rev – UTF-8 aware string reverse.
  • utf8_strpos – Finds the position of a string in another string, and returns the offset UTF-8 character count.
  • utf8_max – Accepts array or string and returns a character with maximum Code Point.
  • utf8_min – Opposite of utf8_max.
  • utf8_word_count – Counts the number of words in a UTF-8 encoded string.
  • utf8_str_shuffle – Shuffles all characters of a UTF-8 encoded string.
  • pcre_utf8_support – Checks if the u modifier is available that enables UTF-8 support in PCRE functions.
  • is_utf8 – Checks if a string is UTF-8 encoded.
  • utf8_url_slug – Creates a UTF-8 encoded URL Slug allowing safe Non-ASCII characters in SEO friendly URLs.
  • utf8_clean – Removes invalid byte sequence from a UTF-8 encoded string.
  • utf8_fits_inside – Checks if the character length of a string is less than or equal to a specific size. Useful for MySQL INSERT.
  • utf8_chr_size_list – Returns an array containing number of bytes (1-4) taken by each UTF-8 encoded character.
  • utf8_max_chr_width – Takes a string and returns the maximum character width of any character in the string. Ranges from 1 to 4.
  • utf8_single_chr_html_encode – Encodes a Unicode character like Ӓ to Ӓ encoded form.
  • utf8_html_encode – Same as utf8_single_chr_html_encode, but applies to a whole string and creates a stream of encoded sequences.
  • utf8_bom – Returns the UTF-8 Byte Order Mark (BOM) Character.
  • is_bom – Accepts a multi-byte character and checks whether it is BOM or not.
  • utf8_file_has_bom – Checks if a UTF-8 encoded file has a BOM (at the start).
  • utf8_string_has_bom – Checks if a string starts with BOM.
  • utf8_add_bom_to_string – Prepends BOM character to a string.
  • utf8_count_chars – Accepts a sinle string argument and returns details of characters in that string.
  • utf8_codepoints – Accepts a string and returns Code Points of all of its characters as integer (e.g 1740) or as string (e.g U+06CC).
  • utf8_int_to_hex – Accepts an integer and converts to U+xxxx Unicode representation. Through an optional parameter, you can suggest the preferred format of output i.e. U+xxxx or \uxxxx or unformated hexadecimal.
  • utf8_hex_to_int – Accepts a Code Point as U+xxxx, \uxxxx or plain hexadecimal, and converts to integer.
  • utf8_chr_to_hex – Accepts a UTF-8 encoded character and returns Code Point as U+xxxx. Through an optional parameter, you can suggest the preferred format of output i.e. U+xxxx or \uxxxx or unformated hexadecimal.

More functions to come in the upcoming versions of the library. Keep visiting this page.

Portable UTF-8 on the web