|
deep inside webpages
 


Portable UTF-8 - A Lightweight Library for Unicode Handling in PHP

Portable UTF-8 library is a Unicode aware alternative to PHP's native string handling API. It is written in PHP and can work without mbstring, iconv, UTF-8 support in PCRE, or any other library. The benefit of Portable UTF-8 is that it is very light-weight, fast, easy to use, easy to bundle, and it always works (no dependencies).


Download: Portable UTF-8 - License
Current Version: 1.3 - release notes

Portable UTF-8 on the web

Why Portable UTF-8?

PHP 5 and earlier versions have no native Unicode support. PHP 6 or 7 [1], where the Unicode support has been promised, may take years. To bridge the gap, there exist several extensions like mbstring, iconv and intl.

The problem with mbstring and others is that most of the time you cannot ensure presence of a specific one on a server. If you rely on one of these, your application is no more portable. This problem gets even severe for open source applications that have to run on different servers with different configurations. Considering these, I decided to write a library:
  • that anyone can use with a simple
    include
    , and;
  • that facilitates working with UTF-8 independent of 3rd party extensions;
  • that can be bundled freely, and can be used for all purposes (commercial or non-commercial);
  • that, MOST IMPORTANTLY, works everywhere, on all platforms, on all environments, without any external support.

Requirements & Recommendations

  • No extensions are required to run this library. Portable UTF-8 only needs PCRE library that is available by default since PHP 4.2.0 and cannot be disabled since PHP 5.3.0. \u modifier support in PCRE for UTF-8 handling is not a must.
  • PHP 4.2 is the minimum requirement, and all later versions are fine with Portable UTF-8.
  • To speed up string handling, it is recommended that you have mbstring or iconv available on your server, as well as the latest version of PCRE library.
  • Although Portable UTF-8 is easy to use; moving from native API to Portable UTF-8 may not be straight-forward for everyone. It is highly recommended that you do not update your scripts to include Portable UTF-8 or replace or change anything before you first know the reason and consequences. Most of the time, some native function may be all what you need.

List of Functions

Here are the functions that the library currently implements: (You may help grow this library by contributing a function through email or through an online code snippet tool.

Function NameForDetails
@since v 1.3
utf8_str_replace
str_replace
UTF-8 aware replace all occurrences of a string with another string.
utf8_str_repeat
str_repeat
Repeat a UTF-8 encoded string.
utf8_str_pad
str_pad
Pad a UTF-8 string to given length with another string.
utf8_strrpos
strrpos
Find position of last occurrence of a char in a UTF-8 string.
utf8_remove_duplicatesRemoves duplicate occurrences of a string in another string.
utf8_wsReturns an array of Unicode White Space characters.
utf8_trim_utilFor internal use - Prepares a string and given chars for trim operations.
utf8_trim
trim
Strip white space or other characters from beginning and end of a UTF-8 string.
utf8_ltrim
ltrim
Strip whitespace or other characters from beginning of a UTF-8 string.
utf8_rtrim
rtrim
Strip whitespace or other characters from end of a UTF-8 string.
utf8_strtolower
strtolower
Make a UTF-8 string Lower Case.
utf8_strtoupper
strtoupper
Make a UTF-8 string Upper Case.
utf8_case_tableReturns an array of all lower and upper case UTF-8 encoded characters.
utf8_ucfirst
ucfirst
Makes string's first char Uppercase.
utf8_lcfirst
lcfirst
Makes string's first char Lowercase.
utf8_ucwords
ucwords
Uppercase the first character of each word in a string.
utf8_stripos
stripos
Find position of first occurrence of a case-insensitive string.
utf8_strripos
strripos
Find position of last occurrence of a case-insensitive string.
mbstring_loadedChecks whether mbstring is available on the server.
iconv_loadedChecks whether iconv is available on the server.
@since v 1.2
utf8_stringMakes a string from UTF-8 code points.
utf8_substr_count
substr_count
Count the number of sub string occurrences.
is_asciiChecks if a string is 7 bit ASCII.
utf8_range
range
Returns an array of characters between two codepoints (int or hex) or UTF8 characters.
utf8_hashGenerates a hash/string of random UTF-8 characters.
utf8_chr_mapApplies callback to all UTF-8 characters.
utf8_access
$string[$i]
Provides a way to access individual UTF-8 characters.
utf8_str_sortSort ascending/descending with respect to codepoints of all characters.
utf8_strip_tags
strip_tags
Removes HTML tags from string.
@since v 1.1
utf8_ord
ord
Returns Unicode Code Point of UTF-8 encoded character.
utf8_chr
chr
Opposite of utf8_ord. Accepts a Unicode Code Point and returns the corresponding UTF-8 encoded character.
utf8_strlen
strlen
Returns number of UTF-8 characters in the string.
utf8_split
str_split
Breaks a string into an array of UTF-8 character(s).
utf8_chunk_split
chunk_split
Splits a UTF-8 encoded string into smaller chunks of specified length. For base64, use the native
chunk_split
.
utf8_substr
substr
Accepts a UTF-8 encoded string and returns a part of it.
utf8_rev
strrev
UTF-8 aware string reverse.
utf8_strpos
strpos
Finds the position of a string in another string, and returns the offset UTF-8 character count.
utf8_max
max
Accepts
array
or
string
and returns a character with maximum Code Point.
utf8_min
min
- Opposite of utf8_max.
utf8_word_count
str_word_count
Counts the number of words in a UTF-8 encoded string.
utf8_str_shuffleShuffles all characters of a UTF-8 encoded string.
pcre_utf8_supportChecks if the u modifier is available that enables UTF-8 support in PCRE functions.
is_utf8Checks if a string is UTF-8 encoded.
utf8_url_slugCreates a UTF-8 encoded URL Slug allowing safe Non-ASCII characters in SEO friendly URLs.
utf8_cleanRemoves invalid byte sequence from a UTF-8 encoded string.
utf8_fits_insideChecks if the character length of a string is less than or equal to a specific size. Useful for MySQL INSERT.
utf8_chr_size_listReturns an array containing number of bytes (1-4) taken by each UTF-8 encoded character.
utf8_max_chr_widthTakes a string and returns the maximum character width of any character in the string. Ranges from 1 to 4.
utf8_single_chr_html_encodeEncodes a Unicode character like Ӓ to Ӓ encoded form.
utf8_html_encodeSame as utf8_single_chr_html_encode, but applies to a whole string and creates a stream of encoded sequences.
utf8_bomReturns the UTF-8 Byte Order Mark (BOM) Character.
is_bomAccepts a multi-byte character and tells whether it is BOM or not.
utf8_file_has_bomChecks if a UTF-8 encoded file has a BOM (at the start).
utf8_string_has_bomChecks if a string starts with BOM.
utf8_add_bom_to_stringPrepends BOM character to a string.
utf8_count_charsAccepts a sinle string argument and returns details of characters in that string.
utf8_codepointsAccepts a string and returns Code Points of all of its characters as integer (e.g 1740) or as string (e.g U+06CC).
utf8_int_to_hexAccepts an integer and converts to U+xxxx Unicode representation. Through an optional parameter, you can suggest the preferred format of output i.e. U+xxxx or \uxxxx or unformated hexadecimal.
utf8_hex_to_intAccepts a Code Point as U+xxxx, \uxxxx or plain hexadecimal, and converts to integer.
utf8_chr_to_hexAccepts a UTF-8 encoded character and returns Code Point as U+xxxx. Through an optional parameter, you can suggest the preferred format of output i.e. U+xxxx or \uxxxx or unformated hexadecimal.

More functions to come in the upcoming versions of the library. Keep visiting this page, or subscribe via RSS for updates.

Portable UTF-8 on the web




  1. Portable UTF-8 v 1.2 released
  2. Portable UTF-8: Demo
  3. Portable UTF-8 v 1.3 released