How to validate ASCII Text - PHP

Web-hosting business can now be started for just $19.95 with unlimited resources. Start your own.

ASCII is one of the earliest character encoding scheme providing a way of encoding control characters, commonly used symbols, alphabets and digits. Many modern character encoding schemes extend the character set used by ASCII by providing additional characters so as to not only maintain backward compatibility but also achieve the benefits of encoding more international characters.

ASCII uses 7-bit charset consisting of maximum of 128 code points, starting from decimal
 0
(hex
0x00
) with the last code point decimal
127
(hex
0x7F
). A byte (octet) consists of 8 bits, capable of storing values
 0
through
255
. While ASCII does not use code points beyond 7 bits, other character encoding schemes do. Other encoding schemes (like ISO 8859-1 that uses 8 bit charset) provide more characters than ASCII does.

Validating ASCII

If we iterate through a binary string and look for just the 8th bit in each byte we can easily guess whether the string is ASCII or not, by checking whether any of the character or byte has 8th bit used. If the 8th bit is used in any character, it surely means that the whole string uses character encoding other than ASCII.

OK, to do the validation with simple iteration:
  1. function is_ascii( $string = '' ) {
  2.     $num = 0;
  3.     while( isset( $string[$num] ) ) {
  4.         if( ord( $string[$num] ) & 0x80 ) {
  5.             return false;
  6.         }
  7.         $num++;
  8.     }
  9.     return true;
  10. }
PHP Code

The function takes
string
as its only argument and checks each byte in the
$string
. The
while
loop continues as long as there are bytes in the
$string
. The expression
ord($string[$num])&0x80
checks if the 8th bit is
ON
in current byte. If
True
validation fails and the function returns
False
because the byte does not represent a valid ASCII character.

The above function gets a little slower for long strings. So, here is a regular expression based faster alternative:
  1. function is_ascii( $string = '' ) {
  2.     return ( bool ) ! preg_match( '/[\\x80-\\xff]+/' , $string );
  3. }
PHP Code

This function takes
$string
and checks byte by byte whether any of its character false within the range of
128
-
255
. If yes, the function will return
False
.

Both functions use different methods but are identical in what they do. Both return
True
if
$string
passed is a valid 7-bit ASCII, and return
False
otherwise.

is_ascii()
may be wrong

is_ascii()
may provide wrong answer. This is because Character Encodings overlap. ASCII forms a subset of many modern character encoding schemes. For example, a valid ASCII string can also be validated as UTF-8, Windows-1252 and ISO 8859-1 (just to name a few). This is why, many tools dealing with internationalization do not offer an easy or reliable way to detect character encoding of a string. Although any string validating as ASCII is likely to be so, but there are chances that the author or the creator of the string intends to treats it differently (as UTF-8 for example). To avoid such problems, it is recommended that:
  1. When validating character encoding, apply the check on a reasonably large sample of characters. The longer a string is, the more variety of characters it is likely to have to make validation accurate and reliable;
  2. You also look for HTML tags, to see encoding used by the document creator - if you are working with HTML documents. Often only the document creator knows how the document is encoded;
  3. You use charset in HTTP Content-Type Header to get the encoding of the document - for documents served through HTTP, for the same reason as mentioned above;
  4. Look for file signature. UTF-8 uses Byte Order Mark at the start of a file that defines the file as UTF-8. Although a file having Byte Order Mark would never validate as ASCII because Byte Order Mark consists of a 3 byte UTF-8 encoded character that is beyond the scope of 7-bit ASCII.



  1. High Performance Hosting is now $3.96 a month
  2. Remove undesired characters with trim_all() - PHP
  3. Number-to-Word Conversion with PHP
  4. Replace last occurance of a String - PHP
  5. Insecure PHP Constants and Variables
  6. Implementing QuickSort in PHP


© 2012-2017 PageConfig.com - Scripts - Twitter