How to validate ASCII Text – PHP

ASCII is one of the earliest character encoding scheme providing a way of encoding control characters, commonly used symbols, alphabets and digits. Many modern character encoding schemes extend the character set used by ASCII by providing additional characters so as to not only maintain backward compatibility but also achieve the benefits of encoding more international characters.

ASCII uses 7-bit charset consisting of maximum of 128 code points, starting from decimal 0 (hex 0x00) with the last code point decimal 127 (hex 0x7F). A byte (octet) consists of 8 bits, capable of storing values 0 through 255. While ASCII does not use code points beyond 7 bits, other character encoding schemes do. Other encoding schemes (like ISO 8859-1 that uses 8 bit character set) provide more characters than ASCII does.

Validating ASCII

If we iterate through a binary string and look for just the 8th bit in each byte we can easily guess whether the string is ASCII or not, by checking whether any of the character or byte has 8th bit used. If the 8th bit is used in any character, it surely means that the string uses character encoding other than ASCII.

OK, to do the validation with simple iteration:

The function takes string as its only argument and checks each byte in the $string. The
while loop continues until it reaches the end of the $string. The expression ord($string[$num])&0x80 checks if the 8th bit is ON in current byte. If so, the validation fails and the function returns false because the byte does not represent a valid ASCII character.

The above function gets a little slower for very large strings. So, here is a regular expression based faster alternative:

This function takes $string and checks byte by byte whether any of its character false within the range of 128 – 255. If so, the function will return false. Both functions use different methods but are identical in what they do. Both return true if $string passed is a valid 7-bit ASCII, and return false otherwise.

Character Encodings Overlap

is_ascii() may provide wrong answer. This is because Character Encodings overlap. ASCII forms a subset of many modern character encoding schemes. For example, a valid ASCII string is also a valid UTF-8, Windows-1252 and ISO 8859-1 (just to name a few). This is why, many tools dealing with internationalization do not offer an easy or reliable way to detect character encoding of a string. Although any string validated as ASCII is likely to be so, but there are chances that the author or the creator of the string intends to treats it differently (as UTF-8 for example). To avoid such problems, it is recommended that when validating character encoding:

  • Apply the check on a reasonably large sample of characters. The longer a string is, the more variety of characters it is likely to have to make validation accurate and reliable;
  • If you are working with HTML documents, look for HTML tag to see encoding used by the document creator. Often only the document creator knows how the document is encoded;
  • For documents served through HTTP, use charset in HTTP Content-Type Header to get the encoding of the document;
  • Look for file signature. UTF-8 uses Byte Order Mark at the start of a file that defines the file as UTF-8. Although a file having Byte Order Mark would never validate as ASCII because Byte Order Mark consists of a 3 byte UTF-8 encoded character that is beyond the scope of 7-bit ASCII.