Validating for real alphabetic
Validation is an essential part of any application. We need to check that the data entered is in the range of the set of data we can handle. And not only security purposes, but also to make sure that is is into what we can process.
Not so long ago, I had to make a very common validation: Alphabetic Characters. Most of us developers would have just created a regular expression against the set A-Za-z or maybe using another set like \w. Well, this does not always gives us what we really want.
In my case, I had to validate for more than just A-Z. This is, my application should allow for different languages where the alphabet was extended from the basic Latin 26-letters.
Accented vowels
Sure, I could add the accented vowels. á, é, í, ó, ú, á, é, í, ó, ú. Well, that’s the acute accent. We have the grave one: à. We have the circumflex: â. Diaeresis: ä. Oh, wait. They’re even more. Suddenly, too much to remember or to manually write.
Wait, there’s more…
And not only vowels. It seems that consonants can also be accented. ý. ñ. š. ç.
Oh, there are even more letters. In German, for example, the “ss” letter combination evolved to ß. Those are ligatures: œ. þ. æ.
These are all, believe it or not, part of the Extended Latin Alphabet. So, if I wanted Johann Strauß, Kurt Gödel or Maria Skłodowska (later known as Mrs. Curie) to have a user in my application, I needed to allow this type of entrances.
Languages provide a tool for that
Some languages do provide a tool for that. For instance, Perl provides the \X operator. This matches any unicode character. Anyway, this is a little more than we want to actually achieve.
Other tool languages provide is the \p{} and \P{} operators. This goes for Perl and .NET. I think Java also does. More information on these special features can be read at the Unicode section of Regex Tutorial.
However, if you’re trying to have a rich web 2.0 application, then you need to have this working in JavaScript too. Of course, server side validations need to be made, but still, a rich user experience demands that we do not wait to go to the server until we give the user a “Invalid name” message or something alike.
What can we do in JavaScript?
JavaScript does provide support for the \uXXXX operator to match a specific unicode codepoint. Knowing that, I made a quick look trought the Unicode Block Listing, and gathered all those points that where part of the Latin or extended Latin alphabet. Here’s what I found:
- From block: C0 controls and basic Latin (U+0000–007F)
\u0041-\u005A
\u0061-\u007A - From block: Latin-1 Supplement (U+0080–00FF)
\u00C0-\u00F6
\u00F8-\u00FF - From block: Latin extended-A (U+0100–017F), Latin extended-B (U+0180–024F), IPA Extension (U+0250-02AF)
\u0100-\u02AF - From block: Latin extended additional (U+1E00–1EFF)
\u1E00-\u1EFF - From block: Latin Extended-C (U+2C60-2C7F)
\u2C60-\u2C7F
In case you wonder why the range Latin-1 Supplement leaves out the \u00F7 codepoint, it’s because it is a division symbol.
Ok. Making this all one RegExp (I added a space at the end, that is on purpose):
var regex = new RegExp(/^[\u0041-\u005A\u0061-\u007A\u00C0-\u00F6\u00F8-\u00FF\u0100-\u02AF\u1E00-\u1EFF\u2C60-\u2C7F ]+$/);
Let’s simplify it a little bit (\u00FF and \u0100 are consecutives, we can include them in one single range).
var regex = new RegExp(/^[\u0041-\u005A\u0061-\u007A\u00C0-\u00F6\u00F8-\u02AF\u1E00-\u1EFF\u2C60-\u2C7F ]+$/);
And there it is! You can try it out at the JavaScript Regular Expression Tester!
By the way, this expression should work on other languages as well.