Alpha's Manifesto

A black and white figure's thought-hive

Link del día: Bytecode for Dummies

Para aquellos que desarrollamos sobre lo que se llaman plataformas (como por ejemplo .NET o Java), sabemos que el código que nosotros escribimos no se compila a lenguaje de máquina realmente, sino que se compila en algún lenguaje intermedio que luego es interpretado para una mejor ejecución en la máquina apropiada sobre la que esté corriendo la plataforma.

El punto que muchos dejamos de lado es saber interpretar ese lenguaje intermedio. Este lenguaje muchas veces puede proveernos información muy válida sobre problemas de performance que puede sufrir nuestra aplicación, usos de memoria no liberados, o incluso de la forma en la que se realizan llamadas al sistema operativo.

Charles Nutter realizó una presentación llamada JVM Bytecode for Dummies (and for the rest of you all) que explica detalladamente cómo podemos iniciarnos en este mundo. Él se enfocó en el bytecode de la máquina virtual de Java, pero esto es aplicable a otras máquinas virtuales y a otras plataformas también. Puede que al principio nos maree un poco con ejemplos algo complejos, pero luego la teoría va tomando color hasta ser bastante tangible y podemos entender cómo el bytecode realmente refleja nuestro código. Mejor aún, podemos directamente programar con bytecode y aprovecharnos de eso mismo.

Soy un zorrinito interpretado.

Validating for real alphabetic

Validation is an essential part of any application. We need to check that the data entered is in the range of the set of data we can handle. And not only security purposes, but also to make sure that is is into what we can process.

Not so long ago, I had to make a very common validation: Alphabetic Characters. Most of us developers would have just created a regular expression against the set A-Za-z or maybe using another set like \w. Well, this does not always gives us what we really want.

In my case, I had to validate for more than just A-Z. This is, my application should allow for different languages where the alphabet was extended from the basic Latin 26-letters.

Accented vowels

Sure, I could add the accented vowels. Á, É, Í, Ó, Ú, á, é, í, ó, ú. Well, that’s the acute accent. We have the grave one: à. We have the circumflex: â. Diaeresis: ä. Oh, wait. They’re even more. Suddenly, too much to remember or to manually write.

Wait, there’s more…

And not only vowels. It seems that consonants can also be accented. ý. ñ. š. ç.

Oh, there are even more letters. In German, for example, the “ss” letter combination evolved to ß. Those are ligatures: œ. þ. æ.

These are all, believe it or not, part of the Extended Latin Alphabet. So, if I wanted Johann Strauß, Kurt Gödel or Maria Skłodowska (later known as Mrs. Curie) to have a user in my application, I needed to allow this type of entrances.

Languages provide a tool for that

Some languages do provide a tool for that. For instance, Perl provides the \X operator. This matches any unicode character. Anyway, this is a little more than we want to actually achieve.

Other tool languages provide is the \p{} and \P{} operators. This goes for Perl and .NET. I think Java also does. More information on these special features can be read at the Unicode section of Regex Tutorial.

However, if you’re trying to have a rich web 2.0 application, then you need to have this working in JavaScript too. Of course, server side validations need to be made, but still, a rich user experience demands that we do not wait to go to the server until we give the user a “Invalid name” message or something alike.

What can we do in JavaScript?

JavaScript does provide support for the \uXXXX operator to match a specific unicode codepoint. Knowing that, I made a quick look trought the Unicode Block Listing, and gathered all those points that where part of the Latin or extended Latin alphabet. Here’s what I found:

In case you wonder why the range Latin-1 Supplement leaves out the \u00F7 codepoint, it’s because it is a division symbol.

Ok. Making this all one RegExp (I added a space at the end, that is on purpose):

var regex = new RegExp(/^[\u0041-\u005A\u0061-\u007A\u00C0-\u00F6\u00F8-\u00FF\u0100-\u02AF\u1E00-\u1EFF\u2C60-\u2C7F ]+$/);

Let’s simplify it a little bit (\u00FF and \u0100 are consecutives, we can include them in one single range).

var regex = new RegExp(/^[\u0041-\u005A\u0061-\u007A\u00C0-\u00F6\u00F8-\u02AF\u1E00-\u1EFF\u2C60-\u2C7F ]+$/);

And there it is! You can try it out at the JavaScript Regular Expression Tester!

By the way, this expression should work on other languages as well.