Exploring programming in Thamil (not English) through Clojure

Or: A clear example of what macros can do

Introduction

I started working on a library called clj-thamil that I envision as a general-purpose library for Thamil language computing (ex: mobile & web input method), but a slight excursion in that work has led me to some very deep, intriguing ideas — some of which are technical, and some of which are socio-cultural. But they all fit together in my mind — Clojure, macros, opportunity and diversity (in computing), and the non English-speaking world.

I think that the implications are things that we should all think about. But if nothing else, hopefully you can read this account and understand something about macros — the kind of power they uniquely provide and at least good one use case where they are necessary.

Technical Aspects

How does one even begin??

I tried starting on this Thamil language project a year ago, but I immediately shelved it and left it alone for a large majority of that time. Why? I couldn’t find an editor that would support programming and typing of Thamil characters properly at the same time.

(FYI: The standard spelling is Tamil since British colonial times, but it is pronounced “Thamil”.)

I’m using Mac OS X, which has been supporting Unicode well. Thamil, like other South & Southeast Asian languages, are set up in Unicode so that most of their letters [human language elements] require more than 2 characters [computer memory storage type]. Character is not synonymous with letter. For example, the letter கி in Unicode is the combination of the characters க + ி. But the rendering of the computer character ி is not an actual letter in the Thamil language. Also, both characters have to be treated together as a unit not only by the OS but also by the applications that render the text — basically, the stack between storage and user interface — for க + ி to be recognized as being side-by-side and converted into a different shape, கி. Many Mac-native applications and editors like TextEdit handle the character-combo letters by default. Many programming-specific editors are cross-platform and/or non-native, so even their ports to Mac OS X don’t use the OS-support required for proper rendering. Neither Emacs for OS X, Eclipse, IntelliJ, jEdit, or a couple of other programming text editors “worked” – got OS support to combine characters. I basically gave up, but 9 months later, I tried Aquamacs on a whim, and it worked!

Java, Unicode, and Clojure

Java was designed to support Unicode from the beginning. And by that, they mean that instead of a character being an 8-bit ASCII element, characters in Java are 16-bits as defined by the original Unicode spec. Since Clojure emits byte code that runs on the JVM, it also supports Unicode by default. What that means is that you can start using symbols (‘variable names’) where the characters come from ranges designated for other languages without problems. So the following works fine:

(def π 3.14159)

From functions to macros

Clojure, like any language that supports a functional programming paradigm, has functions as first-class values. The interesting part is that we can store first-class values, since we can take any function and create a new binding (associating a value with a new ‘name’) whose value is now equal to the original function.

(def ιитєяρσѕє interpose)
(ιитєяρσѕє "," ["one", "two", "three"])

So now we can ‘translate’ function names, even if superficially. So how far can we go? The core library of Clojure operations come from special forms, functions, and macros. Special forms and macros can’t resolve to values, though, so if we can find a different way to “translate” them, we can use that technique to pull off a fairly extensive translation of Clojure from English to an entirely different human language (aka “natural language”).

What are macros?

In essence, a macro is special type of function where the input is some block of code, and the output is a block of code. As a result, macros are run in a special way — they are run on the code blocks before the contents of that code block get evaluated.

This enables macros to abstract out code reptition in ways that regular functions can’t. Basically, if you see any code repetition whatsoever, and if that can’t be helped by better code design and refactoring the repetitious code into a new function, then a macro will be your answer. My favorite example is the with-open macro, which gracefully handles try-catch-finally blocks for I/O objects with minimal code. Doto and its ‘fancier’ cousins, the threading macros -> (“thread first”) and ->> (“thread last”), are also good examples.

Translating macros and special forms using macros

Macros operate code at a ‘higher’ level than your regular function — we’re looking at the input code blocks to the macro as a bunch of shapes that we manipulate using the macro. Basically, we’re looking at the text of the code and treating it as data to operate on, before we take the result and then evaluate it like regular code.

So at this level on which macros operate, we can do the following to pull off our ‘translation’ idea for the special forms and macros: create a macro that takes whatever was given to it and pass it along verbatim to some other special form/macro.

As an example, take the Thamil word for ‘if’ – ‘எனில்’. I want to create a macro where whatever I pass to ‘எனில்’ gets passed verbatim to ‘if’. That is, whatever follows ‘எனில்’ in the parenthases — (எனில் . .. …) — instead gets passed to ‘if’ in its own parenthases — (if . .. …). And it turns out to be simple:

(defmacro எனில்
  [& body]
  `(if ~@body))

The above macro definition says to take the code block(s) passed to ‘எனில்’, package them up into an array of shapes of code called ‘body’, and then unwrap the code shapes into a call to ‘if’.

So we’re done! Right? All we have to do is just list out all of the functions, macros, and special forms to translate in this manner, and we will be done:

(def take எடு)
(def drop விடு)
...
(defmacro எனில்
  [& body]
  `(if ~@body))
(defmacro வரையறு
  [& body]
  `(def ~@body))
...

Macros, macros, everywhere!

That seems tedious. Inefficient. There is a lot of repetitive code here (the “def”, the “defmacro”, the shape of the defmacro definition, etc.). And we can’t really write a function to refactor out the repetitive code. But I just said that this is the kind of case that a macro can solve.

Once you strip out the repetitive code, all you are left with is:

take எடு
drop விடு
...
if எனில்
def வரையறு
...

This looks like a couple of maps, which makes sense. We’re associating an English word to a corresponding Thamil one. We need to represent the words as symbols so that the Thamil words don’t get evaluated. Putting a single quote (‘) in front of the words convert them into their symbol forms:

{'take 'எடு
 'drop 'விடு
 ...}
{'if 'எனில்
 'def 'வரையறு
 ...}

I’ll fast-forward through the details and say that you can see the final macros that take a map of symbols (symbol of the English name mapping to the symbol of the Thamil name). And you can see the progression of steps that it took to get there in the linked slides.

The final results — programming in Thamil

And here is a namespace of functions that are written in Thamil that do basic natural language operations (pluralizing a noun, adding noun case suffixes). The pluralizing function looks like this:

(வரையறு-செயல்கூறு பன்மை
  "ஒரு சொல்லை அதன் பன்மை வடிவத்தில் ஆக்குதல்
  takes a word and pluralizes it"
  [சொல்]
  (வைத்துக்கொள் [எழுத்துகள் (சரம்->எழுத்துகள் சொல்)]
    (பொறுத்து

     ;; (fmt/seq-prefix? (புரட்டு சொல்) (புரட்டு "கள்"))
     (பின்னொட்டா? சொல் "கள்")
     சொல்

     (= "ம்" (கடைசி எழுத்துகள்))
     (செயல்படுத்து சரம் (தொடு (கடைசியின்றி எழுத்துகள்) ["ங்கள்"]))

     (மற்றும் (= 1 (எண்ணு எழுத்துகள்))
            (நெடிலா? சொல்))
     (சரம் சொல் "க்கள்")

     (மற்றும் (= 2 (எண்ணு எழுத்துகள்))
            (ஒவ்வொன்றுமா? அடையாளம் (விவரி குறிலா? எழுத்துகள்)))
     (சரம் சொல் "க்கள்")

     (மற்றும் (= 2 (எண்ணு எழுத்துகள்))
            (குறிலா? (முதல் எழுத்துகள்))
            (= "ல்" (இரண்டாம் எழுத்துகள்)))
     (சரம் (முதல் எழுத்துகள்) "ற்கள்")

     (மற்றும் (= 2 (எண்ணு எழுத்துகள்))
            (குறிலா? (முதல் எழுத்துகள்))
            (= "ள்" (இரண்டாம் எழுத்துகள்)))
     (சரம் (முதல் எழுத்துகள்) "ட்கள்")

     :அன்றி
     (சரம் சொல் "கள்"))))

Commentary on macros and state

Because macros aren’t values like numbers, strings, and functions are, you can’t compose them. Once you use a macro, you might end up having to use more macros around it (ex: you can’t pass it around to existing higher-order functions). Our use case is an example of that. So use macros sparingly, as a last resort. Prefer using functions — they compose and can be passed as arguments to other functions. This is why I have a separate macro for translating function names, even though the macro for translating the names of macros and special forms alone is sufficient.

While the benefit of only needing a map of symbols can be viewed as simplicity or elegance, it is the result of an instinct about programming in general imparted by Clojure’s design to try to isolate state and operate on it with a toolset of composable functions. It’s a mindset that keeps paying dividends.

Technical implications

Since the only Thamil-specific information required to effect the “translation” is stored in just 2 maps, does this mean that we can use the same strategy for any other language? Sure! Why not? As far as Java is concerned, all of the characters it sees when it parses code are 16-bit Unicode characters/codepoints. It doesn’t know which range the codepoints fall in, or even how they have to be handled by the OS and applications to appear properly. So, nothing is Thamil-specific.

Also, it’s important enough to be worth pointing out, even if it is obvious to you, that none of the macro code here required modifying Clojure as a language, or the Clojure parser or compiler. This is all “user-level” code. And yet, we’ve created what is truly an entirely new programming language. I can create code that is entirely in Thamil without knowing that Clojure / Lisp exists underneath. Cascalog is another favorite example of mine of what creating a new language on top of Lisp looks like that is written using “user-level” Lisp code, even though it doesn’t quite syntactically resemble the core Clojure / Lisp that it is based on. The power to shape your language to suit your needs, even if it starts looking like another language, is the power that macros give you. And this is why Paul Graham’s book about Lisp is called On Lisp — the title emphasizes that Lisp lets you write new languages on top of Lisp.

Technical gaps and future possibilities

The method for translation is not a true translation, as you can tell. It’s cosmetic. So there are a few places where our abstraction fails:

  • Clojure is based on Java (it runs on the Java Virtual Machine).
    Since Java is written entirely in English, any Java interop from
    Clojure will require English. Also, stack traces and error messages
    will all be in English
  • The translation of functions is done by assigning existing Clojure
    functions to Thamil sybols because functions are values. This means
    evaluating the value of a Thamil symbol referring to a function
    will use the name of its value — the (English) name of the Clojure function
  • The namespace bootstrapping problem — in order to use Thamil names
    in a namespace, you need to ‘import’ (require, in Clojure parlance)
    the namespace that contains the translations (here, clj-thamil.core).
    But until those translations are imported (‘required’), they aren’t available, so
    the require statement has to be in English. If namespace
    sounds like a weird concept, think of it like a package, module, or
    file.
  • Things like literals (true/false, special keywords in Clojure macros
    like :as, :refer, :keys) would have to be translated at read time. Numbers represented in other languages’ numerals would need their own logic to interpret. The boolean values true and false are tricky since they represent the Java values, so if they are returned by a Clojure function, how could you change that behavior? Change the Clojure function to return a different, equivalent value? Then create your own implementations of translations of true?, false?, nil?, and if to use your new booleans (and redefine if to point to your translated if)? At which point, you would need to re-evaluate all of the functions/macros that use if (ex: when, if-not, and) before re-evaluating your translations

Some of these issues might be solved by modifying the Clojure reader, which some projects already do. Another idea is to localize the source code for Clojure itself somehow. I would consider exploring how far the first idea can take you. The second approach seems like it would be near-comprehensive, but also a lot of difficult work that risks obsoletion when the language changes. Fortunately, Clojure as a language is “stable” as I see it — the design is carefully thought out and controlled in a consistent and cohesive way. Changes are usually additions to the language or implementation details, making most code forward-compatible (including all of the code used here).

Social and Cultural Importance

There are a lot of implications of creating the ability to program in another human language which, I think, in the balance is a net positive for the world. The most obvious point is that English is not the primary language for most of the world.

For all the kids in the non-English speaking world, especially the ones in non-Western / non-developed countries, learning to program means having to learn and think in a second language in order to learn programming and write code in a programming language. Even in a place like Southern India, which is a hotspot for programming work, this creates a challenge for kids who do not enjoy the privilege of access to good English education but who still want to program (and get lucrative jobs). The divide is clear; even the state government of Tamil Nadu, where Thamil speakers live, which also creates the Thamil language textbooks and distributes them for free to all grade school students, uses screenshots of the default English interface of basic computer software as part of its computer/technology books (at least when I last checked). Of course, the hands-on classes would be more of the same. Students who aren’t fluent in English by their teenage years manage by memorizing which clicks of which icons and UI elements do what they need. The presence of an error dialog box may tell you that something is wrong, but being able to read the text of the error message, comprehend it (along with the jargon), and take actions accordingly is a different task altogether.

The task of learning programming is hard enough. It is a technical area that requires learning a separate vocabulary. It is an abstract subject that is not necessarily easy to explain. Having programming in someone’s first language allows that person to deal with only the concepts of programming when learning it. And through different human languages, we may open up different approaches to programming than with what we get through just English. What does it mean to write code that mimics human language when your language isn’t subject-verb-object (SVO), but instead subject-object-verb (SOV), as is most common in the world’s languages. Does OOP make more intuitive sense to people who speak SOV languages? What about Clojure/Lisp? In my limited experience of programming in Thamil in Clojure, it feels pretty similar. Human languages that start with a verb are rare, so in one way, you could say that Lisp is equally strange to most people. But the fact that there is less syntax to learn, and that the rules of the language are few and simple, and the fact that the code that you write fits the contours of the problem domain that you’re trying to solve, contribute to the experience of Clojure in Thamil being similar to Clojure in English.

The tech industry, as epitomized by Silicon Valley, has been recently contemplating its lack of diversity — an overwhelming number are white and/or male in leading companies and startups. I’m happy to see the small, growing, wonderful efforts to address the inequity in various programming circles. But the clj-thamil project has helped me take a step back and think about addressing the segment of programmers who not only lack the privileges of others in an American context, but to address those who do not enjoy such privileges in a global context — their language, their region’s wealth, and their personal wealth.

The privileges that we enjoy in the English-speaking world should not enable us to rationalize away these differences and privileges, though. Some people might think that, perhaps, the world would be better off if everyone were to speak one language. But suppose we did. Which language would be that one language? Chinese, because it is spoken in the most populous country? Or English, because it is spoken by more people and in disproportionately wealthy countries (a legacy of unfair colonial conquests)? Esperanto, an artificial language that inherits many aspects from European languages? There is no way to decide which language is universal without establishing more inequity. Also, selecting a universal language would erase cultural and geographic knowledge (and diversity in lifestyles!). And barring these concerns, if there were magically some agreeable universal language, and given a medium that could globally connect the world instantly (ex: the internet), that universal language would still fragment along geographic and socio-economic lines because humans naturally maintain differences to mark these distinctions.

Along the lines of what Bret Victor said in Inventing on Principle, I hope that we can properly enable programming for all the people around the world in the language that they think most easily in, since that would be a form of expression that we are opening up and allowing to flourish.

  • http://www.ezhillang.org/ Write Code in Tamil!

    Elango, wonderful blog post on your efforts in programming in Tamil. I will first congratulate you on the creation of this macro framework, and building a library on top of it as proof of concept.

    Lisp has affordance of also being grammatically similar to Tamil than English in structure – i..e. predicate expression based structures. CLISP, and JavaScript also offer non-ASCII identifiers and can achieve somewhat of same effect like macro’s you have pointed out.

    My efforts have been in Ezhil, http://ezhillang.org and Open-Tamil among other things. The socio-cultural need to open up world of programming to vernacular speakers has continued to haunt technology penetration.

    • http://www.elangocheran.com Elango

      Thanks! A JavaScript library would be great — I’ve definitely been thinking about that for a while, given that there’s ClojureScript, but haven’t looked into it much, so far. There is still a lot work to be done to bring Tamil into technology.

  • Pingback: Clojure-Thamil and privilege of programming in English | தமிழில் நிரல் எழுது – Write code in தமிழ்

  • http://www.shanth.tk/ Shanthakumar

    *sigh*
    “Thamizh” -> “Thamil” -> “Tamil”

  • Pingback: Speaking at Clojure/West 2015! | Sequenced Thoughts from C/S

  • Mark Engelberg

    It doesn’t look like this approach will capture the doc-string and other metadata for the functions, because that info is attached to the var, not the function itself. With a little more work on the macro, you should be able to move over the metadata as well (see the potemkin library for examples).

    • http://www.elangocheran.com/ Elango Cheran

      Thanks for the info, it looks useful. I’ll try to work that in.