An ag(e)ing hacker, Luca Saiu's blog
2013-08-23 12:54 (last update: 2023-09-29 17:57) Luca Saiu

A practical GNU epsilon tutorial

Several people wrote me in private about epsilon, requesting a tutorial or at least some practical information about how to run the thing. I’m very happy about the feedback I received, from people I already knew and from others who had just discovered my work on the net. I hadn’t anticipated it: thanks. The insterest I saw motivated me to write this informal, hands-on introduction.

Of course I’ll also prepare reference documentation, but not at this time: I won’t actually start working on a proper manual until the language reaches a reasonably stable state. However I plan to adapt this post into a chapter to be kept up to date and added to the “official” manual, in a first part named Tutorial Introduction.

I’m very interested in your feedback about my presentation style: please write to epsilon-devel or to me personally if you have constructive suggestions.

In this post color is significant: unless you’re already an epsilon expert you will follow much more easily using a graphic web browser supporting CSS. You should see this text in red, this other text in green and the latest one in yellow. This posts also uses Unicode subscripts and, since I’m already requiring all of this anyway, a couple Greek letters as well. I apologize to the users of w3m, links and lynx: it’s for this one time only.

Audience

This post doesn’t belong to the series of short articles I plan to write “for myself” about current developments in epsilon; the idea of this introduction is just to show interested people how to start, so that they can explore the system themselves.

The current implementation can be played with, but is far from friendly. In its current state I can only recommend it in good conscience to experienced programmers who already know some Lisp dialect (Scheme, Common Lisp or Emacs Lisp), and can competently use a GNU/Linux system.

Lexical conventions

The name “ε” comes from the Mathematical Analysis naming convention for small variables. When I’m limited to traditional character sets or if I feel less smug I write it in Latin characters as “epsilon”; in any case, to convey the idea of the language smallness, I always write “epsilon” lowercase, even at the beginning of a sentence. The names of the core language ε₀ and the high-level personality ε₁ can be written in Latin characters as epsilon0 and epsilon1 (again, always lowercase) or abbreviated into e0 and e1 when part of program identifiers.

Rationale and introduction

A sentence by Chuck Moore resonates with me: current software is shameful1. It feels strangely right for being such an extreme idea.

My ideal environment is different from Mr Moore’s, but I share his opinion about most software, which is indeed bloated and overcomplicated. More importantly, we’re developing things the wrong way: our tools are at the same time too complex, not powerful enough, and not suitable to formal automatic reasoning.

In my vision a programming language should be extensible, so that the user may bind it to her needs and to the problem. The language should be built upon a very small and simple core using powerful syntactic extension capabilities, such as macros. This way, a program written using extensions is automatically rewritten into the core language, to be then executed or compiled. A program should be able to inspect its own state, including procedures and globals, and self-modify.

With some important philosophical differences explained in my PhD thesis introduction, the idea is accomplishing the vision expressed by Guy Steele in his marvelous 1998 talk Growing a Language, which hasn’t materialized yet out of the Lisp world. You can read a transcription at the address https://www.cs.virginia.edu/~evans/cs655/readings/steele.pdf, but I highly recommend the video recording of the original talk. Download it, and watch it as soon as you have an hour to spare:

youtube-dl www.youtube.com/watch?v=_ahvzDzKdB0

The ε core language, an imperative first-order language, is called ε₀. It’s inconvenient for humans to use for serious problems, yet simple and potentially very efficient. Using ε₀ I’ve built ε₁, a higher-level language with a Lispy feel, but untyped — not dynamically typed like Lisp: untyped, like Forth and machine language.

ε₁ is defined with macros and expression-to-expression transformations, in a way such that any piece of ε₁ code is automatically rewritten into ε₀ before execution. This strategy simplifies the system a lot, since it only needs an interpreter or compiler for ε₀: everything else, including the ε₁ extension machinery, can be run as ε₀ code — it doesn’t matter if machine-generated.

ε₁ is usable as a programming language, if a little unforgiving to beginners. Within the architecture of epsilon, ε₁ can be thought of as one personality2 built on top of the core language, useful to build more high-level personalities: the user is free to radically change the language and build something potentially very different, for example including static or dynamic typing, object orientation, or continuations. If some user dislikes ε₁, she’s free to change it into something else which rewrites into ε₀: Lispy features such as s-expressions are not hardwired in ε₀, and can be replaced with something else as well if a user so prefers.

ε₀ and ε₁ programs can self-modify, creating or removing procedures and global variables at run time. This adds power, but makes the language difficult to reason about and compile in the general case; however in the particular case of a program eventually reaching a “static” (which is to say, not self-modifying) state, we can use ordinary compilation techniques to translate the final static ε₀ program into native code, and recover efficiency — I’ll write about my compilation hacks in some forthcoming post.

In a style similar to compilation, a program can unexec to a file, to be later loaded and interpreted, possibly with a different runtime library or on another machine. The unexec operation entails freezing the current dynamic state of the program (including procedures), to be resumed later.

Thanks to reflection, unexecing and compiling don’t need to be special language features, but are instead ordinary procedures accessing the currently active global definitions as data structures. This is part of a general pattern: what in other languages is hardwired, in ε becomes defined by the user and changeable: in a sense I chose to move complexity from the language definition, which is to say Mathematics, to much more flexible code.

I call ε the whole system or set of languages/layers, starting from ε₀ and ε₁, up to the higher-level personalities (not yet existing at this point) on top of them.

ε is radical and somewhat subversive: since mainstream languages aren’t up to the task and aren’t getting better, I propose an open-ended system which the user can grow in any direction. Don’t worry about compatibility: dialect proliferation is good. When the right ideas emerge we can think of standardizing a personality, or some subset of it3 — but not now. First we need language experimentation, again: the software crisis isn’t anywhere near solved. Let the people play.

My PhD thesis

I’ve been pondering about programming languages for a long time, and epsilon has actually had several implementations already4, each less naïve and more ambitious than the previous. The current rewrite has also been the topic of my recently completed PhD thesis. My thesis describes the core language in close detail, and gives a pretty good overview of extension mechanisms. Of course a thesis is not software documentation: it contains a formal mathematical description of the language and its properties, and as such the bulk of it isn’t conceived for the end user.

At the beginning I was thinking to make my thesis very accessible: in my original intentions a formally-minded programmer would’ve been able to follow the treatment, without any specific prerequisites. That attempt ultimately failed: describing epsilon requires some mathematical sophistication, and since the design contains non-standard choices I’ve developed a non-standard mathematical notation, which many people find heavyweight.
I like another phrase which I’ve heard said by some Forther, I think Greg Bailey or Jeff Fox if not Chuck Moore himself: simple but not trivial. And that can certainly be said of ε: it’s simple — very simple — but still not trivial. Despite the inherent non-triviality of the matter, some chapters of my thesis remain pretty accessible to motivated programmers, particularly Chapter 1, the commentary part of Chapter 2, the whole Chapter 3 and most of Chapter 5.

The text is available in color for on-screen reading at the address http://ageinghacker.net/publications/luca-saiu--phd-thesis-color.pdf, or black-and-white, for printing, at http://ageinghacker.net/publications/luca-saiu--phd-thesis-bw.pdf.

I spent a long time working on this document, and I’d be happy if others could get something out of it. Have a look.

Implementation, and the relation beteen ε₀ and ε₁

I implemented ε in itself; in particular, I used ε₀ to implement an ε₀ interpreter with its reflective data structures (what these structures are will become clear below), and syntactic extension mechanisms: macros and transforms. This first part of the implementation was painful to write since ε₀ is by design a weak language, without much abstraction power.

Then of course I built ε₁ using ε₀ with macros and transforms: and each new syntactic form I added became immediately available to define another one. In a somewhat arbitrary fashion, I called “ε₁” the final state of this extension process, blessing it as a personality.
ε₁ contains all of ε₀ plus syntactic extension mechanisms, and a library of syntactic extensions including variadic syntax, blocks, closures, imperative loops, sum types, pattern matching, futures, unexec, and so on — none of these features is present in ε₀, and the distance between ε₀ and ε₁ feels very wide.

The idea of building ε₁ in ε₀ is not only aesthetically pleasing; it is a way of associating a formal specification to ε₁. Since I gave a formal mathematical specification of ε₀ (in my thesis), everything defined in it inherits its rigorously formal nature; but, again, ε₁ is defined with code rather than logic rules and equations.

Just to complete the picture: the runtime is of course written in C. In particular, I have two garbage collectors written by me, one of which parallel, not yet integrated. The code generated by the native compilers I’m working on (compilers are written in ε₁, of course) also require a little assembly code. Right now I support MIPS and x86_64; more backends will come including bytecode for a virtual machine, as a fallback case.

The bootstrap problem

How did I implement ε₀, to run it the first time? I used GNU Guile (http://www.gnu.org/software/guile) for bootstrapping, extended with a little C to implement ε data as a SMOB5, yielding what I call guile+whatever. The implementation still depends on Guile at the present time, but only for relatively minor functionality such as s-expression parsing and printing. I’ll have to re-implement this functionality in ε itself, remove Guile, and then the system will be completely self-hosting.

Don’t get me wrong: I love Guile; I really do. It’s an awesome system, well documented, with a good library, and getting faster. I also like the people working on it. But I need to remove Guile as a dependency: ε is not Scheme (ε₀ is much more minimal than Scheme), and it should stand on its own. In particular, ε can be used as a very low level language if the user prefers so, and I plan to support very small systems6 as compilation targets.

Setup

I assume you’re running a GNU/Linux system. I develop and regularly test on little-endian MIPS and x86_64; however any other GNU system should work as well, including GNU/Hurd. I suppose the software also works on BSD systems, with few or no changes.

You need (including development packages, if you don’t compile dependencies yourself):

  • GNU Guile, version 1.8.x or 2.0.x (Guile 1.8 is faster with epsilon, because of my ugly macro hacks);
  • Bohem’s garbage collector: http://www.hpl.hp.com/personal/Hans_Boehm/gc/;
  • GNU libunistring;
  • GNU Bison;
  • flex;
  • GNU autoconf and automake;
  • GNU Texinfo (but you don’t need TeX);
  • A reasonable compilation environment: GCC, make, the GNU libc headers.
  • bzr, for getting the sources

Get a copy of the epsilon trunk from the bzr repository on Savannah (http://savannah.gnu.org/bzr/?group=epsilon):

bzr branch bzr://bzr.savannah.gnu.org/epsilon/trunk epsilon-trunk
cd epsilon-trunk

Generate the configuration machinery, configure and compile:

./autogen.sh && ./configure && make && echo SUCCESS

If you have all the dependencies listed above everything should work, and after a little while you should see the SUCCESS message. You don’t need to install.

As this is the first time you use the system, you have to bootstrap it from Guile. Enter the bootstrap source directory (which, at the current time, actually contains much more than what’s needed for bootstrap; yes, I should rename it).

cd bootstrap/scheme

Enter guile+whatever. If you’re using Guile 1.8.x, type:

../../bin/guile+whatever

If instead you’re using Guile 2.0.x, type:

../../bin/guile+whatever --no-auto-compile

You need the option --no-auto-compile because of a couple dirty kludges I did with Guile macros, which prevent bytecode compilation. This is not Guile’s fault: Guile 2 is great, and also a tremendous improvement over Guile 1, but unfortunately I’ve abused the language. It’s not worth fixing that part: you’ll never need to touch it, and I’ll drop the Guile dependency anyway.

At guile+whatever’s prompt, type:

(load "bootstrap.scm")

This will take a while: the system builds itself from a temporary ε₀ implementation written with Scheme macros, before unexecing. You don’t need to understand the details.

If there are no error messages, you can test the bootstrapped system. Exit guile+whatever by pressing Ctrl+D, re-enter guile+whatever like above (don’t forget the --no-auto-compile option on Guile 2.0.x), and at the prompt type:

(load "quick-start.scm")

This should be much faster.

You won’t need to bootstrap again unless you modify some file under bootstrap/scheme/. So from the next time, if the sources haven’t changed, you can simply come back to bootstrap/scheme/, run guile+whatever (with --no-auto-compile if needed) and execute (load "quick-start.scm") from the prompt.

Writing more comfortably, from guile+whatever and Emacs

It’s nearly always a good idea to enable Guile’s readline support. Execute these lines at guile+whatever’s prompt:

(use-modules (ice-9 readline))
(activate-readline)

If you add the two lines to your ~/.guile, they will be executed automatically when you enter Guile, and guile+whatever as well.

I strongly recommend the Emacs major mode for epsilon (actually ε₁) which I derived from Scheme mode. I find it already useful for indentation and font locking, even if at this stage it can’t interact with external processes yet: you’ll have to explicitly kill&yank from the editor to the guile+whatever REPL.
Visit epsilon-trunk/emacs/epsilon.el and do M-x eval-buffer. You can toggle the major mode when you’re visiting an epsilon file with M-x epsilon-mode. If you like ParEdit (http://www.emacswiki.org/emacs/ParEdit) for editing s-expressions you can enable that as well.

A screenshot of Emacs with epsilon mode

Visit epsilon-trunk/bootstrap/scheme/core.e and epsilon-trunk/bootstrap/scheme/epsilon1.scm: they will be useful to keep around for reference while playing with predefined procedure.

Basics of ε₁

We’re now goning to play with ε₁. As part of this first introduction we’ll only show already existing forms, without discussing how the user can define her own macros and transforms.
Notice, however, that the symbol names starting with the conventional prefix ‘e1:’ are all defined as macros (in some cases also relying on transforms): essentially all forms occurring in user expressions don’t come predefined in ε₀.

You’ll need ε₁ also in the future to do more complicated things such as defining new extensions, and possibly even to write another personality to replace ε₁ itself: using ε₀ only is way too cumbersome in practice, and I won’t cover ε₀ at all here.

guile+whatever is an extended Guile, so Scheme is still available:

(+ 1 2)
⇒ 3

Scheme has been very useful for me during the initial implementation, where I had to carefully mix languages; but it’s more of a source of confusion for users at this point. You’ll have to live with the nuisance of having two different interpreters on the same REPL, for the time being.

The stuff values are made of: fixnums, pointers, buffers

We want to distinguish our data and operations from Scheme’s predefined versions, which are incompatible with ours. By convention, we name our global identifiers with prefixes associated to informal “name spaces”: for example our sum operation over fixnums (integers small enough in modulo to fit within a machine register, possibly minus a few reserved bits) is fixnum:+, different from Guile’s + which operates on Guile’s s-expressions. At the current time we also need7 to wrap toplevel expressions within the e1:toplevel Guile macro, to tell the system that what’s contained should be evaluated in ε₁, not Guile. Again, remember that this is just a convention I adopt: I’ve simply decided to use the character ‘:’ to delimit prefixes, even if ‘:’ is not special in any way and can be used anywhere within symbol names.

I won’t waste time describing ε₁’s syntax: all ε₁ forms are encoded as s-expressions, like in Lisp, and in fact ε₁’s syntax is very similar to Lisp’s: if you can read some Lisp dialect you can read ε₁, even if it’s not compatible with any particular Lisp dialect. As for semantics, ε₁ evaluation is call-by-value, left-to-right; tail calls don’t consume stack space.

3
⇒ 3 ;; no e1:toplevel: again this was Scheme, not ε₁
(e1:toplevel 3)
⇒ 3
(e1:toplevel (fixnum:+ 1 2))
⇒ 3

The REPL prints the result in color, to make it clear that it’s an ε value. Unboxed objects are green, boxed objects red (or yellow when they have already been printed as part of the same data structure). Notice that all unboxed objects are shown as fixnums.

[To do: I think this should be moved]

(e1:toplevel (e1:begin ;; like begin in Scheme and progn in other Lisps
               1              ;; first run this, ignoring result(s)…
               (fixnum:+ 2 3) ;; …then this, again ignoring result(s)…
               8))            ;; …till the last form: return its result(s)
⇒ 8

Let’s try cons:make, a procedure allocating two-element buffers:

(e1:toplevel (cons:make 7 9))
⇒ 0x2af7350[7 9] ;; a pair, also called a cons
(e1:toplevel (cons:make 7 9))
⇒ 0x3636fe0[7 9] ;; another pair (different pointer)
(e1:toplevel (cons:make 1 (cons:make 10 20)))
⇒ 0x306b240[1 0x306b200[10 20]] ;; a pair with another pair inside

The red or yellow hexadecimal numbers represent pointers, and of course their specific values may vary from machine to machine and from execution to execution. The elements between brackets right after each “red” pointer represent the epsilon objects contained in the pointed buffer; such elements in their turn may be boxed or unboxed. In order to avoid potentially infinite printings, yellow pointers are printed without their elements, which are already known anyway. We’ll encounter a yellow pointer soon.

These objects printed in colors are the whatever part of guile+whatever, whatever meaning untyped: the implementation needs to represent “whatevers” differently from Guile’s predefined s-expressions, which carry tags at runtime; in guile+whatever, an epsilon datum is one of a wealth of s-expressions cases: in “pure” ε implementations such as the one I’ll get after removing the Guile dependency, there are ε data only. Later I’ll show two already existing examples of pure ε implementations, not depending on Guile. The idea is that the same program can be run with different runtime libraries, representing objects in a different way: some will be more efficient, others more forgiving to the programmer.

The form e1:define adds or replaces a toplevel definition: like in Scheme8, the same definition form works for both procedures and non-procedures, according to the shape of the first parameter:

  • (e1:define x expression1expressionm) defines x as a global, the result of expressionm, obtained after first evaluating the previous expressions for side effects.
  • (e1:define (x y1yn) expression1expressionm) defines x as a procedure taking the formals y1yn, and having the sequence expression1expressionm as body. The procedure has the same result(s)9 as the last expression in its body.

Definition themselves return zero results. Notice that global definitions are expressions: they can occur anywhere expressions can. Differently from Scheme’s define, ε₁’s e1:define always affects global bindings, even if it’s used within a deeply-nested expression.

As a special case for convenience you are allowed not to wrap a toplevel e1:define in guile+whatever within e1:toplevel; so for example (e1:toplevel (e1:define x 10)) can be also written more simply as (e1:define x 10).

So, let’s define a global variable:

(e1:define c (cons:make 100 200))
⇒ ;; zero results

c is now a pair holding two fixnums:

(e1:toplevel c)
⇒ 0x2b71c80[100 200]
(e1:toplevel c)
⇒ 0x2b71c80[100 200] ;; the same object, of course: same pointer

All boxed objects are mutable10. The procedure buffer:set! takes three parameters: the object to update, a 0-based field index, and the new value for the field.

ε follows the Scheme convention of using a ‘!’ suffix when naming procedures used for their side effects.

(e1:toplevel (buffer:set! c 0 75))
⇒ ;; zero results
(e1:toplevel c)
⇒ 0x2b71c80[75 200] ;; the pointer didn't change

With mutation it’s easy to build an object referring itself, which yields our first “yellow” pointer:

(e1:toplevel (buffer:set! c 1 c))
⇒ ;; zero results
(e1:toplevel c)
⇒ 0x2b71c80[75 0x2b71c80] ;; c is now cyclic

Of course buffer:set! isn’t limited to pairs. You can create a buffer of any size with the procedure buffer:make, which takes the number of elements as parameter, and update it:

(e1:define b (buffer:make 4)) ;; make a buffer of four cells;; zero results
(e1:toplevel b)
⇒ 0x149ae10[127 127 127 127]

guile+whatever currently fills every cell which haven’t been explicitly initialized with the 127 fixnum.

We can update the buffer now named b, and look up its elements with buffer:get:

(e1:toplevel (buffer:set! b 2 99)) ;; make the third element (indexed 2) be 99;; zero results
(e1:toplevel b)
⇒ 0x149ae10[127 127 99 127] ;; same pointer, updated element
(e1:toplevel (buffer:get b 0))
⇒ 127
(e1:toplevel (buffer:get b 2))
⇒ 99

Fixnums and buffers are the stuff ε values are made of: every datum in ε, (including even reflective objects such as expressions: see below) is in fact encoded using only fixnums and buffers.

You can write booleans as #t and #f, like in Scheme; but in fact any non-#f value is taken as true, and #f is just another notation for 0:

(e1:toplevel #f)
⇒ 0 ;; the false value #f is just the 0 fixnum
(e1:toplevel #t)
⇒ 1 ;; any non-0 value is "true", but using #t is cleaner

In ε₀ characters are just fixnums as well:

(e1:toplevel #\a) ;; lowercase 'a' character (Scheme notation)97 ;; Unicode code point, same as in ASCII

Error situations in ε₁

What happens if we go out of bounds in a buffer access?

(e1:toplevel (buffer:get b 50))
error→ out of bounds memory access

Nothing good happens, of course. Errors are not recoverable in ε₁: there is no exception or condition mechanism at this level. What you should do is to prevent error situations from ever happening.

(e1:toplevel (fixnum:/ 10 0)) ;; divide ten by zero
error→ division by zero

The runtime of guile+whatever is relatively friendly: it prints out an error message, then goes back to the REPL; but that is not necessarily the case for efficent runtimes: in the same situation a runtime conceived for speed may crash with a Segmentation Fault, give a wrong result, or silently affect the system state: for example if you’re writing out of the bounds of a buffer, an efficient runtime may silently corrupt some unrelated data structure.

Personalities at a level higher than ε₁ will provide error handing facilities, but here I made the choice to sacrifice any feature which may impact performance or minimality. The details become more apparent when looking at ε₀.

Slightly higher-level data structures: vectors, strings, boxes, tuples, records

Manipulating buffers using only buffer:make, buffer:get and buffer:set! can become tedious. That’s why I also provide convenient syntax to make initialized buffers of any fixed size, called tuples:

(e1:toplevel (tuple:make 10 b #t))
⇒ 0x21f0850[10 0x149ae10[127 127 99 127] 1] ;; a three-element buffer
(e1:toplevel (tuple:make 10 b #t 20 #f))
⇒ 0x21f82c0[10 0x149ae10[127 127 99 127] 1 20 0] ;; a five-element buffer
(e1:toplevel (tuple:make (fixnum:+ 2 2)
                         (fixnum:1+ 2))) ;; as 1+ in Lisp: the successor of 20x167f5f0[4 3] ;; arguments are evaluated

The result of evaluating a tuple is just an ordinary buffer: the result of tuple:make can’t be distinguished by a buffer made by buffer:make and filled by buffer:set! calls.

As another convenience feature for working on buffers, records provide a way of accessing fields by name in fixed-size buffers:

(e1:toplevel (record:define point x y))
⊣ Defining the procedure point...
⊣ Defining the procedure point-make-uninitialized...
⊣ Defining the procedure point-explode...
⊣ Defining the procedure point-explode-from-second-element...
⊣ Defining the procedure point-get-x...
⊣ Defining the procedure point-with-x...
⊣ Defining the procedure point-set-x!...
⊣ Defining the procedure point-get-y...
⊣ Defining the procedure point-with-y...
⊣ Defining the procedure point-set-y!...
⇒ ;; zero results

Records are a good example of this general way of using the system: when we called record:define to define the record type point, the effect was to automatically generate useful procedures to work on points. Such procedures aren’t special: they work on buffers and fixnums, and you could write them by hand as well; but having them automatically generated when defining a record is a good use of syntactic abstraction: I defined record:define (as a macro) once and for all, and from now you can use it as if it were primitive, thinking at a higher level:

(e1:define p (point 10 20))
⇒ ;; zero results
(e1:toplevel p)
⇒ 0x24c3830[10 20]
(e1:toplevel (point-explode p))
⇒ 10 ;; point-explode yields two results...20 ;; ...you don't know how to use them yet
(e1:toplevel (point-get-x p))
⇒ 10
(e1:toplevel (point-set-x! p (fixnum:+ (point-get-x p) 100)))
⇒ ;; zero results
(e1:toplevel p)
⇒ 0x24c3830[110 20]
(e1:toplevel (point-with-y p 57)) ;; make a copy of p with a different y0x268fa30[110 57] ;; different address
(e1:toplevel p)
⇒ 0x24c3830[110 20] ;; p didn't change

If you only use the procedures automatically generated by record:define to work on point data structures, you can ignore their internal representation — in this case just the relative order of x and y. However ε₁ never hides information, by design: the underlying representation is always visible and accessible by buffer:get and buffer:set!. Using these generic buffer accessors indiscriminately doesn’t sound like a good idea in general, but the system won’t stop you: you’re supposed to know what you’re doing.

Vectors are more or less what you would expect: collections of objects (any object, possibly with different shapes), which can be addressed by index:

(e1:define v (vector:make 10))
(e1:toplevel v)
⇒ 0x72ec30[10 127 127 127 127 127 127 127 127 127 127]

The underlying implementation is obvious: vectors are just buffers where the first element holds the number of “payload” elements — which is to say, the number of cells excluding the first cell itself.

vector:get and vector:set! work on payload indices, “skipping” the first element. Again, you could use buffer:get and buffer:set! with incremented indices instead, but that will only be worth the trouble if you’re desperate for optimization.

(e1:toplevel (vector:get v 0))
⇒ 127 ;; not 10: this is the first payload element
(e1:toplevel (vector:get v 2))
⇒ 127
(e1:toplevel (vector:set! v 2 324))
(e1:toplevel (vector:get v 2))
⇒ 324
(e1:toplevel v)
⇒ 0x72ec30[10 127 127 324 127 127 127 127 127 127 127]

As for generic buffers, there’s no guarantee that bounds are checked, in general. Because it’s a development tool guile+whatever actually checks and in case of failure prints a reasonable error message; but you shouldn’t expect even that from the efficient runtimes.

Since the length is stored within the data structure, you can read it back:

(e1:toplevel (vector:length v))
⇒ 10

You can concatenate vectors:

(e1:toplevel (vector:append v (vector:make 3)))
⇒ 0x72efb0[13 127 127 324 127 127 127 127 127 127 127 127 127 127]
(e1:define w (vector:make 2))
(e1:toplevel (vector:set! w 0 4))
(e1:toplevel w)
⇒ 0x70c240[2 4 127]
(e1:toplevel (vector:append w w w w)) ;; any number of parameters!0x6c9850[8 4 127 4 127 4 127 4 127]

Notice however that vector:append doesn’t deep-clone elements:

(e1:define vv (vector:make 2))
(e1:toplevel (vector:set! vv 0 (tuple:make 1 2 3)))
(e1:toplevel vv)
⇒ 0x240d7f0[2 0x2433d80[1 2 3] 127]
(e1:toplevel (vector:append vv (vector:make 2)))
⇒ 0x1bad9f0[4 0x2433d80[1 2 3] 127 127 127] ;; shares 0x2433d80 with vv

If ε₀ characters are just fixnums, as you may guess strings are just vectors of fixnums:

(e1:toplevel "aa")
⇒ 0x16032b0[2 97 97] ;; two characters: #\a and #\a
(e1:toplevel (string:append "aa" "bb")) ;; string:append is just an alias0x15d7a00[4 97 97 98 98]
(e1:toplevel (vector:append "aa" "bb")) ;; this works just as well0x16c0520[4 97 97 98 98]
(e1:toplevel (string:length "foobar")) ;; an alias as well6

Fixnums have a wide enough range to cover all Unicode code points, which is good, but I don’t want to get into the UTF/UCF encoding craziness: so I always use this internal one-fixnum-per-character representation. The implementation uses the nice GNU libunistring (http://www.gnu.org/software/libunistring) by Bruno Haible for input and output.

(e1:toplevel (string:write "Hello there!\n"))
⇒ ;; zero results
⊣ Hello there!

As a temporary limitation of guile+whatever, you shouldn’t expect to see output until you also write a final newline character.

ε₁ vectors are not resizable: re-allocating a vector entails changing its address in memory, hence its identity. You can implement resizable vectors, if you want them, by adding a level of indirection: the data structure pointer refers a single-cell buffer pointing to a vector like the one above, which you can replace on resize. This is what I actually did for hash tables, which are implemented as a vector which must be able to accommodate more elements without raising the fill factor over a certain threshold.

This idea of one-element buffers adding a level of indirection for mutable structures is generally useful. I call box11 such a one-element buffer. I’ve defined procedures to allocate, lookup and update boxes:

(e1:define b1 (box:make 10))
(e1:toplevel b1)
⇒ 0x799280[10] ;; just a one-element buffer
(e1:toplevel (box:get b1))
⇒ 10
(e1:toplevel (box:set! b1 45))
⇒ ;; zero results
(e1:toplevel (box:get b1))
⇒ 45
(e1:toplevel b1)
⇒ 0x799280[45] ;; same address.  Here it's important!

I’ve hinted at boxes maintaining an “identity” for a vector which can be replaced with a resized version. That’s quite easy, if you keep in mind that box:make just allocates a box holding the word you pass, be it a fixnum or a pointer — but the content is not recursively cloned:

(e1:toplevel v)
⇒ 0x72ec30[10 127 127 324 127 127 127 127 127 127 127] ;; v is 0x72ec30
(e1:define b2 (box:make v)) ;; make a box pointing to the vector
(e1:toplevel b2)
⇒ 0x79a5e0[0x72ec30[10 127 127 324 127 127 127 127 127 127 127]] ;; same vector: 0x72ec30
(e1:toplevel (box:get b2))
⇒ 0x72ec30[10 127 127 324 127 127 127 127 127 127 127]
(e1:toplevel (vector:set! (box:get b2) 0 4)) ;; update the vector
(e1:toplevel (box:get b2))
⇒ 0x72ec30[10 4 127 324 127 127 127 127 127 127 127] ;; same (now updated) vector
(e1:toplevel v)
⇒ 0x72ec30[10 4 127 324 127 127 127 127 127 127 127] ;; notice the 4
(e1:toplevel (box:set! b2 w)) ;; replace the pointer in the box
(e1:toplevel b2)
⇒ 0x79a5e0[0x70c240[2 4 127]] ;; same box, different content
(e1:toplevel v)
⇒ 0x72ec30[10 4 127 324 127 127 127 127 127 127 127] ;; still 0x72ec30

Exercise: implement resizable-vector:make, resizable-vector:get, resizable-vector:set!, resizable-vector:length, resizable-vector:resize!, using boxes and vectors.

Equality and boxedness tags

I always try to be precise when speaking about pointers and shared data, because the idea is important for equality. The procedure whatever:eq? corresponds to eq in Common Lisp or Emacs Lisp and eq? in Scheme12: it’s an equality by identity.
Again, we follow the Scheme naming convention according to which a name ending in ‘?’ identifies a procedure returning a boolean.

Let’s look at how whatever:eq? behaves. Unboxed objects are easy to compare:

(e1:toplevel (whatever:eq? 7 3))
⇒ 0 ;; the word 7 is different from the word 3
(e1:toplevel (whatever:eq? 3 3))
⇒ 1 ;; the word 3 is equal to the word 3
(e1:toplevel (whatever:eq? 97 #\a))
⇒ 1 ;; the fixnum 97 is the fixnum 97

Notice that whatever:eq? is an ordinary procedure, so its receives its parameters already evaluated (as usual, call-by-value left-to-right):

(e1:toplevel (whatever:eq? (fixnum:+ 2 2)
                           4))
⇒ 1 ;; the word 4 is equal to the word 4

In case of boxed objects, whatever:eq? only looks at the two pointers:

(e1:define t1 (tuple:make 1 2 3))
(e1:toplevel t1)
⇒ 0x15e5200[1 2 3]
(e1:define t2 (tuple:make 1 2 3))
(e1:toplevel t2)
⇒ 0x183a0c0[1 2 3] ;; same content as t1
(e1:toplevel (whatever:eq? t1 t2))
⇒ 0 ;; 0x15e5200 is different from 0x183a0c0: not the same object

What happens if we compare a boxed object with a fixnum having the same value as the pointer? Let’s see:

(e1:toplevel t1)
⇒ 0x15e5200[1 2 3] ;; t1 is 0x15e5200
#x15e5200 ;; use Guile to print the address in decimal
⇒ 22958592
(e1:toplevel (whatever:eq? 22958592 t1))
⇒ 1 ;; the fixnum is "the same" as the pointer

So the answer is that whatever:eq? doesn’t distinguish between pointers and non-pointers when comparing.

This issue is actually deeper than it looks. I’ve already made clear that in ε, differently from Lisp, objects doesn’t carry their “type”: an integer, a boolean or a character are represented in the exact same way. However, you might wonder how the system can print its nice object “dumps” distinguishing pointers from non-pointers, also writing buffers with the correct number of elements.

The answer is that, even if there are no types, guile+whatever represents boxedness tags: for each ε object the runtime keeps track of its fixnum-vs.-pointer nature, and also associates a word containing its length to each buffer. This is not guaranteed to happen in all runtimes: later we will see an example of a more efficient runtime which doesn’t store this information. With that runtime, of course, it won’t be possible to dump objects using our color notation, or in any other way: numbers and pointers are indistinguishable, in the general case.

Even if it’s possible to represent boxedness tags without too much overhead, I like the idea of not depending on them, particularly when compiling for very small targets where code size counts; for this reason, ε permits you to use them if you prefer so, but you can do away with them if you want to build something really lean and minimal. Fixnums may be more narrow on efficient runtimes representing boxedness tags: a good implementation reserves one bit for this information within each fixnum/pointer.

There are procedures to access boxedness tags, which always fail in runtimes not representing them: boxedness:fixnum?, boxedness:buffer? and boxedness:buffer-length. If you want your program to run on all runtimes, don’t use them.

boxedness:fixnum? returns a non-0 value if and only if its parameter is a fixnum, which is to say a non-pointer:

(e1:define n 10)
(e1:define b (box:make 10))
(e1:toplevel (boxedness:fixnum? n))
⇒ 1
(e1:toplevel (boxedness:fixnum? b))
⇒ 0

Conversely, boxedness:buffer? returns a non-0 value if and only if its parameter is a pointer:

(e1:toplevel (boxedness:buffer? n))
⇒ 0
(e1:toplevel (boxedness:buffer? b))
⇒ 1

boxedness:buffer-length returns the length of the buffer pointed by its argument. You shouldn’t ever pass it a non-pointer: guile+whatever will fail with an error message, but as usual efficient runtimes may just crash:

(e1:toplevel (boxedness:buffer-length (buffer:make 4)))
⇒ 4
(e1:toplevel (boxedness:buffer-length b))
⇒ 1 ;; one element
(e1:toplevel (boxedness:buffer-length n))
error→ size of non-pointer ;; likely crash, on efficient runtimes

At the cost of being obnoxious let me stress again that boxedness:fixnum?, boxedness:buffer? and boxedness:buffer-length are only available on runtimes which represent boxedness tags: calling any of them on a runtime without boxedness tags yields a failure situation, or again a crash.

Lists, and simple programming examples

Lists are really nothing new: you just obtain them by chaining conses13, which is to say two-element buffers, by convention right-deep, like in Lisp. By convention the empty list is 0, which can’t be confused with a pointer on modern machines14, where memory addresses never have very low values.

More formally, we could state that (by induction) a list is either the empty list 0 or a cons containing an element on the left and another list on the right.

You can use 0 and cons:make or tuple:make to make lists, one cons at a time:

(e1:toplevel 0)
⇒ 0 ;; the empty list
(e1:toplevel (cons:make 10 (cons:make 20 0)))
⇒ 0x14670b0[10 0x14670f0[20 0]] ;; a list containing 10 and 20

Of course the system doesn’t force you to nest on the right side: for example we wouldn’t call a “list” this left-deep structure, which is the mirror image of the previous example:

(e1:toplevel (cons:make (cons:make 0 20) 10))
⇒ 0x14a0050[0x14a0010[0 20] 10]

Even if the left-deep nested pair above is still a perfectly valid memory data structure, the ε₁ convenience syntax and procedures work on ordinary lists, which are right-deep and terminated with 0. When used on non-list structures, predefined functions for lists may fail or give unexpected results.

Given a list, you can check if it’s empty or obtain its head (first element) and tail (list of all the elements except the first) with the procedures list:null?, list:head and list:tail:

(e1:define my-list (cons:make 100 (cons:make 200 0)))
(e1:toplevel (list:null? my-list))
⇒ 0
(e1:toplevel (list:null? 0))
⇒ 1 ;; the empty list is empty
(e1:toplevel (list:head my-list))
⇒ 100
(e1:toplevel (list:tail my-list))
⇒ 0x7ddb50[200 0]

list:head and list:tail don’t allocate new buffers: they just return what’s contained in the given cons. In other words, they are accessors:

(e1:toplevel my-list)
⇒ 0x7ddc50[100 0x7ddb50[200 0]] ;; 0x7ddb50, as the tail above
(e1:toplevel (list:tail my-list))
⇒ 0x7ddb50[200 0] ;; 0x7ddb50 again

list:head and list:tail are very fast (they are essentially calls to buffer:get with index 0 or 1), but as usual in ε₁ they don’t check for errors: you shouldn’t ever call them on an empty list — and using them on something different than a cons which is part of a list is probably a bad idea.

(e1:toplevel (list:head 0))
error→ buffer:get on a non-buffer --- or simply crash
(e1:toplevel (list:tail (tuple:make 1 2 3)))
⇒ 2 ;; not a list!  This isn't an error, but using list:tail on non-lists is confusing

I’ve defined several utility procedures working on lists. One of them is the one-parameter procedure list:iota (name inspired by APL, thru Guile and MIT Scheme) returns a list holding all fixnums from 0 included to the argument excluded.

(e1:toplevel (list:iota 0))
⇒ 0 ;; the empty list
(e1:toplevel (list:iota 2))
⇒ 0x14a8100[0 0x14a80a0[1 0]]
(e1:toplevel (list:iota 10))
⇒ 0x14b02d0[0 0x14b0270[1 0x14b0230[...]]]

The REPL didn’t print the complete structure, because we hit the depth limit. We can raise or eliminate the limit by calling set-whatever-dump-maximum-depth!, which is currently a Guile procedure — which means that we can’t use e1:toplevel. We can give the procedure either a natural number, of #f to mean “no limit”:

(set-whatever-dump-maximum-depth! #f)
(e1:toplevel (list:iota 10))
⇒ 0x14b85d0[0 0x14b8570[1 0x14b8530[2 0x14b84f0[3 0x14b8490[4 0x14b8430[5 0x14b83d0[6 0x14b8370[7 0x14b8310[8 0x14b82b0[9 0]]]]]]]]]]

A generalization of list:iota is list:range, a procedure taking two fixnum parameters and returning a list of fixnums, from the the first to the second, both included — or an empty list if the first parameter is greater than the second:

(e1:toplevel (list:range 10 25))
⇒ 0x12701d0[10 0x11ec730[11 0x11ec6d0[12 0xde34d0[13 0x105d290[14 0x105d230[15 0x11a82e0[16 0x891af0[17 0x1208e70[18 0x90e010[19 0x90dfb0[20 0x65ed00[21 0x65eca0[22 0x76c900[23 0x6d1830[24 0x7f7830[25 0]]]]]]]]]]]]]]]]

The e1:length procedure takes a list and returns its length, as a fixnum:

(e1:toplevel (list:length (list:iota 10000)))
⇒ 10000 ;; the first 10000 naturals are 10000 in number
(e1:toplevel (list:length 0))
⇒ 0 ;; the empty list has zero elements

Re-defining list utility procedures is a good programming exercise. Let’s implement mylength, our own version of list:length.

mylength will be a recursive procedure doing a case analysis on its parameter. You’ll need a conditional. ε₁ has a good variety of Lisp-style conditionals, including:

  • one-way: e1:when, e1:unless;
  • one- or two-way: e1:if;
  • multi-way: e1:cond, e1:case.

Syntax and semantics follow Scheme conventions, apart from the usual ‘e1:’ prefix.

Our length procedure has only two cases: empty list, or non-empty list. The length of an empty list is zero, and the length of a non-empty list is one plus the length of its tail. We can compute “one plus” with either the two-parameter fixnum:+ (giving it 1 as one parameter) or with the one-parameter fixnum:1+.

As a naming convention, I sometimes use “plural” variable names such as as, bs, xs and ys for list objects.

(e1:define (mylength xs)
  (e1:if (list:null? xs)
    0
    (fixnum:1+ (mylength (list:tail xs)))))

Does it work?

(e1:toplevel (mylength (list:iota 100)))
⇒ 100

It seems to work. But if we test it with a bigger list, the thing fails:

(e1:toplevel (mylength (list:iota 1000000))) ;; dangerous: don't try this
error→ some strange error or crash

I was careful to make list:iota tail-recursive but mylength is clearly not (the recursive call occurs in a non-tail position, as the argument of a fixnum:1+ call). Since mylength consumies an unbounded quantity of stack space proportional to the list length, it isn’t really usable on large arguments.

It’s easy to redefine mylength to be tail-recursive:

(e1:define (mylength xs)
  (mylength-acc xs 0))
(e1:define (mylength-acc xs acc)
  (e1:if (list:null? xs)
    acc
    (mylength-acc (list:tail xs) (fixnum:1+ acc))))

The new mylength can compute the length of any list in constant space:

(e1:toplevel (mylength (list:iota 1000000)))
⇒ 1000000

As a further example, let’s compute the last element of the given list. There are three cases: an empty list (on which we fail), a one-element list, (of which we know the last element), or a list with two or more elements:

(e1:define (mylast xs)
  (e1:cond ((list:null? xs)
            (e1:error "mylast of empty list"))
           ((list:null? (list:tail xs))
            (list:head xs))
           (else ;; #t works as well, like t in Common Lisp/Emacs Lisp
            (mylast (list:tail xs)))))

The only recursive call is already in tail position.

You should think of e1:myerror as a procedure which potentially crashes the system or brings it into an unrecoverable “failure state”. However, before crashing, e1:error will at least print an error message. If you want to optimize at the cost of being even more unforgiving than e1:error, you can remove the first e1:cond case:

(e1:define (mylast xs)
  (e1:cond ((list:null? (list:tail xs))
            (list:head xs))
           (else
            (mylast (list:tail xs)))))

A two-way e1:cond may look better as an e1:if:

(e1:define (mylast xs)
  (e1:if (list:null? (list:tail xs))
    (list:head xs))
    (mylast (list:tail xs)))

We may want to factor away the two calls to list:tail, even if in practice they aren’t that expensive and the implementation is not yet particularly efficient15. The idea, of course, is using a let block. ε₁ has Lisp-style e1:let and e1:let*:

(e1:toplevel (e1:let ((a 1)
                      (b 2))
               (fixnum:+ a b)))
⇒ 3 ;; e1:let* would behave just as e1:let here
(e1:define g 100)
(e1:toplevel (e1:let ((g 10)
                      (b g)) ;; this sees the global g
               (fixnum:+ g b)))
⇒ 110
(e1:toplevel (e1:let* ((g 10)
                       (b g)) ;; this sees the local g
               (fixnum:+ g b)))
⇒ 20

Using a block, the fast but unfriendly version of mylast becomes:

(e1:define (mylast xs)
  (e1:let ((tail (list:tail xs)))
    (e1:if (list:null? tail)
      (list:head xs)
      (mylast tail))))

If you want to do exercises, you can re-implement in ε₁ any of the list procedures in core.e and epsilon1.scm, using recursion. You’ll see that the source files are divided into “sections”, delimited by a line full of semicolons plus another comment line containing a title. Look at sections containing the word “List” in their names.

When testing your code, you’ll probably want to use the ε₁ form list:list, which resembles Lisp’s list:

(e1:toplevel (list:list 1 (fixnum:+ 10 20) 435 -2))
⇒ 0xcfbbc0[1 0xcfc000[30 0x840650[435 0xe59640[-2 0]]]]
(e1:toplevel (mylast (list:list 1 2 3 4 5)))
⇒ 5

Digression: a look at ε₀

list:list may be convenient, but it’s nothing very deep: it’s just a macro which expands to an expression using list:cons as many times as needed, with the correct nesting. Even without knowing anything about macros, we can check this with the Guile debugging procedure meta:macroexpand (notice the quote):

(meta:macroexpand '(list:list 1 (fixnum:+ 10 20) 435 -2))
⊣ [call list:cons 1₆₈₆₀₀ [call list:cons [call fixnum:+ 10₆₈₆₀₁ 20₆₈₆₀₂]₆₈₆₀₃ [call list:cons 435₆₈₆₀₄ [call list:cons -2₆₈₆₀₅ list:nil₆₈₆₀₆]₆₈₆₀₇]₆₈₆₀₈]₆₈₆₀₉]₆₈₆₁₀

What’s that strange notation? Well, for the first time you’re looking at ε₀ expressions, which I intentionally made visually distinct, using brackets. In fact when an ε₁ macro call is completely expanded, it always yields its result as an ε₀ expression, ready to be executed or compiled. Even if you don’t know ε₀ yet, you can already read this: the expression consists in nested procedure calls, using as arguments either fixnum literals, or list:nil; list:nil is just a global variable, defined as 0. The subscript numbers are handles, unique identifiers attached to each ε₀ expression. list:cons is just an alias of cons:make.

Just to have a peek at ε₀, let’s look at the definitions of some simple procedures, with their bodies already expanded to ε₀ expressions:

(meta:print-procedure-definition 'list:length)
⊣ Formals: (x)
⊣ [call list:length-acc x₂₆₆ 0₂₆₇]₂₆₈
(meta:print-procedure-definition 'list:length-acc)
⊣ Formals: (x acc)
⊣ [if x₂₆₉ ∈ {0} then acc₂₇₀ else [call list:length-acc [call list:tail x₂₇₁]₂₇₂ [call fixnum:1+ acc₂₇₃]₂₇₄]₂₇₅]₂₇₆

The only conditional in ε₀ is a slightly unusual “[if econstants then e else e]”; the reason is efficiency: such conditionals are easy to nest and compile using jump tables or comparison trees, which may be an important optimization for some styles of programming. See how small these handles are? The list length procedures were defined very early in the bootstrap process.

You don’t really need to do anything with ε₀ at this time: the only thing you should remember is that each piece of ε₁ code is always translated into ε₀ before execution, compilation, or even procedure definition. The expanded ε₀ code will usually be longer and less human-friendly than the corresponding ε₁ version, but much easier to execute and analyze.

Now that you’ve seen meta:print-procedure-definition, we can use it to look at how list:null? works:

(meta:print-procedure-definition 'list:null?)
⊣ Formals: (list)
⊣ [call whatever:zero? list₂₅₈]₂₅₉

whatever:zero? computes what you’d expect, but its definition may look unusual:

(meta:print-procedure-definition 'whatever:zero?)
⊣ Formals: (x)
⊣ [call e1:not x₂₅₈]₂₅₉

e1:not is also interesting, by the way:

(meta:print-procedure-definition 'e1:not)
⊣ Formals: (condition)
⊣ [if condition₅₀₈₈ ∈ {0} then 1₅₀₈₇ else 0₅₀₈₉]₅₀₉₀

Now, can you explain why whatever:zero? and list:null? are essentially aliases for e1:not?

Practical programming in ε₁

[To do: fill this]

Sums

Look again at our formal definition of lists: by induction, a list is either the empty list or a cons containing an element and another list.

This data structure definition by alternative cases which are allowed to be recursive is a useful idea, more general than lists. For example, let’s say that we want to define binary trees. We might say that “by induction, a binary tree is either empty or a non-empty triple containing a left (sub-)tree, a root element, and a right (sub-)tree”.

This kind of structure definition is popular in the functional programming community where it’s known as sum or sum-of-products — a “product” being an n-uple, and a “sum” being a union of disjoint sets where each element of the union conceptually carries a tag representing the origin set.

Differently from most functional languages and as usual in ε₁, the system won’t force you to respect any “typing” constraint: for example storing something different from a tree in the left field of a tree is confusing, but you can do it if you want.

Like I did for records, I provide a form to automatically generate the needed constructor and accessor procedures, given a sum definition.

Here’s a definition in ε₁ for our sample binary tree, having two cases: the empty case with no fields, and the non-empty case having fields named left, root and right:

(e1:toplevel (sum:define tree
               (empty)
               (non-empty left root right)))
⊣ Defining the procedure tree-empty...
⊣ Defining the procedure tree-empty?...
⊣ Defining the procedure tree-empty-explode...
⊣ Defining the procedure tree-non-empty...
⊣ Defining the procedure tree-non-empty-make-uninitialized...
⊣ Defining the procedure tree-non-empty-explode...
⊣ Defining the procedure tree-non-empty-explode-from-second-element...
⊣ Defining the procedure tree-non-empty-get-left...
⊣ Defining the procedure tree-non-empty-with-left...
⊣ Defining the procedure tree-non-empty-set-left!...
⊣ Defining the procedure tree-non-empty-get-root...
⊣ Defining the procedure tree-non-empty-with-root...
⊣ Defining the procedure tree-non-empty-set-root!...
⊣ Defining the procedure tree-non-empty-get-right...
⊣ Defining the procedure tree-non-empty-with-right...
⊣ Defining the procedure tree-non-empty-set-right!...
⊣ Defining the procedure tree-non-empty?...

Of course trees are just buffers and fixnums. We can look at their memory representation by building two small samples: an empty tree, and a non-empty tree having 42 as its root and empty left and right subtrees:

(e1:toplevel (tree-empty))
⇒ 0
(e1:toplevel (tree-non-empty (tree-empty) 42 (tree-empty))) ;;  420x12376e0[0 42 0]                                         ;; /  \

The representation in memory is efficient, and very similar to what we already saw about lists: the empty tree is 0, and non-empty trees point to three-element buffers containing in order left subtree, root, and right subtree.

Of course we can build trees of any size:

(e1:define t (tree-non-empty (tree-non-empty (tree-empty) 1 (tree-empty))   ;;   2
                             2                                              ;;  / \
                             (tree-non-empty (tree-empty) 3 (tree-empty)))) ;; 1  3
(e1:toplevel t)                                                             ;;/ \/ \0x1021e80[0xfe0040[0 1 0] 2 0xfdff20[0 3 0]]

Since sums usually have more than one case, it’s useful to query the case of a given object: of course you shouldn’t use tree procedures on anything but trees, but for any given sum all cases (in this case empty and non-empty) are always safe for case-querying:

(e1:toplevel (tree-empty? (tree-empty)))
⇒ 1 ;; the empty tree is in fact empty
(e1:toplevel (tree-non-empty? (tree-empty)))
⇒ 0 ;; the empty tree isn't non-empty
(e1:toplevel (tree-non-empty? t))
⇒ 1

If you are sure that the object is the appropriate case of a sum you can use accessors, which are essentially the same as for records:

(e1:toplevel (tree-non-empty-get-right t))
⇒ 0xfdff20[0 3 0]
(e1:toplevel (tree-non-empty-get-root t))
⇒ 2
(e1:toplevel (tree-non-empty-get-root (tree-non-empty-get-left t)))
⇒ 1 ;; the root of the left subtree of t

Since you can’t resize existing buffers or update unboxed objects, you can’t in general mutate an object from a case to another. However in ε₁ you can always mutate sum fields, and you get handy procedures for that:

(e1:toplevel (tree-non-empty-set-right! t (tree-empty)))
;; t's left subtree is now empty
(e1:toplevel t)
⇒ 0x1021e80[0xfe0040[0 1 0] 2 0]

Like for records, you can make copies with a different field:

(e1:toplevel (tree-non-empty-with-left t (tree-empty)))
⇒ 0x1a202a0[0 2 0]
(e1:toplevel t)
⇒ 0x1021e80[0xfe0040[0 1 0] 2 0] ;; t didn't change

Well-designed recursive sums lend themselves particularly well to recursive programming, since case analysis tends to follow the structure of sum cases, with recursive calls occurring on recursive substructures.

As a simple example, let the height of a tree be zero for empty trees; and for non-empty trees, let it be one plus the height of the tallest subtree (left or right). Easy enough:

(e1:define (height t)
  (e1:if (tree-empty? t)
    0
    (e1:let ((left-height (height (tree-non-empty-get-left t)))
             (right-height (height (tree-non-empty-get-right t))))
      (fixnum:1+ (e1:if (fixnum:> left-height right-height)
                   left-height
                   right-height)))))

It works:

(e1:toplevel (height t))
⇒ 2 ;; the left subtree has height one, the right one is empty

height was a little clumsy to write. We can make the definition much nicer if we recognize that the inner e1:if above is just the computation of the maximum between two fixnums.

There is already a fixnum:max procedure; but even if you forgot about that, you could redefine a maximum procedure for fixnums by yourself:

(e1:define (my-max a b)
  (e1:if (fixnum:> a b)
    a
    b))

And so, thanks to procedural abstraction, we can refactor height to be much more readable:

(e1:define (height t)
  (e1:if (tree-empty? t)
    0
    (fixnum:1+ (fixnum:max (height (tree-non-empty-get-left t))
                           (height (tree-non-empty-get-right t))))))

Sums work nicely, but they potentially hide a tricky representation problem. Think once more about case-querying; how can the system distinguish one case from another, for our trees? Sums must work on every runtime, so case-querying procedures can’t rely on boxedness tags.

What distinguishes an empty tree from a non-empty tree? More generally, what distinguishes a boxed sum case from an unboxed case, and one case from another?

(e1:toplevel (sum:define s1
               (a)
               (b)
               (c)))
⊣ ;; [procedures are automatically defined, as usual]

The s1 sum has three cases: a, b and c, all with zero fields. In practice an s1 object is nothing more than an enum type in C: every s1 is a, or b, or c; nothing more. The implementation is also just as trivial:

(e1:toplevel (s1-a))
⇒ 0
(e1:toplevel (s1-b))
⇒ 1
(e1:toplevel (s1-c))
⇒ 2

So cases with no arguments can always be represented as unboxed objects, distinct from one another. The constructors for such cases always return the same results when called multiple times:

(e1:toplevel (s1-a))
⇒ 0 ;; again
(e1:toplevel (s1-a))
⇒ 0 ;; no pointers: there is really only one a

Notice also that 0 is the representation of both the a case of s1, and of the empty case of tree: it’s impossible to distinguish them from one another, even you can always tell apart different cases of the same sum.

Let’s define a sum having more than one case with fields:

(e1:toplevel (sum:define s2
               (a m n)
               (b q)))
⊣ ;; [automatic procedure definitions]

An s2 may be either an a, containing two fields m and n, or a b, containing one field q. Ok, but how are s2’s represented in memory?

(e1:toplevel (s2-a 100 200))
⇒ 0x19ba510[0 100 200]
(e1:toplevel (s2-a 100 200))
⇒ 0x19c0410[0 100 200] ;; different pointer!
(e1:toplevel (s2-b 500))
⇒ 0x19c5f70[1 500]

Of course both cases are boxed now; but the interesting difference is how the first work of each object is now a tag identifying the case: either 0 for a, or 1 for b. Of course you don’t need to remember the presence of the tag if you use the automatically-generated accessors for working with fields:

(e1:toplevel (s2-b-get-q (s2-b 600)))
⇒ 600 ;; you don't need to remember that q is the second field
(e1:toplevel (s2-a? (s2-b 700)))
⇒ 0
(e1:toplevel (s2-b? (s2-b 800)))
⇒ 1

Sums only need to contain tag words when they have more than one boxed case.

Unboxed cases can always be distinguished from pointers even without boxedness tags, because they are small numbers: no modern system will allocate a heap buffer at an address such as 0, 1, 2 or even 1024, which is still a reasonable upper bound on the number of cases, even in pathological situations. Thanks to this fact, we can afford a much more efficient representation for cases with no fields.

We have already said that lists are a sum: indeed, epsilon1.scm contains the definition:

(e1:toplevel (sum:define list:list
               (nil)
               (cons head tail)))

Of course there are convenience aliases for common operations; for example list:head, which we’ve already shown, is easier to write than the automatically-generated name ‘list:list-cons-get-head’; however the underlying representation exactly follows our description above: one unboxed case, 0, plus one boxed case with two fields. Tag words aren’t needed, so list conses can be represented in a compact way in memory, using only two words per cons. Since some lists may contain a very large number of elements, saving one word per element can make quite a difference for performance.

An open sum is a sum to which you can add more cases later on. Since there’s no way to tell how many boxed cases will be needed in the end, all boxed cases have to contain a tag word; apart from this slight inefficiency, open sums work just like non-open sums.

Let’s define an open sum named os. You can use form sum:define-open (which of course is defined as a macro as well) just like sum:define:

(e1:toplevel (sum:define-open os
               (x)
               (y a)))
⊣ ;; automatic procedure definitions...
(e1:toplevel (os-y 68))
⇒ 0x901640[1 68]

We gave our sum os two cases named x and y. Let’s add one more, z:

(e1:toplevel (sum:extend-open os
               (z a)))
⊣ ;; automatic procedure definitions...
(e1:toplevel (os-z 6))
⇒ 0x8a7608[2 6]
(e1:toplevel (os-y 6))
⇒ 0x8a5bd0[1 6] ;; "old" cases still work

We can keep adding cases even after we’ve started to make instances, as you already saw. Let’s add two more, t1 and t2:

(e1:toplevel (sum:extend-open os
               (t1 a b c)
               (t2)))
⊣ ;; automatic procedure definitions...
(e1:toplevel (os-t1 10 20 30))
⇒ 0xcdc3d8[3 10 20 30]
(e1:toplevel (os-y 7))
⇒ 0x8a30e8[1 7]

A programming example: structural equality with boxedness tags

At this point you should be ready to do some simple ε₁ programming, trying to come up with an implementation before reading my code.

If we accept to rely on boxedness tags, we can have a structural equality procedure similar to equal in Common Lisp and Emacs Lisp or to equal? in Scheme. Boxedness tags are necessary, since we can’t know the object shape in advance, and if we are to dereference pointers to compare corresponding buffers, we also need to recognize them as pointers, and obtain buffer lengths.

Defining the procedure whatever:equal? is a good programming exercise. It only relies on Lisp-style if, cond, let, and, not (all available in ε₁ with the prefix ‘e1:’), plus trivial arithmetics (such procedures are already available for fixnums, with the prefix ‘fixnum:’).

Our whatever:equal? is a typical recursive procedure of two arguments, performing case analysis. There are three possible cases:

  • both arguments are fixnums: we can use whatever:eq?;
  • both arguments are pointers (the most difficult case): we use a helper function;
  • one argument is a fixnum, the other is a pointer: the arguments are different.

Even if we haven’t written the helper procedure whatever:buffer-equal? yet, we can already define the main procedure:

(e1:define (whatever:equal? a b)
  (e1:let ((fixnum-a (boxedness:fixnum? a))
           (fixnum-b (boxedness:fixnum? b)))
    (e1:cond ((e1:and fixnum-a fixnum-b)
              (whatever:eq? a b))
             ((e1:and (e1:not fixnum-a) (e1:not fixnum-b))
              (whatever:buffer-equal? a b))
             (else
              #f))))

Two buffers are equal when they have the same length, and all corresponding elements are equal.

There are several ways of defining whatever:buffer-equal?, but the easiest is probably using another helper procedure (this time recursive), checking the array content from a given index to the end. However, we don’t even need to look at the elements if the two buffers have different lengths:

(e1:define (whatever:buffer-equal? pointer-1 pointer-2)
  (e1:let ((length-1 (boxedness:buffer-length pointer-1))
           (length-2 (boxedness:buffer-length pointer-2)))
    (e1:if (whatever:eq? length-1 length-2) ;; lengths are fixnums
      (whatever:buffer-equal-from-length? pointer-1 pointer-2 0 length-1)
      #f)))

And now, finally, the recursive helper function. It returns #f immediately if it finds a difference at some point, in which case it’s useless to look at the rest. If the procedure reaches the buffer ends without having found a difference, it concludes that the buffers were equal — whatever:buffer-equal-from-to? is only used on buffers of the same length, so we don’t have to worry about reaching the end on just one of the two buffers.
Notice that whatever:buffer-equal-from-to? recursively calls whatever:equal? when comparing buffer elements: this is reasonable, because the two corresponding elements can themselves be any combination of fixnums and pointers.

(e1:define (whatever:buffer-equal-from-length? pointer-1 pointer-2 from length)
  (e1:cond ((whatever:eq? from length)
            #t)
           ((whatever:equal? (buffer:get pointer-1 from)
                             (buffer:get pointer-2 from))
            (whatever:buffer-equal-from-length? pointer-1
                                                pointer-2
                                                (fixnum:1+ from)
                                                length))
           (else
            #f)))

We can now test whatever:equal?:

(e1:toplevel (whatever:equal? (tuple:make 1 (tuple:make 43 56))
                              (tuple:make 1 (tuple:make 43 56) 1)))
⇒ 0
(e1:toplevel (whatever:equal? (tuple:make 1 (tuple:make 43 56) 1)
                              (tuple:make 1 (tuple:make 43 56) 2)))
⇒ 0
(e1:toplevel (whatever:equal? (tuple:make 1 (tuple:make 43 56) 1)
                              (tuple:make 1 (tuple:make 43 7) 1)))
⇒ 0
(e1:toplevel (whatever:equal? (tuple:make 1 (tuple:make 43 56) 1)
                              (tuple:make 1 (tuple:make 43 56) 1)))
⇒ 1

If you want to do another similar exercise by yourself you can write a deep-cloning procedure, a hashing procedure, or a lexicographic comparison procedure.

You could also try to improve whatever:equal so that it never loops on cyclic structures, but that’s much more difficult (Hint: you need two associative data structures using pointers as keys and sets of pointers as data, to keep track of tentative and proved equalities).

A look at reflective data structures

The ε₁ form e1:value resembles Lisp’s quote, and serves to tell the system that we are interested in some immediate constant as a data structure; it is mainly useful for symbols: (e1:value foo) evaluates to the symbol foo as a data structure, which of course isn’t the same as the variable named foo. On the other hand (e1:value 42) works exactly like 42 in ε₁.

Let’s have a look at the symbol foo as a data structure:

(e1:toplevel (e1:value foo))
⇒ 0x1484770[0x9a60b0[3 102 111 111] 0 127 0 0 0 0 0 0 0 0]

It’s a large boxed data structure — don’t worry, you won’t need to remember all fields, or their positions.
The first field is the string 0x9a60b0[3 102 111 111], containing its length followed by the characters 102, 111 and 111; which is to say "foo", the symbol name. The other fields contain the symbol value as a global, a procedure, a macro, and other such information. Since foo isn’t the global name of anything, there is nothing very interesting to see: we just have 0 in fields which otherwise would usually point to boxed data. The 127 fixnum is actually irrelevant, and could be any other value; that field holds the value associated to foo as a global variable, if any: the previous field is a flag saying whether such value exists, and in this case it’s 0, because there’s no global variable named foo. The flag is necessary since any value, boxed or unboxed, including 0 or 127, is a potentially valid value for a global variable.

Of course named symbols are unique after interning, like in Lisp. Therefore we can check whether two symbols are equal with whatever:eq? — in practice by comparing their pointers, which is a very fast operation:

(e1:toplevel (e1:value foo)) ;; same symbol name as before...0x1484770[0x9a60b0[3 102 111 111] 0 127 0 0 0 0 0 0 0 0] ;; ...same pointer!
(e1:toplevel (whatever:eq? (e1:value foo)
                           (e1:value foo)))
⇒ 1 ;; 0x1484770 is equal to 0x1484770

Now let’s define a global named ‘foo’, then look at the symbol again:

(e1:define foo 671)
(e1:toplevel (e1:value foo))
⇒ 0x1484770[0x9a60b0[3 102 111 111] 1 671 0 0 0 0 0 0 0 0]

What happened shouldn’t be surprising at this point: the symbol now has a global value, so the flag changed from 0 to 1; and the field following it changed from 127, which was unused with the flag set to 0, to the appropriate value 671.

Of course the symbol field holds its global value. Formal parameters or local variables named ‘foo’ don’t affect it:

(e1:define (bar foo)
  (fixnum:+ foo foo))
(e1:toplevel (bar 57))
⇒ 114
(e1:toplevel (e1:let* ((a 10)
                       (foo (+ a 2)))
               foo))
⇒ 12
(e1:toplevel (e1:value foo)) ;; nothing changed in foo0x1484770[0x9a60b0[3 102 111 111] 1 671 0 0 0 0 0 0 0 0]

To make things more interesting, now let’s define a procedure named ‘foo’. Our procedure will be a trivial zero-argument constant function, always returning 82.

(e1:define (foo)
  82)
(e1:toplevel (e1:value foo))
⇒ 0x1484770[0x9a60b0[3 102 111 111] 1 671 0 0xd74d98[1 68979 82] 0 0 0 0 0 0]

You can immediately notice that the global binding to 671 survived: this means that a symbol can hold a global non-procedure and a global procedure at the same time, like in Common Lisp and Emacs Lisp, but differently from Scheme:

(e1:toplevel (foo))
⇒ 82
(e1:toplevel foo)
⇒ 671

This possibility of mapping different global entities (non-procedures, procedures, macros, ...) to the same name is a natural consequence of how symbols work.

You can also see that our (simple) procedure doesn’t look very hard to read in the data structure: what changed is just the new buffer 0xd74d98[1 68979 82]. Can we understand it? It turns out that yes, we can read it quite easily. Since the buffer contains the fixnum 82, you may guess that the buffer at 0xd74d98 represents the procedure body; indeed it does.

Look for the section named “Expressions as an open sum type” in epsilon1.scm. You’ll find this code:

(e1:toplevel (sum:define-open e0:expression
               (variable handle name)
               (value handle content)
               (bundle handle items)
               (primitive handle name actuals)
               (let handle bound-variables bound-expression body)
               (call handle procedure-name actuals)
               (call-indirect handle procedure-expression actuals)
               (if-in handle discriminand values then-branch else-branch)
               (fork handle procedure-name actuals)
               (join handle future)))

The section title says it all: ε₀ expressions are an open sum. The sum cases you see above are everything which remains after macroexpanding and transforming; some cases which are added later are always rewritten away before execution or compilation, until you get only ε₀ expressions as specified above; there is really nothing more.

[Move this: Let me stress again how simple ε₀ expressions are: at ten cases total, this is much simpler than any mainstream language including “small” languages such as Scheme and SML, and much smaller than ε₁ as well.]

I won’t explain every case now; you can already understand several of them. You should notice that all cases are boxed, and they all contain a handle as their first element: you already saw handles before, printed as subscript numbers by the debugging facility.

Now you have all the information you need to understand 0xd74d98[1 68979 82] as an e0:expression sum: it has a tag word 1, hence it’s the second case from the beginning, value; it contains a handle with value 68979, and a content with value 82.

You can check this reasoning with meta:print-procedure-definition:

(meta:print-procedure-definition 'foo)
⊣ Formals: ()
⊣ 82₆₈₉₇₉

Let’s redefine the procedure foo as something slightly more complex: let’s make it the identity function, taking one parameter and returning it unchanged. Now, to make our dump a little easier to read, I’ll use foo as the formal parameter name as well.

(e1:define (foo foo)
  foo)

You shouldn’t worry about foo being associated at the same time with a global, a procedure and a parameter: foo as occurs in the body can’t be a procedure reference, and in ε₁ (like in any other reasonable language) a parameter binding takes precedence over a global binding.

(e1:toplevel foo)
⇒ 671 ;; our non-procedure global
(e1:toplevel (foo 3))
⇒ 3

Everything works. Let’s look at the symbol again:

(e1:toplevel (e1:value foo))
⇒ 0x1484770[0x9a60b0[3 102 111 111] 1 671 0xf44178[0x1484770 0] 0xf44638[0 69017 0x1484770] 0 0 0 0 0 0]

There are two differences with respect to the previous version:

  • right after the global value 671 there is the buffer 0xf44178[0x1484770 0] instead of 0;
  • the procedure body has changed: now it’s 0xf44638[0 69017 0x1484770].

The first difference shows that a symbol field contains a list of formal parameters, as symbols. When the procedure had zero parameters the list was empty, now it’s a one-element list containing 0x1484770, which is foo itself.
The second change is even easier to follow: the procedure body is now the first case (tag 0) of e0:expression, named variable: and the variable is 0x1484770 — once more, the symbol foo. Finally the new expression got a fresh handle, 69017.

(meta:print-procedure-definition 'foo)
⊣ Formals: (foo)
⊣ foo₆₉₀₁₇

Within the e0:expression sum, variables are always represented as symbols, and sequences (be they variables, values or expressions) as lists. Of course all sub-expressions are e0:expression sums: the e0:expression sum is recursive.

As another example, let’s see how a slightly more complicated expression translates to ε₀. The e1:bundle form serves to return multiple results, like values in Scheme and Common Lisp:

(e1:define (my-procedure)
  (e1:bundle 60
             70))
(e1:toplevel (my-procedure))
⇒ 6070

Good, my-procedure works. Let’s look at its body:

(e1:toplevel (e1:value my-procedure))
⇒ 0xb56ab8[0xb3d3b0[12 109 121 45 112 114 111 99 101 100 117 114 101] 0 127 0 0xaf14e8[2 75167 0xaf14b8[0x1a597c8[1 75165 60] 0xaf3bd8[0xaf3bb8[1 75166 70] 0]]] 0 0 0 0 0 0]

We got the information we wanted, but we saw the whole symbol as well. If we only want the procedure body, we can use the procedure state:procedure-get-body, which takes a symbol and returns the body of the procedure named after it. Actually state:procedure-get-body is a simple selector looking up the given buffer at the appropriate index — notice that 0 isn’t a valid value for any e0:expression case, so a 0 result represents the absence of a procedure binding; we don’t need a separate flag as for non-procedure bindings.

(e1:toplevel (state:procedure-get-body (e1:value my-procedure)))
⇒ 0xaf14e8[2 75167 0xaf14b8[0x1a597c8[1 75165 60] 0xaf3bd8[0xaf3bb8[1 75166 70] 0]]]
(e1:toplevel (state:procedure-get-body (e1:value this-is-not-a-procedure-name)))
⇒ 0
(e1:toplevel (e1:value this-is-not-a-procedure-name))
⇒ 0x43d05e0[0x43cf7f0[28 116 104 105 115 45 105 115 45 110 111 116 45 97 45 112 114 111 99 101 100 117 114 101 45 110 97 109 101] 0 127 0 0 0 0 0 0 0 0] ;; the body field is 0

So we’ve seen that the bundle expression indeed contains a list of value expressions. Let’s check:

(meta:print-procedure-definition 'my-procedure)
⊣ Formals: ()
⊣ [bundle 60₇₅₁₆₅ 70₇₅₁₆₆]₇₅₁₆₇

Thanks to their structure, it’s easy to work on expressions with recursive procedures: notably it’s easy to write an ε₀ interpreter in ε₁, and in fact you can find one in core.e — actually written in a very restricted subset of ε₁, very close to ε₀. Much in the same way, it’s not hard to write a compiler in ε₁ which translates ε₀ into native code — and I did that as well: see compiler.e.

In order for a compiler to work, it has to access all global definitions. As you saw, global procedures and non-procedures for a given symbol are easy to find; but where do you find all interned symbols? The answer, of course, is in the symbol table.

It may look a little frightening because of its size, but in the end the symbol table (a hash table mapping symbol names as strings into symbols) is just a data structure like any others, made of fixnums and pointers. Its written dump is very large, and not by accident: starting from the symbol table you can reach every symbol, and symbols point to their global definitions as globals and procedures... Since procedure formals and bodies contain symbols, you’ll see many “yellow” pointers as well.

(e1:toplevel symbol:table)
⇒ 0x946950[0x946990[4096 1993 0xc095b0[0xc09670[0xa1d780[0] 0xa1d700[0xa1d780 0 127 0 0 0 0 0 0 0 0]] 0xc095f0[0xc09630[0x980980[10 99 111 110 100 105 116 105 111 110 115] 0x980900[0x980980 0 127 0 0 0 0 0 0 0 0]] 0]] 0xc09530[0xc09570[0xb7dfa0[10 98 111 117 110 100 45 102 111 114 109] 0xb7df20[0xb7dfa0 0 127 0 0 0 0 0 0 0 0]] 0] 0 0 0 0xc094b0[0xc094f0[0xa17200[14 115 117 109 58 100 101 115 99 114 105 112 116 111 114] 0xa8fad0[0xa17200 0 127 0xa90200[0xa17110[0xa17190[7 116 114 105 118 105 97 108] 0 127 0 0 0 0 0 0 0 0] 0xa90240[0xa16fd0[0xa17050[17 99 97 115 101 115 45 115 101 120 112 114 101 115 115 105 111 110] 0 127 0 0 0 0 0 0 0 0] 0]] 0xa8fb50[4 18816 0xa901c0[0xa8fc80[0xa8fd00[3 95 57 52] 0 127 0 0 0 0 0 0 0 0] 0] 0xa900f0[5 18815 0x95ca20[0x95cd00[11 98 117 102 102 101 114 58 109 97 107 101] 0 127 0x95ccc0[0x95cbc0[0x95cc40[10 101 108 101 109 101 110 116 45 110 111] 0 127 0 0 0 0 0 0 0 0] 0] 0x95caf0[3 188 0x95ca20 0x95cb40[0x95cb80[0 187 0x95cbc0] 0]] 0 0 0x95caa0[1 1 0 0 4] 0 0 0] 0xa90140[0xa90180[1 18814 2] 0]] 0xa8fba0[4 18813 0 0xa8ff20[5 18806 0x95c0b0[0x95c410[18 98 117 102 102 101 114 58 105 110 105 116 105 97 108 105 122 101 33] 0 127 0x95c350[0x954760[0x9547e0[6 98 117 102 102 101 114] 0 127 0 0 0 0 0 0 0 0] 0x95c390[0x954640[0x9546c0[5 105 110 100 101 120] 0 127 0 0 0 0 0 0 0 0] 0x95c3d0[0x95b270[0x95b2f0[11 110 101 119 45 101 108 101 109 101 110 116] 0 127 0 0 0 0 0 0 0 0] 0]]] 0x95c180[3 201 0x95c0b0 0x95c1d0[0x95c310[0 198 0x954760] 0x95c210[0x95c2d0[0 199 0x954640] 0x95c250[0x95c290[0 200 0x95b270] 0]]]] 0 0 0x95c130[3 0 1 0 9] 0 0 0] 0xa8ff70[0xa900b0[0 18803 0xa8fc80] 0xa8ffb0[0xa90070[1 18804 0] 0xa8fff0[0xa90030[0 18805 0xa17110] 0]]]] 0xa8fbf0[4 18812 0 0xa8fd50[5 18810 0x95c0b0 0xa8fda0[0xa8fee0[0 18807 0xa8fc80] 0xa8fde0[0xa8fea0[1 18808 1] 0xa8fe20[0xa8fe60[0 18809 0xa16fd0] 0]]]] 0xa8fc40[0 18811 0xa8fc80]]]] 0 0 0 0 0 0]] 0] 0 0xc09430[0xc09470[0xb12540[11 115 116 97 116 101 58 101 114 114 111 114] 0xb124c0[0xb12540 0 127 0 0 0 0 0 0 0 0]] 0] 0 0 0 0 0xc093b0[0xc093f0[0xb12220[11 115 116 97 116 101 58 97 112 112 108 121] 0xb121a0[0xb12220 0 127 0 0 0 0 0 0 0 0]] 0] 0 0 0 0 0 0 0 0 0 0xc09230[0xc09270[0xc09330[10 112 114 105 109 105 116 105 118 101 63] 0xc092b0[0xc09330 0 127 0 0 0 0 0 0 0 0]] 0] 0xc091b0[0xc091f0[0xb14850[27 115 116 97 116 101 58 97 100 100 45 112 114 105 109 105 116 105 118 101 45 115 121 109 98 111 108 115] 0xb13a50[0xb14850 0 127 0xb147d0[0x9c59d0[0x9c5a50[6 115 97 108 105 115 116] 0 127 0 0 0 0 0 0 0 0] 0xb14810[0x963680[0x963700[3 97 99 99] 0 127 0 0 0 0 0 0 0 0] 0]] 0xb13ad0[7 2757 0xb14790[0 2738 0x9c59d0] 0xb14750[0 0] 0xb14710[0 2739 0x963680] 0xb13b30[4 2756 0xb146d0[0xb0cc70[0xb0ccf0[12 102 105 114 115 116 45 115 121 109 98 111 108] 0 127 0 0 0 0 0 0 0 0] 0] 0xb14570[5 2742 0x6b2d50[0x65cb40[8 99 111 110 115 58 99 100 114] 0 127 0x6f91a0[0x957840[0x9578c0[4 99 111 110 115] 0 127 0 0 0 0 0 0 0 0] 0] 0x61e700[5 250 0x957670[0x957950[12 99 111 110 115 58 103 101 116 45 99 100 114] 0 127 0x957910[0x957840 0] 0x9576f0[5 235 0x954460[0x9548c0[10 98 117 102 102 101 114 58 103 101 116] 0 127 0x954840[0x954760 0x954880[0x954640 0]] 0x954530[3 193 0x954460 0x954580[0x954720[0 191 0x954760] 0x9545c0[0x954600[0 192 0x954640] 0]]] 0 0 0x9544e0[2 1 0 0 7] 0 0 0] 0x957740[0x957800[0 233 0x957840] 0x957780[0x9577c0[1 234 1] 0]]] 0
;;; [...] this is just the beginning.  Try it!

There are procedures which walk the symbol:table structure and return a list containing all the symbols bound to procedures or to non-procedures; these are handy to get “global” information about the program state:

(e1:toplevel (state:global-names))
⇒ ;; [huge output]
(e1:toplevel (state:procedure-names))
⇒ ;; [different huge output]

e1:define is just a macro!

Many of the details above aren’t important in practice: you won’t need to remember the field order in a symbol object, or to parse an ε₀ expression data structure at a glance; you certainly don’t need to remember such encodings by heart. But what is important is that all of this can be done, by accessing reflective data as ordinary data structures: in principle you could define a complicated procedure using only buffer:make, buffer:get and buffer:set!. In fact e1:define is nothing magic: it is itself a macro, which can define a procedure by macroexpanding the procedure definition as provided by the user until it reduces to a piece of ε₀ code, than sets some symbol field. The result of macroexpanding an e1:define use will be a large ε₀ expression which when executed builds another ε₀ expression, to be associated to a symbol with buffer:set!.

All of this should explain why no case of ε₀ expressions looks like a global definition: indeed, global definitions aren’t in the core language at all: there’s no need for them if we can mutate symbol fields. By mutating such fields at run time, we can even write self-modifying programs in a natural way: e1:define expressions can occur in any context where another ε₁ expression can occur, even in deeply nested code: differently from most languages ε₁ has no separate “toplevel forms”, but only expressions: and expression syntactically include one another.

As a slightly less trivial example of a global definition, let month->days return the number of days in the given month (as a 0-based index), for a non-leap year. We can use a Lisp-style e1:case form (which compares by identity):

(e1:define (month->days month-index)
  (e1:case month-index
           ((8 3 5 10) ;; 30 days hath September, April, June and November
            30)
           ((1) ;; …excepting February alone…
            28)
           (else ;; …all the rest have thirty-one
            31)))

Of course the e1:case form isn’t in ε₀: rather it’s a macro which gets rewritten into simple ε₀ conditionals:

(meta:print-procedure-definition 'month->days)
⊣ Formals: (month-index)
⊣ [let [_416] be month-index₇₅₃₅₁ in [if _416₇₅₃₅₂ ∈ {8, 3, 5, 10} then 30₇₅₃₅₃ else [if _416₇₅₃₅₄ ∈ {1} then 28₇₅₃₅₅ else 31₇₅₃₅₆]₇₅₃₅₇]₇₅₃₅₈]₇₅₃₅₉

Notice that we’ve never called month->days yet: e1:define performed the expansion once and for all at definition time.

The non-procedure case of e1:define is similar: e1:define macroexpands what the user supplied, obtains an ε₀ expression, evaluates it, and sets (another) symbol field.

S-expressions

We’ve looked at the expansion result, which is to say ε₀ expressions; now, without looking at the expansion process itself, it’s time to focus on the source of this translation: what exactly is it that e1:define (and e1:toplevel as well) eventually rewrites to an ε₀ expression?

The answer is an s-expression: for example (e1:define qwe (fixnum:+ 1 2 3 4)) has to turn, somehow, the s-expression (fixnum:+ 1 2 3 4) into an expression: but (fixnum:+ 1 2 3 4) is just a data structure: in Lisp you would call it a “list”, containing a “symbol” and four “numbers”; in ε₁, to show more clearly that this is a (potentially) syntactic object, different from the lists we saw before, we call it an s-list (pronounced: ess-list). This s-list happens to contain an s-symbol and four s-fixnums.
We say that ε₁ types can be injected into s-expressions: for example fixnums are injected as s-fixnums, booleans as s-booleans, the empty list as the empty s-list, and so on.

If you want to see the expansion of an s-expression into an expression, you can use the Guile procedure meta:macroexpand (it’s a Guile procedure: mind the quote, and don’t use e1:toplevel). You can see, for example, that fixnum:+ can accept any number of arguments in ε₁, but when rewritten in ε₀ fixnum:+ calls become simple (and efficient) two-argument calls, possibly nested: [To do: shall I hint at how this works? Later, I’d say]

(meta:macroexpand '(fixnum:+ 1 2 3 4))
⊣ [call fixnum:+ [call fixnum:+ [call fixnum:+ 1₇₅₃₇₃ 2₇₅₃₇₄]₇₅₃₇₅ 3₇₅₃₇₆]₇₅₃₇₇ 4₇₅₃₇₈]₇₅₃₇₉
(meta:macroexpand '(fixnum:+))
⊣ 0₇₅₃₈₀
(meta:macroexpand '(fixnum:+ 20))
⊣ 20₇₅₃₈₁

This is a fundamental difference between ε and Lisp: ε is not homoiconic; which is to say, in ε the data structures used to represent syntax (s-conses, s-symbols, s-fixnums, the empty s-list) are distinct from expressions: one notable feature of Lisp dialect is that data structures and expressions is the lack of this distinction.

Quoting and quasiquoting, in ε₁, yield an expression evaluating to the given s-expression; after evaluation, the result is the s-expression as supplied. Let’s see:

(e1:toplevel 100)
⇒ 100 ;; 100 as a fixnum
(e1:toplevel '100)
⇒ 0xb44cf8[2 100] ;; 100 as an s-fixnum
(e1:toplevel '120)
⇒ 0xb3f030[2 120] ;; 120 as an s-fixnum

You may already have guessed why an s-fixnum is represented as a boxed object, containing a word which precedes the actual fixnum value: s-expressions are a sum — an open sum, in fact16. 2 happens to be the tag associated to the s-fixnum case of s-expressions. Other types injected into s-expressions have each their own tag:

(e1:toplevel '())
⇒ 0x1a46830[0 0] ;; empty s-list
(e1:toplevel '#t)
⇒ 0xb5b170[1 1]
(e1:toplevel '#f)
⇒ 0x1743368[1 0] ;; all s-booleans have the same tag
(e1:toplevel '"abcdefghijkl")
⇒ 0x1671b68[5 0xb973a8[12 97 98 99 100 101 102 103 104 105 106 107 108]]
(e1:toplevel 'abcdefghijkl)
⇒ 0xb65100[6 0x1701f98[0x1702e40[12 97 98 99 100 101 102 103 104 105 106 107 108] 0 127 0 0 0 0 0 0 0 0]]

We call s-conses the injection of conses, in this case meant in the restricted sense of pairs of s-expressions only:

(e1:toplevel (cons:make 100 200))
⇒ 0x19c6aa8[100 200] ;; not an s-expression
(e1:toplevel '(100 . 200)) ;; an s-cons of two s-fixnums0x91a4a0[3 0x953cb8[0x91dc30[2 100] 0x920d18[2 200]]]

An s-list is either the empty s-list or an s-cons whose right side is an s-list. We call s-car and s-cdr the left-hand side and right-hand side of a cons; by our definition of s-cons, both are always s-expressions.

(e1:toplevel '(1 . (2 . ()))) ;; an s-cons which is also an s-list0x92a1c8[3 0x99ade0[0x9a0a50[2 1] 0x99adc0[3 0x99a3c8[0x9a0a70[2 2] 0x99a3a8[0 0]]]]]
(e1:toplevel '(1 2)) ;; just another way of writing '(1 . (2 . ()))0x916c00[3 0x916be0[0x91aa68[2 1] 0x916bc0[3 0x91aac8[0x91aa88[2 2] 0x91aaa8[0 0]]]]]
(e1:toplevel '(quux 2)) ;; this happens to map to an ε₀ expression…0x16f0aa8[3 0x16f0ab8[0x16f0b68[6 0x164c0f8[0x174f320[4 113 117 117 120] 0 127 0 0 0 0 0 0 0 0]] 0x16f0ac8[3 0x16f0b08[0x16f0b28[2 2] 0x16f0b18[0 0]]]]]
(e1:toplevel '(e1:if)) ;; …this doesn't, but it's still a valid s-expression;; [huge structure: e1:if is bound to a macro, indirectly referring many symbols]

You’ve already seen that the empty s-list has tag 0, s-booleans have tag 1, s-fixnums 2, s-strings 5 and s-symbols 6. But of course you don’t have to remember such trivialities by heart: there are case-querying procedures for injected types. Case-querying procedures work on any s-expression, but of course not on non-s-expression ε data:

(e1:toplevel (sexpression:cons? '0))
⇒ 0 ;; the 0 s-fixnum isn't an s-cons
(e1:toplevel (sexpression:fixnum? '0))
⇒ 1 ;; the 0 s-fixnum is an s-fixnum
(e1:toplevel (sexpression:fixnum? '#f))
⇒ 0 ;; the #f s-boolean isn't an s-fixnum
(e1:toplevel (sexpression:boolean? '#f))
⇒ 1 ;; the #f s-boolean is in fact an s-boolean
(e1:toplevel (sexpression:symbol? 'foobar))
⇒ 1
(e1:toplevel (sexpression:cons? 7))
error→ failure or crash: there's no s-expression tag in 7

There are also selector procedures for s-conses substructures; we call s-car the left element of an s-cons, and s-cdr its right element. Selectors are defined for s-car, s-cdr and their compositions up to length four (for example the s-caddr is the s-car of the s-cdr of the s-cdr). Mutators are also available for destructively changing the s-car and s-cdr of a given s-cons.

(e1:toplevel (sexpression:car '(2 . 3)))
⇒ 0x103e7b8[2 2] ;; 2 as an s-fixnum
(e1:toplevel (sexpression:cdr '(2 . 3)))
⇒ 0x1031178[2 3] ;; 3 as an s-fixnum
(e1:toplevel (sexpression:cdddr '(1 2 3)))
⇒ 0x73a1c0[0 0] ;; empty s-list
(e1:define sc '(1 . 2))
(e1:toplevel sc)
⇒ 0x6ff890[3 0x6ff870[0x6ff830[2 1] 0x6ff850[2 2]]]
(e1:toplevel (sexpression:set-car! sc '5))
(e1:toplevel sc)
⇒ 0x6ff890[3 0x6ff870[0x6ff830[2 5] 0x6ff850[2 2]]] ;; the s-car changed
(e1:toplevel (sexpression:set-cdr! sc sc))
(e1:toplevel sc)
⇒ 0x6ff890[3 0x6ff870[0x6ff830[2 5] 0x6ff890]] ;; circular s-cons: (5 . (5 . (5 . …)))

S-expression constructors are actually injection procedures or injectors, useful for converting ordinary untagged data into s-expressions;

(e1:toplevel (sexpression:inject-boolean #t))
⇒ 0x33d26e0[1 1]
(e1:toplevel (sexpression:inject-fixnum 42))
⇒ 0x33c8020[2 42]
(e1:toplevel (sexpression:inject-symbol (e0:value blah-blah)))
⇒ 0x3405c90[6 0x33de740[0x33de230[9 98 108 97 104 45 98 108 97 104] 0 127 0 0 0 0 0 0 0 0]]

sexpression:cons is a convenience procedure constructing an s-cons holding the two given s-expressions.
It’s your responsibility to ensure that s-conses only contain s-expressions. If you put something else in an s-cons field (for example if you forget an injector call), you’ll obtain a data structure which will likely cause problems later:

(e1:toplevel (sexpression:cons '1 '#f))
⇒ 0x3440d40[3 0x3440dd0[0x3440ef0[2 1] 0x3440e60[1 0]]]
(e1:toplevel (sexpression:cons '1 #f)) ;; say we forget the quote0x33d3660[3 0x33d41a0[0x33d4230[2 1] 0]] ;; not an s-cons: 0 isn't an s-expression 

Of course ordinary ε data don’t carry any type information, so it’s your responsibility to use the correct injector:

(e1:toplevel (sexpression:inject-boolean 0))
⇒ 0x3410340[1 0] ;; a boolean s-expression #f

Ejection procedures or ejectors return the content of a given s-expression. The universal ejector sexpression:eject works on any s-expression case; case-specific ejectors check that the s-expression case is correct, then call e1:error on case mismatch, or behave like the universal ejector otherwise.

(e1:toplevel (sexpression:eject '1))
⇒ 1
(e1:toplevel (sexpression:eject-fixnum '1))
⇒ 1
(e1:toplevel (sexpression:eject '#t))
⇒ 1 ;; just the same!
(e1:toplevel (sexpression:eject-fixnum '#t))
error→ fail or crash

Ejecting an s-cons yields a pair of s-expressions: the pair itself has no tag, but the left and right side are the same objects contained in the original s-cons, tag and all.

(e1:define my-s-cons
  '(() . #f))
(e1:toplevel my-s-cons)
⇒ 0x34503c0[3 0x3450380[0x3450300[0 0] 0x3450340[1 0]]]
(e1:toplevel (sexpression:eject-cons my-s-cons))
⇒ 0x3450380[0x3450300[0 0] 0x3450340[1 0]] ;; same pointers

I’ve already said that quoting expands to an ε₀ expression building the given s-expression as a constant data structure. Now you can understand the expansion of '(1 2):

(meta:macroexpand ''(1 2)) ;; two quotes: you're studying '(1 2)
⊣ [call sexpression:cons [call sexpression:make 2₆₉₀₈₂ 1₆₉₀₈₃]₆₉₀₈₄ [call sexpression:cons [call sexpression:make 2₆₉₀₈₅ 2₆₉₀₈₆]₆₉₀₈₇ [call sexpression:make 0₆₉₀₈₈ 0₆₉₀₈₉]₆₉₀₉₀]₆₉₀₉₁]₆₉₀₉₂

It macroexpands to an ε₀ expression building the s-expression (1 2). We can confirm this by looking at the result of the evaluation of '(1 2) in ε₁:

(e1:toplevel '(1 2))
⇒ 0x346ba10[3 0x346b9d0[0x346b890[2 1] 0x346b990[3 0x346b950[0x346b8d0[2 2] 0x346b910[0 0]]]]]

Indeed it’s the s-expression (1 2), also written as (1 . (2 . ())): an s-cons holding the s-fixnum 1 and another s-cons, holding the s-fixnum 2 and the empty s-list.

Notice that quoting is different from e1:value: for example, '42 yields the s-fixnum 42, while (e1:value 42) yields the fixnum 42:

(e1:toplevel '42)
⇒ 0x819cb30[2 42]
(e1:toplevel (e1:value 42))
⇒ 42

At this point quasiquoting shouldn’t be very surprising to you, even if it’s slighly less convenient than in Lisp because of the need for injectors. For example `(1 ,(sexpression:inject-fixnum 2)) macroexpands to an ε₀ expression building the s-expression (1 2), which can also be written as (1 . (2 . ())):

(e1:toplevel `(1 ,(sexpression:inject-fixnum 2)))
⇒ 0x348be00[3 0x348bdc0[0x348ba40[2 1] 0x348bd00[3 0x348bcc0[0x348bb40[2 2] 0x348bc00[0 0]]]]]

Of course the thing gets more interesting if we use non-constants in the unquoted parts:

(e1:define my-s-boolean '#t)
(e1:toplevel `(1 . ,my-s-boolean))
⇒ 0x7e92380[3 0x7e92340[0x7e921c0[2 1] 0x7e859a0[1 1]]] ;; the s-expression (1 . #t)
(e1:define my-s-list '(#f))
(e1:toplevel my-s-list)
⇒ 0x7edd230[3 0x7edd1f0[0x7edd170[1 0] 0x7edd1b0[0 0]]] ;; the s-expression (#f) or (#f . ())
(e1:toplevel `(,my-s-list))
⇒ 0x7ef9fe0[3 0x7ef9fa0[0x7edd230[3 0x7edd1f0[0x7edd170[1 0] 0x7edd1b0[0 0]]] 0x7ef9ee0[0 0]]] ;; the s-expression ((#f)) or ((#f . ()) . ())
(e1:toplevel `(,@my-s-list)) ;; unquote splicing0x7f070f0[3 0x7f070b0[0x7edd170[1 0] 0x7f06ff0[0 0]]] ;; the s-expression (#f), again

Don’t forget to ensure that your unquoted values are always s-expressions. If you omit the needed injection or quoting, you’ll get a non-s-expression data structure, which will likely to cause problems later:

(e1:toplevel `(,(sexpression:inject-fixnum 1)))
⇒ 0x817e3d0[3 0x817e390[0x817e210[2 1] 0x817e2d0[0 0]]] ;; the s-expression (1)
(e1:toplevel `(,'1))
⇒ 0x818a6d0[3 0x818a690[0x818a510[2 1] 0x818a5d0[0 0]]] ;; the same: '1 yields an s-expression
(e1:toplevel `(,1))
⇒ 0x8195290[3 0x8195250[1 0x8195190[0 0]]] ;; not a valid s-expression: 1 isn't an s-fixnum!

This will be unsurprising for you at this point: quoting and quasiquoting are not magic notations at the core of the language: they are defined in ε₁ as well, as macros. S-expressions are not the core data structure, being built as they are on buffers and fixnums.

It would be possible to build a different personality, completely replacing s-expression-based syntax with something else, possibly using ε₁ for the implementation. I wouldn’t accept such a change in the official version of ε without convincing evidence of some gain in extensibility, but you can certainly attempt that if you want: it’s free software; play with it.

What’s the point of s-expressions?

There are two reasons to have s-expressions in ε₁:

  • representing user syntax
  • selective dynamic typing

The first point is obvious: you use s-expressions for writing ε₁ programs, and — you haven’t seen any definitions yet, but it’s easy to guess — macros to manipulate them. S-expressions are quite a nice way of encoding syntax: in particular they are not ambiguous to parse, they are reasonably compact, and they lend themselves to automatic manipulation. Quasiquoting makes it easy to write macros in which only some sub-structures are computed; s-expression syntax is very regular and easy to learn, even if it looks “different” from conventional languages.

The second point is more interesting, because it shows a fundamental difference between ε and Lisp: in ε objects have no runtime types: 0 may be a fixnum, a sum, a boolean, the empty list; but if you decide to use s-expression for some data structure, then you gain the flexibility of Lisp (or other dynamically-typed languages), only where you want it: s-expression tags are a form of runtime type representation.

As you saw ε’s s-expressions aren’t very efficient: even s-fixnums have to be allocated on the heap, and there is no unboxed case at all. I might decide to change this in the future, ideally by optimizing some more generic pattern in the representation of sums. In any case, making s-expressions more efficient would affect the readability of dumps — which might be still acceptable; Forth, for example, follows an untyped model in which there is really no way of distinguishing a pointer from a non-pointer: even without any notion of boxedness tags, Forthers are happy with the power of their language. They may be right, and destroying our nice colored dumps may be acceptable; I’m currently not strong enough at Forth for judging myself in a competent way.

Speaking of printed representations, s-expressions also have the advantage of being easy to print. The elements of the s-list (0 #f ()), for example, are all printed differently, which may be useful. At the present time ε₁ has no s-expression printer, even if Jérémie Koenig has kindly contributed a nice pretty-printer, that I still have to integrate.

S-expressions are also easy to parse, and I’ll have to add that support as well; I have a working prototype written in another language, to translate by machine into ε₁, or re-implement. Implementing this s-expression frontend is the only step needed before dropping Guile as a dependency and let ε be completely self-hosting. In this case the scanner, which will have to be extensible, is way more complex than the parser: here’s a complete attributed grammar to recognize s-expressions; it’s an LL(1) grammar, which I can easily code by hand as a recursive descent parser like I already did in the prototype.

Just in case it weren’t obvious this is an attributed grammar, not ε₁ code:

s-expression ::=
  atom                { atom }
| ( rest              { rest }
| prefix s-expression { s-cons(lookup(prefix), s-cons(s-expression, ())) }

rest ::=
  )                 { () }
| . s-expression )  { s-expression }
| s-expression rest { s-cons(s-expression, rest) }

Even in ε₁’s current state without a working parser and printer we can still play with s-expressions in a relatively comfortable way. The trick is using the Guile conversion procedures I’ve written to translate between ε₁’s and Guile’s incompatible s-expressions: guile-sexpression->sexpression and sexpression->guile-sexpression.
Since we’re speaking about Guile procedures, we’ll have to call them out of e1:toplevel; and since they are procedures and not macros, we need to quote their arguments when appropriate in Scheme.

(guile-sexpression->sexpression 15)
⇒ 0x52e0f20[2 15] ;; not unlike quoting 15 in ε₁
(e1:toplevel '15)
⇒ 0x52e8800[2 15] ;; …see?
(guile-sexpression->sexpression 'rhfgkjsdfg)
⇒ 0x52f2220[6 0x52ea9d0[0x52ea450[10 114 104 102 103 107 106 115 100 102 103] 0 127 0 0 0 0 0 0 0 0]]
(e1:toplevel 'rhfgkjsdfg)
⇒ 0x52f2220[6 0x52ea9d0[0x52ea450[10 114 104 102 103 107 106 115 100 102 103] 0 127 0 0 0 0 0 0 0 0]]

Data conversion between Guile and ε₁

In a similar vein to s-expression conversion procedures, I also defined conversion procedures for ε₁ data different from s-expressions. For example, guile-sexpression->whatever (I should probably rename it) returns a given Guile datum converted into its ε₁ counterpart, as long as it’s unboxed:

(guile-sexpression->whatever 67)
⇒ 67
(guile-sexpression->whatever #\C)
⇒ 67 ;; big C is Unicode code point 67
(guile-sexpression->whatever #f)
⇒ 0

Converting in the other direction requires a different procedure per Guile s-expression case; and of course the procedure parameter has to be an ε₁ object, which may force us to use e1:toplevel:

(whatever->guile-boolean (e1:toplevel 1))
⇒ #t
(whatever->guile-fixnum (e1:toplevel 1))
⇒ 1

In practice it’s particularly useful to convert symbols between Guile and ε₁, since dumps easily get big for globally-bound symbols:

(symbol->guile-symbol (e1:toplevel (e1:value symbol:table)))
⇒ symbol:table
(guile-symbol->symbol 'aaaaaa)
⇒ 0x536050[0x3271170[6 97 97 97 97 97 97] 0 127 0 0 0 0 0 0 0 0]
(guile-symbol->symbol 'aaaaaa)
⇒ 0x536050[0x3271170[6 97 97 97 97 97 97] 0 127 0 0 0 0 0 0 0 0] ;; the same

Converting lists from ε₁ to Guile is easy if we accept having Guile lists of ε whatevers. If we want to convert the elements as well, we have to use Guile map (called mapcar in other Lisps):

(list->guile-list (e1:toplevel (list:list 1 2 3 4 5)))
⇒ (1 2 3 4 5)
(map whatever->guile-fixnum (list->guile-list (e1:toplevel (list:list 1 2 3 4 5))))
⇒ (1 2 3 4 5)

Going the other way essentially forces use to use some mapping, because we can’t have whatevers referring arbitrary Guile s-expressions — I’ve intentionally forbidden it, because it would not be useful and would make some bugs very difficult to catch at bootstrapping time. So guile-list->list, counter-intuitively, takes two parameters: the Guile list, and an element conversion procedure:

(guile-list->list '(1 2 3) guile-sexpression->whatever)
⇒ 0x3270ad0[1 0x3270a90[2 0x3270a50[3 0]]]

You can see which procedures are available in bootstrap/scheme/conversion.scm.

Of course the very need for this translation between Guile and ε₁ data will go away once I write a proper frontend for ε.
On the other hand the distinction between ε₁ untagged data and ε₁ s-expressions is useful, and will remain. It’s perfectly reasonable to think of making a different personality without such a distinction — essentially a Lisp — but such decisions don’t belong in ε₁, which I want to remain low-level enough to be efficient, and to be extensible into various directions without being committed to limiting choices.

[To do: quoting and quasiquoting: refer e1:value as mentioned before]

More advanced ε₁ topics

[To do: fill this]

Unexec

Let’s define the Fibonacci series, in its naïve exponential-time form typical of micro-benchmarks:

(e1:define (fibo n)
  (e1:if (fixnum:< n 2)
    n
    (fixnum:+ (fibo (fixnum:- n 2))
              (fibo (fixnum:1- n)))))

Check that it works, paying attention to how long this takes to compute:

(e1:toplevel (fibo 32))
⇒ 2178309

Now let’s try something cool. We want to freeze the current global state (including globals and procedures) and unexec it into an image which, when execed, will compute (fibo 32) and print the result followed by a newline. Easy:

(e1:toplevel (e1:unexec "/tmp/u"
               (fixnum:write (fibo 32))
               (string:write "\n")))
⇒ ;; zero results

This was fast: clearly the system didn’t compute (fibo 32) this time. In fact it just created the file /tmp/u, which will perform the slow computation when it’s run, starting from the global state we’ve saved — which, crucially, includes the definition of fibo.

Open a new terminal, without closing the guile+whatever session. From the new terminal enter the epsilon-trunk directory, then try execing:

./bin/epsilon-image-interpreter-smob /tmp/u
⊣ 2178309

This was about as slow as the interactive system above: in fact epsilon-image-interpreter-smob represents ε data structures using Guile SMOBs, just as guile+whatever.

But we can exec the same unexeced image with a more efficient runtime:

./bin/epsilon-image-interpreter-tagged /tmp/u
⊣ 2178309

This was much faster. But we can do even better17:

./bin/epsilon-image-interpreter-untagged /tmp/u
⊣ 2178309

epsilon-image-interpreter-tagged represents boxedness tags, but epsilon-image-interpreter-untagged doesn’t. This implies that when execing an unexeced image with epsilon-image-interpreter-untagged, you can’t unexec again; when execing with epsilon-image-interpreter-tagged (or epsilon-image-interpreter-smob), you can.

Let’s try (back on the interactive system):

(e1:toplevel (e1:unexec "/tmp/u2"
               (string:write "Now unexecing into /tmp/u3\n")
               (e1:unexec "/tmp/u3"
                 (fixnum:write (fibo 32))
                 (string:write "\n"))))
⇒ ;; zero results

Verify that you have no /tmp/u3. Exec /tmp/u2 with epsilon-image-interpreter-tagged, and /tmp/u3 will appear. Exec that, and it will compute fibo. Notice that you can use epsilon-image-interpreter-untagged on /tmp/u3, because that doesn’t need boxedness tags; but if you try running

./bin/epsilon-image-interpreter-untagged /tmp/u2

you’ll get an error.

Unexec limitations

[To do: unexec limitations: external state such as open files, pointer identity]

ε compilers will have the same interface as unexec. They are still a little clumsy to use, so I won’t cover them here. However as soon as the interface matures compiling will feel a lot like unexecing, except that it will be slower at generating, and much faster at executing.

How unexec works

[To do: fill this]

Closures

[To do: fill this]

Imperative loops

[To do: fill this]

Pattern matching

[To do: fill this]

Keyword parameters

[To do: fill this]

Promises

[To do: fill this]

Futures

[To do: fill this]

Explore the source

At this point you should be able to understand essentially all of core.e, in which ε₀ is defined using itself along with reflective structures, and much of epsilon1.scm, in which ε₁ features are defined one by one. You will not understand the reasons behind some apparently gratuitous complications without reading my thesis (they mostly have to do with bootstrapping from Guile), but the intent should be clear.

epsilon1.scm is a case where you can feel very clearly how the language expressivity accelerates: the beginning is very clumsy, but after each extension is defined, it can be used to define the next one; at the end of the file, ε₁ has become pretty powerful.

Even compiler.e is not overly difficult, except for some kludgish solutions in the frontend part.

Conclusion

[To do: You’ve learned... That was actually simple: all the difficulty was in understanding the “big picture” rationale, and in distinguishing ε from Guile, which is just a transient implementation problem. Using ε₁ itself is easy.]

[To do: I can write another tutorial on syntactic abstraction if there is interest; but later. Let me do some development as well]

For more information

[To do: ε₁ doesn’t feel very minimalistic]

You’re welcome to subscribe to epsilon’s mailing list at https://lists.gnu.org/mailman/listinfo/epsilon-devel and ask questions publicly. At the present time there is no -bug or -help mailing list, so epsilon-devel is appropriate for any kind of discussion about epsilon. I may activate other lists if there is more traffic in the future.

[Move this: ε₁’s syntax is encoded by s-expressions, as in Lisp dialects (there is a crucial difference, but we will not see it now), to make syntactic extension easier.]

[Move this: [Maybe: whatevers can only refer other whatevers]]

[Move this: final unexec possibly without GC]

[Move this: means of abstraction, means of combination]

[Move this: Every ε₁ form you saw (everything with the ‘e1:’ prefix) is defined as a macro]

[Move this: Pattern-matching is on sums, tuples and records; not on s-expressions, like Guile’s (ice-9 match) module.]

[Move this: A footnote about addresses and moving GCs: it’s clear that the current GC is not moving, since we’ve assumed that object address don’t “spontaneously” change. This is definitely useful for debugging; however I might relax this constraint in the future. Currently, I do use addresses as keys in associative structures sometimes, which of course is only reliable with non-movable addresses.]

[Move this: e1:define is in ε₁: we don’t strictly need to have it predefined. in fact, we can simulate its behavior.]

[Move this: No error handling in ε₁: you should prevent errors, not recover from them.]

[Move this: No pointer arithmetics]

[Move this: S-expressions are for dynamic typing: just like boxedness tags, you can use them if you want]

[Move this: Show something which fails, near the beginning.]

[Move this: tail calls]

[Move this: Say "detour" instead of "digression"]

[Move this: more data structures: hint at hashes. I also implemented association lists18, and in fact hash buckets are alists.]

[Move this: I’m already requiring a CSS-capable browser, so I can write “λ-calculus” with a real “λ”.]

[Move this: I’ve heard some Forthers call this kind of reflection facility “browsing”. I don’t know how standard this term is.]

— Luca Saiu, 2013-08-23 12:54 (last update: 2023-09-29 17:57)

Tags:
english, epsilon, gnu, guile, hacking, tutorial

Next post Previous post

You might want to go to the main blog index (Atom feedfeeds for every post: Atom 1.0, RSS 2.0) or to my web site https://ageinghacker.net.

[my photo]
Luca Saiu

The opinions I express here are my own and do not necessarily reflect the beliefs or policies of my employer or for that matter of anyone else. In case you felt that the public statement of my thoughts threatened your warm sense of security and your emotional stability feel free to leave at any time.
The system does not support user comments and probably never will. Anyway you can contact me if you want to discuss some topic with me. I might update my posts if you provide interesting insights.

You might be interested in my web site https://ageinghacker.net.


Copyright © 2009, 2011-2014, 2017, 2018, 2021-2024 Luca Saiu
Verbatim copying and redistribution of this entire page are permitted in any medium without royalties, provided this notice is preserved.
This page was generated by
trivialblog. trivialblog is free software, available under the GNU GPL.
Tag icon copyright information is available in this file.


Footnotes

(1)

See http://www.colorforth.com/cf.htm. Mr Moore was writing about conventional software as compared to colorForth, his minimal Forth dialect. ε and colorForth are very different systems, but they share much in terms of philosophy and rationale. I believe both to be local optima in the language design space, following McCarthy’s intuition of http://www-formal.stanford.edu/jmc/lisp20th.html.

(2)

Some research operating systems are based on micro-kernels on which several different libraries libraries emulate several different operating system APIs, at the same time: each library is called a “personality”.

(3)

As an example of a similar initiative, the SRFI effort (http://srfi.schemers.org/) in the Scheme community is working well: that may be the closest thing yet to Steele’s vision of a maintainer coordinating users’ improvements. And it’s very loose as an organization, which is good.

(4)

The first one, from 2001, was only my second compiler. My early attempts look a little embarrassing now.

(5)

See http://www.gnu.org/software/guile/manual/html_node/Defining-New-Types-_0028Smobs_0029.html. My implementation of the whatever “type” (static typing proponents are already throwing a fit at this point) is very inefficient; even sticking to SMOBs, there are better solutions than mine. I stuck with something simple; there’s no point in optimizing now, because I’ll soon get rid of the “whatever” runtime.

(6)

If you search for my name in recent Usenet messages, you’ll find a cool sub-project of mine, still in the planning stage.

(7)

In some cases using the macro may not actually be necessary, but recognizing all such cases requires knowledge of the implementation. Always using e1:toplevel at the top level for ε₁ forms is never an error.

(8)

This syntax description is very informal: body forms may be zero or more, in either case. A more correct definition should actually use dot notation: since parameters are zero or more and body expressions are zero or more, we can express the two cases as (e1:define x . expressions) and (e1:define (x . formals) . expressions). I’m trying to be friendly to less expert Lispers, avoiding dot notation in this first introduction.

(9)

I’ll not cover multiple results here.

(10)

Before functional programming fundamentalists jump at my throat: I’m not saying that it’s a good idea to use mutation everywhere; I like mostly functional programming myself. But in some cases, imperative mutation provides just the right kind of modularity, as shown for example in SICP: http://mitpress.mit.edu/sicp/full-text/sicp/book/node53.html.
I have some experience of purely functional programming: I implemented an entire canonical LR(1) parser generator in the second implementation of ε, which was purely functional. It was very painful to write, and the result didn’t come out particularly beautiful, simple or more maintainable. It was definitely longer than an alternative with assignments could’ve been.

Back to the main point, competent programmers don’t need to be protected from themselves. If you want a bondage language, I don’t think you’ll like ε₀ and ε₁.

(11)

Scheme’s SRFI 111 specifies a feature with essentially the same interface, even if not based on exposed untyped buffers: http://srfi.schemers.org/srfi-111/. Boxes are also similar to ML’s refs, of course without the type restrictions. I’ve used a different name because I don’t like unnecessary abbreviations, and the word “box” is just as long as ref.

(12)

According to the specifications, Scheme systems aren’t technically forced to implement eq? as an equality by identity. But I don’t want to be anal, and we all know that in practice eq? just compares two words. As for ε, by talking about data structures at a low level with pointers, buffers and fixnums, I can afford the luxury of stating what really happens in the implemented system.

(13)

I often use the name “cons” to speak of generic pairs, as it’s common in the Lisp jargon. In the ML and Haskell communities, “conses” are just used in lists. But since there’s absolutely no difference between the two kinds of pairs in ε₁, I adopt the Lisp convention.

(14)

I can also support older or less conventional machines such as VAXen or the target of my cool sub-project, which violate this rule: The idea is to have the heap start at some fixed address higher than 0: every number lower than the minimum will always be a non-pointer for our purposes, even independently of boxedness tags.

(15)

The current implementation of local variable is naïve, and the code will actually run slower after this change. However, that’s hardly the point. I will make the implementation efficient in the future.

(16)

Only conceptually for the time being, for reasons of convenience while bootstrapping. But the implementation will probably change soon.

(17)

epsilon-image-interpreter-tagged can be easily optimized, and made nearly as fast as epsilon-image-interpreter-untagged.

(18)

But their implementation will change in the future: they can be more efficient, and efficiency is important when a lot of elements are alive — this is the case of many hash tables, containing a large number of (small) buckets.