Index: generic/asm/README
===================================================================
diff -u
--- generic/asm/README	(revision 0)
+++ generic/asm/README	(revision f5d6cc8d42ce2199f8185f6b3617f82be805a76b)
@@ -0,0 +1,240 @@
+
+The current implementation of the Tcl-byte-code engine is built around
+a large switch statement (the technique is called switch-threading).
+It is actually quite hard to understand and to de-mangle, since there
+are many goto-operations for optimization, etc. There exists several
+other implementation techniques for execution engines coming from the
+field of direct threaded code. In particular i invested into
+differences between switch-threading, call-threading and label
+threading. implementation mechanisms around, namely call-threading and
+label-threading (direct threading). 
+
+First of all, threaded.c is a short introductory example, showing in a
+small concise program the differences of these techniques.  On my
+machine i got the following results:
+
+1: 0 seconds, 94591 usec
+2: 0 seconds, 84749 usec
+3: 0 seconds, 59811 usec
+
+[1] switch threading
+[2] call threading
+[3] label threading 
+
+This experiment shows that the implementation of switch-threading 
+is the slowest, followed by call threading and label threading. 
+However, the implementation of label-threading depends on a 
+non-standard C extension (label address operator) needed for 
+computed gotos. However, the label address operator is supported 
+by at least three different compilers (supported on GCC, 
+clang and IBM's XL C/C++).
+
+Based on this study i built an experimental new code-engine for tcl
+(integrated in nsf/nx) based on the following ideas:
+
+- Use Tcl's basic implementation elements as base elements of the
+  execution-engine: Tcl_Obj, Tcl_ObjCmdProc (objProcs).
+
+- Instead of defining a "byte code", implement the code essentially as
+  an array of objProcs, were as well the argument vectors of the
+  objProcs are built as far as possible at compile time. This makes it
+  possible to implement quickly an execution engine for all of Tcl.
+
+- By using objProcs as basic mechanism, every tcl-command
+  implementation will become effectively an "instruction" of the code
+  engine. By defining new objProcs, the command set of the engine will
+  be extended automatically. Specialized instructions with the same
+  format can be registered depending on syntactic properties of the
+  source (e.g. not all argument test will be required, part of the
+  syntax errors can be caught at compile time, on could implement
+  specialized versions for Tcl's "set" for setting and reading
+  variables, etc.).
+
+- The code engine is designed for allowing different resolving
+  strategies for cmds (objProcs): come commands could be resolved at
+  compile-time, some commands must be resolved at runtime. Similar
+  experiments can be made in the object oriented case for objects and
+  methods. It is not completely clear, if the binding time flexibility
+  should be an option for a production version of the execution
+  engine, since this would cause differences in the semantics of Tcl
+  (when commands are redefined). However, this option provides some
+  good evidence for potential speedups.
+
+- The current byte code engine performs double dispatches on object
+  oriented calls (one dispatch on the object, a second dispatch on the
+  method). The new execution engine allows direct dispatch on methods.
+
+- Exploit typing information: NSF/NX provides "optionally typing"
+  information (e.g. x:int). This optional typing information can be
+  used at runtime as well for optimizing the code.
+
+- Provide an intermediate representation ("Tcl assembler") instead of
+  "hard wiring" instructions to Tcl commands as implemented in
+  Tcl. The Tcl assembler makes it easier to implement high-level
+  optimizations to provide different implementations for e.g. a
+  sequence of tcl instructions. Note that a Tcl compiler (a program
+  translating Tcl source into Tcl assembler) can be implemented in Tcl
+  as well. Furthermore, for some high-performance requirements, it is
+  possible to simply write Tcl assembly code by hand (this is at least
+  an alternative to C-coding).  Note that the Tcl assembler is not
+  machine dependent, the instructions are those of the "abstract Tcl
+  machine". Currently, the syntax of the Tcl assembler is a nested Tcl
+  list.
+
+- As explained above, label-threading is the fastest implementation
+  technique of the compared approaches, but it is not supported by all
+  c-compilers. Implemented execution engines are sufficiently
+  different to make maintenance of two (or more) execution engines
+  hard. In order to address this issue, i implemented an assembly code
+  generator and an execution code generator in NX from a common,
+  declarative source. The assembly code and the execution engines can
+  be included in the source. Depending on the C-compiler, the "right"
+  include can be chosen. Note that it is certainly possible to
+  implement some more different (optimized) execution engines as well.
+
+- Replacing for Tcl the complete byte-code engine is quite a large
+  task. In order to experiment with the new execution engines i
+  implemented asm-proc and asm-method as counterparts of "proc" and
+  "method", which will take in their body the tcl assembly
+  language. When such a command is called, the appropriate execution
+  engine is called. This way, the new execution engine can co-exist
+  with Tcl's classical byte-code engine, it should not be as well
+  quite easy to follow this approach in jimtcl.
+
+- In cases where the new tcl-compilation and assembly generation
+  fails, one could fall back to the basic implementation (e.g. tcl
+  byte code engine).
+  
+Some preliminary results:
+
+  - If one executes just the objProcs (e.g. for "set" Tcl_SetObjCmd),
+    the old code engine is faster, but the new ones "work" as well. It
+    is certainly possible to provide new optimized instructions for
+    certain tasks (accessing/setting variables, increment operations
+    etc.) using the same format.
+
+  - Using such instructions, both new execution engines (for
+    call-threading and label-threading) show to be much faster than
+    Tcl's current bytecode engine.
+
+  - The current implementation based on call-threading is for the
+    example below up to about three times faster than the
+    Tcl-implementation, the label-threading variant is about 5 times
+    faster.
+
+Current shortcomings / work in progress: 
+
+  - The current implementation is not re-entrant (no stack
+    integration, proc local variables are currently stored in the
+    assembly proc structure, not on the stack).
+
+  - For allowing "uplevels" and "upvars" and compiled locals, we will
+    need in Tcl the proc call frames, which will certainly cost some
+    performance. For high-performance situations, these frames could
+    be made optional (these are not available if one would code the
+    procs straightforward in C). 
+
+    Note that the example below with the for loop and incr, reentrancy
+    and stack frames are not required/needed at all (since the code
+    does neither call a proc (or eval) and the code is not recursive -
+    it is not clear, how often such optimization would be possible).
+
+  - The instruction set is just sufficient for the about 30 examples
+    in my current regression test set. The instruction set is just
+    experimental, and not sufficiently thoughtfully designed.
+
+  - The assembly language is very primitive: one has a set of "slot"
+    declarations which are constants or variables. Later references
+    address these slots via their indices (starting with
+    0). Similarly, the instructions are numbered as well (the code is
+    a C array of instructions). A "goto" simply addresses the next
+    instruction via the array index. All "declarations" of local vars
+    have to be written before the "instructions".
+
+  - It is not clear, how much Tcl compatibility is needed or wanted
+    (interface to compiled locals, variable-/call-traces, etc.)
+
+  - No effort so far on compiling Tcl source into Tcl assembly.
+
+Below is a short example in Tcl (and the Tcl byte code engine) and in
+Tcl assembly in two implementation variants, one with Tcl_Objs, one
+with integers (in case the typing information is available or
+inferable from the Tcl source). 
+
+The timing results based on clang under Mac OS X:
+
+Call Threading:
+asm/t.001:	6.64
+asm/t.002:	4.04
+asm/t.003:	2.46
+
+Label Threading:
+asm/t.001:	6.58
+asm/t.002:	2.74
+asm/t.003:	1.33
+
+I'll push the version over the holidays to our public git repository.
+
+All the best
+-gustaf
+
+
+==================================================
+package req nx::test
+nx::Test parameter count 100000
+
+proc sum10.tcl {} {
+  set sum 0
+  for {set i 0} {$i < 100} {incr i} {
+    incr sum $i
+  }
+  return $sum
+}
+
+# Implementation in Tcl assembly, using tcl-objs for "sum", "i" and
+# the constants
+
+nsf::asm::proc sum10.asm1 {} {
+  {obj sum}
+  {obj i}
+  {obj 0}
+  {obj 1}
+  {obj 100}
+  {obj 0}
+  {var obj 0} 
+  {var obj 1}              
+  {duplicateObj slot 6 obj 2}  
+  {duplicateObj slot 7 obj 5}  
+  {leIntObj slot 4 slot 7}
+  {jumpTrue instruction 7}
+  {incrObj slot 6 slot 7} 
+  {incrObj slot 7 slot 3} 
+  {jump instruction 2}
+  {setResult slot 6}
+}
+
+# Implementation in Tcl assembly, using integers for "sum", "i" and
+# the constants
+
+nsf::asm::proc sum10.asm2 {} {
+  {obj sum}
+  {obj i}
+  {integer int 1}
+  {integer int 100}
+  {integer int 0}
+  {integer int 0}
+  {setInt slot 4 int 0}
+  {setInt slot 5 int 0}
+  {leInt slot 3 slot 5}
+  {jumpTrue instruction 7}
+  {incrInt slot 4 slot 5} 
+  {incrInt slot 5 slot 2} 
+  {jump instruction 2}
+  {setResultInt slot 4}
+}
+
+? {sum10.tcl} "4950"
+? {sum10.asm1} "4950"
+? {sum10.asm2} "4950"
+
+