Wednesday 14 March 2018

Tcl/Tk vs Trampolines

This is a note on a problem I encountered while working on the House Management example domain in my ColdFrame project.
TL;DR: If you write a Tcl callback function in Ada, declare it at library level.
The failure occurred in a Tcl/Tk-based demo program, which displays a cutaway view of a baronial mansion in Michigan with "lamps" on the landings, controlled by "buttons", where the interaction with the Ada application is via mouse clicks on the "buttons", which cause Tcl callbacks.
The last time I'm sure this worked was in 2014. Now (March 2018), with a new macOS (High Sierra) and a much more recent Ada compiler (GCC 7.1.0) but the same release of Tcl/Tk (the macOS-provided one, release 8.5), the image is displayed but the program crashes without ever getting to the Ada code at all (variously Storage_Error, SIGSEGV, SIGBUS, SIGILL).
macOS provides debug versions of the Tcl, Tk libraries, but only as dynamic libraries, named for example Tcl_debug instead of plain Tcl. You can tell the dynamic loader dyld to load the debug version by setting the symbol DYLD_IMAGE_SUFFIX:
DYLD_IMAGE_SUFFIX=_debug ./stairwell_demo
but this didn't help: there seems to be a macOS issue with GDB and dynamic libraries, and GDB couldn't display anything useful in the stack trace.
I tried GCC 8.0.1 (prerelease): same problem.
I tried ActiveTCL (currently version 8.6); same problem.
I tried macOS Sierra : same problem.
I tried on Debian Jessie with GNAT GPL 2017: same problem.
Huh.
At this point, I thought to try building a static debug version of the Tcl library myself, and downloaded Tcl 8.5.19 from Sourceforge (hoping that the system Tk, 8.5.9, would work with this; it did). I built it, using
$ cd unix/
$ ./configure --disable-shared --enable-symbols --enable-threads
$ make
and installed just libtcl8.5.a and tcl8.5/... in /usr/local/lib, not wanting to complicate my life with possibly incompatible executables in /usr/local/bin.
Now, link with
   Linker_Options :=
     (
      "-ltk8.5",
      "/usr/local/lib/libtcl8.5.a",
      "-framework", "CoreFoundation"
     );
(where the Core Foundation framework would previously have been automatically loaded by the dynamic version of Tcl) and run: now the debugger stops at the signal and can show something useful:
Program received signal SIGSEGV, Segmentation fault.
0x00007ffeefbff731 in ?? ()
(gdb) bt
#0  0x00007ffeefbff731 in ?? ()
#1  0x000000010007a3f6 in TclInvokeStringCommand (clientData=0x101060710,
    interp=0x10203b010, objc=3, objv=0x100810ee0)
    at /Users/simon/tmp/tcl8.5.19/unix/../generic/tclBasic.c:2197
#2  0x000000010007c390 in TclEvalObjvInternal (interp=0x10203b010, objc=3,
    objv=0x100810ee0, command=0x7ffeefbff0a8 "pushButton 0 1["00"]",
    length=14, flags=0)
    at /Users/simon/tmp/tcl8.5.19/unix/../generic/tclBasic.c:3734
Looking up a frame, we see
(gdb) up
#1  0x000000010007a3f6 in TclInvokeStringCommand (clientData=0x101060710,
    interp=0x10203b010, objc=3, objv=0x100810ee0)
    at /Users/simon/tmp/tcl8.5.19/unix/../generic/tclBasic.c:2197
2197     result = (*cmdPtr->proc)(cmdPtr->clientData, interp, objc, argv);
(gdb) p cmdPtr->proc
$1 = (Tcl_CmdProc *) 0x7ffeefbff731
so the program is calling what should be the callback for the pushButton 0 1 command at address 0x00007ffeefbff731 and crashing.
This address is clearly not the address of the Ada Push_Button_Command function, but instead it's on the stack. (I did wonder why it's odd, but it's an x86_64 processor so instructions don't have to be aligned). The code there starts
(gdb) disass 0x7ffeefbff731,0x7ffeefbff750
Dump of assembler code from 0x7ffeefbff731 to 0x7ffeefbff750:
   0x00007ffeefbff731: idivl  0x7ffeef(%rdi)
   0x00007ffeefbff737: add    %al,0x10(%rax)
   0x00007ffeefbff73a: add    %al,(%rax)
   0x00007ffeefbff73c: add    %eax,(%rax)
   0x00007ffeefbff73e: add    %al,(%rax)
   0x00007ffeefbff740: jo     0x7ffeefbff739
   0x00007ffeefbff742: mov    $0x7ffeef,%edi
   0x00007ffeefbff747: add    %al,(%rax)
   0x00007ffeefbff749: add    %al,(%rax)
   0x00007ffeefbff74b: add    (%rcx),%al
   0x00007ffeefbff74d: add    %al,(%rax)
   0x00007ffeefbff74f: add    %al,(%rax)
It's actually the address of a trampoline. In the GNAT Reference Manual it says
[A trampoline] is a structure that is built on the stack and contains dynamic code to be executed at run time. On some targets, a trampoline is built for the following features: Access, Unrestricted_Access, or Address of a nested subprogram; nested task bodies; primitive operations of nested tagged types. Trampolines do not work on machines that prevent execution of stack data. For example, on windows systems, enabling DEP (data execution protection) will cause trampolines to raise an exception. Trampolines are also quite slow at run time.
On many targets, trampolines have been largely eliminated. Look at the version of system.ads for your target - if it has Always_Compatible_Rep equal to False, then trampolines are largely eliminated. In particular, a trampoline is built for the following features: Address of a nested subprogram; Access or Unrestricted_Access of a nested subprogram, but only if pragma Favor_Top_Level applies, or the access type has a foreign-language convention; primitive operations of nested tagged types.
and, although Always_Compatible_Rep is indeed equal to False, and I've never come across pragma Favor_Top_Level, the function concerned is (a) nested, and (b) called via Tcl and so has a foreign language convention.
Rewriting the application so as to put the callback code into a library-level package turns out to mean that trampolines are not required (in spite of the GRM note above):
Breakpoint 2, stairwell_demo_application__push_button_command (clientdata=0,
    interp=0x102042210, argc=3, argv=0x10204fc50)
    at /Users/simon/coldframe/examples/House_Management.impl/stairwell_demo_application.adb:72
72    function Push_Button_Command
(gdb) bt
#0  stairwell_demo_application__push_button_command (clientdata=0,
    interp=0x102042210, argc=3, argv=0x10204fc50)
    at /Users/simon/coldframe/examples/House_Management.impl/stairwell_demo_application.adb:72
#1  0x000000010007b266 in TclInvokeStringCommand (clientData=0x100853310,
    interp=0x102042210, objc=3, objv=0x10204fae0)
    at /Users/simon/tmp/tcl8.5.19/unix/../generic/tclBasic.c:2197
#2  0x000000010007d200 in TclEvalObjvInternal (interp=0x102042210, objc=3,
    objv=0x10204fae0, command=0x7ffeefbff118 "pushButton 0 1["00"]",
    length=14, flags=0)
    at /Users/simon/tmp/tcl8.5.19/unix/../generic/tclBasic.c:3734
and
(gdb) up
#1  0x000000010007b266 in TclInvokeStringCommand (clientData=0x100853310,
    interp=0x102042210, objc=3, objv=0x10204fae0)
    at /Users/simon/tmp/tcl8.5.19/unix/../generic/tclBasic.c:2197
2197     result = (*cmdPtr->proc)(cmdPtr->clientData, interp, objc, argv);
(gdb) p cmdPtr->proc
$1 = (Tcl_CmdProc *) 0x100027caa <stairwell_demo_application__push_button_command>
Hey presto!

(Afterthought: I set
pragma Restrictions (No_Implicit_Dynamic_Code);
on the new package spec, and there were no errors.)























1 comment:

  1. Worth noting that disabling DEP should allow detecting this gotcha. I remember, decades ago, Robert Dewar talking about the use of trampolines in GNAT. This was during the Ada 95 requirements development. I thought about suggesting a no trampolines requirement, but figured it was implicit--there were CPUs even then where executing code on the stack was next to impossible. (Some CPUs had no easy way to clean up only the L1I cache so it would fetch the new code. Completely clearing the cache is painfully slow.)

    ReplyDelete