From: ·········@gmail.com
Subject: ( picoVerse-:( I want to generate machine code ) )
Date: 
Message-ID: <1172024764.815683.84140@v45g2000cwv.googlegroups.com>
I want to generate machine code and stick it into a byte vector
( ByteArray in Smalltalk ) and then pass it to something to have it
evaluated.  I looked into the Intel op code manuals but I find it
confusing.  It looks something like :

<opCode><Mod/rn><SIB>

where each one of these things takes up a Byte.

When they write the contents of the opCode etc they index the bits
from right to left starting at 0 and going to 7.  bit 0 is the least
significant bit and bit 7 is the most significant.

Now,

If you apply this same right to left addressing scheme to the 3 bytes
above then you get that <SIB> should be at a lower address than the
<opCode>.  But this can't be right because the CPU uses the opCode to
decide what comes next, doesn't it?  The IP instruction pointer is
increasing in value generally right?  Not decreasing( excluding
jumps ).  So the <opCode> should be at a lower address than the <SIB>
which should be at a higher address.  The CPU sees the opCode and uses
it to decide how many bytes might come next.  right?  So the addresses
of the bits are increasing from right to left within the bytes but the
addresses of the bytes are increasing from left to right.  It's bass
ackwards and no wonder they don't explain it if it's true.  Who cares
if it's unusable just so it looks good.  You're just supposed to
know.  or something.

Now suppose the opCode is 3 bytes long.  And is listed in the manual
in hex as 0F 3A 78 .  So which of these three bytes comes first in
RAM?  Which one has the lowest address?  It seems like we are supposed
to assume that 0F comes first and has the lowest address with 78
coming last and having the highest address of the three bytes.  Again
the bit addresses are increasing from right to left within the bytes
but the byte addresses are increasing from left to right.  It would be
nice if everything went left to right or right to left so things were
consistent.

But if you did that then perhaps you would have to write 78 3A 0F if
byte addresses were going to increase from right to left same as the
bit addresses within the bytes.  But then the actual bits should be
reversed too shouldn't they?  So you would get I don't know what:

87 A3 F0

no that's not right either.  Is it.  If all the bits are reversed you
get:

0F 3A 78 = 00001111  00111010  01111000
so backwards:
00011110  01011100  11110000
which is
1E 5C F0

I know that there are many languages that produce machine code.  Where
do they get their information about how to do it?   In Smalltalk or
Lisp I want to do it like:


( label := ( c move:( Double )into:( EAX )from:( EBX asAddress ) )
asAddress )
. . . . .
( c move:( Double )into:( EAX asAddress )from:( EBX ) )
( c jumpOnZeroTo:( label ) )
( c move:( Double )into:( EAX )from:( EBP asBase + 3 ) )


etc.  In other words it's assembler but it's written out more self
documenting.  Like in Smalltalk.  Or Lisp.  Where c could be either a
CPU simulator or a machine code generator depending on if you were
debugging or not.

So where can I get help about how to generate the machine code?  Which
byte goes where in RAM?  How do I read these Intel manuals?  etc.  Is
there a forum for this somewhere?  I have searched on the web and for
books but I have come up largely dry.

-Kjell

From: Wolfram Fenske
Subject: Re: ( picoVerse-:( I want to generate machine code ) )
Date: 
Message-ID: <1172077199.387356.31390@p10g2000cwp.googlegroups.com>
·········@gmail.com writes:

> I want to generate machine code and stick it into a byte vector
> ( ByteArray in Smalltalk ) and then pass it to something to have it
> evaluated.  I looked into the Intel op code manuals but I find it
> confusing.  It looks something like :
>
> <opCode><Mod/rn><SIB>
>
> where each one of these things takes up a Byte.
>
> When they write the contents of the opCode etc they index the bits
> from right to left starting at 0 and going to 7.  bit 0 is the least
> significant bit and bit 7 is the most significant.
>
> Now,
>
> If you apply this same right to left addressing scheme to the 3 bytes
> above

You don't.  The bytes are ordered from left to right.  However, if
there's a large integer constant in the instruction, say, 0x11223344
(assuming 44 is the least significant byte) it will look like "0x44
0x33 0x22 0x11" in memory on an x86 machine (little-endian order).

[...]

> Now suppose the opCode is 3 bytes long.  And is listed in the manual
> in hex as 0F 3A 78 .  So which of these three bytes comes first in
> RAM?

0x0F

[...]

> But if you did that then perhaps you would have to write 78 3A 0F if
> byte addresses were going to increase from right to left same as the
> bit addresses within the bytes.  But then the actual bits should be
> reversed too shouldn't they?  So you would get I don't know what:
>
> 87 A3 F0
>
> no that's not right either.  Is it.

No, it isn't.  You don't reverse the octets or the bits.  You're
making it harder than it is.  Assuming that the bit and byte order of
the machine where you create the instruction vector are the same as on
the machine where you execute it, this is not a problem.

[...]

> I know that there are many languages that produce machine code.  Where
> do they get their information about how to do it?

Someone hard-coded it into the compiler?

> In Smalltalk or Lisp I want to do it like:
>
>
> ( label := ( c move:( Double )into:( EAX )from:( EBX asAddress ) )
> asAddress )
> . . . . .
> ( c move:( Double )into:( EAX asAddress )from:( EBX ) )
> ( c jumpOnZeroTo:( label ) )
> ( c move:( Double )into:( EAX )from:( EBP asBase + 3 ) )
>
>
> etc.  In other words it's assembler but it's written out more self
> documenting.

If you're actually going to program in that syntax, I predict your
fingers will be bloody stumps before you ever finish "Hello World!".
There's a reason why assembler syntax is so terse: you have to write
such a lot of it.  If you want to make your life easier I suggest you
program support for control constructs (if, for, ...), function calls
and other idioms.  E. g. you write (call 'foo eax ebx 5) and it
expands into

--8<---------------cut here---------------start------------->8---
  pushl $0x5
  push  %ebx
  push  %eax
  call  (int)&foo - [current postion in your bytecode vector]
  add   $0xc,%esp
--8<---------------cut here---------------end--------------->8---

> So where can I get help about how to generate the machine code?

I'm currently working on a just-in-time compiler.  I was also trying
to learn something about this stuff but I didn't find a lot of useful
information.  Maybe I wasn't looking hard enough.  So what I did was
write dummy programs in C and use Gnu's "objdump" to disassemble the
output [1]and see which byte sequences were generated.  Inline
assembler
in C is also helpful.  Say, you want to find out how to compare two
doubles.  You could write this:

--8<---------------cut here---------------start------------->8---
  asm("pushl $0x111;\n""addl $0x4,%%esp;\n"
      :		// output
      :		// input
      :"%eax","%ebx","%ecx","%edx","%edi","%esi" // clobbered
registers
      );

  res = double_value < 3.0;

  asm("pushl $0x222;\n""addl $0x4,%%esp;\n"
      :		// output
      :		// input
      :"%eax","%ebx","%ecx","%edx","%edi","%esi" // clobbered
registers
      );
--8<---------------cut here---------------end--------------->8---

In the disassembly you look for the marker 0x111 (a byte sequence that
usually won't be generated by anything else) and there you are:

--8<---------------cut here---------------start------------->8---
68 11 01 00 00          push   $0x111           ; our start marker
83 c4 04                add    $0x4,%esp        ; (end of marker)
dd 45 d0                fldl   0xffffffd0(%ebp) ; load double_value
dd 05 b8 89 04 08       fldl   0x80489b8        ; load 3.0
da e9                   fucompp                 ; compare
df e0                   fnstsw %ax              ; store the FPU
                                                ; status word in %ax
f6 c4 45                test   $0x45,%ah        ; test the status word
0f 94 c0                sete   %al              ; set %al to the value
                                                ; of the ZERO flag
0f b6 c0                movzbl %al,%eax         ; convert %al to an
                                                ; int
89 45 e0                mov    %eax,0xffffffe0(%ebp) ; store %eax in
res
68 22 02 00 00          push   $0x222           ; our end marker
83 c4 04                add    $0x4,%esp
--8<---------------cut here---------------end--------------->8---

> Which byte goes where in RAM?  How do I read these Intel manuals?
> etc.

I don't know.  I didn't find these manuals too helpful, either.

> Is there a forum for this somewhere?

Perhaps comp.compilers or some assembler discussion group.
comp.lang.lisp isn't the right place.


Footnotes:
[1]  It might be confusing at first that the Gnu tools don't use
     Intels assembler syntax, but AT&T syntax.  So "mov eax,ecx" in
     Intel syntax will become "mov %ecx,%eax".  You'll get used to
     it.

--
Wolfram Fenske

A: Yes.
>Q: Are you sure?
>>A: Because it reverses the logical flow of conversation.
>>>Q: Why is top posting frowned upon?
From: Vassil Nikolov
Subject: [off-topic] examining the C compiler's output (Ex: ( picoVerse-:( I want to generate machine code ) ))
Date: 
Message-ID: <yy8vejoid8z8.fsf_-_@eskimo.com>
On 21 Feb 2007 08:59:59 -0800, "Wolfram Fenske" <·····@gmx.net> said:
| ...
| write dummy programs in C and use Gnu's "objdump" to disassemble the
| output

  But why not simply use the C compiler's -S switch?

  ---Vassil.


-- 
Our programs do not have bugs; it is just that the users' expectations
differ from the way they are implemented.
From: Wolfram Fenske
Subject: Re: examining the C compiler's output (Ex: ( picoVerse-:( I want to generate machine code ) ))
Date: 
Message-ID: <1172145252.067714.82730@s48g2000cws.googlegroups.com>
On Feb 22, 7:03 am, Vassil Nikolov <···············@pobox.com> wrote:
> On 21 Feb 2007 08:59:59 -0800, "Wolfram Fenske" <····@gmx.net> said:
> | ...
> | write dummy programs in C and use Gnu's "objdump" to disassemble the
> | output
>
>   But why not simply use the C compiler's -S switch?

Thanks for reminding me.  Yes, "gcc -S" is also helpful, but it
doesn't tell you which byte sequences are generated from the assembly
statements.  Using "objdump -d" you can see that e. g. "push %edi" is
0x57, "push %esi" is 0x56.  For a just-in-time compiler you need to
know this, assuming you write everything yourself [1].


Wolfram

Footnotes:
[1]  One could also use libraries like VCODE or Gnu Lightning, but I
     don't think this is what the OP had in mind.
From: Vassil Nikolov
Subject: Re: examining the C compiler's output (Ex: ( picoVerse-:( I want to generate machine code ) ))
Date: 
Message-ID: <yy8vlkipy3fw.fsf@eskimo.com>
On 22 Feb 2007 03:54:12 -0800, "Wolfram Fenske" <·····@gmx.net> said:
| Thanks for reminding me.  Yes, "gcc -S" is also helpful, but it
| doesn't tell you which byte sequences are generated from the assembly
| statements.  Using "objdump -d" you can see that e. g. "push %edi" is
| 0x57, "push %esi" is 0x56.  For a just-in-time compiler you need to
| know this, assuming you write everything yourself

  I've never found myself in such a situation, but even then, would
  running as with a suitable listing option on -S output yield the same
  information while preserving (useful) comments generated by the C
  compiler?

  ---Vassil.


-- 
Our programs do not have bugs; it is just that the users' expectations
differ from the way they are implemented.
From: Wolfram Fenske
Subject: Re: examining the C compiler's output (Ex: ( picoVerse-:( I want to generate machine code ) ))
Date: 
Message-ID: <1172205045.100661.180770@q2g2000cwa.googlegroups.com>
On Feb 23, 4:10 am, Vassil Nikolov <···············@pobox.com> wrote:
> On 22 Feb 2007 03:54:12 -0800, "Wolfram Fenske" <····@gmx.net> said:
> | Thanks for reminding me.  Yes, "gcc -S" is also helpful, but it
> | doesn't tell you which byte sequences are generated from the assembly
> | statements.  Using "objdump -d" you can see that e. g. "push %edi" is
> | 0x57, "push %esi" is 0x56.  For a just-in-time compiler you need to
> | know this, assuming you write everything yourself
>
>   I've never found myself in such a situation, but even then, would
>   running as with a suitable listing option on -S output yield the same
>   information while preserving (useful) comments generated by the C
>   compiler?

Probably.  I have to say, I'm not really an assembly language
programmer and I've never used gcc's -S switch or "as" before.  Right
now I know all the opcodes I need to know.  I'll remember to try your
suggestion next time I want to know which instruction generates which
byte sequence.


Wolfram