“ARM Assembly Adventures” Dennis Chen and Michael Bocamazo
What did you do?
We “investigated” optimization of the accuracy of fixed point math functions on the ARM CPU family.
Computer representations of numbers are never entirely accurate, they just get “close enough”, since we only have a finite amount of space, or bits, to represent numbers in. Similarly, mathematical functions that we often call in programs without questioning, like cos(x) or x/y, really use algorithms to return an answer that is close enough to the value we're looking for. In our project, we implemented some of these mathematical functions ourselves to understand them better. Ideally we would have then optimized the accuracy of these mathematical functions by tweaking constants that are involved in the algorithms, but for this project, we reached for the proverbial moon, our arms weren’t long enough, shot ourselves in the face, and fell down a very tall staircase headfirst. One could say we overscoped.
More specifically, for this project we chose a specific numerical representation (fixed point math), and a specific architecture (ARM) to work with. This limited us to using certain algorithms, and to writing our code using a instruction set that is specific to ARM CPUs. In the absence of tools to run and evaluate example ARM assembly programs, we implemented several basic ARM instructions in python so we could accurately translate between ARM and python and run python code that emulated an ARM CPU. ARM assembly instructions like add,subtract,multiply,udiv,lshift,etc… were implemented in python. Afterwards, we used these instructions as building blocks to write an IQ30 fixed point division function and evaluated it's accuracy against python's own native floating point division.
After successfully doing the above, we then attempted to implement an IQ29 arctan function in our frankenstein python/ARM, but everything came to a screeching halt when we realized that the actual ARM assembly code we were using as a reference used looked up tables and stored values that we could not find. Without these values we couldn't know if our arctan function was even implemented correctly, so the original ambitious goals of machine learning were scrapped, probably much to Eric's amusement.
Alternatively, we could have bootstrapped with optimization from just randomized initial guesses in the look up table, and used some of MATLAB’s non-gradient based optimizers. However, we were unsure of the correctness of our implementation (given that it was impossible to run without the table) and in the final hours focused on making the work we have done as meaningful as possible as an introduction to ARM.
What we hoped to replicate: http://upload.wikimedia.org/wikipedia/commons/f/f7/Atan2.png
Why did you do it?
The impetus for doing numerical optimization and function approximation was Eric’s statement that genetic algorithms worked well enough as a non-gradient based optimizer for many function approximations that relied on look up table values, or parameterizations of truncated Taylor series. We wanted to see if we could demonstrate the feasibility of using other optimization approaches in the context of function approximation. For this reason, we wanted to recreate Eric’s code to get fidelity on the execution and comparison. Because the original source was lost to the sands of time, we used a disassembler to recreate the source from the compiled version (this meant we lost any comments available). The source was in the unfamiliar ARM (Little Endian, v7) architecture, which meant we had to familiarize ourselves with an instruction set with additional features (Persistent condition flags! Conditional Execution! Variable operand length!). This was an educational task in itself, and in doing so, we produced something of (questionable) educational value for students looking to understand more about different assembly systems, or hacky ways to do integer math with fidelity in python.
How did you do it? Our github repo: https://github.com/dennis-chen/arm-assembly-simulator
First, we obtained Eric’s original ARM Stellaris IQ Math Standalone package from Texas Instruments: http://www.ti.com/tool/sw-iqmath&DCMP=STELLARIS&
Then, we installed a disassembler to recreate the source from this version. We installed the demo version of the hexrays disassembler. https://www.hex-rays.com/products/ida/support/download_demo.shtml (To enable auto comments, go to options→ general → auto comments. Disregard the strange icon.) However, this does not seem to include the table of constants used for function approximation, or in our foolishness we did not see them, so this approach is currently not complete.
The first function that we wanted to implement for introduction was the signed fractional division function. Integer division (UDIV) is already implemented as an instruction in ARM, but that cannot be used for fractional division. Instead, an algorithm that shifts the dividend to significance and iteratively evaluates the significance of the remainder of the dividend is used to achieve a “screaming nasty fast” implementation of fractional division. We translated the ARM into pseudo code, using the ARM documentation (http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dui0041c/Babbfdih.html#), the ARM wiki (http://www.heyrick.co.uk/armwiki/Main_Page), and some notes pages that were marginally helpful for the main ideas of division, but were not identical to this implementation (http://opencourseware.kfupm.edu.sa/colleges/ccse/coe/coe308/files%5C2-Lecture_Notes_05-MultiplyDivide.pdf, http://www.ann.ece.ufl.edu/courses/eel4713_12spr/slides/Lec8-division.pdf). From this pseudo code, we wrote a python implementation using fixed-point arithmetic. We then tested its accuracy against Python’s floating point division, and it was within reasonable accuracy (below 1e-8 of abs((actual-expected)/expected), so our implementation was confirmed in its correctness.
Then, we did the same procedure for the atan2 function. atan2 differs from atan of one argument in that it is able to return a value between -pi and pi instead of just -pi/2 and pi/2 because it evaluates both arguments, rather than just their ratio. atan2 would be a suitable function for attempting optimization as an introduction to it. We then translated the ARM code to our set of python functions as we had before, but we soon realized that we were missing a piece - the lookup tables originally used. Then, we decided to transition to making our code do something like emulation of a program, or translation of ARM assembly into python-executable functions that can form into a program. This would become what is essentially a really bad approximation of what MARS is for ARM. Some such program probably already exists, however, it was useful to create it.
Because the theme of this project is meant to be communication, and an introduction ARM has some educational value, we chose to write a short introduction to the ARM assembly instruction set for those familiar with MIPS, which could become a useful reference for the course, or a time savings for similar projects in the future.
Our program translator is contained in ‘simulator.py’, and you can simply run python simulator.py my_arm_file to see the outputs of the program. We had to stop working on it given the time crunch, so all it currently does is parse most of the assembly into a series of python functions that will be called, but haven’t been implemented yet. The jump is additionally not yet working, but we hope to extend this into a program simulator (with a basic and very limited instruction set) that actually works for expo because it seems feasible.
How can someone else build on it?
To build on this code, someone could extend the functionality of our translator/emulator/simulator, or make contributions to the introduction to ARM. Because our work became the foundation we needed for what we wanted to do with optimization, tackling the original optimization problem would be a good way to build on what we have done so far.
A Short Introduction to ARM From the perspective of those familiar with MIPS
Conditional Execution The biggest difference that is immediately apparent is the conditional execution flags and blocks. The ARM architecture retains the values of four flags that allow for a variety of comparisons to be made, and have much more compression than other assembly languages that require many conditional branching instructions. The “IT” or If-Then blocks are much more intuitive to programmers of high level languages, because not every condition requires a branch, which seems like it would accumulate to progressively messier systems of GOTO statements. Instead, an the conditional execution instruction starts the block as “IT”, “ITT” or “ITTT”, depending on the number of following instructions in the block.
There are four base conditional flags: Z = Zero, C = Carryout, V = Overflow, and N = Negative. Many instructions are able to set these base flags directly, such as “TEQ” (test if equal) or “CMP” (compare), but in many instructions, an “S” flag may be appended to set the flags if necessary. From the 4 base flags, there are 14 flags that are constructed that the IT instruction can call upon. The include all comparisons: >, <, >=, ⇐, with and without signed representation. Some of them, such as CC (Carry Clear: C == 0) and MI (Minus, negative, N == 1) encode the same arithmetic meaning but are used in different situations based on the instruction that set them. HI, unsigned higher, means greater than or unordered, C == 1 and Z == 1. LT: Signed Less than (Less than, or unordered) checks if N != V. So the flags are somewhat hidden from the programmer, but the conditional branching is just as flexible as the many possible branching conditions seen in MIPS. The setting instruction could be “CMP R0, R1” then the leading instruction of the block “ITTT HI”, with “MOVHI R2, R0”, “MOVHI R0, R1”, “MOVHI R1, R2” following. This translates to: If R0 > R1, swap the values of R0 and R1, which is done through the intermediate of R2. It is possible that the ITTT instruction increments the program counter by the appropriate number of instructions past the conditional block if the condition is not met, in which case it would save a few instructions on each miss. Conditional branches, such as BEQ, do not require a preceding IT instruction, and simply examine the condition flag. Often, the IT blocks and conditional flags can be nicely simplified away in a Python translation by creating an If block. Sometimes, however, the flag to be evaluated is dependent on a value that changes before the conditional execution, in which case it is necessary to reorder the instructions, or store the value temporarily as a placeholder.
Word Length You may notice funny .W or .N after each instruction. W means “Wide” and N means “Narrow”. These hard code the word lengths at 32 and 16 bits respectively.
Exchange of Instruction Set “An ARM for a THUMB makes the whole world’s computers execute faster(?)”. You may notice BX, BXEQ, or similar. Branch-Exchange means the the instruction sets used switch on this instruction between ARM and THUMB. From cursory research, THUMB is a more compressed instruction set that can execute faster if done correctly (probably involving a lot of hand-optimization of algorithms) but if done incorrectly can be slower because sometimes multiple instructions are required to replicate what can be done in ARM. We don’t have enough knowledge to comment on the differences beyond that.
Shifting and Rotation Instructions to shift (logical or arithmetic) can be placed in the arguments of a instruction as in “ADD.W R3, R3, R2, LSL#2” which means R3 = R3 + R2«2. In addition logical or arithmetic shifting left and right, there is also Rotation, which allows variable bits to be shifted in. RRX - Rotate Right with Extend - shifts in the value of the carry flag to the MSB and shifts right.
Interesting Arithmetic There are a few interesting arithmetic operators not in MIPS. Some of these are RSB: Reverse subtract, which amounts to negating a variable and adding it to an immediate. “RSB.W R0, R0, #0x20000000” is R0 = #0x20000000-R0. There is a very useful operator for division, MLS, the multiply-subtract. Of the form: MLS Rd, Rn, Rm, Ra; operates: Rd = Ra - Rn*Rm. When doing fractional division, this allows you to find the expected remainder given the current estimate of the quotient, as Remainder = Dividend - Quotient*Divisor. This is the key instruction for understanding the division algorithm (that we’ve translated, at least. There are probably others).
Conclusion: Using the structure of conditional execution, one can readily convert pseudocode into a structure that is suitable for coding into ARM. The If-Then blocks translate directly, and if condition execution requires more than three instructions, or has multiple levels of flow control, a branch is probably appropriate. However, this allows most simple programs to reserve branching for actual function calls rather than flow-control blocks. There are many instructions not included in MIPS, such as shifting and arithmetic functions, that allow for better hand-optimization of programs, and can even make code more readable. (It looks like) ARM is actually used in industry, while MIPS is mostly educational, so familiarity with ARM could be valuable and enriching.