************************ register setup Ucore has: 8 general purpose register with 16 bit width (r0-r7) 1 t bit 1 stack pointer with 16 bit width 1 external memory register with 32 bit width 1 programm counter with 16 bit width 1 multiply result high register with 16 bit width ************************ memory map In standard configuration ucore has a internal memory with a size of 16384 x 16 bit words. This internal memory is used for code and data. After the internal memory the control 'memory' is maped. The control memory can only used for data (rqld/ld/st) and not for code. External memory have full 32 bit range (depends on implementation). $0000 code / data memory * 16384 * 16 bit * * $3fff $4000 code / data memory MIRROR * * * $7fff $8000 ctrl / memory StackPointer start at $8000 on reset * XXXX * 16 bit * * $XXXX ************************ register access The ucore have a 6 stage pipeline: 1. prepare fetch 2. fetch 3. decode 4. read reg 5. execute 6. write reg There is no dirt detection or stall between register accesses. So you should read the data of a written register 2 instruction after this. Otherwise you have a register hazard (assembler will give an warning). Example: movei r0,$ae ; nop ;write r0 = ae addi r0,1 ;r0 = r0 ($ae) + 1 Normally you should put an nop or other instruction in this case. Example (not optimized - 6 cycles): movei r0,$ae nop add r0,r5 movei r1,$ef nop mul r1,r6 Example (optimized - 4 cycles): movei r0,$ae movei r1,$ef add r0,r5 mul r1,r6 ************************ Branch A (conditional) branch on ucore will ever (not) taken in 1 cycle. The disadvantages of branches are use of delay slot or cover it in hardware. Ucore will use four delay slot after branch instruction. Note: These instructions will execute before branch! But placed after the branch instruction. So the lowest loop possible loop will have 5 instructions. Example: loop br loop nop ;delay slot nop ;delay slot nop ;delay slot nop ;delay slot In most case you can fill delay slot with real code and don’t have to use nop. Example 1 (not optimzed - 8 cycles): movei r0,1 movei r1,7 gpci r7,2 ;next 4 + X br drawPixel nop ;delay slot nop ;delay slot nop ;delay slot nop ;delay slot Example 1 (optimzed - 6 cycles): gpci r7,2 br drawPixel movei r0,1 ;delay slot movei r1,7 ;delay slot nop ;delay slot nop ;delay slot Example 2 (not optimized - 8 * n + 3 cycles -> if n 128 -> total 1027 cycles) movei r0,127 movei r1,0 nop loop st r1,r6 addi r1,1 subi r0,1 brts loop nop ;delay slot nop ;delay slot nop ;delay slot nop ;delay slot Example 2 (optimized - 6 * n + 2 cycles -> if n 128 -> total 770 cycles) movei r0,127 movei r1,0 loop subi r0,1 ;not used in delay slot (irq) brts loop st r1,r6 ;delay slot addi r1,1 ;delay slot nop ;delay slot nop ;delay slot ************************ Interrupts Ucore will not handle interrupts in classic way, because this will increase the hardware (pipeline save ,...). Ucore will inject branch instruction if interrupts occur and are enabled. The advantage of this method is that no special hardware design is needed. The disadvantage is that it is not easy possible to calculate the latency between interrupt and execute interrupt. Example: sei ;irq enable br anySubroutine ;if irq is enabled and occure a branch is take to irq vector and after irq is finished (rte) the give branch will take nop ;delay slot nop ;delay slot nop ;delay slot nop ;delay slot Also enabled interrupts have a effect of delay slot of the branch instruction, so you should not used load/store and condition over delayslots. Example: sei br anySubroutine addt r0,r1 ;delay slot (t is not save in delay slot if irq is enabled, because t is only store at br instruction) nop ;delay slot nop ;delay slot nop ;delay slot ************************ Request load and load rqld/ld can only access internal memory. There are two cycles delay after rqld to get the value. Example: rqld r0,0 ;send load to address (r0+0) nop ;data send to memory nop ;data will read ld r1 ;readed data to r1 Its possible to send more rqld (pipelined), so you can have two cycles for one load. Example: rqld r0,0 rqld r0,1 rqld r0,2 ld r1 ;r1 <- internal_mem[r0+0] ld r2 ;r2 <- internal_mem[r0+1] ld r3 ;r3 <- internal_mem[r0+2] Important Note: Request load combinations are only working correct in same memory block (code/data or ctrl memory). Mixing rqld between memory blocks will not work (dynamic chip select, last one wins). Example: movei r2,$0 movei r1,$0 ;code/data memory moveih r2,$20 ;ctrl memory rqld r1,0 ;request from code/data memory rqld r2,0 ;request ctrl memory nop ld r3 ;FAIL to read because rqld r2,0 switch to crtl memory (here ctrl memory is read at address 0) ld r4 ************************ Request pop and pop The behavior is equal to (request load and load). Example (1 pop): push r4 .... rqpop nop nop pop r4 Example (3 pop) push r1 push r2 push r3 .... rqpop rqpop rqpop push r3 push r2 push r1 Important Note: Same issue like rqld/ld, but I think nobody rqpop/pop at ctrl memory. ************************ External memory access The external memory access will have an 32 bit address. This address is shared by load and store. The external address is settable by esadr instruction. Example: esadr r1,r0 ;external address is r1:r0 External memory access also use request load and load model. The difference to internal memory access is, that there is no defined wait cycle time. So if the data’s are not available at external load, the pipeline is stalled. example: esadr r1,r0 ;external address is r1:r0 erqld 0 ;load from [r0:r1 + 0] eld r7 ;get external data (stalls while data are not available) To increase performance you should insert some instruction (that needed after load) between erqld and eld to prevent/attenuate stalls. Store data to external memory do only stall if outgoing fifo is full. So if the receiver (store to) is fast enough, est will not stall). Example (write 8 words to external memory, if receiver can handle data with 1 cycle no stall occurs): movei r0,0 esadr r7,r6 ;external address is r1:r0 est r0,0 est r0,1 est r0,2 est r0,3 est r0,4 est r0,5 est r0,6 est r0,7 Example (prevent latency of erqld to eld) esadr r7,r6 erqld 0 erqld 1 erqld 2 erqld 3 ;request 4 words .... ;do something to let data come into recive fifo eld r0 eld r1 eld r2 eld r3 ;get data (no stall if data already recived)