************************ register setup

Ucore has:

8 general purpose register with 16 bit width (r0-r7)
1 t bit
1 stack pointer with 16 bit width
1 external memory register with 32 bit width 
1 programm counter with 16 bit width
1 multiply result high register with 16 bit width


************************ memory map

In standard configuration ucore has a internal memory with a size of 16384 x 16 bit words. 
This internal memory is used for code and data.

After the internal memory the control 'memory' is maped.
The control memory can only used for data (rqld/ld/st) and not for code.

External memory have full 32 bit range (depends on implementation).


$0000 code / data memory
*     16384 * 16 bit
*
*
$3fff
$4000 code / data memory MIRROR
*     
*
*
$7fff
$8000 ctrl / memory			StackPointer start at $8000 on reset
*     XXXX * 16 bit
*
*
$XXXX


************************ register access 

The ucore have a 6 stage pipeline:

	1. prepare fetch
	2. fetch
	3. decode
	4. read reg
	5. execute
	6. write reg

There is no dirt detection or stall between register accesses. 
So you should read the data of a written register 2 instruction after this. Otherwise you have a register hazard (assembler will give an warning).

Example:

	movei	r0,$ae	; 
	nop		;write r0 = ae
	addi	r0,1	;r0 = r0 ($ae) + 1
	
Normally you should put an nop or other instruction in this case.

Example (not optimized - 6 cycles):

	movei	r0,$ae
	nop
	add	r0,r5
	movei	r1,$ef
	nop
	mul	r1,r6
	
Example (optimized - 4 cycles):

	movei	r0,$ae
	movei	r1,$ef
	add	r0,r5
	mul	r1,r6
	
************************ Branch

A (conditional) branch on ucore will ever (not) taken in 1 cycle. The disadvantages of branches are use of delay slot or cover it in hardware. Ucore will use four 
delay slot after branch instruction. Note: These instructions will execute before branch! But placed after the branch instruction.

So the lowest loop possible loop will have 5 instructions.

Example:

loop	br	loop
	nop	;delay slot
	nop	;delay slot
	nop	;delay slot
	nop	;delay slot
	
In most case you can fill delay slot with real code and don’t have to use nop.

Example 1 (not optimzed - 8 cycles):

	movei	r0,1
	movei	r1,7
	gpci	r7,2	;next 4 + X
	br	drawPixel
	nop	;delay slot
	nop	;delay slot
	nop	;delay slot
	nop	;delay slot
	
Example 1 (optimzed - 6 cycles):	

	gpci	r7,2	
	br	drawPixel
	movei	r0,1	;delay slot
	movei	r1,7	;delay slot
	nop		;delay slot
	nop		;delay slot

Example 2 (not optimized - 8 * n + 3 cycles -> if n 128 -> total 1027 cycles)

	movei	r0,127
	movei	r1,0
	nop
	
loop	st	r1,r6
	addi	r1,1
	subi	r0,1
	brts	loop
	nop	;delay slot
	nop	;delay slot
	nop	;delay slot
	nop	;delay slot
	
Example 2 (optimized - 6 * n + 2 cycles -> if n 128 -> total 770 cycles)

	movei	r0,127
	movei	r1,0
	
loop	subi	r0,1	;not used in delay slot (irq)
	brts	loop
	st	r1,r6	;delay slot
	addi	r1,1	;delay slot	
	nop		;delay slot
	nop		;delay slot	
	
************************ Interrupts 	

Ucore will not handle interrupts in classic way, because this will increase the hardware (pipeline save ,...).
Ucore will inject branch instruction if interrupts occur and are enabled.
The advantage of this method is that no special hardware design is needed.
The disadvantage is that it is not easy possible to calculate the latency between interrupt and execute interrupt.

Example:
	
	sei			;irq enable
	br	anySubroutine	;if irq is enabled and occure a branch is take to irq vector and after irq is finished (rte) the give branch will take
	nop	;delay slot
	nop	;delay slot
	nop	;delay slot
	nop	;delay slot
	

Also enabled interrupts have a effect of delay slot of the branch instruction, so you should not used load/store and condition over delayslots.
	
Example:
	
	sei
	br	anySubroutine
	addt	r0,r1	;delay slot	(t is not save in delay slot if irq is enabled, because t is only store at br instruction)
	nop		;delay slot
	nop		;delay slot
	nop		;delay slot
	
	
************************ Request load and load

rqld/ld can only access internal memory.
There are two cycles delay after rqld to get the value.

Example:

	rqld	r0,0	;send load to address (r0+0)
	nop		;data send to memory
	nop		;data will read
	ld	r1	;readed data to r1
	
Its possible to send more rqld (pipelined), so you can have two cycles for one load.

Example:

	rqld	r0,0
	rqld	r0,1
	rqld	r0,2
	ld	r1	;r1 <- internal_mem[r0+0]
	ld	r2	;r2 <- internal_mem[r0+1]
	ld	r3	;r3 <- internal_mem[r0+2]
	
Important Note:

Request load combinations are only working correct in same memory block (code/data or ctrl memory).
Mixing rqld between memory blocks will not work (dynamic chip select, last one wins).

Example:

	movei	r2,$0
	movei	r1,$0	;code/data memory
	moveih	r2,$20	;ctrl memory
	
	rqld	r1,0	;request from code/data memory
	rqld	r2,0	;request ctrl memory
	nop
	ld	r3	;FAIL to read because rqld r2,0 switch to crtl memory (here ctrl memory is read at address 0)
	ld	r4
	
************************ Request pop and pop

The behavior is equal to (request load and load).

Example (1 pop):

	push	r4
	
	....
	
	rqpop
	nop
	nop
	pop	r4
	
Example (3 pop)

	push	r1
	push	r2
	push	r3
	
	....
	
	rqpop
	rqpop
	rqpop
	push	r3
	push	r2
	push	r1
	
Important Note:

Same issue like rqld/ld, but I think nobody rqpop/pop at ctrl memory.
	
************************ External memory access

The external memory access will have an 32 bit address. This address is shared by load and store.
The external address is settable by esadr instruction.

Example:

	esadr	r1,r0	;external address is r1:r0 
	
External memory access also use request load and load model. The difference to internal memory access is, that there is no defined wait cycle time.
So if the data’s are not available at external load, the pipeline is stalled.

example:

	esadr	r1,r0	;external address is r1:r0 
	erqld	0	;load from [r0:r1 + 0]
	eld	r7	;get external data (stalls while data are not available)
	
To increase performance you should insert some instruction (that needed after load) between erqld and eld to prevent/attenuate stalls.
Store data to external memory do only stall if outgoing fifo is full. So if the receiver (store to) is fast enough, est will not stall).

Example (write 8 words to external memory, if receiver can handle data with 1 cycle no stall occurs):

	movei	r0,0
	esadr	r7,r6	;external address is r1:r0 
	est	r0,0
	est	r0,1
	est	r0,2
	est	r0,3
	est	r0,4
	est	r0,5
	est	r0,6
	est	r0,7
	
Example (prevent latency of erqld to eld)

	esadr	r7,r6
	erqld	0
	erqld	1
	erqld	2
	erqld	3	;request 4 words
	....		;do something to let data come into recive fifo 
	eld	r0	
	eld	r1
	eld	r2
	eld	r3	;get data (no stall if data already recived)