This will cover some basics in assembler optimization, and the syntax to use in Visual Studio C++ for inline assembler.
This will cover a basic routine in C++, and how it is converted to MMX. The example is from an actual filter in AviSynth, with only minor changes.
Consider the following C-routing:
void Limiter::c_limiter(BYTE* p, int row_size, int height, int modulo, int cmin, int cmax) {
The parameters of this routine is:
When converting to MMX and IntegerSSE it is a good thing to look at which commands are available for the task needed. In this case we choose to focus on IntegerSSE, because it contains pminub and pmaxub, which selects the minimum and maximum bytes of two packed registers. It is always a good idea to support plain MMX, since there are still many machines out there, that only support these instcrutions.
An important aspect of MMX is parallel processing. That means processing several bytes at once. The MMX instructions all work on 8 bytes at the time, but in many cases, you have to unpack these bytes to words (8 to 16 bits) to be able to do things like additions, etc.)
The equivalent of the routine above in IntegerSSE looks like this:
void Limiter::isse_limiter_mod8(BYTE* p, int row_size, int height, int modulo, int cmin, int cmax) {
This routine performs the same task as the routine above. The filter requires mod8 rowsize, because it processes 8 pixels in parallel.
Let's go through the code, line by line.
cmax|=(cmax<<8);
cmin|=(cmin<<8);
__asm {
mov eax, [height]
mov ebx, p
mov ecx, [modulo]
movd mm7,[cmax]
movd mm6,[cmin]
mm7 now contains "0x0000|0000|0000|cmcm" (| on inserted for readability). Remember we duplicated the max and min values in the C-part.
pshufw mm7,mm7,0
pshufw mm6,mm6,0
In this example it results in mm7 containing "0xcmcm|cmcm|cmcm|cmcm". So basicly cmax and cmin are now placed in all 8 bytes in the mm6 and mm7 registers.
yloop:
This is a jump destination for a jump routine.
mov edx,[row_size]
align 16
xloop:
The "align 16" is to be used before any loop destination, that will be frequently used. It inserts commands that doesn't do anything, and ensures that the xloop destination will be aligned on a 16 byte boundary.
movq mm0,[ebx]
mm0 now contains 0xp8p7|p6p5|p4p3|p2p1, where p1 is the leftmost pixel onscreen. This may look a bit backwards at first, but you'll get used to it.
pminub mm0,mm7
pmaxub mm0,mm6
movq [ebx],mm0
add ebx,8
sub edx,8
jnz xloop
add ebx,ecx;
dec eax
emms
$Date: 2006/11/24 18:21:26 $
Original version of this document http://www.avisynth.org/SimpleMmxOptimization