(Nov 23, 2007) (Thanksgiving Night)
Today, I tried using vector instructions (SSE2 to be specific) in Visual C++ 2003. There are something worth noting:
1. we might always want to use _mm_malloc( data_size, 16 ) in place of new operator to make sure that our dynamically array will be aligned at 16 bytes (this is a restriction for many operations in SSE2; otherwise we will get memory exception). More details can be found at http://www.x86.org/articles/sse_pt3/simd3.htm and http://www.tacc.utexas.edu/resources/user_guides/intel/c_ug/linux117.htm
Remember that, to free the allocated memory, use _mm_free.
2. For those using Visual C++, they have alternatives when they want to aligned data. For example,
__declspec(align(16)) float m_fArray[ARRAY_SIZE];
and m_fArray = (float*) _aligned_malloc(ARRAY_SIZE * sizeof(float), 16);
Please see more info at http://www.codeproject.com/cpp/sseintro.asp
3. As can be seen from 2, it is desirable to align our floating-point array as well since this will allow us to cast-and-use the floating-point array.
4. There are interesting classes which are suitable for vector instructions: vector3D and 4x4 Matrix. Please see http://www.x86.org/articles/sse_pt3/simd3.htm (near the page bottom).
More may be available at http://www.codeproject.com/useritems/SSE_optimized_2D_vector.asp (see the source code package).
5. Examples in using add, mul, and sqrt via SSE: http://www.codeproject.com/cpp/sseintro.asp
6. memcpy and _mm_loadu_ps will play an important role to increase speed for convolution since SSE needs to align data at 16 bits, but we want to move convolution window little by little (4 bytes). However, if we interleavedly process data, we need to perform memory move / copy only four times. This should be good for performance improvement.
There is a website about fast factor 2 resampling using SSE at http://mail.gnome.org/archives/beast/2006-March/msg00001.html. I don't know if it does show something related to my convolution problem, but I will one day take a look at it seriously.
7. Very simple example of SSE in GCC: http://www.tuleriit.ee/progs/rexample.php