GPU Overdrive
GPUs are designed for high performance graphics computing, but what if we could harness that for general purpose. CUDA is a supercomputing architecture developed by NVIDIA which enables that. Developers can write programs in C like format, OpenCL or DirectX Compute and the code will be compiled to work with GPUs which support CUDA. There is an entire website dedicated to this, GPGPU.org.
Lets see how it affects the way code is written. Consider Matrix Multiplication, on a normal CPU the complexity is O(N^{3}). If you use Strassen Algorithm you can reduce it to O(N^{log27}), but for now lets see the basic algorithm. On a GPU the process can be reduced to one loop. Here is how itâ€™s done.
on a CPU 
1: // A(N:P), B(P:M) and C(N:M) 2: for(int i=0;i<N;i++) 3: for(int j=0;j<M;j++) 4: for(int k=0;k<P;k++) 5: C[i][j] += A[i][k] * B[k][j];

Of the three loops, the loop required in GPU version is the one on line 4.
on a GPU ( using C for CUDA ) 
1: // A(N:P), B(P:M) and C(N:M) 2: fnMultiply(A,B,C)
3: {
4: // index i,j is obtained from threadId 5: for(int k=0;k<P;k++) 6: C[i][j] += A[i][k] * B[k][j];
7: }
8: main()
9: {
10: // call by creating N*M threads 11: fnMultiply<<<1,dim3(N,M)>>>(A,B,C); 12: }

In the case of GPU there are NxM threads computing in parallel, assuming that the GPU hardware supports NxM threads. By using thread blocks and shared memory the overall performance can be further improved. In the example above one thread block of size NxM is assumed. See the samples in the CUDA SDK for a optimized version.