硬件模型

shared-memoy在硬件层是以SM为单位，在逻辑层是以block为单位
warp是GPU在硬件层的并行单位。一般来说，warp等于32. SM在处理一个block kernel时，会经可能多的发射warp，每个warp内线程的大小为32.
每个SM可以同时驻留多个 block执行（active block），这主要取决于当前SM是否拥有足够的硬件资源，如Register，shared-memory等等.¹
在200机器上的GPU，每个SM拥有256*256个32位的寄存器（即平均每个线程有256个32-bite的寄存器），49152byte的shared memory,65536byte的constant memory。

优化原则

把部分无共享的shared-memory变为global memory，并没有增加速度，推测是因为对shared-memory减少的还不够不足以使得同时在SM运行的block增加。（已经验证）
有些常量经常被用到，且占用的空间很小。之前使用常量存储器，现改用参数传递，放入寄存器进行加速。
把与threadIdx无关的公共计算提到host端计算，再把结果使用参数传递给kernel函数
写回的变量不使用shared-memory
shared-memory和分块大小之间有一个tradeoff，即越多的shared-memory会导致在同一SM上驻留的block减少，但同时其访存的性能会上升
除法非常耗时，如果精度允许的话，使用被除数的倒数组成乘法替换除法。
GPU适合小而多的运算，对于复杂运算（例如许多除法，大尺寸工作集），CPU反而占据性能优势
分块的大小：block中总的线程数要大于每个sm中硬件线程的个数，这样才能保证硬件线程都处在工作中。

CUDA手册:性能优化

Performance optimization revolves around three basic strategies: Maximize parallel execution to achieve maximum utilization; Optimize memory usage to achieve maximum memory throughput;* Optimize instruction usage to achieve maximum instruction throughput

Maximize Utilization

XXXX###Application Level XXXX###Device Level 1. For devices of compute capability 1.x, only one kernel can execute on a device at one time, so the kernel should be launched with at least as many thread blocks as there are multiprocessors in the device.2. For devices of compute capability 2.x and higher, multiple kernels can execute concurrently on a device, so maximum utilization can also be achieved by using streams to enable enough kernels to execute concurrently as described in Asynchronous Concurrent Execution.

Multiprocessor Level

XXXX

Maximize Memory Throughput

XXXX

Device Memory Accesses

XXXX

Shared Memory xxxx Because it is on-chip, shared memory has much higher bandwidth and much lower latency than local or global memory.To achieve high bandwidth, shared memory is divided into equally-sized memory modules, called banks, which can be accessed simultaneously. Any memory read or write request made of n addresses that fall in n distinct memory banks can therefore be serviced simultaneously, yielding an overall bandwidth that is n times as high as the bandwidth of a single module.

However, if two addresses of a memory request fall in the same memory bank, there is a bank conflict and the access has to be serialized. The hardware splits a memory request with bank conflicts into as many separate conflict-free requests as necessary, decreasing throughput by a factor equal to the number of separate memory requests. If the number of separate memory requests is n, the initial memory request is said to cause n-way bank conflicts.
To get maximum performance, it is therefore important to understand how memory addresses map to memory banks in order to schedule the memory requests so as to minimize bank conflicts. This is described in Compute Capability 1.x,Compute Capability 2.x, Compute Capability 3.x, and Compute Capability 5.0 for devices of compute capability 1.x, 2.x, 3.x, and 5.0, respectively.

详见http://stackoverflow.com/questions/12212003/how-concurrent-blocks-can-run-a-single-gpu-streaming-multiprocessor/12213137#12213137↩