Stencil BenchmarkSets

候选benchmark PloyBench cell_cuda radinia FDTD

1
2
3
4
5
6
7
8
9
for (i=1; i<imax-1; i++)
for (j=1; j<jmax-1; j++)
for (k=1; k<kmax-1; k++)
{
s0 =  a0[i][j][k]*p[i+1][j][k]+ a1[i][j][k]*p[i][j+1][k]+ a2[i][j][k]*p[i][j][k+1]+ b0[i][j][k]*(p[i+1][j+1][k] - p[i+1][j-1][k]- p[i-1][j+1][k] + p[i-1][j-1][k])+ b1[i][j][k]*(p[i][j+1][k+1] - p[i][j+1][k-1]- p[i][j-1][k+1] + p[i][j-1][k-1])+ b2[i][j][k]*(p[i+1][j][k+1] - p[i+1][j][k-1]- p[i-1][j][k+1] + p[i-1][j][k-1])+ c0[i][j][k]*p[i-1][j][k]+ c1[i][j][k]*p[i][j-1][k]+ c2[i][j][k]*p[i][j][k-1]+ wrk1[i][j][k];
ss = (s0*a3[i][j][k]-p[i][j][k])*bnd[i][j][k];     //(ss = delta P)
wrk2[i][j][k]=p[i][j][k]+omega*ss;    //(over-relaxation)
gosa += ss*ss;                        //(residual, measure of convergence)
}
  • 3D Possion 19-point: test
1
2
3
4
5
a1[k][j][i]=c0*(b[k][j][i]+
              c1*(a0[k][j][i+1]+a0[k][j][i-1]+a0[k][j+1][i]+a0[k][j-1][i]+a0[k][j][i+1]+a0[k][j][i-1])+a0[k][j][i]+a0[k][j][i])+
              a0[k+1][j+1][i]+a0[k+1][j-1][i]+a0[k-1][j+1][i]+a0[k-1][j-1][i]+
              a0[k+1][j][i+1]+a0[k+1][j][i-1]+a0[k-1][j][i+1]+a0[k-1][j][i-1]
              a0[k][j+1][i+1]+a0[k][j+1][i-1]+a0[k][j-1][i+1]+a0[k][j-1][i-1]);
  • 3D heat:test
1
2
3
4
a(t, i, j, k) =  0.125 * (a(t-1, i+1, j, k) - 2.0 * a(t-1, i, j, k) + a(t-1, i-1, j, k))
         c + 0.125 * (a(t-1, i, j+1, k) - 2.0 * a(t-1, i, j, k) + a(t-1, i, j-1, k))
         + 0.125 * (a(t-1, i, j, k+1) - 2.0 * a(t-1, i, j, k) + a(t-1, i, j, k-1))
         + a(t-1, i, j, k);
  • Wave equation:test
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
/* It's order-4, 3D 15 point stencil, to match up with Matteo Frigo's
 * hand-optimized wave equation 
 */
float c0 = coef[0], c1 = coef[1], c2 = coef[2], c3 = coef[3], c4 = coef[4];
float div = c0 * pa(t, i, j, k) +
                c1 * ((pa(t, i, j, k+1) + pa(t, i, j, k-1))
                    + (pa(t, i, j+1, k) + pa(t, i, j-1, k))
                    + (pa(t, i+1, j, k) + pa(t, i-1, j, k)))
              + c2 * ((pa(t, i, j, k+2) + pa(t, i, j, k-2))
                    + (pa(t, i, j+2, k) + pa(t, i, j-2, k))
                    + (pa(t, i+2, j, k) + pa(t, i-2, j, k)))
              + c3 * ((pa(t, i, j, k+3) + pa(t, i, j, k-3))
                    + (pa(t, i, j+3, k) + pa(t, i, j-3, k))
                    + (pa(t, i+3, j, k) + pa(t, i-3, j, k)))
              + c4 * ((pa(t, i, j, k+4) + pa(t, i, j, k-4))
                    + (pa(t, i, j+4, k) + pa(t, i, j-4, k))
                    + (pa(t, i+4, j, k) + pa(t, i-4, j, k)));
pa(t+1, i, j, k) = 2 * pa(t, i, j, k) - pa(t+1, i, j, k) + vsq[i * Nxy + j * Nx + k] * div;

R 函数内使用load加载全局变量

问题

之前的代码中把训练的模型存储到文件中,之后调用的时候会通过函数load加载。这次把load的调用包装到一个函数内,结果怎么都load不进来(当前函数内有效,切换到其他函数时加载的全局变量global.model就为空了)。经过研究,发现load把文件内的变量加载到当前函数environment的局部变量里了。

解决方案

在调用load之后,把函数内局部变量的值用<<-赋给全局变量。

1
2
 load("globalmodels.saved")
 global.model <<- global.model

R 设置工作路径为当前文件所在路径

问题:

对于R语言来说,要include另一个文件,需要用到命令source;对于在源文件中的R脚本,要source当前文件所在目录下的其他文件,还需要把R语言解释器的工作路径设置为当前文件所在路径,否则source某个文件就需要使用绝对路径,不利于项目的移植。

解决方案:

通过下方的一块代码片段,就可以把R语言的工作路径设置为文件所在路径

1
2
3
4
frame_files <- lapply(sys.frames(), function(x) x$ofile)
frame_files <- Filter(Negate(is.null), frame_files)
TOPDIR <- dirname(frame_files[[length(frame_files)]])
setwd(TOPDIR)

CUDA 优化经验

硬件模型

  1. shared-memoy在硬件层是以SM为单位,在逻辑层是以block为单位

  2. warp是GPU在硬件层的并行单位。一般来说,warp等于32. SM在处理一个block kernel时,会经可能多的发射warp,每个warp内线程的大小为32.

  3. 每个SM可以同时驻留多个 block执行(active block),这主要取决于当前SM是否拥有足够的硬件资源,如Register,shared-memory等等.1

  4. 在200机器上的GPU,每个SM拥有256*256个32位的寄存器(即平均每个线程有256个32-bite的寄存器),49152byte的shared memory,65536byte的constant memory。

优化原则

  1. 把部分无共享的shared-memory变为global memory, 并没有增加速度,推测是因为对shared-memory减少的还不够 不足以使得同时在SM运行的block增加。(已经验证)

  2. 有些常量经常被用到,且占用的空间很小。之前使用常量存储器,现改用参数传递,放入寄存器进行加速。

  3. 把与threadIdx无关的公共计算提到host端计算,再把结果使用参数传递给kernel函数

  4. 写回的变量不使用shared-memory

  5. shared-memory和分块大小之间有一个tradeoff,即越多的shared-memory会导致在同一SM上驻留的block减少,但同时其访存的性能会上升

  6. 除法非常耗时,如果精度允许的话,使用被除数的倒数组成乘法替换除法。

  7. GPU适合小而多的运算,对于复杂运算(例如许多除法,大尺寸工作集),CPU反而占据性能优势

  8. 分块的大小:block中总的线程数要大于每个sm中硬件线程的个数,这样才能保证硬件线程都处在工作中。

CUDA手册:性能优化

Performance optimization revolves around three basic strategies: Maximize parallel execution to achieve maximum utilization; Optimize memory usage to achieve maximum memory throughput;* Optimize instruction usage to achieve maximum instruction throughput

Maximize Utilization

XXXX###Application Level XXXX###Device Level 1. For devices of compute capability 1.x, only one kernel can execute on a device at one time, so the kernel should be launched with at least as many thread blocks as there are multiprocessors in the device.2. For devices of compute capability 2.x and higher, multiple kernels can execute concurrently on a device, so maximum utilization can also be achieved by using streams to enable enough kernels to execute concurrently as described in Asynchronous Concurrent Execution.

Multiprocessor Level

XXXX

Maximize Memory Throughput

XXXX

Device Memory Accesses

XXXX

Shared Memory xxxx Because it is on-chip, shared memory has much higher bandwidth and much lower latency than local or global memory.To achieve high bandwidth, shared memory is divided into equally-sized memory modules, called banks, which can be accessed simultaneously. Any memory read or write request made of n addresses that fall in n distinct memory banks can therefore be serviced simultaneously, yielding an overall bandwidth that is n times as high as the bandwidth of a single module.

  • However, if two addresses of a memory request fall in the same memory bank, there is a bank conflict and the access has to be serialized. The hardware splits a memory request with bank conflicts into as many separate conflict-free requests as necessary, decreasing throughput by a factor equal to the number of separate memory requests. If the number of separate memory requests is n, the initial memory request is said to cause n-way bank conflicts.
  • To get maximum performance, it is therefore important to understand how memory addresses map to memory banks in order to schedule the memory requests so as to minimize bank conflicts. This is described in Compute Capability 1.x,Compute Capability 2.x, Compute Capability 3.x, and Compute Capability 5.0 for devices of compute capability 1.x, 2.x, 3.x, and 5.0, respectively.

  1. 详见http://stackoverflow.com/questions/12212003/how-concurrent-blocks-can-run-a-single-gpu-streaming-multiprocessor/12213137#12213137

论文写作经验

对reviewer的误区

Reviewers will carefully read my paper ? Totally wrong!

  • A reviewer often has a pile of papers to review
  • A reviewer typically makes up his/her mind after 5-minute browsing title/author->abstract->conclusions->references –>introduction, then spends < 1h to justify (moving to main body of the paper)

Your action:

  • make your points EARLY
  • bring up your results QUICK
  • highlight your contributions FAST
  • make your paper easy to read, clear to follow, and good to learn

Component of a paper

英语写作 句子

句子

  • xxxx,to the best of our knowledge,not been previously studied and is now more compelling than ever
  • While a large body of work exists on DRAM in laboratory conditions, little has been reported on real DRAM failures in large production clusters.
  • with extremely low energy footprints, all within acceptable area bounds Failures are costly in terms of …
  • To our knowledge, these are the first performance results of a shortest path problem on realistic graph instances in the order of billions of vertices and edges.[1]
  • In this work, our focus is to attack the dual problems of increasing power consumption and latency for DRAM devices
  • These observations have the potential to open up new avenues for architecture research.
  • In terms of xxx —– 对于xxx方面
  • To the best of our knowledge this is the first proposed scheme that … reads have no option but to wait at the memory controller.
  • Guard against forgetting —-防止XXXX
  • reason about —-推出
  • At 16 KB per multiprocessor, the shared memory is a scarce resource and must be used sparingly.

CUDA 计时的方法

问题

测试CUDA应用的时间

解决方案

引用自stackoverflow:

You could do sth along the lines of :

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
#include<sys/time.h>

struct timeval t1, t2;

gettimeofday(&t1, 0);

kernel_call<<<dimGrid, dimBlock, 0>>>();

gpuErrchk(cudaThreadSynchronize());

gettimeofday(&t2, 0);

double time = (1000000.0*(t2.tv_sec-t1.tv_sec) + t2.tv_usec-t1.tv_usec)/1000000.0;

printf("Time to generate:  %3.1f ms \n", time);

or:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
float time;
cudaEvent_t start, stop;

gpuErrchk( cudaEventCreate(&start) );
gpuErrchk( cudaEventCreate(&stop) );
gpuErrchk( cudaEventRecord(start, 0) );

kernel_call<<<dimGrid, dimBlock, 0>>>();

gpuErrchk( cudaEventRecord(stop, 0) );
gpuErrchk( cudaEventSynchronize(stop) );
gpuErrchk( cudaEventElapsedTime(&time, start, stop) );

printf("Time to generate:  %3.1f ms \n", time);

其中,在cpu端计时,由于GPU的kenernl与cpu是异步执行,所以在获得整个cuda应用的结束时间前,必须要调用cudaThreadSynchronize(现替换成cudaDeviceSynchronize),否则测的时间是最后一个kernel 启动的时间。

ROSE 在源文件的include语句之前插入语句

问题

今天,在使用ROSE自动生成CUDA代码时,遇到一个问题:程序中需要使用纹理存储器对GPU访存进行加速,相应地要生成texture变量声明的代码。由于texture初始化时使用到一个宏,该宏定义在文件头部,这就使得texture的初始化必须在宏定义之后。翻遍了手册,找不到把语句插入preprocessinfo 结点(include,#define )之后的方法,于是退一步打算找到当前scope中第一个语句,然后插到该语句之后。本打算调用firststatement的方法,报错,并且scope中的statement并不全是源码中的语句,会包含一些头文件中的结点。

解决方法

通过比较scope和scope中语句的名字,找到当前源文件中第一个SgStatement

1
2
3
4
5
6
7
8
9
10
11
12
13
 //find the first statement of current source file
  SgStatement* firststatement=NULL;
  SgStatementPtrList statelist=currentscope->generateStatementList();
  for(int i=0;i<statelist.size();i++)
  {
  SgStatement* tmp=statelist[i];
  if(tmp->get_file_info()->get_filenameString()==currentscope->get_file_info()->get_filenameString())
    {
    firststatement=tmp;
    break;
    }
   }
                                                                     }

检测Linux下 ODBC是否配置成功

问题

测试ODBC安装是否成功

解决方案

使用isql测试数据源是否配置正确

1
2
3
# dbname:the username of target database
# dbpsw: the password of target database
imysql -v datasourcename dbname dbpsw

如果数据源配置正确,会出现进入数据库

ROSE 编程要点

  • SgName的输出要使用getString方法,而不能使用unparseToCompleteString
  • 在创建AST node变量时,子节点要使用deepCopy进行深度拷贝,防止一个node在语法树中被多次引用;若在编程中不小心出现这种情况,可以通过打印dot图观察依赖关系,从而找到修改被多次引用的node。