当前位置: 首页 > news >正文

成都有哪些网站开发公司wordpress文章摘要调用

成都有哪些网站开发公司,wordpress文章摘要调用,seo国外英文论坛,想制作自己的网站GPU-Puzzles项目可以让你学习到GPU编程和cuda核心并行编程的概念#xff0c;通过一个个小问题让你理解cuda的编程和调用#xff0c;创建共享显存空间#xff0c;实现卷积和矩阵乘法等#xff0c;通过每个小问题之后还会奖励一个狗狗小视频#x1f601; 下面是项目的仓库通过一个个小问题让你理解cuda的编程和调用创建共享显存空间实现卷积和矩阵乘法等通过每个小问题之后还会奖励一个狗狗小视频 下面是项目的仓库https://github.com/srush/GPU-Puzzleshttps://github.com/srush/GPU-Puzzles 我本人也是做完了所有的puzzles特地做一份讲解供大家参考 Puzzle 1: Map  Implement a kernel (GPU function) that adds 10 to each position of vector a and stores it in vector out. You have 1 thread per position. 题目的目的是让out输出为a中所有元素10 def map_spec(a):return a 10def map_test(cuda):def call(out, a) - None:local_i cuda.threadIdx.x# FILL ME IN (roughly 1 lines)out[local_i] a[local_i] 10return callSIZE 4 out np.zeros((SIZE,)) a np.arange(SIZE) problem CudaProblem(Map, map_test, [a], out, threadsperblockCoord(SIZE, 1), specmap_spec ) problem.show() 这里不能完全用Python代码的思想去阅读函数每次调用cuda.threadId.x的时候都会取一个新的核心实现并行的效果下面是可视化运行效果 # MapScore (Max Per Thread):| Global Reads | Global Writes | Shared Reads | Shared Writes || 1 | 1 | 0 | 0 | Puzzle 2 - Zip Implement a kernel that adds together each position of a and b and stores it in out. You have 1 thread per position. 在out中每个元素是a,b向量中对应位置元素和这里不需要什么额外操作直接local_i作为对应位置的索引即可 def zip_spec(a, b):return a bdef zip_test(cuda):def call(out, a, b) - None:local_i cuda.threadIdx.x# FILL ME IN (roughly 1 lines)out[local_i] a[local_i] b[local_i]return callSIZE 4 out np.zeros((SIZE,)) a np.arange(SIZE) b np.arange(SIZE) problem CudaProblem(Zip, zip_test, [a, b], out, threadsperblockCoord(SIZE, 1), speczip_spec ) problem.show() 可视化效果 # ZipScore (Max Per Thread):| Global Reads | Global Writes | Shared Reads | Shared Writes || 2 | 1 | 0 | 0 | Puzzle 3 - Guards  Implement a kernel that adds 10 to each position of a and stores it in out. You have more threads than positions. 这里是Map的升级版我们拥有更多的cuda线程但是这不影响利用if判断使local_i索引有效即可 def map_guard_test(cuda):def call(out, a, size) - None:local_i cuda.threadIdx.x# FILL ME IN (roughly 2 lines)if (local_i size):out[local_i] a[local_i] 10return callSIZE 4 out np.zeros((SIZE,)) a np.arange(SIZE) problem CudaProblem(Guard,map_guard_test,[a],out,[SIZE],threadsperblockCoord(8, 1),specmap_spec, ) problem.show() 可视化效果 # GuardScore (Max Per Thread):| Global Reads | Global Writes | Shared Reads | Shared Writes || 1 | 1 | 0 | 0 | Puzzle 4 - Map 2D Implement a kernel that adds 10 to each position of a and stores it in out. Input a is 2D and square. You have more threads than positions. 这里更进了一步教我们cuda可以创建二维的线程我们创建了多余的线程需要更多边界判断 def map_2D_test(cuda):def call(out, a, size) - None:local_i cuda.threadIdx.xlocal_j cuda.threadIdx.y# FILL ME IN (roughly 2 lines)if local_i size and local_j size:out[local_i, local_j] a[local_i, local_j]1return callSIZE 2 out np.zeros((SIZE, SIZE)) a np.arange(SIZE * SIZE).reshape((SIZE, SIZE)) problem CudaProblem(Map 2D, map_2D_test, [a], out, [SIZE], threadsperblockCoord(3, 3), specmap_spec ) problem.show() 可视化效果 # Map 2DScore (Max Per Thread):| Global Reads | Global Writes | Shared Reads | Shared Writes || 1 | 1 | 0 | 0 | Puzzle 5 - Broadcast Implement a kernel that adds a and b and stores it in out. Inputs a and b are vectors. You have more threads than positions. 其实就是前面zip的二维版本对行向量和列向量相同索引位置相加线程比实际矩阵大要注意边界 def broadcast_test(cuda):def call(out, a, b, size) - None:local_i cuda.threadIdx.xlocal_j cuda.threadIdx.y# FILL ME IN (roughly 2 lines)if local_i size and local_j size:out[local_i, local_j] a[local_i, 0] b[0, local_j]return callSIZE 2 out np.zeros((SIZE, SIZE)) a np.arange(SIZE).reshape(SIZE, 1) b np.arange(SIZE).reshape(1, SIZE) problem CudaProblem(Broadcast,broadcast_test,[a, b],out,[SIZE],threadsperblockCoord(3, 3),speczip_spec, ) problem.show() 可视化效果 # BroadcastScore (Max Per Thread):| Global Reads | Global Writes | Shared Reads | Shared Writes || 2 | 1 | 0 | 0 | Puzzle 6 - Blocks Implement a kernel that adds 10 to each position of a and stores it in out. You have fewer threads per block than the size of a. 这里和前面不同每一块中线程数比矩阵要小但是这也不影响因为把块是把线程分组了每个块有一定数量的线程程序会循环遍历取出一个个线程块判断好边界条件即可 def map_block_test(cuda):def call(out, a, size) - None:i cuda.blockIdx.x * cuda.blockDim.x cuda.threadIdx.x# FILL ME IN (roughly 2 lines)if i size:out[i] a[i] 10return callSIZE 9 out np.zeros((SIZE,)) a np.arange(SIZE) problem CudaProblem(Blocks,map_block_test,[a],out,[SIZE],threadsperblockCoord(4, 1),blockspergridCoord(3, 1),specmap_spec, ) problem.show() 可视化效果 # BlocksScore (Max Per Thread):| Global Reads | Global Writes | Shared Reads | Shared Writes || 1 | 1 | 0 | 0 | Puzzle 7 - Blocks 2D  Implement the same kernel in 2D. You have fewer threads per block than the size of a in both directions. 这里线程矩阵比实际的矩阵大小要小但是没关系设置好边界条件直接遍历即可 def map_block2D_test(cuda):def call(out, a, size) - None:i cuda.blockIdx.x * cuda.blockDim.x cuda.threadIdx.x# FILL ME IN (roughly 4 lines)j cuda.blockIdx.y * cuda.blockDim.y cuda.threadIdx.yif i size and j size:out[i, j] a[i, j] 10return callSIZE 5 out np.zeros((SIZE, SIZE)) a np.ones((SIZE, SIZE))problem CudaProblem(Blocks 2D,map_block2D_test,[a],out,[SIZE],threadsperblockCoord(3, 3),blockspergridCoord(2, 2),specmap_spec, ) problem.show() 可视化效果 # Blocks 2DScore (Max Per Thread):| Global Reads | Global Writes | Shared Reads | Shared Writes || 1 | 1 | 0 | 0 | Puzzle 8 - Shared  Implement a kernel that adds 10 to each position of a and stores it in out. You have fewer threads per block than the size of a. 这里是利用共享内存的教学cuda申请共享内存只能用静态变量 TPB 4 def shared_test(cuda):def call(out, a, size) - None:shared cuda.shared.array(TPB, numba.float32)i cuda.blockIdx.x * cuda.blockDim.x cuda.threadIdx.xlocal_i cuda.threadIdx.xif i size:shared[local_i] a[i]cuda.syncthreads()# FILL ME IN (roughly 2 lines)if I size:out[i] shared[local_i] 10return callSIZE 8 out np.zeros(SIZE) a np.ones(SIZE) problem CudaProblem(Shared,shared_test,[a],out,[SIZE],threadsperblockCoord(TPB, 1),blockspergridCoord(2, 1),specmap_spec, ) problem.show() 可视化效果 # SharedScore (Max Per Thread):| Global Reads | Global Writes | Shared Reads | Shared Writes || 1 | 1 | 1 | 1 | Puzzle 9 - Pooling Implement a kernel that sums together the last 3 position of a and stores it in out. You have 1 thread per position. You only need 1 global read and 1 global write per thread. 这里教会我们要控制全局读写次数如果把需要重复读的内容移动到共享内存可以提升效率 def pool_spec(a):out np.zeros(*a.shape)for i in range(a.shape[0]):out[i] a[max(i - 2, 0) : i 1].sum()return outTPB 8 def pool_test(cuda):def call(out, a, size) - None:shared cuda.shared.array(TPB, numba.float32)i cuda.blockIdx.x * cuda.blockDim.x cuda.threadIdx.xlocal_i cuda.threadIdx.x# FILL ME IN (roughly 8 lines)if i size:shared[local_i] a[i]cuda.syncthreads()if i 0:out[i] shared[local_i]elif i 1:out[i] shared[local_i] shared[local_i - 1]else:out[i] shared[local_i] shared[local_i - 1] shared[local_i - 2]return callSIZE 8 out np.zeros(SIZE) a np.arange(SIZE) problem CudaProblem(Pooling,pool_test,[a],out,[SIZE],threadsperblockCoord(TPB, 1),blockspergridCoord(1, 1),specpool_spec, ) problem.show() 可视化效果 # PoolingScore (Max Per Thread):| Global Reads | Global Writes | Shared Reads | Shared Writes || 1 | 1 | 3 | 1 | Puzzle 10 - Dot Product Implement a kernel that computes the dot-product of a and b and stores it in out. You have 1 thread per position. You only need 2 global reads and 1 global write per thread. 这里是手动实现向量点乘我们可以先把结果暂存到一个共享内存中然后最后用一个线程统一计算共享内存的数据并输出到结果这样可以控制全局读写次数 def dot_spec(a, b):return a bTPB 8 def dot_test(cuda):def call(out, a, b, size) - None:shared cuda.shared.array(TPB, numba.float32)i cuda.blockIdx.x * cuda.blockDim.x cuda.threadIdx.xlocal_i cuda.threadIdx.x# FILL ME IN (roughly 9 lines)if i size:shared[local_i] a[i] * b[i]cuda.syncthreads()if local_i 0:total 0.0for j in range(TPB):total shared[j]out[0] totalreturn callSIZE 8 out np.zeros(1) a np.arange(SIZE) b np.arange(SIZE) problem CudaProblem(Dot,dot_test,[a, b],out,[SIZE],threadsperblockCoord(SIZE, 1),blockspergridCoord(1, 1),specdot_spec, ) problem.show() 可视化效果 # DotScore (Max Per Thread):| Global Reads | Global Writes | Shared Reads | Shared Writes || 2 | 1 | 8 | 1 | 后面还有puzzle11-14因为难度比较大思路较为复杂留在下一篇文章做讲解 ​​​​​​​GPU Puzzles讲解二-CSDN博客
http://www.hkea.cn/news/14294245/

相关文章:

  • 网站容易出现的问题南昌网站建设有限公司
  • 建设银行交易明细查询网站怎做网站转app
  • 把网站传到服务器上怎么做公司注册网上申请流程
  • 网站设计的任务代引流推广公司
  • 做盗版电影网站赚钱吗甘肃企业网站建设
  • 网站建设运营招聘爬虫网站开发
  • 高端网站制作哪家专业视频制作收费标准
  • 浙江恒元建设网站小型企业网站建设的背景
  • 网页链接制作软件北京seo相关
  • 个人网站备案简介怎么写wordpress微信内发链接显示简介
  • 海口网站设计建设哪个网站做ppt
  • 江油网站建设制作策划哪家专业谷歌seo是啥
  • 专业网站优化报价wordpress安装主题出错
  • iis 网站设置烟台网站建设合肥公司
  • 电脑搭建网站需要空间用于手机的导航网站要怎么做
  • 邢台建手机网站流程短期培训班学什么好
  • 淘宝上做网站的生意怎么样编程语言
  • 赤峰网站设计公司南昌做网站的公司哪里好
  • 电子商务网站建设计划书网站开发用什么语言最安全
  • 用ps做网站首页顶部图片建立网站基本知识
  • 东莞网站建设是什么wordpress 转 html
  • asp.net 做电子购物网站的网银结算功能如何实现的制作英文网站费用
  • 网站建设的经费自助定制网站开发公司
  • 兼职做设计什么网站好seo5
  • 做网站 商标分类深圳市网站设
  • 张家界建设企业网站湖北网站开发公司
  • 网站建设需求怎么给自己的品牌做网站
  • 上传网站的三种方法做电影网站的软件
  • 一个上线的网站需要怎么做网站建设主要包括
  • 长春哪里有做网站的制作网站推广码