|My GPU Projects|
Miraflores, 25 Jan 2020Here is my self-guided tour of programming the GPU:
- Hello World
- NNs and Time Series
- Object recognition
- Speeding up my Micro photography pipeline (not)
Miraflores, 25 Jan 2020To get my feet wet, I made up a simple benchmark adding two long vectors on the GPU.
- I needed to refresh my memory on running Visual Studio. I used it quite a bit in the 90's working on a webserver. Fortunately, it hasn't changed too much.
- Also, I wanted to see what was available to monitor the GPU. For that, Task Manager has some nice features. You do need to do a couple things to turn on the monitoring feature, described here.
- I call Nvidia's deviceQuery code at the start to report on the GPU.
- It is easy to change between data types of the numbers being added - e.g. float32, int, etc. Presently, this is by changing a "define" in the code. I'll make this more automatic in a future version.
- Automatically run through a variety of data types and numbers of thread and block counts.
- Organize the results from #1 in a table and plots - either in my report or a spreadsheet.
- Check the load on the CPU and GPU without the measured benchmark load running to make sure it doesn't affect the results too much.
Starting bench1 last updated 28 Jan 2020 deviceQuery Starting... CUDA Device Query (Runtime API) version (CUDART static linking) Detected 1 CUDA Capable device(s) Device 0: "GeForce GTX 1650 with Max-Q Design" CUDA Driver Version / Runtime Version 10.2 / 10.2 CUDA Capability Major/Minor version number: 7.5 Total amount of global memory: 4096 MBytes (4294967296 bytes) (16) Multiprocessors, ( 64) CUDA Cores/MP: 1024 CUDA Cores GPU Max Clock rate: 1245 MHz (1.25 GHz) Memory Clock rate: 3501 Mhz Memory Bus Width: 128-bit L2 Cache Size: 1048576 bytes Maximum Texture Dimension Size (x,y,z) 1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384) Maximum Layered 1D Texture Size, (num) layers 1D=(32768), 2048 layers Maximum Layered 2D Texture Size, (num) layers 2D=(32768, 32768), 2048 layers Total amount of constant memory: 65536 bytes Total amount of shared memory per block: 49152 bytes Total number of registers available per block: 65536 Warp size: 32 Maximum number of threads per multiprocessor: 1024 Maximum number of threads per block: 1024 Max dimension size of a thread block (x,y,z): (1024, 1024, 64) Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535) Maximum memory pitch: 2147483647 bytes Texture alignment: 512 bytes Concurrent copy and kernel execution: Yes with 6 copy engine(s) Run time limit on kernels: Yes Integrated GPU sharing Host Memory: No Support host page-locked memory mapping: Yes Alignment requirement for Surfaces: Yes Device has ECC support: Disabled CUDA Device Driver Mode (TCC or WDDM): WDDM (Windows Display Driver Model) Device supports Unified Addressing (UVA): Yes Device supports Compute Preemption: Yes Supports Cooperative Kernel Launch: No Supports MultiDevice Co-op Kernel Launch: No Device PCI Domain ID / Bus ID / location ID: 0 / 87 / 0 Compute Mode: < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) > deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 10.2, CUDA Runtime Version = 10.2, NumDevs = 1 Result = PASS Commom setup took 0.077523 sec CPU add.. 5.00 Gops took 9.448 secs for 0.529 TFLOPS GPU setup (cudaMallocs & copy over) took 0.241 secs GPU add.. 50.00 Gops took 6.996 secs for 7.147 TFLOPS, speedup = 13.504655 Copy back took 0.017 secs PASSED C:\Users\svbre\Source\Repos\bench1\x64\Debug\bench1.exe (process 8340) exited with code 0. To automatically close the console when debugging stops, enable Tools->Options->Debugging->Automatically close the console when debugging stops. Press any key to close this window . . .
7.147e12 FLOPS / 1024 / 1.25e9 = 5.6 Floating Point Operations per CUDA core per machine cyclePretty amazing. But equally amazing is that the CPU does so well, here turning out 529 GFLOPS. The CPU is a "Quad-Core 10th Gen Intel® Core™ i7-1065G7 Processor with Hyper-Threading 1.3 GHz / 3.9 GHz (Base/Turbo)" per Razer. I see approx. 45% CPU utilization as the CPU portion of my benchmark runs. I'm guessing that's 45% of 4 cores (as I do see a spike of nearly 100% at the start) but I assume the vector addition itself is not multi-threaded (but why 45% and not 25% - something else is going on?). I'll give it credit for 2 cores. Then, if I do the same calculation as above,
529e9 FLOPS / 2 / 3.9e9 = 68 Floating Point Operations per i7 core per machine cycleI get 68 Floating Point Operations per machine cycle. Is this possible? As I google around, looking at i7 floating point performance numbers, I do see results in the hundreds of GFLOPS. So, maybe. But, I also need to check my code.
You can download my current code (a Visual Studio C++ project file) here.
GPU from javasript
Barranco, 3 Feb 2020When I tried the gpu.js example, it appears to run on the GPU on the Intel graphics card on my laptop. I don't see a means to address the Nvidia GPU - nothing obvious in the API and the couple hits in google searches are not positive. So, I will take a pass on gpu.js at this time.
turbo.js looks to be a little easier to sort out. I see it is written in GLSL (a C-style language) and uses WebGL (apparently an established interface to VR, GPU, etc devices). I don't see options to enumerate and strictly select a particular GPU in a multi-GPU system such as mine, but there is an option in the API that allows an app to ask between running the high-performance GPU (such as the Nvidia GPU) or the power saver GPU (such as the Intel GPU) in a system such as my laptop. It doesn't guarantee the app will get that GPU, but it's worth a try to find out.
I see now that gpu.js also uses WebGL but the code is minified and I don' see an un-minified version. The turbo.js GLSL looks a little more approachable anyway so I'll work with that.
The buttons aren't hooked up yet, but the way it will work is...
You click on one of the buttons under the Mandelbrot plot to start a benchmark run - either on your CPU, on your GPU using WebGL's attempt to run on the low-performance/low-power GPU or on the high-performance/high-power GPU (the latter two options if your system is so configured - for example, my laptop, a 13" Razer is configured with both an Intel GPU and a higher performance Nvidia GPU. It is intended for "gamers".). Clicking on "Benchmark all three" will cycle through all three, as you might expect. The plot of the Mandelbrot Set will automatically run through a series of load points
The load is determined by the number of iterations in the Mandelbrot evaluation loop for each frame. This has the effect of increasing/decreasing the resolution of the plot for that frame. Presently, you can see the effect of this by left-clicking on the plot and seeing it zoom in with increasing resolution (number of iterations) as you hold down your mouse.Note: I presently don't have a way to test this on a system without a GPU.
My to-dos: calculate TFLOPS and plot the results for each of CPU/slowGPU/fastGPU vs load on the plots on the right (presently displaying eye candy - a demo sine wave, courtesy of flotr2).
plot goes here
Time per frame vs Load
plot goes here
GFLOPS vs Load
Ref: Core Language (GLSL)
Ref: WebGL Browser Report
Ref: WebGL MAX parameters support
Ref: Managing multi-GPU systems
Ref: How to run WebGL on discrete Nvidia GPU for notebooks with Nvidia Optimus
Ref: WebGL: Context creation parameters
<!- Neural Networks ->
Artificial Neural Networks
Miraflores, 25 Jan 2020Here are a couple of what appear to be good sample codes for Neural Networks on the GPU:
- A Neural Network in 10 lines of CUDA C++ Code - catchy title, huh?
- CUDA Neural Network Implementation - looks to be a better tutorial
<!- NNs and Time Series->
Neural Networks and Time Series
Miraflores, 28 Jan 2020Back in the 80's, I was working in the IBM mainframe architecture and performance analysis group in Poughkeepsie, NY. Something I was curious about was using NNs to analyze computer performance. Time Series are commonly used to characterize activity in a system. I thought it would be interesting to feed time series and summary data (utilizations, number of users, response time, time of day, etc) to a NN to model the system performance, and predict and analyze problems. To start to explore this space, I thought I'd start with this course:
Modeling Time Series Data with Recurrent Neural Networks
<!- Tensor Flow ->
Miraflores, 25 Jan 2020Running Tensor Flow on an Nvidia GPU on Windows looks like it may involve some fiddly setup. An alternative is to run it on the Raspberry Pi or on Google's cloud. I suspect this is a useful step to my next exercise, Object Recognition. And to exercises with Google's Coral and Nvidia's Jetson.
Ref: Nvidia Jetson Nano vs. Google Coral Dev board
<!- Object Recognition ->
Miraflores, 25 Jan 2020I would very much like to get into the sort of object recognition shown in this video. I'll see how much I can do on my laptop here, then carry on with it once I'm back in the states.
<!- My photog pipeline ->
Speeding up my Micro photography pipeline (not)
Miraflores, 29 Jan 2020One of the reasons I wanted to look at programming the GPU was to see about speeding up my Micro photography pipeline. The notion was to implement the conversion of the raw Sony image files as they stream off the camera (in ARW format), composing the 4 (or 16) pixel shifted raw images to 16-bit TIFFs for input to Zerene Stacker. I knew there was an open source program, Raw Therape, available for this. I see a couple problems with this plan:
- Raw Therape comes with a significant GUI. It may exist but if not, stripping the source code down to a "headless" core to then port to the GPU doesn't appeal to me. Without having that separation built into the common source would mean redoing the splitting out of the code with every Therapee update. Ugh.
- Moving the image data from CPU memory to GPU memory and the result back to save it in my NAS would take CPU time.
The potential benefit by having the GPU crunch on the image vs the overhead in the
memory moves is a question.
[3 Feb 2020: CUDA 6 added Unified Memory to avoid memory moves.]
- My Intel i7 is doing a pretty good job in my simple vector add benchmark. Yes, there are higher performing GPUs but maybe 5 or 6 times the performance at a much higher price.
- $ per MIPS
- Wattage per MIPS
- Cu ft per MIPS
Barranco, 5 Feb 2020
|Cost||Watts||Cu In||MIPS(*)||MIPS/$||MIPS/Watt||MIPS/Cu In|
|Pi 3B+ - 1GB||$45||225,209,527||5,5,12|
|Pi 4B - 1GB||$42||925,749,2037||22,18,49|
|Pi 4B - 2GB||$46||20,16,44|
|Pi 4B - 4GB||$62||15,12,33|
|2 x Six-Core X5650 Xeon||$140|
* - Linpack SP,DP,SP NEON; Pi
Barranco, 6 Feb 2020
Finding comparable performance numbers to compare Pi with Xeon is proving difficult. These refurbished rack-mount Xeon servers are dirt cheap, albeit old. I think I will order 1 or 2 and just try them with my ARW-to-TIFF app. When I get the Beowulf cluster running on the Pi's, it should be easy to add these cheap servers to my cluster and simply measure them.
- It should be relatively easy to port the Therapee code to the Pi. They offer a Linux version for download. It already runs on Windows.
- Therapee already has support for running in a "headless" mode - ie. command line without the GUI. It should be relatively easy to call Therapee from a program (that I will write) that monitors images arriving in a NAS folder from the camera, kicks off Therapee to write the TIFF to a NAS folder that Zerene Stacker monitors, meanwhile checking for problems (errors, timeouts, bottlenecks, etc). Streaming images from the camera to a NAS folder works reliably (it's built into Sony Imaging Edge - I've used it a lot without problems) and consuming images for stacking is reportedly built into Zerene.
- I think it may be easier to determine and mitigate bottlenecks in the MP setup.
- It should be fun and instructive.
--- FIN ---