My GPU Projects
Miraflores, 25 Jan 2020
Here is my self-guided tour of programming the GPU:
  1. Hello World
  2. GPU from javascript (latest update)
  3. ANN
  4. NNs and Time Series
  5. TensorFlow
  6. Object recognition
  7. Speeding up my Micro photography pipeline (not)
 Hello World
Miraflores, 25 Jan 2020
To get my feet wet, I made up a simple benchmark adding two long vectors on the GPU. 
  1. I needed to refresh my memory on running Visual Studio.  I used it quite a bit in the 90's working on a webserver.  Fortunately, it hasn't changed too much.
  2. Also, I wanted to see what was available to monitor the GPU.  For that, Task Manager has some nice features.  You do need to do a couple things to turn on the monitoring feature, described here.
A couple features of my code:
  1. I call Nvidia's deviceQuery code at the start to report on the GPU.
  2. It is easy to change between data types of the numbers being added - e.g. float32, int, etc.  Presently, this is by changing a "define" in the code.  I'll make this more automatic in a future version.
  1. Automatically run through a variety of data types and numbers of thread and block counts.
  2. Organize the results from #1 in a table and plots - either in my report or a spreadsheet.
  3. Check the load on the CPU and GPU without the measured benchmark load running to make sure it doesn't affect the results too much.
Here is a sample output from my laptop GPU:
Starting bench1 last updated 28 Jan 2020

deviceQuery Starting...

 CUDA Device Query (Runtime API) version (CUDART static linking)

Detected 1 CUDA Capable device(s)

Device 0: "GeForce GTX 1650 with Max-Q Design"
  CUDA Driver Version / Runtime Version          10.2 / 10.2
  CUDA Capability Major/Minor version number:    7.5
  Total amount of global memory:                 4096 MBytes (4294967296 bytes)
  (16) Multiprocessors, ( 64) CUDA Cores/MP:     1024 CUDA Cores
  GPU Max Clock rate:                            1245 MHz (1.25 GHz)
  Memory Clock rate:                             3501 Mhz
  Memory Bus Width:                              128-bit
  L2 Cache Size:                                 1048576 bytes
  Maximum Texture Dimension Size (x,y,z)         1D=(131072), 2D=(131072, 65536), 3D=(16384, 16384, 16384)
  Maximum Layered 1D Texture Size, (num) layers  1D=(32768), 2048 layers
  Maximum Layered 2D Texture Size, (num) layers  2D=(32768, 32768), 2048 layers
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total number of registers available per block: 65536
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  1024
  Maximum number of threads per block:           1024
  Max dimension size of a thread block (x,y,z): (1024, 1024, 64)
  Max dimension size of a grid size    (x,y,z): (2147483647, 65535, 65535)
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             512 bytes
  Concurrent copy and kernel execution:          Yes with 6 copy engine(s)
  Run time limit on kernels:                     Yes
  Integrated GPU sharing Host Memory:            No
  Support host page-locked memory mapping:       Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support:                        Disabled
  CUDA Device Driver Mode (TCC or WDDM):         WDDM (Windows Display Driver Model)
  Device supports Unified Addressing (UVA):      Yes
  Device supports Compute Preemption:            Yes
  Supports Cooperative Kernel Launch:            No
  Supports MultiDevice Co-op Kernel Launch:      No
  Device PCI Domain ID / Bus ID / location ID:   0 / 87 / 0
  Compute Mode:
     < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 10.2, CUDA Runtime Version = 10.2, NumDevs = 1
Result = PASS

Commom setup took 0.077523 sec
CPU add.. 5.00 Gops took 9.448 secs for 0.529 TFLOPS
GPU setup (cudaMallocs & copy over) took 0.241 secs
GPU add.. 50.00 Gops took 6.996 secs for 7.147 TFLOPS, speedup = 13.504655
Copy back took 0.017 secs

C:\Users\svbre\Source\Repos\bench1\x64\Debug\bench1.exe (process 8340) exited with code 0.
To automatically close the console when debugging stops, enable Tools->Options->Debugging->Automatically close the console when debugging stops.
Press any key to close this window . . .
This shows the GPU turning out 7.147 TFLOPS or 13.5 times faster than doing the same calculation on the CPU.  Amazing.  deviceQuery tells me there are 1024 CUDA cores running at 1.25 GHz.  That looks like:
7.147e12 FLOPS / 1024 / 1.25e9 = 5.6 Floating Point Operations per CUDA core per machine cycle
Pretty amazing.  But equally amazing is that the CPU does so well, here turning out 529 GFLOPS.  The CPU is a "Quad-Core 10th Gen Intel® Core™ i7-1065G7 Processor with Hyper-Threading 1.3 GHz / 3.9 GHz (Base/Turbo)" per Razer.  I see approx. 45% CPU utilization as the CPU portion of my benchmark runs.  I'm guessing that's 45% of 4 cores (as I do see a spike of nearly 100% at the start) but I assume the vector addition itself is not multi-threaded (but why 45% and not 25% - something else is going on?).  I'll give it credit for 2 cores.  Then, if I do the same calculation as above,
529e9 FLOPS / 2 / 3.9e9 = 68 Floating Point Operations per i7 core per machine cycle
I get 68 Floating Point Operations per machine cycle.  Is this possible?  As I google around, looking at i7 floating point performance numbers, I do see results in the hundreds of GFLOPS.  So, maybe.  But, I also need to check my code.
You can download my current code (a Visual Studio C++ project file) here.
 GPU from javasript
Barranco, 2 Feb 2020
I'd like to invoke the GPU from javascript (ie. a webpage).  Here are a couple projects enabling that: My plan would be to start with my benchmark (above), add plotting the results, add varying the thread and block counts.  Then jump ahead to doing Neural Networks using javascript on the GPU.  Ambitious, I know.
Barranco, 3 Feb 2020
When I tried the gpu.js example, it appears to run on the GPU on the Intel graphics card on my laptop.  I don't see a means to address the Nvidia GPU - nothing obvious in the API and the couple hits in google searches are not positive.  So, I will take a pass on gpu.js at this time.
turbo.js looks to be a little easier to sort out.  I see it is written in GLSL (a C-style language) and uses WebGL (apparently an established interface to VR, GPU, etc devices).  I don't see options to enumerate and strictly select a particular GPU in a multi-GPU system such as mine, but there is an option in the API that allows an app to ask between running the high-performance GPU (such as the Nvidia GPU) or the power saver GPU (such as the Intel GPU) in a system such as my laptop.  It doesn't guarantee the app will get that GPU, but it's worth a try to find out.
I see now that gpu.js also uses WebGL but the code is minified and I don' see an un-minified version.  The turbo.js GLSL looks a little more approachable anyway so I'll work with that.
Barranco, 7 Feb 2020
This will be my exercise in coding javascript on the GPU, a work in progress.  It is (or will be) a benchmark comparing the calculation of the Manbelbrot Set on your CPU vs one or two GPUs on your system (if you have them, of course).  It is based on the GPU-enabled, WebGL-based Mandelbrot demo described here and found here (do a "View page source" in your browser to well, view the source code).  The original author is not clear.  I thank and credit him, none the less.  The bit that I have added to it is to wrap it in the benchmark.
The buttons aren't hooked up yet, but the way it will work is...
You click on one of the buttons under the Mandelbrot plot to start a benchmark run - either on your CPU, on your GPU using WebGL's attempt to run on the low-performance/low-power GPU or on the high-performance/high-power GPU (the latter two options if your system is so configured - for example, my laptop, a 13" Razer is configured with both an Intel GPU and a higher performance Nvidia GPU.  It is intended for "gamers".).  Clicking on "Benchmark all three" will cycle through all three, as you might expect.  The plot of the Mandelbrot Set will automatically run through a series of load points
The load is determined by the number of iterations in the Mandelbrot evaluation loop for each frame.  This has the effect of increasing/decreasing the resolution of the plot for that frame.  Presently, you can see the effect of this by left-clicking on the plot and seeing it zoom in with increasing resolution (number of iterations) as you hold down your mouse.
Note: I presently don't have a way to test this on a system without a GPU. 
My to-dos: calculate TFLOPS and plot the results for each of CPU/slowGPU/fastGPU vs load on the plots on the right (presently displaying eye candy - a demo sine wave, courtesy of flotr2).
FPS: ---
canvas not supported
plot goes here
Time per frame vs Load
plot goes here
GFLOPS vs Load
    Key:   CPU=
Power Saving GPU=
Performance GPU=

Ref: Core Language (GLSL)
Ref: WebGL Browser Report
Ref: WebGL MAX parameters support
Ref: Managing multi-GPU systems
Ref: How to run WebGL on discrete Nvidia GPU for notebooks with Nvidia Optimus
Ref: WebGL: Context creation parameters
 Artificial Neural Networks
Miraflores, 25 Jan 2020
Here are a couple of what appear to be good sample codes for Neural Networks on the GPU: My plan would be to start with one or both of those, then add to the code say, generalizing the topology or inputs.  Or possibly adding graphics.
 Neural Networks and Time Series
Miraflores, 28 Jan 2020
Back in the 80's, I was working in the IBM mainframe architecture and performance analysis group in Poughkeepsie, NY.  Something I was curious about was using NNs to analyze computer performance.  Time Series are commonly used to characterize activity in a system.  I thought it would be interesting to feed time series and summary data (utilizations, number of users, response time, time of day, etc) to a NN to model the system performance, and predict and analyze problems.  To start to explore this space, I thought I'd start with this course:
Modeling Time Series Data with Recurrent Neural Networks
 Tensor Flow
Miraflores, 25 Jan 2020
Running Tensor Flow on an Nvidia GPU on Windows looks like it may involve some fiddly setup.  An alternative is to run it on the Raspberry Pi or on Google's cloud.  I suspect this is a useful step to my next exercise, Object Recognition.  And to exercises with Google's Coral and Nvidia's Jetson.
Ref: Nvidia Jetson Nano vs. Google Coral Dev board
 Object Recognition
Miraflores, 25 Jan 2020
I would very much like to get into the sort of object recognition shown in this video.  I'll see how much I can do on my laptop here, then carry on with it once I'm back in the states.
 Speeding up my Micro photography  pipeline (not)
Miraflores, 29 Jan 2020
One of the reasons I wanted to look at programming the GPU was to see about speeding up my Micro photography pipeline.  The notion was to implement the conversion of the raw Sony image files as they stream off the camera (in ARW format), composing the 4 (or 16) pixel shifted raw images to 16-bit TIFFs for input to Zerene Stacker.  I knew there was an open source program, Raw Therape, available for this.  I see a couple problems with this plan:
  1. Raw Therape comes with a significant GUI.  It may exist but if not, stripping the source code down to a "headless" core to then port to the GPU doesn't appeal to me.  Without having that separation built into the common source would mean redoing the splitting out of the code with every Therapee update.  Ugh.
  2. Moving the image data from CPU memory to GPU memory and the result back to save it in my NAS would take CPU time. The potential benefit by having the GPU crunch on the image vs the overhead in the memory moves is a question. 
    [3 Feb 2020: CUDA 6 added Unified Memory to avoid memory moves.]
  3. My Intel i7 is doing a pretty good job in my simple vector add benchmark.  Yes, there are higher performing GPUs but maybe 5 or 6 times the performance at a much higher price.
All in all, I suspect that doing this using a small number of Linux-based CPUs in an MP cluster might be the best solution.  That is, a rack-mount server cluster or the Pi MP cluster that I've already started.
To-do: plot
  • $ per MIPS
  • Wattage per MIPS
  • Cu ft per MIPS
for the Raspberry Pi and rack-mounted server.
  Barranco, 5 Feb 2020
  Cost Watts Cu In MIPS(*) MIPS/$ MIPS/Watt MIPS/Cu In
Pi 3B+ - 1GB $45 225,209,527 5,5,12
Pi 4B - 1GB $42 925,749,2037 22,18,49
Pi 4B - 2GB $46 20,16,44
Pi 4B - 4GB $62 15,12,33
2 x Six-Core X5650 Xeon $140
  * - Linpack SP,DP,SP NEON; Pi
  Barranco, 6 Feb 2020
Finding comparable performance numbers to compare Pi with Xeon is proving difficult.  These refurbished rack-mount Xeon servers are dirt cheap, albeit old.  I think I will order 1 or 2 and just try them with my ARW-to-TIFF app.  When I get the Beowulf cluster running on the Pi's, it should be easy to add these cheap servers to my cluster and simply measure them.
For my Micro photography speedup problem, I'm thinking I'll try using the Raspberry Pi MP (or Linux-based, Intel CPU cluster) because:
  1. It should be relatively easy to port the Therapee code to the Pi.  They offer a Linux version for download.  It already runs on Windows.
  2. Therapee already has support for running in a "headless" mode - ie. command line without the GUI.  It should be relatively easy to call Therapee from a program (that I will write) that monitors images arriving in a NAS folder from the camera, kicks off Therapee to write the TIFF to a NAS folder that Zerene Stacker monitors, meanwhile checking for problems (errors, timeouts, bottlenecks, etc).  Streaming images from the camera to a NAS folder works reliably (it's built into Sony Imaging Edge - I've used it a lot without problems) and consuming images for stacking is reportedly built into Zerene.
  3. I think it may be easier to determine and mitigate bottlenecks in the MP setup.
  4. It should be fun and instructive.

--- FIN ---