Found 98 repositories(showing 30)
srush
Solve puzzles. Learn CUDA.
modular
Learn GPU Programming in Mojo🔥 by Solving Puzzles
molyswu
using Neural Networks (SSD) on Tensorflow. This repo documents steps and scripts used to train a hand detector using Tensorflow (Object Detection API). As with any DNN based task, the most expensive (and riskiest) part of the process has to do with finding or creating the right (annotated) dataset. I was interested mainly in detecting hands on a table (egocentric view point). I experimented first with the [Oxford Hands Dataset](http://www.robots.ox.ac.uk/~vgg/data/hands/) (the results were not good). I then tried the [Egohands Dataset](http://vision.soic.indiana.edu/projects/egohands/) which was a much better fit to my requirements. The goal of this repo/post is to demonstrate how neural networks can be applied to the (hard) problem of tracking hands (egocentric and other views). Better still, provide code that can be adapted to other uses cases. If you use this tutorial or models in your research or project, please cite [this](#citing-this-tutorial). Here is the detector in action. <img src="images/hand1.gif" width="33.3%"><img src="images/hand2.gif" width="33.3%"><img src="images/hand3.gif" width="33.3%"> Realtime detection on video stream from a webcam . <img src="images/chess1.gif" width="33.3%"><img src="images/chess2.gif" width="33.3%"><img src="images/chess3.gif" width="33.3%"> Detection on a Youtube video. Both examples above were run on a macbook pro **CPU** (i7, 2.5GHz, 16GB). Some fps numbers are: | FPS | Image Size | Device| Comments| | ------------- | ------------- | ------------- | ------------- | | 21 | 320 * 240 | Macbook pro (i7, 2.5GHz, 16GB) | Run without visualizing results| | 16 | 320 * 240 | Macbook pro (i7, 2.5GHz, 16GB) | Run while visualizing results (image above) | | 11 | 640 * 480 | Macbook pro (i7, 2.5GHz, 16GB) | Run while visualizing results (image above) | > Note: The code in this repo is written and tested with Tensorflow `1.4.0-rc0`. Using a different version may result in [some errors](https://github.com/tensorflow/models/issues/1581). You may need to [generate your own frozen model](https://pythonprogramming.net/testing-custom-object-detector-tensorflow-object-detection-api-tutorial/?completed=/training-custom-objects-tensorflow-object-detection-api-tutorial/) graph using the [model checkpoints](model-checkpoint) in the repo to fit your TF version. **Content of this document** - Motivation - Why Track/Detect hands with Neural Networks - Data preparation and network training in Tensorflow (Dataset, Import, Training) - Training the hand detection Model - Using the Detector to Detect/Track hands - Thoughts on Optimizations. > P.S if you are using or have used the models provided here, feel free to reach out on twitter ([@vykthur](https://twitter.com/vykthur)) and share your work! ## Motivation - Why Track/Detect hands with Neural Networks? There are several existing approaches to tracking hands in the computer vision domain. Incidentally, many of these approaches are rule based (e.g extracting background based on texture and boundary features, distinguishing between hands and background using color histograms and HOG classifiers,) making them not very robust. For example, these algorithms might get confused if the background is unusual or in situations where sharp changes in lighting conditions cause sharp changes in skin color or the tracked object becomes occluded.(see [here for a review](https://www.cse.unr.edu/~bebis/handposerev.pdf) paper on hand pose estimation from the HCI perspective) With sufficiently large datasets, neural networks provide opportunity to train models that perform well and address challenges of existing object tracking/detection algorithms - varied/poor lighting, noisy environments, diverse viewpoints and even occlusion. The main drawbacks to usage for real-time tracking/detection is that they can be complex, are relatively slow compared to tracking-only algorithms and it can be quite expensive to assemble a good dataset. But things are changing with advances in fast neural networks. Furthermore, this entire area of work has been made more approachable by deep learning frameworks (such as the tensorflow object detection api) that simplify the process of training a model for custom object detection. More importantly, the advent of fast neural network models like ssd, faster r-cnn, rfcn (see [here](https://github.com/tensorflow/models/blob/master/research/object_detection/g3doc/detection_model_zoo.md#coco-trained-models-coco-models) ) etc make neural networks an attractive candidate for real-time detection (and tracking) applications. Hopefully, this repo demonstrates this. > If you are not interested in the process of training the detector, you can skip straight to applying the [pretrained model I provide in detecting hands](#detecting-hands). Training a model is a multi-stage process (assembling dataset, cleaning, splitting into training/test partitions and generating an inference graph). While I lightly touch on the details of these parts, there are a few other tutorials cover training a custom object detector using the tensorflow object detection api in more detail[ see [here](https://pythonprogramming.net/training-custom-objects-tensorflow-object-detection-api-tutorial/) and [here](https://towardsdatascience.com/how-to-train-your-own-object-detector-with-tensorflows-object-detector-api-bec72ecfe1d9) ]. I recommend you walk through those if interested in training a custom object detector from scratch. ## Data preparation and network training in Tensorflow (Dataset, Import, Training) **The Egohands Dataset** The hand detector model is built using data from the [Egohands Dataset](http://vision.soic.indiana.edu/projects/egohands/) dataset. This dataset works well for several reasons. It contains high quality, pixel level annotations (>15000 ground truth labels) where hands are located across 4800 images. All images are captured from an egocentric view (Google glass) across 48 different environments (indoor, outdoor) and activities (playing cards, chess, jenga, solving puzzles etc). <img src="images/egohandstrain.jpg" width="100%"> If you will be using the Egohands dataset, you can cite them as follows: > Bambach, Sven, et al. "Lending a hand: Detecting hands and recognizing activities in complex egocentric interactions." Proceedings of the IEEE International Conference on Computer Vision. 2015. The Egohands dataset (zip file with labelled data) contains 48 folders of locations where video data was collected (100 images per folder). ``` -- LOCATION_X -- frame_1.jpg -- frame_2.jpg ... -- frame_100.jpg -- polygons.mat // contains annotations for all 100 images in current folder -- LOCATION_Y -- frame_1.jpg -- frame_2.jpg ... -- frame_100.jpg -- polygons.mat // contains annotations for all 100 images in current folder ``` **Converting data to Tensorflow Format** Some initial work needs to be done to the Egohands dataset to transform it into the format (`tfrecord`) which Tensorflow needs to train a model. This repo contains `egohands_dataset_clean.py` a script that will help you generate these csv files. - Downloads the egohands datasets - Renames all files to include their directory names to ensure each filename is unique - Splits the dataset into train (80%), test (10%) and eval (10%) folders. - Reads in `polygons.mat` for each folder, generates bounding boxes and visualizes them to ensure correctness (see image above). - Once the script is done running, you should have an images folder containing three folders - train, test and eval. Each of these folders should also contain a csv label document each - `train_labels.csv`, `test_labels.csv` that can be used to generate `tfrecords` Note: While the egohands dataset provides four separate labels for hands (own left, own right, other left, and other right), for my purpose, I am only interested in the general `hand` class and label all training data as `hand`. You can modify the data prep script to generate `tfrecords` that support 4 labels. Next: convert your dataset + csv files to tfrecords. A helpful guide on this can be found [here](https://pythonprogramming.net/creating-tfrecord-files-tensorflow-object-detection-api-tutorial/).For each folder, you should be able to generate `train.record`, `test.record` required in the training process. ## Training the hand detection Model Now that the dataset has been assembled (and your tfrecords), the next task is to train a model based on this. With neural networks, it is possible to use a process called [transfer learning](https://www.tensorflow.org/tutorials/image_retraining) to shorten the amount of time needed to train the entire model. This means we can take an existing model (that has been trained well on a related domain (here image classification) and retrain its final layer(s) to detect hands for us. Sweet!. Given that neural networks sometimes have thousands or millions of parameters that can take weeks or months to train, transfer learning helps shorten training time to possibly hours. Tensorflow does offer a few models (in the tensorflow [model zoo](https://github.com/tensorflow/models/blob/master/research/object_detection/g3doc/detection_model_zoo.md#coco-trained-models-coco-models)) and I chose to use the `ssd_mobilenet_v1_coco` model as my start point given it is currently (one of) the fastest models (read the SSD research [paper here](https://arxiv.org/pdf/1512.02325.pdf)). The training process can be done locally on your CPU machine which may take a while or better on a (cloud) GPU machine (which is what I did). For reference, training on my macbook pro (tensorflow compiled from source to take advantage of the mac's cpu architecture) the maximum speed I got was 5 seconds per step as opposed to the ~0.5 seconds per step I got with a GPU. For reference it would take about 12 days to run 200k steps on my mac (i7, 2.5GHz, 16GB) compared to ~5hrs on a GPU. > **Training on your own images**: Please use the [guide provided by Harrison from pythonprogramming](https://pythonprogramming.net/training-custom-objects-tensorflow-object-detection-api-tutorial/) on how to generate tfrecords given your label csv files and your images. The guide also covers how to start the training process if training locally. [see [here] (https://pythonprogramming.net/training-custom-objects-tensorflow-object-detection-api-tutorial/)]. If training in the cloud using a service like GCP, see the [guide here](https://github.com/tensorflow/models/blob/master/research/object_detection/g3doc/running_on_cloud.md). As the training process progresses, the expectation is that total loss (errors) gets reduced to its possible minimum (about a value of 1 or thereabout). By observing the tensorboard graphs for total loss(see image below), it should be possible to get an idea of when the training process is complete (total loss does not decrease with further iterations/steps). I ran my training job for 200k steps (took about 5 hours) and stopped at a total Loss (errors) value of 2.575.(In retrospect, I could have stopped the training at about 50k steps and gotten a similar total loss value). With tensorflow, you can also run an evaluation concurrently that assesses your model to see how well it performs on the test data. A commonly used metric for performance is mean average precision (mAP) which is single number used to summarize the area under the precision-recall curve. mAP is a measure of how well the model generates a bounding box that has at least a 50% overlap with the ground truth bounding box in our test dataset. For the hand detector trained here, the mAP value was **0.9686@0.5IOU**. mAP values range from 0-1, the higher the better. <img src="images/accuracy.jpg" width="100%"> Once training is completed, the trained inference graph (`frozen_inference_graph.pb`) is then exported (see the earlier referenced guides for how to do this) and saved in the `hand_inference_graph` folder. Now its time to do some interesting detection. ## Using the Detector to Detect/Track hands If you have not done this yet, please following the guide on installing [Tensorflow and the Tensorflow object detection api](https://github.com/tensorflow/models/blob/master/research/object_detection/g3doc/installation.md). This will walk you through setting up the tensorflow framework, cloning the tensorflow github repo and a guide on - Load the `frozen_inference_graph.pb` trained on the hands dataset as well as the corresponding label map. In this repo, this is done in the `utils/detector_utils.py` script by the `load_inference_graph` method. ```python detection_graph = tf.Graph() with detection_graph.as_default(): od_graph_def = tf.GraphDef() with tf.gfile.GFile(PATH_TO_CKPT, 'rb') as fid: serialized_graph = fid.read() od_graph_def.ParseFromString(serialized_graph) tf.import_graph_def(od_graph_def, name='') sess = tf.Session(graph=detection_graph) print("> ====== Hand Inference graph loaded.") ``` - Detect hands. In this repo, this is done in the `utils/detector_utils.py` script by the `detect_objects` method. ```python (boxes, scores, classes, num) = sess.run( [detection_boxes, detection_scores, detection_classes, num_detections], feed_dict={image_tensor: image_np_expanded}) ``` - Visualize detected bounding detection_boxes. In this repo, this is done in the `utils/detector_utils.py` script by the `draw_box_on_image` method. This repo contains two scripts that tie all these steps together. - detect_multi_threaded.py : A threaded implementation for reading camera video input detection and detecting. Takes a set of command line flags to set parameters such as `--display` (visualize detections), image parameters `--width` and `--height`, videe `--source` (0 for camera) etc. - detect_single_threaded.py : Same as above, but single threaded. This script works for video files by setting the video source parameter videe `--source` (path to a video file). ```cmd # load and run detection on video at path "videos/chess.mov" python detect_single_threaded.py --source videos/chess.mov ``` > Update: If you do have errors loading the frozen inference graph in this repo, feel free to generate a new graph that fits your TF version from the model-checkpoint in this repo. Use the [export_inference_graph.py](https://github.com/tensorflow/models/blob/master/research/object_detection/export_inference_graph.py) script provided in the tensorflow object detection api repo. More guidance on this [here](https://pythonprogramming.net/testing-custom-object-detector-tensorflow-object-detection-api-tutorial/?completed=/training-custom-objects-tensorflow-object-detection-api-tutorial/). ## Thoughts on Optimization. A few things that led to noticeable performance increases. - Threading: Turns out that reading images from a webcam is a heavy I/O event and if run on the main application thread can slow down the program. I implemented some good ideas from [Adrian Rosebuck](https://www.pyimagesearch.com/2017/02/06/faster-video-file-fps-with-cv2-videocapture-and-opencv/) on parrallelizing image capture across multiple worker threads. This mostly led to an FPS increase of about 5 points. - For those new to Opencv, images from the `cv2.read()` method return images in [BGR format](https://www.learnopencv.com/why-does-opencv-use-bgr-color-format/). Ensure you convert to RGB before detection (accuracy will be much reduced if you dont). ```python cv2.cvtColor(image_np, cv2.COLOR_BGR2RGB) ``` - Keeping your input image small will increase fps without any significant accuracy drop.(I used about 320 x 240 compared to the 1280 x 720 which my webcam provides). - Model Quantization. Moving from the current 32 bit to 8 bit can achieve up to 4x reduction in memory required to load and store models. One way to further speed up this model is to explore the use of [8-bit fixed point quantization](https://heartbeat.fritz.ai/8-bit-quantization-and-tensorflow-lite-speeding-up-mobile-inference-with-low-precision-a882dfcafbbd). Performance can also be increased by a clever combination of tracking algorithms with the already decent detection and this is something I am still experimenting with. Have ideas for optimizing better, please share! <img src="images/general.jpg" width="100%"> Note: The detector does reflect some limitations associated with the training set. This includes non-egocentric viewpoints, very noisy backgrounds (e.g in a sea of hands) and sometimes skin tone. There is opportunity to improve these with additional data. ## Integrating Multiple DNNs. One way to make things more interesting is to integrate our new knowledge of where "hands" are with other detectors trained to recognize other objects. Unfortunately, while our hand detector can in fact detect hands, it cannot detect other objects (a factor or how it is trained). To create a detector that classifies multiple different objects would mean a long involved process of assembling datasets for each class and a lengthy training process. > Given the above, a potential strategy is to explore structures that allow us **efficiently** interleave output form multiple pretrained models for various object classes and have them detect multiple objects on a single image. An example of this is with my primary use case where I am interested in understanding the position of objects on a table with respect to hands on same table. I am currently doing some work on a threaded application that loads multiple detectors and outputs bounding boxes on a single image. More on this soon.
nightwalker89
Windows GPU Cuda BSGS Collider search Puzzles or Public keys
cryptoaivo
🚀 CrackBit - World's Fastest GPU-Accelerated Bitcoin Puzzle Solver 42+ Billion Keys/Sec on RTX 4090 | Multi-GPU Support | Real-Time Monitoring Solve Bitcoin puzzles 5-15x faster than existing tools with our CUDA-optimized engine. Features automatic resume, cluster mode, and military-grade encryption.
kaakkkaaaa
CuTeDSL GPU Puzzles 🧩 + Bonus 📦
neoblizz
Sudoku -- Puzzle Solver on GPU using CUDA.
shreyansh26
Solutions to the GPU Puzzles by Sasha Rush - https://github.com/srush/GPU-Puzzles
xoxo121
Solutions for GPU programming puzzles
hevnsnt
GPU-accelerated Bitcoin Puzzle solver using Pollard's Kangaroo algorithm. K=1.15 efficiency. CUDA + Metal.
kvatz
Solving N-Puzzle problem using AI Algorithms on GPU CUDA
lggurgel
Silikangaroo is a high-performance Bitcoin public key recovery tool (Pollard's Kangaroo algorithm) optimized specifically for Apple Silicon (M1/M2/M3) chips. It leverages the Metal API for GPU acceleration to solve Bitcoin puzzles efficiently on macOS.
pscamillo
GPU-accelerated Pollard's Kangaroo for secp256k1 ECDLP. Fork of RCKangaroo by RetiredCoder with bug fixes (endomorphism constants, sign-extension, Bloom filter), ALL-TAME mode, XDP 8x, ultra-compact 16-byte DPs, async BSGS resolver, and table freeze. Validated on Puzzle 79.
vincent-terpstra
a Sudoku solver using CUDA to solve a 25x25 puzzle on the GPU
Qalander
KeyKiller is designed to achieve extreme performance on modern NVIDIA GPUs, solving the Satoshi puzzle. It leverages CUDA, warp-level parallelism, and batched EC operations to push the boundaries of cryptographic key discovery. It is commonly used in research projects such as Secp256k1-CUDA,Brainflayer,BitCrack, Keyhunt CUDA BitCrack,RCKangaroo.
Jigsaw puzzle reassembly through the parallelization of Genetic Algorithms (GAs) using Graphics Processing Units (GPUs) within a CUDA environment, with a specific focus on edge matching.
benrayfield
A decentralized wiki style interactive math book, based on a combinator (that is both a universal lambda function and a pattern calculus function of 6 parameters) in which it is extremely easier to say true things than to say false things, based on a logic similar to godel-number where one must commit to statements about lambda called on lambda returns lambda before one can verify which lambdas they are, and in theory scaleable enough for graphics, musical instruments, GPU number crunching, etc, but lets start simple, so everyone can understand and fit the pieces of the puzzle together.
AlexanderKud
bitcoin puzzle transformation GPU
isaccanedo
Solve puzzles. Learn CUDA
bekli23
# 🧠 F4lc0nPool - GPU Bitcoin Puzzle Hunting
d4mr
Learn GPU compute programming by solving puzzles in WGSL. 14 puzzles with a Rust + wgpu test harness.
Rjected
Solving a large number of timelock puzzles in parallel using GPU acceleration
nameera231
A program that uses the GPU to solve sudoku puzzles. These puzzles are an interesting example of a constraint-satisfaction problem, which is a general approach to solving a wide variety of problems. This computation also lends itself well to the GPU: a lot of available parallelism within each puzzle, and can solve many puzzles simultaneously.
eztam-
A high-performance macOS application for Bitcoin puzzle solving, leveraging Apple Silicon GPUs for accelerated private-key search.
achraf-azize
Training a double duel Q-learning agent to achieve the 2048 tile in the Puzzle Game 2048, within 3 hours constraint using google Colab GPU.
guih2
por Sasha Rush - srush_nlp As arquiteturas de GPU são críticas para o aprendizado de máquina e parecem estar se tornando ainda mais importantes a cada dia. No entanto, você pode ser um especialista em aprendizado de máquina sem nunca tocar no código da GPU. É um pouco estranho trabalhar sempre através da abstração. Este notebook é uma tentativa de ensinar programação de GPU para iniciantes de uma forma completamente interativa. Em vez de fornecer texto com conceitos, ele o joga direto na codificação e na construção de kernels de GPU. Os exercícios usam NUMBA que mapeia diretamente o código Python para os kernels CUDA. Parece Python, mas é basicamente idêntico a escrever código CUDA de baixo nível. Em algumas horas, acho que você pode ir do básico ao entendimento dos algoritmos reais que alimentam 99% do aprendizado profundo hoje. Se você quiser ler o manual, está aqui: Guia NUMBA CUDA Eu recomendo fazer isso no Colab, pois é fácil começar. Certifique-se de fazer sua própria cópia, ative o modo GPU nas configurações ( Runtime / Change runtime type, defina Hardware acceleratorcomo GPU) e, em seguida, vá para a codificação. Abrir no Colab (Se você gosta desse estilo de quebra-cabeça, confira também meus Tensor Puzzles for PyTorch.) !pip install -qqq git+https://github.com/danoneata/chalk@srush-patch-1 !wget -q https://github.com/srush/GPU-Puzzles/raw/main/robot.png https://github.com/srush/GPU-Puzzles/raw/main/lib.py import numba import numpy as np import warnings from lib import CudaProblem, Coord warnings.filterwarnings( action="ignore", category=numba.NumbaPerformanceWarning, module="numba" ) Quebra-cabeça 1: Mapa Implemente um "kernel" (função GPU) que adicione 10 a cada posição do vetor a e armazene-o em vetor out. Você tem 1 thread por posição. Atenção Este código se parece com Python, mas na verdade é CUDA! Você não pode usar ferramentas padrão do python, como compreensão de lista ou solicitar propriedades do Numpy, como forma ou tamanho (se você precisar do tamanho, ele será fornecido como um argumento). Os quebra-cabeças exigem apenas operações simples, basicamente +, *, indexação de array simples, loops for e instruções if. Se você receber um erro, provavelmente é porque você fez algo extravagante :). Dica: Pense na função callcomo sendo executada 1 vez para cada thread. A única diferença é que cuda.threadIdx.xmuda a cada vez. def map_spec(a): return a + 10 def map_test(cuda): def call(out, a) -> None: local_i = cuda.threadIdx.x # FILL ME IN (roughly 1 lines) return call SIZE = 4 out = np.zeros((SIZE,)) a = np.arange(SIZE) problem = CudaProblem( "Map", map_test, [a], out, threadsperblock=Coord(SIZE, 1), spec=map_spec ) problem.show() # Map Score (Max Per Thread): | Global Reads | Global Writes | Shared Reads | Shared Writes | | 0 | 0 | 0 | 0 | svg problem.check() Failed Tests. Yours: [0. 0. 0. 0.] Spec : [10 11 12 13] Quebra-cabeça 2 - Zíper Implemente um kernel que some cada posição de aand be o armazene em out. Você tem 1 thread por posição. def zip_spec(a, b): return a + b def zip_test(cuda): def call(out, a, b) -> None: local_i = cuda.threadIdx.x # FILL ME IN (roughly 1 lines) return call SIZE = 4 out = np.zeros((SIZE,)) a = np.arange(SIZE) b = np.arange(SIZE) problem = CudaProblem( "Zip", zip_test, [a, b], out, threadsperblock=Coord(SIZE, 1), spec=zip_spec ) problem.show() # Zip Score (Max Per Thread): | Global Reads | Global Writes | Shared Reads | Shared Writes | | 0 | 0 | 0 | 0 | svg problem.check() Failed Tests. Yours: [0. 0. 0. 0.] Spec : [0 2 4 6] Quebra-cabeça 3 - Guardas Implemente um kernel que adicione 10 a cada posição ae o armazene em out. Você tem mais threads do que posições. def map_guard_test(cuda): def call(out, a, size) -> None: local_i = cuda.threadIdx.x # FILL ME IN (roughly 2 lines) return call SIZE = 4 out = np.zeros((SIZE,)) a = np.arange(SIZE) problem = CudaProblem( "Guard", map_guard_test, [a], out, [SIZE], threadsperblock=Coord(8, 1), spec=map_spec, ) problem.show() # Guard Score (Max Per Thread): | Global Reads | Global Writes | Shared Reads | Shared Writes | | 0 | 0 | 0 | 0 | svg problem.check() Failed Tests. Yours: [0. 0. 0. 0.] Spec : [10 11 12 13] Quebra-cabeça 4 - Mapa 2D Implemente um kernel que adicione 10 a cada posição ae o armazene em out. A entrada aé 2D e quadrada. Você tem mais threads do que posições. def map_2D_test(cuda): def call(out, a, size) -> None: local_i = cuda.threadIdx.x local_j = cuda.threadIdx.y # FILL ME IN (roughly 2 lines) return call SIZE = 2 out = np.zeros((SIZE, SIZE)) a = np.arange(SIZE * SIZE).reshape((SIZE, SIZE)) problem = CudaProblem( "Map 2D", map_2D_test, [a], out, [SIZE], threadsperblock=Coord(3, 3), spec=map_spec ) problem.show() # Map 2D Score (Max Per Thread): | Global Reads | Global Writes | Shared Reads | Shared Writes | | 0 | 0 | 0 | 0 | svg problem.check() Failed Tests. Yours: [[0. 0.] [0. 0.]] Spec : [[10 11] [12 13]] Quebra-cabeça 5 - Transmissão Implemente um kernel que adiciona aand be armazena em out. As entradas ae bsão vetores. Você tem mais threads do que posições. def broadcast_test(cuda): def call(out, a, b, size) -> None: local_i = cuda.threadIdx.x local_j = cuda.threadIdx.y # FILL ME IN (roughly 2 lines) return call SIZE = 2 out = np.zeros((SIZE, SIZE)) a = np.arange(SIZE).reshape(SIZE, 1) b = np.arange(SIZE).reshape(1, SIZE) problem = CudaProblem( "Broadcast", broadcast_test, [a, b], out, [SIZE], threadsperblock=Coord(3, 3), spec=zip_spec, ) problem.show() # Broadcast Score (Max Per Thread): | Global Reads | Global Writes | Shared Reads | Shared Writes | | 0 | 0 | 0 | 0 | svg problem.check() Failed Tests. Yours: [[0. 0.] [0. 0.]] Spec : [[0 1] [1 2]] Quebra-cabeça 6 - Blocos Implemente um kernel que adicione 10 a cada posição ae o armazene em out. Você tem menos threads por bloco do que o tamanho de a. Dica: Um bloco é um grupo de threads. O número de threads por bloco é limitado, mas podemos ter muitos blocos diferentes. A variável cuda.blockIdxnos diz em que bloco estamos. def map_block_test(cuda): def call(out, a, size) -> None: i = cuda.blockIdx.x * cuda.blockDim.x + cuda.threadIdx.x # FILL ME IN (roughly 2 lines) return call SIZE = 9 out = np.zeros((SIZE,)) a = np.arange(SIZE) problem = CudaProblem( "Blocks", map_block_test, [a], out, [SIZE], threadsperblock=Coord(4, 1), blockspergrid=Coord(3, 1), spec=map_spec, ) problem.show() # Blocks Score (Max Per Thread): | Global Reads | Global Writes | Shared Reads | Shared Writes | | 0 | 0 | 0 | 0 | svg problem.check() Failed Tests. Yours: [0. 0. 0. 0. 0. 0. 0. 0. 0.] Spec : [10 11 12 13 14 15 16 17 18] Quebra-cabeça 7 - Blocos 2D Implemente o mesmo kernel em 2D. Você tem menos threads por bloco do que o tamanho de aem ambas as direções. def map_block2D_test(cuda): def call(out, a, size) -> None: i = cuda.blockIdx.x * cuda.blockDim.x + cuda.threadIdx.x # FILL ME IN (roughly 4 lines) return call SIZE = 5 out = np.zeros((SIZE, SIZE)) a = np.ones((SIZE, SIZE)) problem = CudaProblem( "Blocks 2D", map_block2D_test, [a], out, [SIZE], threadsperblock=Coord(3, 3), blockspergrid=Coord(2, 2), spec=map_spec, ) problem.show() # Blocks 2D Score (Max Per Thread): | Global Reads | Global Writes | Shared Reads | Shared Writes | | 0 | 0 | 0 | 0 | svg problem.check() Failed Tests. Yours: [[0. 0. 0. 0. 0.] [0. 0. 0. 0. 0.] [0. 0. 0. 0. 0.] [0. 0. 0. 0. 0.] [0. 0. 0. 0. 0.]] Spec : [[11. 11. 11. 11. 11.] [11. 11. 11. 11. 11.] [11. 11. 11. 11. 11.] [11. 11. 11. 11. 11.] [11. 11. 11. 11. 11.]] Quebra-cabeça 8 - Compartilhado Implemente um kernel que adicione 10 a cada posição ae o armazene em out. Você tem menos threads por bloco do que o tamanho de a. Dica: Cada bloco pode ter uma quantidade constante de memória compartilhada que somente as threads desse bloco podem ler e escrever. Depois de escrever você deve usar cuda.syncthreadspara garantir que os threads não se cruzem. (Este exemplo realmente não precisa de memória compartilhada ou syncthreads, mas é uma demonstração.) TPB = 4 def shared_test(cuda): def call(out, a, size) -> None: shared = cuda.shared.array(TPB, numba.float32) i = cuda.blockIdx.x * cuda.blockDim.x + cuda.threadIdx.x local_i = cuda.threadIdx.x if i < size: shared[local_i] = a[i] cuda.syncthreads() # FILL ME IN (roughly 2 lines) return call SIZE = 8 out = np.zeros(SIZE) a = np.ones(SIZE) problem = CudaProblem( "Shared", shared_test, [a], out, [SIZE], threadsperblock=Coord(TPB, 1), blockspergrid=Coord(2, 1), spec=map_spec, ) problem.show() # Shared Score (Max Per Thread): | Global Reads | Global Writes | Shared Reads | Shared Writes | | 1 | 0 | 0 | 1 | svg problem.check() Failed Tests. Yours: [0. 0. 0. 0. 0. 0. 0. 0.] Spec : [11. 11. 11. 11. 11. 11. 11. 11.] Quebra-cabeça 9 - Agrupamento Implemente um kernel que some as 3 últimas posições ae as armazene em out. Você tem 1 thread por posição. Você só precisa de 1 leitura global e 1 gravação global por thread. Dica: Lembre-se de ter cuidado com a sincronização. def pool_spec(a): out = np.zeros(*a.shape) for i in range(a.shape[0]): out[i] = a[max(i - 2, 0) : i + 1].sum() return out TPB = 8 def pool_test(cuda): def call(out, a, size) -> None: shared = cuda.shared.array(TPB, numba.float32) i = cuda.blockIdx.x * cuda.blockDim.x + cuda.threadIdx.x local_i = cuda.threadIdx.x # FILL ME IN (roughly 8 lines) return call SIZE = 8 out = np.zeros(SIZE) a = np.arange(SIZE) problem = CudaProblem( "Pooling", pool_test, [a], out, [SIZE], threadsperblock=Coord(TPB, 1), blockspergrid=Coord(1, 1), spec=pool_spec, ) problem.show() # Pooling Score (Max Per Thread): | Global Reads | Global Writes | Shared Reads | Shared Writes | | 0 | 0 | 0 | 0 | svg problem.check() Failed Tests. Yours: [0. 0. 0. 0. 0. 0. 0. 0.] Spec : [ 0. 1. 3. 6. 9. 12. 15. 18.] Quebra-cabeça 10 - Produto escalar Implemente um kernel que calcule o produto escalar de aand be o armazene em out. Você tem 1 thread por posição. Você só precisa de 1 leitura global e 1 gravação global por thread. Observação: para esse problema, você não precisa se preocupar com o número de leituras compartilhadas. Vamos lidar com esse desafio mais tarde. def dot_spec(a, b): return a @ b TPB = 8 def dot_test(cuda): def call(out, a, b, size) -> None: a_shared = cuda.shared.array(TPB, numba.float32) b_shared = cuda.shared.array(TPB, numba.float32) i = cuda.blockIdx.x * cuda.blockDim.x + cuda.threadIdx.x local_i = cuda.threadIdx.x # FILL ME IN (roughly 9 lines) return call SIZE = 8 out = np.zeros(1) a = np.arange(SIZE) b = np.arange(SIZE) problem = CudaProblem( "Dot", dot_test, [a, b], out, [SIZE], threadsperblock=Coord(SIZE, 1), blockspergrid=Coord(1, 1), spec=dot_spec, ) problem.show() # Dot Score (Max Per Thread): | Global Reads | Global Writes | Shared Reads | Shared Writes | | 0 | 0 | 0 | 0 | svg problem.check() Failed Tests. Yours: [0.] Spec : 140 Quebra-cabeça 11 - Convolução 1D Implemente um kernel que calcula uma convolução 1D entre aand be armazena em out. Você precisa lidar com o caso geral. Você só precisa de 2 leituras globais e 1 gravação global por encadeamento. def conv_spec(a, b): out = np.zeros(*a.shape) len = b.shape[0] for i in range(a.shape[0]): out[i] = sum([a[i + j] * b[j] for j in range(len) if i + j < a.shape[0]]) return out MAX_CONV = 5 TPB = 8 TPB_MAX_CONV = TPB + MAX_CONV def conv_test(cuda): def call(out, a, b, a_size, b_size) -> None: i = cuda.blockIdx.x * cuda.blockDim.x + cuda.threadIdx.x local_i = cuda.threadIdx.x # FILL ME IN (roughly 17 lines) return call Teste 1 SIZE = 6 CONV = 3 out = np.zeros(SIZE) a = np.arange(SIZE) b = np.arange(CONV) problem = CudaProblem( "1D Conv (Simple)", conv_test, [a, b], out, [SIZE, CONV], Coord(1, 1), Coord(TPB, 1), spec=conv_spec, ) problem.show() # 1D Conv (Simple) Score (Max Per Thread): | Global Reads | Global Writes | Shared Reads | Shared Writes | | 0 | 0 | 0 | 0 | svg problem.check() Failed Tests. Yours: [0. 0. 0. 0. 0. 0.] Spec : [ 5. 8. 11. 14. 5. 0.] Teste 2 out = np.zeros(15) a = np.arange(15) b = np.arange(4) problem = CudaProblem( "1D Conv (Full)", conv_test, [a, b], out, [15, 4], Coord(2, 1), Coord(TPB, 1), spec=conv_spec, ) problem.show() # 1D Conv (Full) Score (Max Per Thread): | Global Reads | Global Writes | Shared Reads | Shared Writes | | 0 | 0 | 0 | 0 | svg problem.check() Failed Tests. Yours: [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.] Spec : [14. 20. 26. 32. 38. 44. 50. 56. 62. 68. 74. 80. 41. 14. 0.] Quebra-cabeça 12 - Soma do prefixo Implemente um kernel que calcula uma soma ae a armazena em out. Se o tamanho de afor maior que o tamanho do bloco, armazene apenas a soma de cada bloco. Faremos isso usando o algoritmo de soma de prefixo paralelo na memória compartilhada. Ou seja, cada passo do algoritmo deve somar metade dos números restantes. Siga este diagrama: TPB = 8 def sum_spec(a): out = np.zeros((a.shape[0] + TPB - 1) // TPB) for j, i in enumerate(range(0, a.shape[-1], TPB)): out[j] = a[i : i + TPB].sum() return out def sum_test(cuda): def call(out, a, size: int) -> None: cache = cuda.shared.array(TPB, numba.float32) i = cuda.blockIdx.x * cuda.blockDim.x + cuda.threadIdx.x local_i = cuda.threadIdx.x # FILL ME IN (roughly 12 lines) return call Teste 1 SIZE = 8 out = np.zeros(1) inp = np.arange(SIZE) problem = CudaProblem( "Sum (Simple)", sum_test, [inp], out, [SIZE], Coord(1, 1), Coord(TPB, 1), spec=sum_spec, ) problem.show() # Sum (Simple) Score (Max Per Thread): | Global Reads | Global Writes | Shared Reads | Shared Writes | | 0 | 0 | 0 | 0 | svg problem.check() Failed Tests. Yours: [0.] Spec : [28.] Teste 2 SIZE = 15 out = np.zeros(2) inp = np.arange(SIZE) problem = CudaProblem( "Sum (Full)", sum_test, [inp], out, [SIZE], Coord(2, 1), Coord(TPB, 1), spec=sum_spec, ) problem.show() # Sum (Full) Score (Max Per Thread): | Global Reads | Global Writes | Shared Reads | Shared Writes | | 0 | 0 | 0 | 0 | svg problem.check() Failed Tests. Yours: [0. 0.] Spec : [28. 77.] Quebra-cabeça 13 - Soma do Eixo Implemente um kernel que calcula uma soma em cada linha ae a armazena em out. TPB = 8 def sum_spec(a): out = np.zeros((a.shape[0], (a.shape[1] + TPB - 1) // TPB)) for j, i in enumerate(range(0, a.shape[-1], TPB)): out[..., j] = a[..., i : i + TPB].sum(-1) return out def axis_sum_test(cuda): def call(out, a, size: int) -> None: cache = cuda.shared.array(TPB, numba.float32) i = cuda.blockIdx.x * cuda.blockDim.x + cuda.threadIdx.x local_i = cuda.threadIdx.x batch = cuda.blockIdx.y # FILL ME IN (roughly 12 lines) return call BATCH = 4 SIZE = 6 out = np.zeros((BATCH, 1)) inp = np.arange(BATCH * SIZE).reshape((BATCH, SIZE)) problem = CudaProblem( "Axis Sum", axis_sum_test, [inp], out, [SIZE], Coord(1, BATCH), Coord(TPB, 1), spec=sum_spec, ) problem.show() # Axis Sum Score (Max Per Thread): | Global Reads | Global Writes | Shared Reads | Shared Writes | | 0 | 0 | 0 | 0 | svg problem.check() Failed Tests. Yours: [[0.] [0.] [0.] [0.]] Spec : [[ 15.] [ 51.] [ 87.] [123.]] Quebra-cabeça 14 - Multiplicação de Matrizes! Implemente um kernel que multiplique matrizes quadradas ae bo armazene em out. Dica: O algoritmo mais eficiente aqui copiará um bloco na memória compartilhada antes de computar cada um dos produtos escalares linha-coluna. Isso é fácil de fazer se a matriz couber na memória compartilhada. Faça esse caso primeiro. Em seguida, atualize seu código para calcular um produto escalar parcial e mova iterativamente a parte que você copiou para a memória compartilhada. Você deve ser capaz de fazer o hard case em 6 leituras globais. def matmul_spec(a, b): return a @ b TPB = 3 def mm_oneblock_test(cuda): def call(out, a, b, size: int) -> None: a_shared = cuda.shared.array((TPB, TPB), numba.float32) b_shared = cuda.shared.array((TPB, TPB), numba.float32) i = cuda.blockIdx.x * cuda.blockDim.x + cuda.threadIdx.x j = cuda.blockIdx.y * cuda.blockDim.y + cuda.threadIdx.y local_i = cuda.threadIdx.x local_j = cuda.threadIdx.y # FILL ME IN (roughly 14 lines) return call Teste 1 SIZE = 2 out = np.zeros((SIZE, SIZE)) inp1 = np.arange(SIZE * SIZE).reshape((SIZE, SIZE)) inp2 = np.arange(SIZE * SIZE).reshape((SIZE, SIZE)).T problem = CudaProblem( "Matmul (Simple)", mm_oneblock_test, [inp1, inp2], out, [SIZE], Coord(1, 1), Coord(TPB, TPB), spec=matmul_spec, ) problem.show(sparse=True) # Matmul (Simple) Score (Max Per Thread): | Global Reads | Global Writes | Shared Reads | Shared Writes | | 0 | 0 | 0 | 0 | svg problem.check() Failed Tests. Yours: [[0. 0.] [0. 0.]] Spec : [[ 1 3] [ 3 13]] # Test 2 SIZE = 8 out = np.zeros((SIZE, SIZE)) inp1 = np.arange(SIZE * SIZE).reshape((SIZE, SIZE)) inp2 = np.arange(SIZE * SIZE).reshape((SIZE, SIZE)).T problem = CudaProblem( "Matmul (Full)", mm_oneblock_test, [inp1, inp2], out, [SIZE], Coord(3, 3), Coord(TPB, TPB), spec=matmul_spec, ) problem.show(sparse=True) # Matmul (Full) Score (Max Per Thread): | Global Reads | Global Writes | Shared Reads | Shared Writes | | 0 | 0 | 0 | 0 | svg
reddyroh
No description available
rzou99
No description available
thudoan1706
No description available
siddhantdubey
Working on Sasha Rush's GPU Puzzles