Found 22 repositories(showing 22)
himanshub1007
# AD-Prediction Convolutional Neural Networks for Alzheimer's Disease Prediction Using Brain MRI Image ## Abstract Alzheimers disease (AD) is characterized by severe memory loss and cognitive impairment. It associates with significant brain structure changes, which can be measured by magnetic resonance imaging (MRI) scan. The observable preclinical structure changes provides an opportunity for AD early detection using image classification tools, like convolutional neural network (CNN). However, currently most AD related studies were limited by sample size. Finding an efficient way to train image classifier on limited data is critical. In our project, we explored different transfer-learning methods based on CNN for AD prediction brain structure MRI image. We find that both pretrained 2D AlexNet with 2D-representation method and simple neural network with pretrained 3D autoencoder improved the prediction performance comparing to a deep CNN trained from scratch. The pretrained 2D AlexNet performed even better (**86%**) than the 3D CNN with autoencoder (**77%**). ## Method #### 1. Data In this project, we used public brain MRI data from **Alzheimers Disease Neuroimaging Initiative (ADNI)** Study. ADNI is an ongoing, multicenter cohort study, started from 2004. It focuses on understanding the diagnostic and predictive value of Alzheimers disease specific biomarkers. The ADNI study has three phases: ADNI1, ADNI-GO, and ADNI2. Both ADNI1 and ADNI2 recruited new AD patients and normal control as research participants. Our data included a total of 686 structure MRI scans from both ADNI1 and ADNI2 phases, with 310 AD cases and 376 normal controls. We randomly derived the total sample into training dataset (n = 519), validation dataset (n = 100), and testing dataset (n = 67). #### 2. Image preprocessing Image preprocessing were conducted using Statistical Parametric Mapping (SPM) software, version 12. The original MRI scans were first skull-stripped and segmented using segmentation algorithm based on 6-tissue probability mapping and then normalized to the International Consortium for Brain Mapping template of European brains using affine registration. Other configuration includes: bias, noise, and global intensity normalization. The standard preprocessing process output 3D image files with an uniform size of 121x145x121. Skull-stripping and normalization ensured the comparability between images by transforming the original brain image into a standard image space, so that same brain substructures can be aligned at same image coordinates for different participants. Diluted or enhanced intensity was used to compensate the structure changes. the In our project, we used both whole brain (including both grey matter and white matter) and grey matter only. #### 3. AlexNet and Transfer Learning Convolutional Neural Networks (CNN) are very similar to ordinary Neural Networks. A CNN consists of an input and an output layer, as well as multiple hidden layers. The hidden layers are either convolutional, pooling or fully connected. ConvNet architectures make the explicit assumption that the inputs are images, which allows us to encode certain properties into the architecture. These then make the forward function more efficient to implement and vastly reduce the amount of parameters in the network. #### 3.1. AlexNet The net contains eight layers with weights; the first five are convolutional and the remaining three are fully connected. The overall architecture is shown in Figure 1. The output of the last fully-connected layer is fed to a 1000-way softmax which produces a distribution over the 1000 class labels. AlexNet maximizes the multinomial logistic regression objective, which is equivalent to maximizing the average across training cases of the log-probability of the correct label under the prediction distribution. The kernels of the second, fourth, and fifth convolutional layers are connected only to those kernel maps in the previous layer which reside on the same GPU (as shown in Figure1). The kernels of the third convolutional layer are connected to all kernel maps in the second layer. The neurons in the fully connected layers are connected to all neurons in the previous layer. Response-normalization layers follow the first and second convolutional layers. Max-pooling layers follow both response-normalization layers as well as the fifth convolutional layer. The ReLU non-linearity is applied to the output of every convolutional and fully-connected layer.  The first convolutional layer filters the 224x224x3 input image with 96 kernels of size 11x11x3 with a stride of 4 pixels (this is the distance between the receptive field centers of neighboring neurons in a kernel map). The second convolutional layer takes as input the (response-normalized and pooled) output of the first convolutional layer and filters it with 256 kernels of size 5x5x48. The third, fourth, and fifth convolutional layers are connected to one another without any intervening pooling or normalization layers. The third convolutional layer has 384 kernels of size 3x3x256 connected to the (normalized, pooled) outputs of the second convolutional layer. The fourth convolutional layer has 384 kernels of size 3x3x192 , and the fifth convolutional layer has 256 kernels of size 3x3x192. The fully-connected layers have 4096 neurons each. #### 3.2. Transfer Learning Training an entire Convolutional Network from scratch (with random initialization) is impractical[14] because it is relatively rare to have a dataset of sufficient size. An alternative is to pretrain a Conv-Net on a very large dataset (e.g. ImageNet), and then use the ConvNet either as an initialization or a fixed feature extractor for the task of interest. Typically, there are three major transfer learning scenarios: **ConvNet as fixed feature extractor:** We can take a ConvNet pretrained on ImageNet, and remove the last fully-connected layer, then treat the rest structure as a fixed feature extractor for the target dataset. In AlexNet, this would be a 4096-D vector. Usually, we call these features as CNN codes. Once we get these features, we can train a linear classifier (e.g. linear SVM or Softmax classifier) for our target dataset. **Fine-tuning the ConvNet:** Another idea is not only replace the last fully-connected layer in the classifier, but to also fine-tune the parameters of the pretrained network. Due to overfitting concerns, we can only fine-tune some higher-level part of the network. This suggestion is motivated by the observation that earlier features in a ConvNet contains more generic features (e.g. edge detectors or color blob detectors) that can be useful for many kind of tasks. But the later layer of the network becomes progressively more specific to the details of the classes contained in the original dataset. **Pretrained models:** The released pretrained model is usually the final ConvNet checkpoint. So it is common to see people use the network for fine-tuning. #### 4. 3D Autoencoder and Convolutional Neural Network We take a two-stage approach where we first train a 3D sparse autoencoder to learn filters for convolution operations, and then build a convolutional neural network whose first layer uses the filters learned with the autoencoder.  #### 4.1. Sparse Autoencoder An autoencoder is a 3-layer neural network that is used to extract features from an input such as an image. Sparse representations can provide a simple interpretation of the input data in terms of a small number of \parts by extracting the structure hidden in the data. The autoencoder has an input layer, a hidden layer and an output layer, and the input and output layers have same number of units, while the hidden layer contains more units for a sparse and overcomplete representation. The encoder function maps input x to representation h, and the decoder function maps the representation h to the output x. In our problem, we extract 3D patches from scans as the input to the network. The decoder function aims to reconstruct the input form the hidden representation h. #### 4.2. 3D Convolutional Neural Network Training the 3D convolutional neural network(CNN) is the second stage. The CNN we use in this project has one convolutional layer, one pooling layer, two linear layers, and finally a log softmax layer. After training the sparse autoencoder, we take the weights and biases of the encoder from trained model, and use them a 3D filter of a 3D convolutional layer of the 1-layer convolutional neural network. Figure 2 shows the architecture of the network. #### 5. Tools In this project, we used Nibabel for MRI image processing and PyTorch Neural Networks implementation.
bellos1203
STPN - Weakly Supervised Action Localization by Sparse Temporal Pooling Network
ZiningWang
SHPL: Fusing Bird's Eye View LIDAR Point Cloud and Front View Camera Image for Deep Object Detection
No description available
crowdsourced judgement data for shallow pooling for sparse labels paper
m0hssn
This repository provides a smooth max pooling implementation using the LogSumExp (LSE) function. Unlike traditional max pooling, which can result in sparse gradients, our approach approximates the maximum operation to ensure more effective gradient distribution.
silent567
Graph Structured Sparse Attentional Pooling-v2
HaruoHosoya
Hierarchical sparse coding / pooling based on PCA dimension reduction
SarithRavI
SkipPool: Improved Sparse Hierarchical Graph Pooling with Differentiable Exploration.
guih2
por Sasha Rush - srush_nlp As arquiteturas de GPU são críticas para o aprendizado de máquina e parecem estar se tornando ainda mais importantes a cada dia. No entanto, você pode ser um especialista em aprendizado de máquina sem nunca tocar no código da GPU. É um pouco estranho trabalhar sempre através da abstração. Este notebook é uma tentativa de ensinar programação de GPU para iniciantes de uma forma completamente interativa. Em vez de fornecer texto com conceitos, ele o joga direto na codificação e na construção de kernels de GPU. Os exercícios usam NUMBA que mapeia diretamente o código Python para os kernels CUDA. Parece Python, mas é basicamente idêntico a escrever código CUDA de baixo nível. Em algumas horas, acho que você pode ir do básico ao entendimento dos algoritmos reais que alimentam 99% do aprendizado profundo hoje. Se você quiser ler o manual, está aqui: Guia NUMBA CUDA Eu recomendo fazer isso no Colab, pois é fácil começar. Certifique-se de fazer sua própria cópia, ative o modo GPU nas configurações ( Runtime / Change runtime type, defina Hardware acceleratorcomo GPU) e, em seguida, vá para a codificação. Abrir no Colab (Se você gosta desse estilo de quebra-cabeça, confira também meus Tensor Puzzles for PyTorch.) !pip install -qqq git+https://github.com/danoneata/chalk@srush-patch-1 !wget -q https://github.com/srush/GPU-Puzzles/raw/main/robot.png https://github.com/srush/GPU-Puzzles/raw/main/lib.py import numba import numpy as np import warnings from lib import CudaProblem, Coord warnings.filterwarnings( action="ignore", category=numba.NumbaPerformanceWarning, module="numba" ) Quebra-cabeça 1: Mapa Implemente um "kernel" (função GPU) que adicione 10 a cada posição do vetor a e armazene-o em vetor out. Você tem 1 thread por posição. Atenção Este código se parece com Python, mas na verdade é CUDA! Você não pode usar ferramentas padrão do python, como compreensão de lista ou solicitar propriedades do Numpy, como forma ou tamanho (se você precisar do tamanho, ele será fornecido como um argumento). Os quebra-cabeças exigem apenas operações simples, basicamente +, *, indexação de array simples, loops for e instruções if. Se você receber um erro, provavelmente é porque você fez algo extravagante :). Dica: Pense na função callcomo sendo executada 1 vez para cada thread. A única diferença é que cuda.threadIdx.xmuda a cada vez. def map_spec(a): return a + 10 def map_test(cuda): def call(out, a) -> None: local_i = cuda.threadIdx.x # FILL ME IN (roughly 1 lines) return call SIZE = 4 out = np.zeros((SIZE,)) a = np.arange(SIZE) problem = CudaProblem( "Map", map_test, [a], out, threadsperblock=Coord(SIZE, 1), spec=map_spec ) problem.show() # Map Score (Max Per Thread): | Global Reads | Global Writes | Shared Reads | Shared Writes | | 0 | 0 | 0 | 0 | svg problem.check() Failed Tests. Yours: [0. 0. 0. 0.] Spec : [10 11 12 13] Quebra-cabeça 2 - Zíper Implemente um kernel que some cada posição de aand be o armazene em out. Você tem 1 thread por posição. def zip_spec(a, b): return a + b def zip_test(cuda): def call(out, a, b) -> None: local_i = cuda.threadIdx.x # FILL ME IN (roughly 1 lines) return call SIZE = 4 out = np.zeros((SIZE,)) a = np.arange(SIZE) b = np.arange(SIZE) problem = CudaProblem( "Zip", zip_test, [a, b], out, threadsperblock=Coord(SIZE, 1), spec=zip_spec ) problem.show() # Zip Score (Max Per Thread): | Global Reads | Global Writes | Shared Reads | Shared Writes | | 0 | 0 | 0 | 0 | svg problem.check() Failed Tests. Yours: [0. 0. 0. 0.] Spec : [0 2 4 6] Quebra-cabeça 3 - Guardas Implemente um kernel que adicione 10 a cada posição ae o armazene em out. Você tem mais threads do que posições. def map_guard_test(cuda): def call(out, a, size) -> None: local_i = cuda.threadIdx.x # FILL ME IN (roughly 2 lines) return call SIZE = 4 out = np.zeros((SIZE,)) a = np.arange(SIZE) problem = CudaProblem( "Guard", map_guard_test, [a], out, [SIZE], threadsperblock=Coord(8, 1), spec=map_spec, ) problem.show() # Guard Score (Max Per Thread): | Global Reads | Global Writes | Shared Reads | Shared Writes | | 0 | 0 | 0 | 0 | svg problem.check() Failed Tests. Yours: [0. 0. 0. 0.] Spec : [10 11 12 13] Quebra-cabeça 4 - Mapa 2D Implemente um kernel que adicione 10 a cada posição ae o armazene em out. A entrada aé 2D e quadrada. Você tem mais threads do que posições. def map_2D_test(cuda): def call(out, a, size) -> None: local_i = cuda.threadIdx.x local_j = cuda.threadIdx.y # FILL ME IN (roughly 2 lines) return call SIZE = 2 out = np.zeros((SIZE, SIZE)) a = np.arange(SIZE * SIZE).reshape((SIZE, SIZE)) problem = CudaProblem( "Map 2D", map_2D_test, [a], out, [SIZE], threadsperblock=Coord(3, 3), spec=map_spec ) problem.show() # Map 2D Score (Max Per Thread): | Global Reads | Global Writes | Shared Reads | Shared Writes | | 0 | 0 | 0 | 0 | svg problem.check() Failed Tests. Yours: [[0. 0.] [0. 0.]] Spec : [[10 11] [12 13]] Quebra-cabeça 5 - Transmissão Implemente um kernel que adiciona aand be armazena em out. As entradas ae bsão vetores. Você tem mais threads do que posições. def broadcast_test(cuda): def call(out, a, b, size) -> None: local_i = cuda.threadIdx.x local_j = cuda.threadIdx.y # FILL ME IN (roughly 2 lines) return call SIZE = 2 out = np.zeros((SIZE, SIZE)) a = np.arange(SIZE).reshape(SIZE, 1) b = np.arange(SIZE).reshape(1, SIZE) problem = CudaProblem( "Broadcast", broadcast_test, [a, b], out, [SIZE], threadsperblock=Coord(3, 3), spec=zip_spec, ) problem.show() # Broadcast Score (Max Per Thread): | Global Reads | Global Writes | Shared Reads | Shared Writes | | 0 | 0 | 0 | 0 | svg problem.check() Failed Tests. Yours: [[0. 0.] [0. 0.]] Spec : [[0 1] [1 2]] Quebra-cabeça 6 - Blocos Implemente um kernel que adicione 10 a cada posição ae o armazene em out. Você tem menos threads por bloco do que o tamanho de a. Dica: Um bloco é um grupo de threads. O número de threads por bloco é limitado, mas podemos ter muitos blocos diferentes. A variável cuda.blockIdxnos diz em que bloco estamos. def map_block_test(cuda): def call(out, a, size) -> None: i = cuda.blockIdx.x * cuda.blockDim.x + cuda.threadIdx.x # FILL ME IN (roughly 2 lines) return call SIZE = 9 out = np.zeros((SIZE,)) a = np.arange(SIZE) problem = CudaProblem( "Blocks", map_block_test, [a], out, [SIZE], threadsperblock=Coord(4, 1), blockspergrid=Coord(3, 1), spec=map_spec, ) problem.show() # Blocks Score (Max Per Thread): | Global Reads | Global Writes | Shared Reads | Shared Writes | | 0 | 0 | 0 | 0 | svg problem.check() Failed Tests. Yours: [0. 0. 0. 0. 0. 0. 0. 0. 0.] Spec : [10 11 12 13 14 15 16 17 18] Quebra-cabeça 7 - Blocos 2D Implemente o mesmo kernel em 2D. Você tem menos threads por bloco do que o tamanho de aem ambas as direções. def map_block2D_test(cuda): def call(out, a, size) -> None: i = cuda.blockIdx.x * cuda.blockDim.x + cuda.threadIdx.x # FILL ME IN (roughly 4 lines) return call SIZE = 5 out = np.zeros((SIZE, SIZE)) a = np.ones((SIZE, SIZE)) problem = CudaProblem( "Blocks 2D", map_block2D_test, [a], out, [SIZE], threadsperblock=Coord(3, 3), blockspergrid=Coord(2, 2), spec=map_spec, ) problem.show() # Blocks 2D Score (Max Per Thread): | Global Reads | Global Writes | Shared Reads | Shared Writes | | 0 | 0 | 0 | 0 | svg problem.check() Failed Tests. Yours: [[0. 0. 0. 0. 0.] [0. 0. 0. 0. 0.] [0. 0. 0. 0. 0.] [0. 0. 0. 0. 0.] [0. 0. 0. 0. 0.]] Spec : [[11. 11. 11. 11. 11.] [11. 11. 11. 11. 11.] [11. 11. 11. 11. 11.] [11. 11. 11. 11. 11.] [11. 11. 11. 11. 11.]] Quebra-cabeça 8 - Compartilhado Implemente um kernel que adicione 10 a cada posição ae o armazene em out. Você tem menos threads por bloco do que o tamanho de a. Dica: Cada bloco pode ter uma quantidade constante de memória compartilhada que somente as threads desse bloco podem ler e escrever. Depois de escrever você deve usar cuda.syncthreadspara garantir que os threads não se cruzem. (Este exemplo realmente não precisa de memória compartilhada ou syncthreads, mas é uma demonstração.) TPB = 4 def shared_test(cuda): def call(out, a, size) -> None: shared = cuda.shared.array(TPB, numba.float32) i = cuda.blockIdx.x * cuda.blockDim.x + cuda.threadIdx.x local_i = cuda.threadIdx.x if i < size: shared[local_i] = a[i] cuda.syncthreads() # FILL ME IN (roughly 2 lines) return call SIZE = 8 out = np.zeros(SIZE) a = np.ones(SIZE) problem = CudaProblem( "Shared", shared_test, [a], out, [SIZE], threadsperblock=Coord(TPB, 1), blockspergrid=Coord(2, 1), spec=map_spec, ) problem.show() # Shared Score (Max Per Thread): | Global Reads | Global Writes | Shared Reads | Shared Writes | | 1 | 0 | 0 | 1 | svg problem.check() Failed Tests. Yours: [0. 0. 0. 0. 0. 0. 0. 0.] Spec : [11. 11. 11. 11. 11. 11. 11. 11.] Quebra-cabeça 9 - Agrupamento Implemente um kernel que some as 3 últimas posições ae as armazene em out. Você tem 1 thread por posição. Você só precisa de 1 leitura global e 1 gravação global por thread. Dica: Lembre-se de ter cuidado com a sincronização. def pool_spec(a): out = np.zeros(*a.shape) for i in range(a.shape[0]): out[i] = a[max(i - 2, 0) : i + 1].sum() return out TPB = 8 def pool_test(cuda): def call(out, a, size) -> None: shared = cuda.shared.array(TPB, numba.float32) i = cuda.blockIdx.x * cuda.blockDim.x + cuda.threadIdx.x local_i = cuda.threadIdx.x # FILL ME IN (roughly 8 lines) return call SIZE = 8 out = np.zeros(SIZE) a = np.arange(SIZE) problem = CudaProblem( "Pooling", pool_test, [a], out, [SIZE], threadsperblock=Coord(TPB, 1), blockspergrid=Coord(1, 1), spec=pool_spec, ) problem.show() # Pooling Score (Max Per Thread): | Global Reads | Global Writes | Shared Reads | Shared Writes | | 0 | 0 | 0 | 0 | svg problem.check() Failed Tests. Yours: [0. 0. 0. 0. 0. 0. 0. 0.] Spec : [ 0. 1. 3. 6. 9. 12. 15. 18.] Quebra-cabeça 10 - Produto escalar Implemente um kernel que calcule o produto escalar de aand be o armazene em out. Você tem 1 thread por posição. Você só precisa de 1 leitura global e 1 gravação global por thread. Observação: para esse problema, você não precisa se preocupar com o número de leituras compartilhadas. Vamos lidar com esse desafio mais tarde. def dot_spec(a, b): return a @ b TPB = 8 def dot_test(cuda): def call(out, a, b, size) -> None: a_shared = cuda.shared.array(TPB, numba.float32) b_shared = cuda.shared.array(TPB, numba.float32) i = cuda.blockIdx.x * cuda.blockDim.x + cuda.threadIdx.x local_i = cuda.threadIdx.x # FILL ME IN (roughly 9 lines) return call SIZE = 8 out = np.zeros(1) a = np.arange(SIZE) b = np.arange(SIZE) problem = CudaProblem( "Dot", dot_test, [a, b], out, [SIZE], threadsperblock=Coord(SIZE, 1), blockspergrid=Coord(1, 1), spec=dot_spec, ) problem.show() # Dot Score (Max Per Thread): | Global Reads | Global Writes | Shared Reads | Shared Writes | | 0 | 0 | 0 | 0 | svg problem.check() Failed Tests. Yours: [0.] Spec : 140 Quebra-cabeça 11 - Convolução 1D Implemente um kernel que calcula uma convolução 1D entre aand be armazena em out. Você precisa lidar com o caso geral. Você só precisa de 2 leituras globais e 1 gravação global por encadeamento. def conv_spec(a, b): out = np.zeros(*a.shape) len = b.shape[0] for i in range(a.shape[0]): out[i] = sum([a[i + j] * b[j] for j in range(len) if i + j < a.shape[0]]) return out MAX_CONV = 5 TPB = 8 TPB_MAX_CONV = TPB + MAX_CONV def conv_test(cuda): def call(out, a, b, a_size, b_size) -> None: i = cuda.blockIdx.x * cuda.blockDim.x + cuda.threadIdx.x local_i = cuda.threadIdx.x # FILL ME IN (roughly 17 lines) return call Teste 1 SIZE = 6 CONV = 3 out = np.zeros(SIZE) a = np.arange(SIZE) b = np.arange(CONV) problem = CudaProblem( "1D Conv (Simple)", conv_test, [a, b], out, [SIZE, CONV], Coord(1, 1), Coord(TPB, 1), spec=conv_spec, ) problem.show() # 1D Conv (Simple) Score (Max Per Thread): | Global Reads | Global Writes | Shared Reads | Shared Writes | | 0 | 0 | 0 | 0 | svg problem.check() Failed Tests. Yours: [0. 0. 0. 0. 0. 0.] Spec : [ 5. 8. 11. 14. 5. 0.] Teste 2 out = np.zeros(15) a = np.arange(15) b = np.arange(4) problem = CudaProblem( "1D Conv (Full)", conv_test, [a, b], out, [15, 4], Coord(2, 1), Coord(TPB, 1), spec=conv_spec, ) problem.show() # 1D Conv (Full) Score (Max Per Thread): | Global Reads | Global Writes | Shared Reads | Shared Writes | | 0 | 0 | 0 | 0 | svg problem.check() Failed Tests. Yours: [0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.] Spec : [14. 20. 26. 32. 38. 44. 50. 56. 62. 68. 74. 80. 41. 14. 0.] Quebra-cabeça 12 - Soma do prefixo Implemente um kernel que calcula uma soma ae a armazena em out. Se o tamanho de afor maior que o tamanho do bloco, armazene apenas a soma de cada bloco. Faremos isso usando o algoritmo de soma de prefixo paralelo na memória compartilhada. Ou seja, cada passo do algoritmo deve somar metade dos números restantes. Siga este diagrama: TPB = 8 def sum_spec(a): out = np.zeros((a.shape[0] + TPB - 1) // TPB) for j, i in enumerate(range(0, a.shape[-1], TPB)): out[j] = a[i : i + TPB].sum() return out def sum_test(cuda): def call(out, a, size: int) -> None: cache = cuda.shared.array(TPB, numba.float32) i = cuda.blockIdx.x * cuda.blockDim.x + cuda.threadIdx.x local_i = cuda.threadIdx.x # FILL ME IN (roughly 12 lines) return call Teste 1 SIZE = 8 out = np.zeros(1) inp = np.arange(SIZE) problem = CudaProblem( "Sum (Simple)", sum_test, [inp], out, [SIZE], Coord(1, 1), Coord(TPB, 1), spec=sum_spec, ) problem.show() # Sum (Simple) Score (Max Per Thread): | Global Reads | Global Writes | Shared Reads | Shared Writes | | 0 | 0 | 0 | 0 | svg problem.check() Failed Tests. Yours: [0.] Spec : [28.] Teste 2 SIZE = 15 out = np.zeros(2) inp = np.arange(SIZE) problem = CudaProblem( "Sum (Full)", sum_test, [inp], out, [SIZE], Coord(2, 1), Coord(TPB, 1), spec=sum_spec, ) problem.show() # Sum (Full) Score (Max Per Thread): | Global Reads | Global Writes | Shared Reads | Shared Writes | | 0 | 0 | 0 | 0 | svg problem.check() Failed Tests. Yours: [0. 0.] Spec : [28. 77.] Quebra-cabeça 13 - Soma do Eixo Implemente um kernel que calcula uma soma em cada linha ae a armazena em out. TPB = 8 def sum_spec(a): out = np.zeros((a.shape[0], (a.shape[1] + TPB - 1) // TPB)) for j, i in enumerate(range(0, a.shape[-1], TPB)): out[..., j] = a[..., i : i + TPB].sum(-1) return out def axis_sum_test(cuda): def call(out, a, size: int) -> None: cache = cuda.shared.array(TPB, numba.float32) i = cuda.blockIdx.x * cuda.blockDim.x + cuda.threadIdx.x local_i = cuda.threadIdx.x batch = cuda.blockIdx.y # FILL ME IN (roughly 12 lines) return call BATCH = 4 SIZE = 6 out = np.zeros((BATCH, 1)) inp = np.arange(BATCH * SIZE).reshape((BATCH, SIZE)) problem = CudaProblem( "Axis Sum", axis_sum_test, [inp], out, [SIZE], Coord(1, BATCH), Coord(TPB, 1), spec=sum_spec, ) problem.show() # Axis Sum Score (Max Per Thread): | Global Reads | Global Writes | Shared Reads | Shared Writes | | 0 | 0 | 0 | 0 | svg problem.check() Failed Tests. Yours: [[0.] [0.] [0.] [0.]] Spec : [[ 15.] [ 51.] [ 87.] [123.]] Quebra-cabeça 14 - Multiplicação de Matrizes! Implemente um kernel que multiplique matrizes quadradas ae bo armazene em out. Dica: O algoritmo mais eficiente aqui copiará um bloco na memória compartilhada antes de computar cada um dos produtos escalares linha-coluna. Isso é fácil de fazer se a matriz couber na memória compartilhada. Faça esse caso primeiro. Em seguida, atualize seu código para calcular um produto escalar parcial e mova iterativamente a parte que você copiou para a memória compartilhada. Você deve ser capaz de fazer o hard case em 6 leituras globais. def matmul_spec(a, b): return a @ b TPB = 3 def mm_oneblock_test(cuda): def call(out, a, b, size: int) -> None: a_shared = cuda.shared.array((TPB, TPB), numba.float32) b_shared = cuda.shared.array((TPB, TPB), numba.float32) i = cuda.blockIdx.x * cuda.blockDim.x + cuda.threadIdx.x j = cuda.blockIdx.y * cuda.blockDim.y + cuda.threadIdx.y local_i = cuda.threadIdx.x local_j = cuda.threadIdx.y # FILL ME IN (roughly 14 lines) return call Teste 1 SIZE = 2 out = np.zeros((SIZE, SIZE)) inp1 = np.arange(SIZE * SIZE).reshape((SIZE, SIZE)) inp2 = np.arange(SIZE * SIZE).reshape((SIZE, SIZE)).T problem = CudaProblem( "Matmul (Simple)", mm_oneblock_test, [inp1, inp2], out, [SIZE], Coord(1, 1), Coord(TPB, TPB), spec=matmul_spec, ) problem.show(sparse=True) # Matmul (Simple) Score (Max Per Thread): | Global Reads | Global Writes | Shared Reads | Shared Writes | | 0 | 0 | 0 | 0 | svg problem.check() Failed Tests. Yours: [[0. 0.] [0. 0.]] Spec : [[ 1 3] [ 3 13]] # Test 2 SIZE = 8 out = np.zeros((SIZE, SIZE)) inp1 = np.arange(SIZE * SIZE).reshape((SIZE, SIZE)) inp2 = np.arange(SIZE * SIZE).reshape((SIZE, SIZE)).T problem = CudaProblem( "Matmul (Full)", mm_oneblock_test, [inp1, inp2], out, [SIZE], Coord(3, 3), Coord(TPB, TPB), spec=matmul_spec, ) problem.show(sparse=True) # Matmul (Full) Score (Max Per Thread): | Global Reads | Global Writes | Shared Reads | Shared Writes | | 0 | 0 | 0 | 0 | svg
berndilling
No description available
bhneo
No description available
yongkyunlee
Caltech Spring 2021 CS179 Sparse ROI Pooling Project
Resources for the paper, Foundations for Robust yet Simple Sparse Hierarchical Pooling
leobxpan
Sparse ConvNet with rectangular pooling regions, implemented in TensorFlow
jahz2323
A Unified Sparse Fourier Framework for Convolutional and Pooling Operations
Nadinehijazi
Research project on SPLADE sparse retrieval with softmax pooling and p-norm regularization
agustindiazcano
Volumetric Logic (VL) CUDA Engine. A dynamic, parameter efficient topological replacement for MLP layers in neural networks, implementing sparse spatial routing and native VRAM Memory Pooling.
LEEYJ1021
Full replication package for "Bayesian Hierarchical Inventory Optimization for Korean Agricultural Products Under Sparse Data Conditions". Demonstrates that cross-sectional information pooling systematically substitutes for temporal depth, enabling reliable inventory decisions with as few as 3 observations per product.
Sandesh612
Created cnn model using rescaling layer , 3 convolution layer with relu activation each of them having max pooling layer. Used adam optimizer and sparse categorical crossentropy for loss function. Did data augmentation as the giving data was less for training and testing the model.
mushroom-matthew
Modular MIL framework for predicting colorectal cancer MSI/MMR status from precomputed UNI foundation model embeddings (SurGen dataset). Evaluates mean pooling, attention MIL, and transformer MIL under fair, reproducible conditions. Includes attention visualization, error analysis, and ablations on loss function and sparse evidence selection.
1. Normalize the input: normalize the images between -1 and 1 Tanh as the last layer of the generator output 2. Batchnorm Construct different mini-batches for real and fake, i.e. each mini-batch needs to contain only all real images or all generated images. when batchnorm is not an option use instance normalization (for each sample, subtract mean and divide by standard deviation). 3. Avoid Sparse Gradients the stability of the GAN game suffers if you have sparse gradients LeakyReLU = good (in both G and D) For Downsampling, use: Average Pooling, Conv2d + stride 4: Use the ADAM Optimizer Perhaps the real papers might have a different opinion, this literally makes a remarkable improvement Use SGD for discriminator and ADAM for generator. To decrease instability issues decrease the learning rate to 0.0002 (from 0.001) and the momentum/beta1 to 0.5 (from 0.9) for Adam. Apart from these, I also came across a few blogs that would also help:
All 22 repositories loaded