Can't convert from CUDA 1-Channel texture to DPCT 4-Channel image_wrapper

mcostanzo · 1 October 2021 03:15

Hi everyone !!

I want to migrate this CUDA code to oneAPI. The problem is that when using DPCT, I have an incompatibility with CUDA 1-Channel, because SYCL supports only 4-Channels, but I don’t know how to modify the reading part to obtain the same result.

This is my CUDA example code:

#include "cuda_runtime.h"

#include "device_launch_parameters.h"

#include "cuda_texture_types.h"

#include "texture_fetch_functions.h"

#include "texture_types.h"

#include <stdio.h>

texture < int, 2 > textureD;

__global__ void kernel(int * dOutput, int width, int height)

{

  int row =

    blockIdx.y * blockDim.y + threadIdx.y;

  int col =

    blockIdx.x * blockDim.x + threadIdx.x;

  dOutput[row * width + col] =

    tex2D(textureD, col, row);

}

int main()

{

  int * h;

  int width =

    10;

  int height =

    10;

  int size =

    width * height;

  cudaHostAlloc < int > ( & h, size * sizeof(int), cudaHostAllocDefault);

  int i =

    0;

  for (int row = 0; row < height; row++)

  {

    for (int col = 0; col < width; col++)

    {

      h[row * width + col] =

        i;

      i++;

    }

  }

  int * d;

  size_t pitch;

  cudaMallocPitch < int > ( & d, & pitch, width * sizeof(int), height);

  cudaMemcpy2D(d, pitch, h, width * sizeof(int), width * sizeof(int), height, cudaMemcpyHostToDevice);

  cudaChannelFormatDesc channel =

    cudaCreateChannelDesc < int > ();

  cudaBindTexture2D(NULL, & textureD, d, & channel, width, height, pitch);

  int * hOutput;

  cudaHostAlloc < int > ( & hOutput, size * sizeof(int), cudaHostAllocDefault);

  int * dOutput;

  cudaMalloc < int > ( & dOutput, size * sizeof(int));

  kernel << < 1, width * height >>> (dOutput, width, height);

  cudaMemcpy(hOutput, dOutput, size * sizeof(int), cudaMemcpyDeviceToHost);

  for (int row = 0; row < height; row++)

  {

    for (int col = 0; col < width; col++)

    {

      printf("%d ", h[row * width + col]);

    }

    printf("\n");

  }

  printf("\n");

  for (int row = 0; row < height; row++)

  {

    for (int col = 0; col < width; col++)

    {

      printf("%d ", hOutput[row * width + col]);

    }

    printf("\n");

  }

  getchar();

  cudaFreeHost(h);

  cudaFree(d);

  cudaFree(dOutput);

  cudaFreeHost(hOutput);

  return 0;

}

Anyone can migrate this code to oneAPI ? I’ve been with this for 2 weeks. Thank you so much !

rod · 1 October 2021 09:16

Hello there, welcome. Can you post the SYCL output from DPCT along with the origina CUDA version you have already posted please? We can take a look and figure out what it should look like.

I think that slide 21 in this deck might help you ro understand the migration as it mentions a kernel dimension in a sycl::range cannot be equal to 0. The full talk is also available here.

mcostanzo · 1 October 2021 10:05

Hi @rod ! Thank you for reply.

The SYCL output is the following:

#include <CL/sycl.hpp>
#include <dpct/dpct.hpp>

#include <stdio.h>

/*
DPCT1059:2: SYCL only supports 4-channel image format. Adjust the code.
*/
dpct::image_wrapper<int, 2> textureD;

void kernel(int *dOutput, int width, int height, sycl::nd_item<3> item_ct1,
            dpct::image_accessor_ext<int, 2> textureD)

{

   int row =

       item_ct1.get_group(1) * item_ct1.get_local_range().get(1) +
       item_ct1.get_local_id(1);

int col =

    item_ct1.get_group(2) * item_ct1.get_local_range().get(2) +
    item_ct1.get_local_id(2);

dOutput[row * width + col] =

    textureD.read(col, row);
}

int main()

{
   dpct::device_ext &dev_ct1 = dpct::get_current_device();
   sycl::queue &q_ct1 = dev_ct1.default_queue();

        int *h;

	int width =

		10;

	int height =

		10;

	int size =

		width * height;

        /*
        DPCT1048:0: The original value cudaHostAllocDefault is not meaningful in
        the migrated code and was removed or replaced with 0. You may need to
        check the migrated code.
        */
        h = sycl::malloc_host<int>(size, q_ct1);

        int i =

		0;

	for(int row = 0; row < height; row++)

	{

		for(int col = 0; col < width; col++)

		{

			h[row * width + col] =

				i;

			i++;

		}

	}

	int *d;

	size_t pitch;

        d = (int *)dpct::dpct_malloc<int>(pitch, width * sizeof(int), height);

        dpct::dpct_memcpy(d, pitch, h, width * sizeof(int), width * sizeof(int),
                          height, dpct::host_to_device);

        dpct::image_channel channel =

            /*
            DPCT1059:3: SYCL only supports 4-channel image format. Adjust the
            code.
            */
            dpct::image_channel::create<int>();

        textureD.attach(d, width, height, pitch, channel);

        int *hOutput;

        /*
        DPCT1048:1: The original value cudaHostAllocDefault is not meaningful in
        the migrated code and was removed or replaced with 0. You may need to
        check the migrated code.
        */
        hOutput = sycl::malloc_host<int>(size, q_ct1);

        int *dOutput;

        dOutput = sycl::malloc_device<int>(size, q_ct1);

        /*
        DPCT1049:4: The workgroup size passed to the SYCL kernel may exceed the
        limit. To get the device limit, query info::device::max_work_group_size.
        Adjust the workgroup size if needed.
        */
        q_ct1.submit([&](sycl::handler &cgh) {
                auto textureD_acc = textureD.get_access(cgh);

                auto textureD_smpl = textureD.get_sampler();

                cgh.parallel_for(
                    sycl::nd_range<3>(sycl::range<3>(1, 1, width * height),
                                      sycl::range<3>(1, 1, width * height)),
                    [=](sycl::nd_item<3> item_ct1) {
                            kernel(dOutput, width, height, item_ct1,
                                   dpct::image_accessor_ext<int, 2>(
                                       textureD_smpl, textureD_acc));
                    });
        });

        q_ct1.memcpy(hOutput, dOutput, size * sizeof(int)).wait();

        for(int row = 0; row < height; row++)

	{

		for(int col = 0; col < width; col++)

		{

			printf("%d ", h[row * width + col]);

		}

		printf("\n");

	}

	printf("\n");

	for(int row = 0; row < height; row++)

	{

		for(int col = 0; col < width; col++)

		{

			printf("%d ", hOutput[row * width + col]);

		}

		printf("\n");

	}

	getchar();

        sycl::free(h, q_ct1);

        sycl::free(d, q_ct1);

        sycl::free(dOutput, q_ct1);

        sycl::free(hOutput, q_ct1);

return 0;

}

Thank you again.

mcostanzo · 1 October 2021 10:07

My Problem is with the error:

/*
DPCT1059:2: SYCL only supports 4-channel image format. Adjust the code.
*/

If I modify the int type by adding int4, my program compiles and executes but the results are differents from the original CUDA code

rod · 1 October 2021 10:43

Ah sorry I realise now you are talking about images and textures. Let us take a look at this.

mcostanzo · 1 October 2021 10:44

Excellent! I will wait. Thank you.

rod · 4 October 2021 09:23

Unfortunately there’s no standard way in SYCL to do this currently. Images were refactored in SYCL 2020 there are now only 4-channel image formats. Extensions can be added but this is yet to be done for the scenario you are looking at.

You could use a 4-channel image format and just ignore the other channels, but that’s going to be pretty inefficient.

Another option could maybe be to have a 4-channel image format and have each pixel in the image represent 4 elements of the input data, so each work item would operate on four pixels at a time, this would be the more efficient way to do it.

mcostanzo · 6 October 2021 23:03

Hi @rod,

I was trying to adapt my official code, because the one that I attached here, it’s only an example. My code is so much longer and complicated. I think that the solution is the last one you mention (compute 4 pixels at a time). But the problem I think it’s that I have to modify the logic of the code, that wouldn’t that be too complex? Do you know in this the example that we are working, how to adapt it to compute 4 pixels at a time?

Thank you so much.

mcostanzo · 6 October 2021 23:48

Maybe image_wrapper could be reaplaced by something different?

Topic		Replies	Views
How to convert this CUDA code to oneAPI (sycl) SYCL development	1	1081	8 November 2021
PI CUDA kernels only support images with channel types int32, uint32, float, and half	6	1307	10 November 2021
Question about max group size SYCL development	8	1480	2 February 2022
Question on NVIDIA/CUDA Support	1	1055	15 March 2021
Sample Gaussian-Blur.cpp fails in CUDA Device SYCL development	1	955	29 November 2021

Can't convert from CUDA 1-Channel texture to DPCT 4-Channel image_wrapper

Related topics