Can't convert from CUDA 1-Channel texture to DPCT 4-Channel image_wrapper

Hi everyone !!

I want to migrate this CUDA code to oneAPI. The problem is that when using DPCT, I have an incompatibility with CUDA 1-Channel, because SYCL supports only 4-Channels, but I don’t know how to modify the reading part to obtain the same result.

This is my CUDA example code:

#include "cuda_runtime.h"

#include "device_launch_parameters.h"

#include "cuda_texture_types.h"

#include "texture_fetch_functions.h"

#include "texture_types.h"

#include <stdio.h>

texture < int, 2 > textureD;

__global__ void kernel(int * dOutput, int width, int height)

{

  int row =

    blockIdx.y * blockDim.y + threadIdx.y;

  int col =

    blockIdx.x * blockDim.x + threadIdx.x;

  dOutput[row * width + col] =

    tex2D(textureD, col, row);

}

int main()

{

  int * h;

  int width =

    10;

  int height =

    10;

  int size =

    width * height;

  cudaHostAlloc < int > ( & h, size * sizeof(int), cudaHostAllocDefault);

  int i =

    0;

  for (int row = 0; row < height; row++)

  {

    for (int col = 0; col < width; col++)

    {

      h[row * width + col] =

        i;

      i++;

    }

  }

  int * d;

  size_t pitch;

  cudaMallocPitch < int > ( & d, & pitch, width * sizeof(int), height);

  cudaMemcpy2D(d, pitch, h, width * sizeof(int), width * sizeof(int), height, cudaMemcpyHostToDevice);

  cudaChannelFormatDesc channel =

    cudaCreateChannelDesc < int > ();

  cudaBindTexture2D(NULL, & textureD, d, & channel, width, height, pitch);

  int * hOutput;

  cudaHostAlloc < int > ( & hOutput, size * sizeof(int), cudaHostAllocDefault);

  int * dOutput;

  cudaMalloc < int > ( & dOutput, size * sizeof(int));

  kernel << < 1, width * height >>> (dOutput, width, height);

  cudaMemcpy(hOutput, dOutput, size * sizeof(int), cudaMemcpyDeviceToHost);

  for (int row = 0; row < height; row++)

  {

    for (int col = 0; col < width; col++)

    {

      printf("%d ", h[row * width + col]);

    }

    printf("\n");

  }

  printf("\n");

  for (int row = 0; row < height; row++)

  {

    for (int col = 0; col < width; col++)

    {

      printf("%d ", hOutput[row * width + col]);

    }

    printf("\n");

  }

  getchar();

  cudaFreeHost(h);

  cudaFree(d);

  cudaFree(dOutput);

  cudaFreeHost(hOutput);

  return 0;

}

Anyone can migrate this code to oneAPI ? I’ve been with this for 2 weeks. Thank you so much !

Hello there, welcome. Can you post the SYCL output from DPCT along with the origina CUDA version you have already posted please? We can take a look and figure out what it should look like.

I think that slide 21 in this deck might help you ro understand the migration as it mentions a kernel dimension in a sycl::range cannot be equal to 0. The full talk is also available here.

Hi @rod ! Thank you for reply.

The SYCL output is the following:

#include <CL/sycl.hpp>
#include <dpct/dpct.hpp>

#include <stdio.h>

/*
DPCT1059:2: SYCL only supports 4-channel image format. Adjust the code.
*/
dpct::image_wrapper<int, 2> textureD;

void kernel(int *dOutput, int width, int height, sycl::nd_item<3> item_ct1,
            dpct::image_accessor_ext<int, 2> textureD)

{

   int row =

       item_ct1.get_group(1) * item_ct1.get_local_range().get(1) +
       item_ct1.get_local_id(1);

int col =

    item_ct1.get_group(2) * item_ct1.get_local_range().get(2) +
    item_ct1.get_local_id(2);

dOutput[row * width + col] =

    textureD.read(col, row);
}

int main()

{
   dpct::device_ext &dev_ct1 = dpct::get_current_device();
   sycl::queue &q_ct1 = dev_ct1.default_queue();

        int *h;

	int width =

		10;

	int height =

		10;

	int size =

		width * height;

        /*
        DPCT1048:0: The original value cudaHostAllocDefault is not meaningful in
        the migrated code and was removed or replaced with 0. You may need to
        check the migrated code.
        */
        h = sycl::malloc_host<int>(size, q_ct1);

        int i =

		0;

	for(int row = 0; row < height; row++)

	{

		for(int col = 0; col < width; col++)

		{

			h[row * width + col] =

				i;

			i++;

		}

	}

	int *d;

	size_t pitch;

        d = (int *)dpct::dpct_malloc<int>(pitch, width * sizeof(int), height);

        dpct::dpct_memcpy(d, pitch, h, width * sizeof(int), width * sizeof(int),
                          height, dpct::host_to_device);

        dpct::image_channel channel =

            /*
            DPCT1059:3: SYCL only supports 4-channel image format. Adjust the
            code.
            */
            dpct::image_channel::create<int>();

        textureD.attach(d, width, height, pitch, channel);

        int *hOutput;

        /*
        DPCT1048:1: The original value cudaHostAllocDefault is not meaningful in
        the migrated code and was removed or replaced with 0. You may need to
        check the migrated code.
        */
        hOutput = sycl::malloc_host<int>(size, q_ct1);

        int *dOutput;

        dOutput = sycl::malloc_device<int>(size, q_ct1);

        /*
        DPCT1049:4: The workgroup size passed to the SYCL kernel may exceed the
        limit. To get the device limit, query info::device::max_work_group_size.
        Adjust the workgroup size if needed.
        */
        q_ct1.submit([&](sycl::handler &cgh) {
                auto textureD_acc = textureD.get_access(cgh);

                auto textureD_smpl = textureD.get_sampler();

                cgh.parallel_for(
                    sycl::nd_range<3>(sycl::range<3>(1, 1, width * height),
                                      sycl::range<3>(1, 1, width * height)),
                    [=](sycl::nd_item<3> item_ct1) {
                            kernel(dOutput, width, height, item_ct1,
                                   dpct::image_accessor_ext<int, 2>(
                                       textureD_smpl, textureD_acc));
                    });
        });

        q_ct1.memcpy(hOutput, dOutput, size * sizeof(int)).wait();

        for(int row = 0; row < height; row++)

	{

		for(int col = 0; col < width; col++)

		{

			printf("%d ", h[row * width + col]);

		}

		printf("\n");

	}

	printf("\n");

	for(int row = 0; row < height; row++)

	{

		for(int col = 0; col < width; col++)

		{

			printf("%d ", hOutput[row * width + col]);

		}

		printf("\n");

	}

	getchar();

        sycl::free(h, q_ct1);

        sycl::free(d, q_ct1);

        sycl::free(dOutput, q_ct1);

        sycl::free(hOutput, q_ct1);

return 0;

}

Thank you again.

My Problem is with the error:

/*
DPCT1059:2: SYCL only supports 4-channel image format. Adjust the code.
*/

If I modify the int type by adding int4, my program compiles and executes but the results are differents from the original CUDA code

Ah sorry I realise now you are talking about images and textures. Let us take a look at this.

Excellent! I will wait. Thank you.

Unfortunately there’s no standard way in SYCL to do this currently. Images were refactored in SYCL 2020 there are now only 4-channel image formats. Extensions can be added but this is yet to be done for the scenario you are looking at.

You could use a 4-channel image format and just ignore the other channels, but that’s going to be pretty inefficient.

Another option could maybe be to have a 4-channel image format and have each pixel in the image represent 4 elements of the input data, so each work item would operate on four pixels at a time, this would be the more efficient way to do it.

Hi @rod,

I was trying to adapt my official code, because the one that I attached here, it’s only an example. My code is so much longer and complicated. I think that the solution is the last one you mention (compute 4 pixels at a time). But the problem I think it’s that I have to modify the logic of the code, that wouldn’t that be too complex? Do you know in this the example that we are working, how to adapt it to compute 4 pixels at a time?

Thank you so much.

Maybe image_wrapper could be reaplaced by something different?