Downsampling sycl::images

Hey all,

I am trying to down-sample an image using SYCL. I must be doing something wrong though, as I get an exception while reading the pixel values. I am probably doing something wrong with my sampler. I am using

sycl::sampler sampler(sycl::coordinate_normalization_mode::normalized,
                        sycl::addressing_mode::none,
                        sycl::filtering_mode::linear);

And then, I am reading pixels with normalized values in [0,1]. Is that correct, or should it be in [-1,1]?
Also, it is valid to query the input image, which is based on a range that does not match the range that was given to the command group handler?

You can see the full code below:

sycl::range<2> image_in_range(height, width);
sycl::image<2> image_in(inputData, sycl::image_channel_order::rgba,
                        sycl::image_channel_type::unorm_int8,
                        image_in_range);

constexpr std::size_t out_dim = 256;
sycl::range<2> image_out_range(out_dim, out_dim);
sycl::image<2> image_out(outputData.data(), sycl::image_channel_order::rgba,
                         sycl::image_channel_type::unorm_int8,
                         image_out_range);

myQueue.submit([&](sycl::handler& cgh) {
  auto r = get_optimal_local_range(image_out_range, myQueue.get_device());
  auto myRange = sycl::nd_range<2>(image_out_range, r);

  auto in =
      image_in.get_access<sycl::float4, sycl::access::mode::read>(cgh);
  auto out =
      image_out.get_access<sycl::float4, sycl::access::mode::write>(cgh);

  sycl::sampler sampler(sycl::coordinate_normalization_mode::normalized,
                        sycl::addressing_mode::none,
                        sycl::filtering_mode::linear);

  cgh.parallel_for<DownSample>(myRange, [=](sycl::nd_item<2> itemID) {
    auto coords =
        sycl::int2(itemID.get_global_id(1), itemID.get_global_id(0));

    auto coords_norm =
        sycl::float2(itemID.get_global_id(1) / static_cast<float>(out_dim),
                     itemID.get_global_id(0) / static_cast<float>(out_dim));

    auto newPixel = in.read(coords_norm, sampler);
    newPixel.w() = 1.f;
    out.write(coords, newPixel);
  });
});
myQueue.wait_and_throw();

Thank you for your help!

Hi @joachim.hodara to answer your question on normalized coordinates, you are correct, pixels read and written as normalized values are in the range of 0.0 to 1.0.

Looking at the code sample, I don’t see anything immediately obvious. Could you tell me which device you are targeting and also what error code you get from the exception (this will be an asynchronous exception so you will need to use an async handler), this will help us reproduce the error you are seeing.

Thanks for helping out @Aerialmantis. I am running on the Intel UHD Graphics 620 that comes with my i7-8550U. Querying the SYCL device name returns

Running on Intel(R) Gen9 HD Graphics NEO

I am using an async handler but the behavior is very erratic. Sometimes I get the following error:

SYCLtest: malloc.c:3852: _int_malloc: Assertion `chunk_main_arena (fwd)' failed.
Aborted (core dumped)

but most of the time, I just get

Bus error (core dumped)

Here is the full code in case I forgot to mention something important. I am using stb to read/write the images.

#include <cmath>
#include <iostream>
#include <string>
#include <vector>

#define STB_IMAGE_IMPLEMENTATION
#include <stb_image.h>
#define STB_IMAGE_WRITE_IMPLEMENTATION
#include <stb_image_write.h>

#include <CL/sycl.hpp>

namespace sycl = cl::sycl;

class DownSample;

sycl::range<2> get_optimal_local_range(sycl::range<2> globalSize,
                                       sycl::device d) {
  auto optimalLocalSize =
      d.is_gpu() ? sycl::range<2>(64, 1) : sycl::range<2>(4, 1);

  for (std::size_t i = 0; i < 2; ++i) {
    while (globalSize[i] % optimalLocalSize[i])
      optimalLocalSize[i] /= 2;
  }

  return optimalLocalSize;
}

int main() {
  std::string inputFile = "nature.jpg";
  std::string outputFile = "nature_out.png";

  int widthInt, heightInt, channelsInt;
  unsigned char* inputData =
      stbi_load(inputFile.c_str(), &widthInt, &heightInt, &channelsInt, 4);
  std::size_t width = static_cast<std::size_t>(widthInt),
              height = static_cast<std::size_t>(heightInt),
              channels = static_cast<std::size_t>(channelsInt);
  auto size = width * height * channels;

  constexpr std::size_t out_dim = 256;
  std::vector<char> outputData(out_dim * out_dim * channels);

  sycl::queue myQueue([](sycl::exception_list exception_list) {
    for (auto exception : exception_list) {
      try {
        std::rethrow_exception(exception);
      } catch (const sycl::exception& e) {
        std::cerr << "Async exception caught: " << e.what() << std::endl;
        throw;
      }
    }
  });

  auto device = myQueue.get_device();
  std::cout << "Running on " << device.get_info<sycl::info::device::name>()
            << std::endl;

  {
    sycl::range<2> image_in_range(height, width);
    sycl::image<2> image_in(inputData, sycl::image_channel_order::rgba,
                            sycl::image_channel_type::unorm_int8,
                            image_in_range);

    sycl::range<2> image_out_range(out_dim, out_dim);
    sycl::image<2> image_out(outputData.data(), sycl::image_channel_order::rgba,
                             sycl::image_channel_type::unorm_int8,
                             image_out_range);

    myQueue.submit([&](sycl::handler& cgh) {
      auto r = get_optimal_local_range(image_out_range, myQueue.get_device());
      auto myRange = sycl::nd_range<2>(image_out_range, r);

      auto in =
          image_in.get_access<sycl::float4, sycl::access::mode::read>(cgh);
      auto out =
          image_out.get_access<sycl::float4, sycl::access::mode::write>(cgh);

      sycl::sampler sampler(sycl::coordinate_normalization_mode::normalized,
                            sycl::addressing_mode::none,
                            sycl::filtering_mode::linear);

      cgh.parallel_for<DownSample>(myRange, [=](sycl::nd_item<2> itemID) {
        auto coords =
            sycl::int2(itemID.get_global_id(1), itemID.get_global_id(0));

        auto coords_norm =
            sycl::float2(itemID.get_global_id(1) / static_cast<float>(out_dim),
                         itemID.get_global_id(0) / static_cast<float>(out_dim));

        auto newPixel = in.read(coords_norm, sampler);
        newPixel.w() = 1.f;
        out.write(coords, newPixel);
      });
    });
    myQueue.wait_and_throw();
  }

  stbi_write_png(outputFile.c_str(), static_cast<int>(width),
                 static_cast<int>(height), static_cast<int>(channels),
                 outputData.data(), 0);

  stbi_image_free(inputData);
}

I solved part of the issue. I was assuming that the picture I was importing had 4 channels (RGBA), while in fact it only had 3 (RGB). My sycl::image layout was incorrect, which was likely causing memory issues and the like.

However, the sampling is still not giving me what I would expect.

  1. everything works fine if out_height == out_width.

  2. if out_width is a multiple of out_height, then I see multiple images side by side (rather than one distorted image). (128 by 256 below)

out2

  1. If out_width is not a multiple of out_height, then all bets are off.

Looking at these results, it seems to me that in the code below:

void down_sample(sycl::queue queue, sycl::image<2> image_in,
                 sycl::image<2> image_out) {

  auto out_range = image_out.get_range();

  queue.submit([&](sycl::handler& cgh) {
    auto r = get_optimal_local_range(out_range, queue.get_device());
    auto myRange = sycl::nd_range<2>(out_range, r);

    auto in = image_in.get_access<sycl::float4, sycl::access::mode::read>(cgh);
    auto out =
        image_out.get_access<sycl::float4, sycl::access::mode::write>(cgh);

    sycl::sampler sampler(sycl::coordinate_normalization_mode::normalized,
                          sycl::addressing_mode::none,
                          sycl::filtering_mode::linear);

    cgh.parallel_for<DownSample>(myRange, [=](sycl::nd_item<2> itemID) {
      auto i = itemID.get_global_id(0);
      auto j = itemID.get_global_id(1);

      auto coords = sycl::int2(i, j);

      auto coords_norm =
          sycl::float2((i + 0.5) / static_cast<float>(out_range[0]),
                       (j + 0.5) / static_cast<float>(out_range[1]));

      auto newPixel = in.read(coords_norm, sampler);
      newPixel.w() = 1.f;
      out.write(coords, newPixel);
    });
  });
}

my coords variable is ok, as the entire image gets filled. However, each pixel does not get read properly. I thought I messed up something, but I used sycl::stream to print some values and they always looked reasonable (out_range = [128,256], i in [0, 127], j in [0, 255], coords_norm in [0,1], etc.

I also tried using other sycl::addressing_mode and sycl::filtering_mode, but that didn’t help.

Am I missing something here?

Hi @joachim.hodara, thanks for the information.

Ah I see, that’s a fairly common error when using SYCL images, as there’s no way for the SYCL runtime to validate the data layout matches the pointer that is passed in.

I’ve investigated the issue you described and attempted to reproduce it, and I can confirm that the coordinates appear to be calculated correctly. The coordinates also appear to scale appropriately to fit changing sizes in the input and output images, so I don’t see anything in the kernel itself that would attribute to the issue that you are seeing.

However, I did notice in the earlier code sample is that the call to stbi_write_png takes width and height, which I believe are the dimensions of the input image, should this be out_width and out_height ?

Unfortunately, these must have been a typo when I copied the function. The actual code uses out_width and out_height. You can see the full code below:

#include <iostream>
#include <vector>

#include <CL/sycl.hpp>

#define STB_IMAGE_IMPLEMENTATION
#include <stb_image.h>
#define STB_IMAGE_WRITE_IMPLEMENTATION
#include <stb_image_write.h>

namespace sycl = cl::sycl;

sycl::range<2> get_optimal_local_range(sycl::range<2> globalSize,
                                       sycl::device d) {
  auto optimalLocalSize =
      d.is_gpu() ? sycl::range<2>(64, 1) : sycl::range<2>(4, 1);

  for (std::size_t i = 0; i < 2; ++i) {
    while (globalSize[i] % optimalLocalSize[i])
      optimalLocalSize[i] /= 2;
  }

  return optimalLocalSize;
}

class DownSample;

void down_sample(sycl::queue queue, sycl::image<2> image_in,
                 sycl::image<2> image_out) {

  auto out_range = image_out.get_range();

  queue.submit([&](sycl::handler& cgh) {
    auto r = get_optimal_local_range(out_range, queue.get_device());
    auto myRange = sycl::nd_range<2>(out_range, r);

    auto in = image_in.get_access<sycl::float4, sycl::access::mode::read>(cgh);
    auto out =
        image_out.get_access<sycl::float4, sycl::access::mode::write>(cgh);

    sycl::sampler sampler(sycl::coordinate_normalization_mode::normalized,
                          sycl::addressing_mode::none,
                          sycl::filtering_mode::linear);

    cgh.parallel_for<DownSample>(myRange, [=](sycl::nd_item<2> itemID) {
      auto i = itemID.get_global_id(0);
      auto j = itemID.get_global_id(1);

      auto coords = sycl::int2(i, j);

      auto coords_norm =
          sycl::float2((i + 0.5) / static_cast<float>(out_range[0]),
                       (j + 0.5) / static_cast<float>(out_range[1]));

      auto newPixel = in.read(coords_norm, sampler);
      newPixel.w() = 1.f;
      out.write(coords, newPixel);
    });
  });
}

int main() {
  int widthInt, heightInt, channelsInt;
  unsigned char* data_in =
      stbi_load("cat.png", &widthInt, &heightInt, &channelsInt, 4);
  std::size_t width = static_cast<std::size_t>(widthInt);
  std::size_t height = static_cast<std::size_t>(heightInt);
  std::size_t n_channels = static_cast<std::size_t>(channelsInt);

  constexpr std::size_t out_dim_height = 128;
  constexpr std::size_t out_dim_width = 256;
  std::vector<char> data_out(out_dim_height * out_dim_width * n_channels);

  sycl::queue myQueue([](sycl::exception_list exception_list) {
    for (auto exception : exception_list) {
      try {
        std::rethrow_exception(exception);
      } catch (const sycl::exception& e) {
        std::cerr << "Async exception caught: " << e.what() << std::endl;
        throw;
      }
    }
  });

  {
    sycl::range<2> image_in_range(height, width);
    sycl::image<2> image_in(data_in, sycl::image_channel_order::rgba,
                            sycl::image_channel_type::unorm_int8,
                            image_in_range);

    sycl::range<2> image_out_range(out_dim_height, out_dim_width);
    sycl::image<2> image_out(data_out.data(), sycl::image_channel_order::rgba,
                             sycl::image_channel_type::unorm_int8,
                             image_out_range);

    down_sample(myQueue, image_in, image_out);

    myQueue.wait_and_throw();
  }

  stbi_write_png("out2.png", static_cast<int>(out_dim_width),
                 static_cast<int>(out_dim_height), static_cast<int>(n_channels),
                 data_out.data(), 0);
}

Thank you very much for helping me out. Much appreciated!

EDIT: I just realized I was passing the sycl::images and sycl::queue by value to the kernel. Is that the right thing to do? From what I understand so far, these sycl::images are lightweight constructs that should be cheap to copy around. Correct? If I pass them by value, will the SYCL runtime understand it needs to copy the results back to the host once the kernel is complete? What would be the idiomatic SYCL way of defining that kernel be?

void down_sample(sycl::queue& queue, 
                 const sycl::image<2>& image_in,
                 sycl::image<2>& image_out);

?

It was painful but I got it. I was using the sycl::range order as defined in the ComputeCpp example for Gaussian blur: range(height, width). However, unlike them, in my kernel, I was using

      auto i = itemID.get_global_id(0);
      auto j = itemID.get_global_id(1);
      auto coord = sycl::int2(x, y);

instead of auto coord = sycl::int2(y, x); so that my x and y were all mixed up. Just switching my ranges to range(width, height) fixed the issue.

All is well! However, I would still be curious to know what is the ‘right’ way to pass sycl::images around. By value?

To be purely pedantic, in SYCL you don’t pass queues and images directly to kernels - though I think I understand that you are asking how to handle them generally in your program. Most SYCL types are designed such that copying them is relatively cheap. It will always do the right thing. You can take references to buffers (and in fact, when creating a command group the normal pattern is to capture by reference) just as well as copying them. In SYCL-DNN, I believe the style is to keep references to queues to avoid copying the queue. There’s not much overhead to it but there were some implications of some deprecated queue behaviours that meant that copies could cause some unexpected waits - now removed, I believe.

So in short, if you’re passing to functions, references are absolutely fine; in objects, I think value makes sense, since you then don’t have to worry about lifetimes, though in practice either will work short of keeping references to deleted objects around (i.e. normal C++ rules). I’ve mentioned buffers in this answer, buffers and images will behave the same way.

Hi @joachim.hodara, that’s great, I’m glad you were able to resolve this.