🎉 Celebrating 25 Years of GameDev.net! 🎉

Not many can claim 25 years on the Internet! Join us in celebrating this milestone. Learn more about our history, and thank you for being a part of our community!

Back to Graphics and GPU Programming

Replacing glCopyImageSubData

Graphics and GPU Programming Programming OpenGL C++ texture

Started by taby November 23, 2022 11:24 PM

22 comments, last by taby 1 year, 6 months ago

taby

1,509

Author

November 23, 2022 11:24 PM

I'm currently using glCopyImageSubData to copy from texture to texture. It works fine, but I'm trying to replace it with the following code. It doesn't work, failing on the copy back to the GPU.

	glCopyImageSubData(glowmap_tex, GL_TEXTURE_2D, 0, 0, 0, 0,
		last_frame_glowmap_tex, GL_TEXTURE_2D, 0, 0, 0, 0,
		win_x, win_y, 1);

…

	vector<float> output_pixels(win_x* win_y * 4, 1.0f);
	glActiveTexture(GL_TEXTURE4);
	glBindTexture(GL_TEXTURE_2D, glowmap_tex);
	glBindImageTexture(GL_TEXTURE4, glowmap_tex, 0, GL_FALSE, 0, GL_WRITE_ONLY, GL_RGBA32F);
	glGetTexImage(GL_TEXTURE_2D, 0, GL_RGBA, GL_FLOAT, &output_pixels[0]);

	vector<float> last_frame_output_pixels(win_x* win_y * 4, 1.0f);
	glActiveTexture(GL_TEXTURE4);
	glBindTexture(GL_TEXTURE_2D, last_frame_glowmap_tex);
	glBindImageTexture(GL_TEXTURE4, last_frame_glowmap_tex, 0, GL_FALSE, 0, GL_WRITE_ONLY, GL_RGBA32F);
	glGetTexImage(GL_TEXTURE_2D, 0, GL_RGBA, GL_FLOAT, &last_frame_output_pixels[0]);

	vector<float> combined_output_pixels(win_x* win_y * 4, 1.0f);

	for (int x = 0; x < win_x; x++)
	{
		for (int y = 0; y < win_y; y++)
		{
			size_t index = 4 * ((y * win_x) + x);

			combined_output_pixels[index + 0] = output_pixels[index + 0];// +last_frame_output_pixels[imgIdx + 0];
			combined_output_pixels[index + 1] = output_pixels[index + 1];// +last_frame_output_pixels[imgIdx + 1];
			combined_output_pixels[index + 2] = output_pixels[index + 2];// +last_frame_output_pixels[imgIdx + 2];
			combined_output_pixels[index + 3] = output_pixels[index + 3];// +last_frame_output_pixels[imgIdx + 3];
		}
	}
	
	// The following doesn't work, and I don't know why
	glActiveTexture(GL_TEXTURE4);
	glBindTexture(GL_TEXTURE_2D, last_frame_glowmap_tex);
	glBindImageTexture(4, last_frame_glowmap_tex, 0, GL_FALSE, 0, GL_READ_ONLY, GL_RGBA32F);
	glTexImage2D(GL_TEXTURE_2D, 0, GL_RGBA32F, win_x, win_y, 0, GL_RGBA, GL_FLOAT, &combined_output_pixels[0]);

Any ideas?

JoeJ

4,260

November 24, 2022 12:12 AM

taby said:
Any ideas?

No, but you could work on the proper solution instead spending time on a work around.
Upload / Download to / from GPU is very expensive, and single threaded image processing on CPU is slow too. You want to do the whole thing on GPU alone.

taby

1,509

Author

November 24, 2022 12:50 AM

I've tried all of the other solutions, without luck. This is, of course, only a test, temporary.

OK, I put the shader at https://github.com/sjhalayka/obj_ogl4/blob/aed9d040cb09b2db67a07ebae17a7dafb5a68846/ortho_reflectance.fs.glsl#L44 and the C++ code at https://github.com/sjhalayka/obj_ogl4/blob/aed9d040cb09b2db67a07ebae17a7dafb5a68846/main.cpp#L472

All I need is a little guidance. Do I need to render to a texture?

JoeJ

4,260

November 24, 2022 10:49 AM

taby said:
Do I need to render to a texture?

No. The easiest and fastest way should be to use a compute shader, writing and reading to texels directly.
CS has advantages if you do complex image processing. E.g. we want depth aware DOF. Then you can load a tile of texture to LDS memory.
Then all threads can access this cached image data without having to access VRAM at all.
After your complex processing is done, you write back to VRAM. So you access VRAM only two times.
Contrary, a pixel shader would need to sample the texture from VRAM constantly in its inner loop, which likely is cached as well, but in many cases the LDS approach will win.
However, a downside is that you can not use texture filter HW on LDS memory, so if you need texture filter LDS approach becomes less attractive.

On the API side there are some caveats:
The texture needs to uncompressed.
Likely you need two textures - one to read and another to write the results. (Otherwise threads would randomly read either original values, or changed values form other threads)
The driver must know which textures are readonly or writable for which shaders, so barriers, synchronization, resource transitions can be handled.

Maybe this tutorial gives all the details needed: https://learnopengl.com/Guest-Articles/2022/Compute-Shaders/Introduction

taby

1,509

Author

November 24, 2022 03:09 PM

OK, first off: you're a certifiable genius at graphics programming!

I see your solution now:

Pass two textures into the compute shader
Accumulate them in the shader, or whatever
Write to temporary texture at the end of the shader
Copy finalized temporary texture to last frame's texture using glCopyImageSubData

Does this sound reasonable?

And here I was concocting a spell using the arcane FBO and all that stuff. LOL

P.S. I have a compute shader example on my GitHub too: https://github.com/sjhalayka/qjs_compute_shader

JoeJ

4,260

November 24, 2022 03:44 PM

taby said:
I see your solution now: Pass two textures into the compute shader Accumulate them in the shader, or whatever Write to temporary texture at the end of the shader Copy finalized temporary texture to last frame's texture using glCopyImageSubData Does this sound reasonable?

I think you don't need a temporary texture. Because actually you have no kernel reading adjacent pixels, each pixel is accessed only by exactly one thread. (For this reason, there is no point to cache to LDS either. Pixel shader could do it too, but requires a cumbersome triangle to bound the texture rectangle just to map threads to all texels.)

So this should already work: prevTex[i] = prevTex[i] + curTex[i];

But i'm not sure if we can read and write to the same texture. I guess so, but i rarely worked with textures, so idk.

Btw, just saw ‘slow’ code here:

	for (int x = 0; x < win_x; x++)
	{
		for (int y = 0; y < win_y; y++)
		{
			size_t index = 4 * ((y * win_x) + x);

			combined_output_pixels[index + 0] = output_pixels[index + 0];// +last_frame_output_pixels[imgIdx + 0];

Notice this processes the image ‘vertically’, so the stride of memory access is big.
It should be faster if you reverse the loops:

	for (int y = 0; y < win_y; y++)
	{
		for (int x = 0; x < win_x; x++)
		{
			size_t index = 4 * ((y * win_x) + x);

			combined_output_pixels[index + 0] = output_pixels[index + 0];// +last_frame_output_pixels[imgIdx + 0];

Now we process horizontally, stride is one, and access pattern is ideal.

taby

1,509

Author

November 24, 2022 04:22 PM

Thank you again, for all of the advice. I tried avoiding using a temporary texture, but it's not working. I also tried to use a temporary texture, but the same result: temp_tex ends up with all 0s.

Here is the C++ code now! So simple, thanks to you.


	glowmap_copier.use_program();

	glActiveTexture(GL_TEXTURE0);
	glBindTexture(GL_TEXTURE_2D, last_frame_glowmap_tex);
	glUniform1i(glGetUniformLocation(glowmap_copier.get_program(), "output_image"), 0);

	// activate glow and last frame glow input textures
	glActiveTexture(GL_TEXTURE1);
	glBindTexture(GL_TEXTURE_2D, glowmap_tex);
	glUniform1i(glGetUniformLocation(glowmap_copier.get_program(), "inputa_image"), 1);

	glActiveTexture(GL_TEXTURE2);
	glBindTexture(GL_TEXTURE_2D, last_frame_glowmap_tex);
	glUniform1i(glGetUniformLocation(glowmap_copier.get_program(), "inputb_image"), 2);

	// call compute shader
	glDispatchCompute((GLuint)win_x, (GLuint)win_y, 1);

	// Wait for compute shader to finish
	glMemoryBarrier(GL_SHADER_IMAGE_ACCESS_BARRIER_BIT);

The shader is:

// OpenGL 4.3 introduces compute shaders
#version 430

layout(local_size_x = 1, local_size_y = 1) in;

// Single-channel output
layout(binding = 0, rgba32f) writeonly uniform image2D output_image;
layout(binding = 1, rgba32f) readonly uniform image2D inputa_image;
layout(binding = 2, rgba32f) readonly uniform image2D inputb_image;


void main()
{
	// Get global coordinates
	const ivec2 pixel_coords = ivec2(gl_GlobalInvocationID.xy);
	const vec4 output_pixel = imageLoad(inputa_image, pixel_coords) + imageLoad(inputb_image, pixel_coords);

	imageStore(output_image, pixel_coords, output_pixel);
}

JoeJ

4,260

November 24, 2022 04:39 PM

Maybe it fails because you map units 0 and 2 to the same texture.
But i guess you already tried to use a real additional temporary texture as well?

So yeah, this is what sucks with GPU programming. We never know why it doesn't work.

Edit: Checking for GL errors on API side might help.

taby

1,509

Author

November 24, 2022 05:14 PM

Yeppers, this is what I'm doing now:


	// create output temp texture, with texstorage
	GLuint temp_tex;

	glGenTextures(1, &temp_tex);
	glActiveTexture(GL_TEXTURE0);
	glBindTexture(GL_TEXTURE_2D, temp_tex);
	glTexParameteri(GL_TEXTURE_2D, GL_TEXTURE_WRAP_S, GL_CLAMP_TO_EDGE);
	glTexParameteri(GL_TEXTURE_2D, GL_TEXTURE_WRAP_T, GL_CLAMP_TO_EDGE);
	glTexParameteri(GL_TEXTURE_2D, GL_TEXTURE_MAG_FILTER, GL_LINEAR);
	glTexParameteri(GL_TEXTURE_2D, GL_TEXTURE_MIN_FILTER, GL_LINEAR);
	glTexImage2D(GL_TEXTURE_2D, 0, GL_RGBA32F, win_x, win_y, 0, GL_RGBA, GL_FLOAT, NULL);
	glBindImageTexture(0, temp_tex, 0, GL_FALSE, 0, GL_WRITE_ONLY, GL_RGBA32F);


	glActiveTexture(GL_TEXTURE0);
	glBindTexture(GL_TEXTURE_2D, temp_tex);
	glUniform1i(glGetUniformLocation(glowmap_copier.get_program(), "output_image"), 0);

	// activate glow and last frame glow input textures
	glActiveTexture(GL_TEXTURE1);
	glBindTexture(GL_TEXTURE_2D, glowmap_tex);
	glUniform1i(glGetUniformLocation(glowmap_copier.get_program(), "inputa_image"), 1);

	glActiveTexture(GL_TEXTURE2);
	glBindTexture(GL_TEXTURE_2D, last_frame_glowmap_tex);
	glUniform1i(glGetUniformLocation(glowmap_copier.get_program(), "inputb_image"), 2);

	// call compute shader
	glowmap_copier.use_program();
	glDispatchCompute((GLuint)win_x, (GLuint)win_y, 1);

	// Wait for compute shader to finish
	glMemoryBarrier(GL_SHADER_IMAGE_ACCESS_BARRIER_BIT);

	// copy from temp to last frame using glCopyImageSubData
	glCopyImageSubData(temp_tex, GL_TEXTURE_2D, 0, 0, 0, 0,
		last_frame_glowmap_tex, GL_TEXTURE_2D, 0, 0, 0, 0,
		win_x, win_y, 1);

	glDeleteTextures(1, &temp_tex);

and

// OpenGL 4.3 introduces compute shaders
#version 430

layout(local_size_x = 1, local_size_y = 1) in;

layout(binding = 0, rgba32f) writeonly uniform image2D output_image;
layout(binding = 1, rgba32f) readonly uniform image2D inputa_image;
layout(binding = 2, rgba32f) readonly uniform image2D inputb_image;


void main()
{
	// Get global coordinates
	const ivec2 pixel_coords = ivec2(gl_GlobalInvocationID.xy);
	const vec3 output_pixel = imageLoad(inputa_image, pixel_coords).rgb + imageLoad(inputb_image, pixel_coords).rgb;

	imageStore(output_image, pixel_coords, vec4(output_pixel, 1.0));
}

JoeJ

4,260

November 24, 2022 05:24 PM

Great.
But you know what i have to say about this:

taby said:
layout(local_size_x = 1, local_size_y = 1) in;

🎉 Celebrating 25 Years of GameDev.net! 🎉

Replacing glCopyImageSubData

This topic is closed to new replies.

Popular Topics

Recommended Tutorials

🎉 Celebrating 25 Years of GameDev.net! 🎉

Replacing glCopyImageSubData

This topic is closed to new replies.

Popular Topics

Recommended Tutorials

Reticulating splines