Most of the examples on this site are written to be as understandable as
possible. That means they work, and they’re correct, but they don’t necessarily
show the most efficient way to do something in WebGPU. Further, depending on
what you need to do, there are a myriad of possible optimizations.
In this article will cover some of the most basic optimizations and discuss a
few others. To be clear, IMO, you don’t usually need to go this far. Most of
the examples around the net using WebGPU draw a couple of hundred things and so
really wouldn’t benefit from these optimizations. Still, it’s always good to
know how to make things go faster.
The basics: The less work you do, and the less work you ask WebGPU to do the
faster things will go.
In pretty much all of the examples to date, if we draw multiple shapes we’ve
done the following steps
At Init time:
for each thing we want to draw
create a uniform buffer
create a bindGroup that references that buffer
At Render time:
start an encoder and render pass
for each thing we want to draw
update a typed array with our uniform values for this object
copy the typed array to the uniform buffer for this object
set any pipeline, vertex and index buffers if needed
encode a command(s) to bind the bindGroup(s) for this object
encode a command to draw
end the render pass, finish the encoder, submit the command buffer
Let’s make an example we can optimize that follows the steps above so we can
then optimize it.
Note, this a fake example. We are only going to draw a bunch of cubes and as
such we could certainly optimize things by using instancing which we covered
in the articles on storage buffers
and vertex buffers. I didn’t want to
clutter the code by handling tons of different kinds of objects. Instancing is
certainly a great way to optimize if your project uses lots of the same model.
Plants, trees, rocks, trash, etc are often optimized by using instancing. For
other models, it’s arguably less common.
For example a table might have 4, 6 or 8 chairs around it and it would probably
be faster to use instancing to draw those chairs, except in a list of 500+
things to draw, if the chairs are the only exceptions, then it’s probably not
worth the effort to figure out some optimal data organization that some how
organizes the chairs to use instancing but finds no other situations to use
instancing.
The point of the paragraph above is, use instancing when it’s appropriate. If
you are going to draw hundreds or more of the same thing than instancing is
probably appropriate. If you are going to only draw a few of the same thing then
it’s probably not worth the effort to special case those few things.
In any case, here’s our code. We’ve got the initialization code we’ve been using
in general.
pow(specular, uni.shininess),// value if condition is true
specular >0.0);// condition
let diffuse = uni.color * textureSample(diffuseTexture, diffuseSampler, vsOut.texcoord);
// Lets multiply just the color portion (not the alpha)
// by the light
let color = diffuse.rgb * light + specular;
return vec4f(color, diffuse.a);
}
`,
});
This shader module is uses lighting similar to
the point light with specular highlights covered else where.
It uses a texture because most 3d models use textures so I thought it best to include one.
It multiplies the texture by a color so we can adjust the colors of each cube.
And it has all of the uniform values we need to do the lighting and
project the cube in 3d.
We need data for a cube and to put that data in buffers.
function createBufferWithData(device, data, usage){
label:'textured model with point light w/specular highlight',
layout:'auto',
vertex:{
module,
buffers:[
// position
{
arrayStride:3*4,// 3 floats
attributes:[
{shaderLocation:0, offset:0, format:'float32x3'},
],
},
// normal
{
arrayStride:3*4,// 3 floats
attributes:[
{shaderLocation:1, offset:0, format:'float32x3'},
],
},
// uvs
{
arrayStride:2*4,// 2 floats
attributes:[
{shaderLocation:2, offset:0, format:'float32x2'},
],
},
],
},
fragment:{
module,
targets:[{ format: presentationFormat }],
},
primitive:{
cullMode:'back',
},
depthStencil:{
depthWriteEnabled:true,
depthCompare:'less',
format:'depth24plus',
},
});
The pipeline above uses 1 buffer per attribute. One for position data, one for
normal data, and one for texture coordinates (UVs). It culls back facing
triangles, and it expects a depth texture for depth testing. All things we’ve
covered in other articles.
Let’s insert a few utilities for making colors and random numbers.
/** Given a css color string, return an array of 4 values from 0 to 255 */
Now let’s make some textures and a sampler. We’ll use
a canvas, draw an emoji on it, and then use our function
createTextureFromSource that we wrote in
the article on importing textures
to create a texture from it.
Let’s create a set of material info. We haven’t done this anywhere else but it’s
a common setup. Unity, Unreal, Blender, Three.js, Babylon,js all have a concept
of a material. Generally, a material holds things like the color of the
material, how shiny it is, as well as which texture to use, etc…
We’ll make 20 “materials” and then pick a material at random for each cube.
const numMaterials =20;
const materials =[];
for(let i =0; i < numMaterials;++i){
const color = hslToRGBA(rand(), rand(0.5,0.8), rand(0.5,0.7));
const shininess = rand(10,120);
materials.push({
color,
shininess,
texture: randomArrayElement(textures),
sampler,
});
}
Now let’s make data for each thing (cube) we want to draw. We’ll support a
maximum of 30000. Like we have in the past, we’ll make a uniform buffer for each
object as well as a typed array we can update with uniform values. We’ll also
make a bind group for each object. And we’ll pick some random values we can use
to position and animate each object.
Inside the render loop we’ll update our render pass descriptor. we’ll also
create a depth texture if one doesn’t exist or if the one
we have has a different size then our canvas texture. We did this in
the article on 3d.
// Get the current texture from the canvas context and
Now we can loop over all the objects and draw them, for each one we need
to update all of its uniform values, copy the uniform values to its uniform buffer,
bind the bind group for this object, and draw.
for(let i =0; i < settings.numObjects;++i){
const{
bindGroup,
uniformBuffer,
uniformValues,
normalMatrixValue,
worldValue,
viewProjectionValue,
colorValue,
lightWorldPositionValue,
viewWorldPositionValue,
shininessValue,
axis,
material,
radius,
speed,
rotationSpeed,
scale,
}= objectInfos[i];
// Copy the viewProjectionMatrix into the uniform values for this object
viewProjectionValue.set(viewProjectionMatrix);
// Compute a world matrix
mat4.identity(worldValue);
mat4.axisRotate(worldValue, axis, i + time * speed, worldValue);
mat4.translate(worldValue,[0,0,Math.sin(i *3.721+ time * speed)* radius], worldValue);
mat4.translate(worldValue,[0,0,Math.sin(i *9.721+ time *0.1)* radius], worldValue);
mat4.rotateX(worldValue, time * rotationSpeed + i, worldValue);
Note that the portion of the code labeled “Compute a world matrix” is not so common. It would
be more common to have a scene graph but that would have cluttered
the example even more. We needed something showing animation I threw something together.
Then we can end the pass, finish the command buffer, and submit it.
pass.end();
const commandBuffer = encoder.finish();
device.queue.submit([commandBuffer]);
requestAnimationFrame(render);
}
requestAnimationFrame(render);
A few more things left to do. Let’s add in resizing
One more thing, just to help with better comparisons. An issue we have now is,
every visible cube has every pixel rendered or at least checked if it needs to
be rendered. Since we’re not optimizing the rendering of pixels but rather
optimizing the usage of WebGPU itself, it can be useful to be able to draw to a
1x1 pixel canvas. This effectively removes nearly all of the time spent
rasterizing triangles and instead leaves only the part of our code that is doing
math and communicating with WebGPU.
Increase the number of objects and see when the framerate drops for you. For me,
on my 75hz monitor on an M1 Mac I got ~8000 cubes before the framerate dropped.
In the example above, and in most of the examples on this site, we’ve used
writeBuffer to copy data into a vertex or index buffer. As a very minor
optimization, for this particular case, when you create a buffer you can pass in
mappedAtCreation: true. This has 2 benefits.
It’s slightly faster to put the data into the new buffer
You don’t have to add GPUBufferUsage.COPY_DST to the buffer’s usage.
This assumes you’re not going to change the data later via writeBuffer nor
one of the copy to buffer functions.
function createBufferWithData(device, data, usage){
In the example above we have 3 attributes, one for position, one for normals,
and one for texture coordinates. It’s common to have 4 to 6 attributes where
we’d have tangents for normal mapping and, if
we had a skinned model, we’d add in weights and joints.
In the example above, each attribute is using its own buffer. This is slower both
on the CPU and GPU. It’s slower on the CPU in JavaScript because we need to call
setVertexBuffer once for each buffer for each model we want to draw.
Imagine instead of just a cube we had 100s of models. Each time we switched
which model to draw we’d have to call setVertexBuffer up to 6 times. 100 * 6
calls per model = 600 calls.
Following the rule “less work = go faster”, if we merged the data for the
attributes into a single buffer then we’d only need one call to
setVertexBuffer once per model. 100 calls. That’s like 600% faster!
On the GPU, loading things that are together in memory is usually faster than
loading from different places in memory so on top of just putting the vertex
data for a single model into a single buffer, it’s better to interleave the
data.
Above we put the data for all 3 attributes into a single buffer and then changed
our render pass so it expects the data interleaved into a single buffer.
Note: if you’re loading gLTF files, it’s arguably good to either pre-process
them so their vertex data is interleaved into a single buffer (best) or else
interleave the data at load time.
Optimization: Split uniform buffers (shared, material, per model)
Our example right now has one uniform buffer per object.
structUniforms{
normalMatrix: mat3x3f,
viewProjection: mat4x4f,
world: mat4x4f,
color: vec4f,
lightWorldPosition: vec3f,
viewWorldPosition: vec3f,
shininess: f32,
};
Some of those uniform values like viewProjection, lightWorldPosition
and viewWorldPosition can be shared.
We can split these in the shader to use 2 uniform buffers. One for the shared
values and one for per object values.
structGlobalUniforms{
viewProjection: mat4x4f,
lightWorldPosition: vec3f,
viewWorldPosition: vec3f,
};
structPerObjectUniforms{
normalMatrix: mat3x3f,
world: mat4x4f,
color: vec4f,
shininess: f32,
};
With this change, we’ll save having to copy the
viewProjection, lightWorldPosition and viewWorldPosition
to every uniform buffer. We’ll also copy less data per object
with device.queue.writeBuffer
On my machine, with that change, our math portion dropped ~16%
Optimization: Separate more uniforms
A common organization in a 3D library is to have “models” (the vertex data),
“materials” (the colors, shininess, and textures), “lights” (which lights to
use), “viewInfo” (the view and projection matrix). In particular, in our
example, color and shininess never change so it’s a waste to keep copying
them to the uniform buffer every frame.
Let’s make a uniform buffer per material. We’ll copy the material settings into
them at init time and then just add them to our bind group.
First let’s change the shaders to use another uniform buffer.
When we setup the per object info we no longer need to pass on the material
settings. Instead we just need to add the material’s uniform buffer to the
object’s bind group.
Optimization: Use One large Uniform Buffer with buffer offsets
Right now, each object has it’s own uniform buffer. At render time, for each
object, we update a typed array with the uniform values for that object and then
call device.queue.writeBuffer to update that single uniform buffer’s values.
If we’re rendering 8000 objects that’s 8000 calls to device.queue.writeBuffer.
Instead, we could make one larger uniform buffer. We can then setup the bind
group for each object to use it’s own portion of the larger buffer. At render
time, we can update all the values for all of the objects in one large typed
array and make just one call to device.queue.writeBuffer which should be
faster.
First let’s allocate a large uniform buffer and large typed array. Uniform
buffer offsets have a minimum alignment which defaults to 256 bytes so we’ll
round up the size we need per object to 256 bytes.
Now we can change the per object views to view into that large typedarray. We
can also set the bind group to use the correct portion of the large uniform
buffer.
for(let i =0; i < maxObjects;++i){
const uniformBufferOffset = i * uniformBufferSpace;
const f32Offset = uniformBufferOffset /4;
// offsets to the various uniform values in float32 indices
On my machine that shaved off 40% of the JavaScript time!
Optimization: Use Mapped Buffers
When we call device.queue.writeBuffer, what happens is, WebGPU makes a copy of
the data in the typed array. It copies that data to the GPU process (a separate
process that talks to the GPU for security). In the GPU process that data is
then copied to the GPU Buffer.
We can skip one of those copies by using mapped buffers instead. We’ll map a
buffer, update the uniform values directly into that mapped buffer. Then we’ll
unmap the buffer and issue a copyBufferToBuffer command to copy to the uniform
buffer. This will save a copy.
WebGPU mapping happens asynchronously so rather then map a buffer and wait for
it to be ready, we’ll keep an array of already mapped buffers. Each frame, we
either get an already mapped buffer or create a new one that is already mapped.
After we render, we’ll setup a callback to map the buffer when it’s available
and put it back on the list of already mapped buffers. This way, we’ll never
have to wait for a mapped buffer.
First we’ll make an array of mapped buffers and a function to either get a
pre-mapped buffer or make a new one.
We can’t pre-create typedarray views anymore because mapping
a buffer gives us a new ArrayBuffer. So, we’ll have to
make new typedarray views after mapping.
// offsets to the various uniform values in float32 indices
const kNormalMatrixOffset =0;
const kWorldOffset =12;
for(let i =0; i < maxObjects;++i){
const uniformBufferOffset = i * uniformBufferSpace;
const f32Offset = uniformBufferOffset /4;
// offsets to the various uniform values in float32 indices
At render time we encode a command to copy the transfer buffer
to the uniform buffer before we start looping through the
objects. This is because the copyBufferToBuffer command is
a command on the GPUCommandEncoder. We need it to run before
the objects are rendered but, as we loop over the object’s we’re
encoding render pass commands to render them. Before, we called
device.queue.writeBuffer after updating the typed arrays, which
of course, executes first because we have no called submit yet
on our commands. In this case though, our copy actually is a command
so we have to encode it before the draw commands. This is fine because
remember, it’s just a command, it will not be executed until we
submit the command buffer which means we can still update the transfer
buffer as the copy has not yet happened.
Finally, as soon as we’ve submitted the command buffer we map the buffer again.
Mapping is asynchronous so when it’s finally ready we’ll add it back to the list
of already mapped buffers.
With rendering unchecked, the difference is even bigger. For me I get 9000 at
75fps with the original non-optimized example and 18000 at 75fps in this last
version. That’s a 2x speed up!
Other things that might help
Double buffer the large uniform buffer
This comes up as a possible optimization because WebGPU can not update a
buffer that is currently in use.
So, imagine you start rendering (you call device.queue.submit). The GPU
starts rendering using our large uniform buffer. You immediately try to update
that buffer. In this case, WebGPU would have to pause and wait for the GPU to
finish using the buffer for rendering.
This is unlikely to happen in our example above. We don’t directly update the
uniform buffer. Instead we update a transfer buffer and then later, ask the
GPU to copy it to the uniform buffer.
This issue would be more likely to come up if we update a buffer directly on
the GPU using a compute shader.
This is why, in our loop where we update our per object uniform values, for
each object we have to create 2 Float32Array views into our mapped buffer.
For 20000 objects that’s creating 40000 of these temporary views.
Adding offsets to every input would make them burdensome to use in my opinion
but, just as a test, I wrote a modified version of the math functions that
take an offset. In other words.
It’s up to you if you feel that’s worth it. For me personally, like I
mentioned at the top of the article, I’d prefer to keep it simple to use. I’m
rarely trying to draw 10000 things. But, it’s good to know, if I wanted to
squeeze out more performance, this is one place I might find some. More likely
I’d look into WebAssembly if I needed to go that far.
Directly map the uniform buffer
In our example above we map a transfer buffer, a buffer that only has
COPY_SRC and MAP_WRITE usage flags. We then have to call
encoder.copyBufferToBuffer to copy the contents of that buffer into the
actual uniform buffer.
It would be much nicer if we could directly map the uniform buffer and avoid
the copy. Unfortunately, that ability is not available in WebGPU version 1 but
it is being considered as an optional feature sometime in the future,
especially for uniform memory architectures like some ARM based devices.
Indirect Drawing
Indirect drawing refers to draw commands that take their parameters from a GPU buffer.
pass.draw(vertexCount, instanceCount, firstVertex, firstInstance);// direct
In the indirect case above, someBuffer is a 16 byte portion of a GPU buffer that holds
[vertexCount, instanceCount, firstVertex, firstInstance].
The advantage to indirect draw is that you can have the GPU itself fill out the values.
You can even have the GPU set vertexCount and/or instanceCount to zero when you
don’t want that thing to be drawn.
Using indirect drawing, you could do things like, for example, passing all of the
objects’ bounding boxes or bounding spheres to the GPU and then have the GPU do
frustum culling and if the object is inside the frustum it would update that
object’s indirect drawing parameters to be drawn, otherwise it would update them
to not be drawn. “frustum culling” is a fancy way to say "check if the object
is possibly inside the frustum of the camera. We talked about frustums in
the article on perspective projection.
Render Bundles
Render bundles let you pre-record a bunch of command buffer commands and then
request them to be executed later. This can be useful, especially if your
scene is relatively static, meaning you don’t need to add or remove objects
later.
There’s a great article here
that combines render bundles, indirect draws, GPU frustum culling, to show
some ideas for getting more speed in specialized situations.