We usually do HxWxC, for height, width, and channels, so each pixel is addressed via the two first dims of the input, and then it has 3 channels. Of course, you can transpose the tensor to CxHxW or CxWxH. Different ordering behaves differently with respect to memory locality.