The Tensor (Data Structure)

In machine-learning software, a tensor is the workhorse data structure: an n-dimensional array of numbers, all of the same type, that holds everything from a single weight to an entire batch of images. A scalar is a zero-dimensional tensor, a vector is one-dimensional, a matrix is two-dimensional, and a batch of color images is four-dimensional. The word borrows from mathematics, but as a software object a tensor is simpler than the mathematical notion - it is essentially a multi-dimensional array with a fixed element type and some metadata describing how its numbers are laid out in memory.

PyTorch’s documentation states the core definition plainly: “A torch.Tensor is a multi-dimensional matrix containing elements of a single data type.” That single-type constraint is what makes tensors fast - the whole array can be processed by one tight loop of compiled code, with no per-element type checks. A tensor carries a dtype (such as 32-bit float or 8-bit integer) and a device (CPU or a particular GPU), so the same logical array can live in main memory or in accelerator memory without changing the code that operates on it.

The lineage runs straight back to NumPy’s ndarray, which established the design that the ML frameworks adopted. The NumPy documentation describes the ndarray as “a (usually fixed-size) multidimensional container of items of the same type and size,” whose dimensions are given by a shape - a tuple of integers - and whose element type is a separate dtype object. Critically, NumPy stores the data as one flat block and uses strides, a tuple of byte offsets, to map a multi-dimensional index onto a position in that flat memory. This strided layout is what lets operations like transposing or slicing be performed by changing the metadata rather than copying the data.

That strided design is the quiet reason tensors are efficient. Reshaping, transposing, or taking a sub-array can often be done as a “view” - a new shape-and-strides descriptor pointing at the same underlying buffer - so no numbers move. The framework only copies when it must. Combined with a single dtype, this gives the predictable, contiguous memory access that compiled kernels, SIMD units, and GPUs need to run at full speed.

The tensor is the common currency that ties the rest of the ML stack together. Vectorized operations consume and produce tensors, broadcasting reconciles tensors of different shapes, the computational graph records operations on tensors, and automatic differentiation computes gradients that are themselves tensors. Get the tensor right - dtype, shape, strides, device - and everything built on top of it inherits its speed.