Normalization

Normalization is the practice of structuring the tables of a relational database so that the same fact is not stored in more than one place. The goal is to eliminate redundancy, which in turn removes the “update anomalies” that arise when one copy of a fact is changed and another is left stale. The technique was introduced by Edgar F. Codd, who in his 1970 paper “A Relational Model of Data for Large Shared Data Banks” already described reducing a collection of relations to what he called a normal form.

Codd developed the idea further in his 1971 IBM research report “Further Normalization of the Data Base Relational Model,” where he defined a progression of normal forms. The stated objectives were to make the collection of relations easier to understand and control, simpler to operate upon, and more informative to the casual user, while reducing the need to restructure relations as new kinds of data are introduced.

The core tool of normalization is the functional dependency: the observation that the value of one set of columns determines the value of another. First normal form requires that every column hold a single, atomic value rather than a repeating group. Second and third normal forms, defined in the 1971 report, remove dependencies of non-key columns on only part of a key, and dependencies of non-key columns on other non-key columns. A later refinement, Boyce-Codd normal form, tightens the third-form rule to cover every determinant in the relation.

In practice, normalization is a design discipline: a well-normalized schema decomposes a wide, redundant table into several narrower ones linked by keys, so that each piece of information lives in exactly one table. Joins then reassemble the pieces when a query needs them. Designers sometimes deliberately denormalize for performance, but only after understanding the consistency guarantees they are trading away.