Database normalisation can be described as the process of organising data in a database or table, following a set of standard rules to reduce redundancy, speed up query processing and reduce required disk space. We can also think about database normalisation as a way to organise tables in an anticipated manner for the database users, such as, for example, having only contact details in a contact details table and not invoice number (Microsoft, 2023). The rules applied are called normal forms (NF) and is generally acceptable for data in a database to be organised till the 3rd Normal Form. More rules beyond the 3rd Normal Form do exist, such as the Boyce-Cood Normal Form or, as otherwise called, the 3.5 Normal Form, but are “rarely used outside the academic context” (Rouse, 2014).
In this exercise, we are provided with an initial data table in an un-normalised form. The task is to normalise the table down to the 3rd Normal Form by showing and explaining each subsequent step (1st Normal Form, 2nd Normal Form, 3rd Normal Form).
The table headers look as per the below table and is in 0NF (Un-normalised):
Student Number | Student Name | Exam Score | Support | Date of Birth | Course Name | Exam Boards | Teacher Name |
---|---|---|---|---|---|---|---|
1001 | Bob Baker | 78 | No | 25/08/2001 | Computer Science Maths Physics |
BCS EdExcel OCR |
Mr Jones Ms Parker Mr Peters |
1002 | Sally Davies | 55 | Yes | 02/10/1999 | Maths Biology Music |
AQA WJEC AQA |
Mr Parker Mr Patel Ms Daniels |
1003 | Mark Hanmill | 90 | No | 05/06/1995 | Computer Science Maths Physics |
BCS EdExcel OCR |
Mr Jones Ms Parker Mr Peters |
1004 | Anas Ali | 70 | No | 03/08/1990 | Maths Physics Biology |
AQA OCR WJEC |
Ms Parker Mr Peters Mrs Patel |
1005 | Cheuk Yin | 45 | Yes | 01/05/2022 | Computer Science Maths Music |
BCS EdExcel AQA |
Mr Jones Ms Parker Ms Daniels |
A table is considered to be in 1NF, if it (Hillyer, 2005):
1. Contains only atomic/single values in each and every cell (e.g. a cell cannot have the date of birth and date of school graduation at the same time or the address and the phone number at the same time);
2. The data type/domain attribute needs to be of the same type for each column (e.g. a cell cannot have date values and integer or text values at the same time);
3. Each column name should be unique and not be repeated.
The provided table presents the following issue, which violates the constraint of atomic values:
- Columns ‘Course Name’, ‘Exam Boards’, and ‘Teacher Name’ are not atomic as they contain multiple items. For this exercise, names, we will be considered atomic values as ‘full names’.
None of the other two constraints are violated.
We can also notice that duplicate values in each cell are horizontally distributed (a duplicate value in the Course Name column corresponds to a value in the Exam Board column and the Teacher Name column).
In order to solve the issues, we split our table and reformat it by duplicating each row and keeping only the atomic values. In cells where there are no atomic values, we keep the same data:
Student Number | Student Name | Exam Score | Support | Date of Birth | Course Name | Exam Boards | Teacher Name |
---|---|---|---|---|---|---|---|
1001 | Bob Baker | 78 | No | 25/08/2001 | Computer Science | BCS | Mr Jones |
1001 | Bob Baker | 78 | No | 25/08/2001 | Maths | EdExcel | Ms Parker |
1001 | Bob Baker | 78 | No | 25/08/2001 | Physics | CR | Mr Peters |
1002 | Sally Davies | 55 | Yes | 02/10/1999 | Maths | AQA | Mr Parker |
1002 | Sally Davies | 55 | Yes | 02/10/1999 | Biology | WJEC | Mr Patel |
1002 | Sally Davies | 55 | Yes | 02/10/1999 | Music | AQA | Ms Daniels |
1003 | Mark Hanmill | 90 | No | 05/06/1995 | Computer Science | BCS | Mr Jones |
1003 | Mark Hanmill | 90 | No | 05/06/1995 | Maths | EdExcel | Ms Parker |
1003 | Mark Hanmill | 90 | No | 05/06/1995 | Physics | OCR | Mr Peters |
1004 | Anas Ali | 70 | No | 03/08/1990 | Maths | AQA | Ms Parker |
1004 | Anas Ali | 70 | No | 03/08/1990 | Physics | OCR | Mr Peters |
1004 | Anas Ali | 70 | No | 03/08/1990 | Biology | WJEC | Mrs Patel |
1005 | Cheuk Yin | 45 | Yes | 01/05/2022 | Computer Science | BCS | Mr Jones |
1005 | Cheuk Yin | 45 | Yes | 01/05/2022 | Maths | EdExcel | Ms Parker |
1005 | Cheuk Yin | 45 | Yes | 01/05/2022 | Music | AQA | Ms Daniels |
With a composite primary key in the columns Student Number and Course Name, the above table satisfies all the requirements of the 1st Normal Form; therefore, it is in 1NF. Each observation (column) may have duplicate values, but each instance (row) is unique.
A table is considered to be in 2NF, if it (Hillyer, 2005):
1. Is already in 1NF;
2. Has no partial dependencies, meaning that every non-candidate key column should depend on the whole identified candidate key (potential primary key).
The provided table presents the following issues violating the constraints of partial dependency:
- With identified multi-value candidate key of ‘Student Number’ and ‘Course Name’, we observe that the columns ‘Exam Boards’ and ‘Teacher Name’ depend only on a part of the candidate key which is the ‘Course Name’)
In order to solve the issue, we will need to split our table into two tables by adding a foreign key to connect the two tables:
Student Number | Student Name | Exam Score | Support | Date of Birth |
---|---|---|---|---|
1001 | Bob Baker | 78 | No | 25/08/2001 |
1002 | Sally Davies | 55 | Yes | 02/10/1999 |
1003 | Mark Hanmill | 90 | No | 05/06/1995 |
1004 | Anas Ali | 70 | No | 03/08/1990 |
1005 | Cheuk Yin | 45 | Yes | 01/05/2022 |
Student Number | Course Name | Exam Boards | Teacher Name |
---|---|---|---|
1001 | Computer Science | BCS | Mr Jones |
1001 | Maths | EdExcel | Ms Parker |
1001 | Physics | CR | Mr Peters |
1002 | Maths | AQA | Mr Parker |
1002 | Biology | WJEC | Mr Patel |
1002 | Music | AQA | Ms Daniels |
1003 | Computer Science | BCS | Mr Jones |
1003 | Maths | EdExcel | Ms Parker |
1003 | Physics | OCR | Mr Peters |
1004 | Maths | AQA | Ms Parker |
1004 | Physics | OCR | Mr Peters |
1004 | Biology | WJEC | Mrs Patel |
1005 | Computer Science | BCS | Mr Jones |
1005 | Maths | EdExcel | Ms Parker |
1005 | Music | AQA | Ms Daniels |
The duplicate values have also been removed from the Student table. We can observe that the Course table has the same composite key as before, creating conflict with the 2NF requirements, as there are partial dependencies with the Teacher Name being dependant on the course name only. For this reason we split the table further:
Course Name | Teacher Name |
---|---|
Computer Science | Mr Jones |
Maths | Ms Parker |
Physics | Mr Peters |
Biology | Mr Patel |
Music | Ms Daniels |
Student Number | Course Name | Exam Boards |
---|---|---|
1001 | Computer Science | BCS |
1001 | Maths | EdExcel |
1001 | Physics | CR |
1002 | Maths | AQA |
1002 | Biology | WJEC |
1002 | Music | AQA |
1003 | Computer Science | BCS |
1003 | Maths | EdExcel |
1003 | Physics | OCR |
1004 | Maths | AQA |
1004 | Physics | OCR |
1004 | Biology | WJEC |
1005 | Computer Science | BCS |
1005 | Maths | EdExcel |
1005 | Music | AQA |
The final tables are the below:
Course Name | Teacher Name |
---|---|
Computer Science | Mr Jones |
Maths | Ms Parker |
Physics | Mr Peters |
Biology | Mr Patel |
Music | Ms Daniels |
Student Number | Course Name | Exam Boards |
---|---|---|
1001 | Computer Science | BCS |
1001 | Maths | EdExcel |
1001 | Physics | CR |
1002 | Maths | AQA |
1002 | Biology | WJEC |
1002 | Music | AQA |
1003 | Computer Science | BCS |
1003 | Maths | EdExcel |
1003 | Physics | OCR |
1004 | Maths | AQA |
1004 | Physics | OCR |
1004 | Biology | WJEC |
1005 | Computer Science | BCS |
1005 | Maths | EdExcel |
1005 | Music | AQA |
Student Number | Student Name | Exam Score | Support | Date of Birth |
---|---|---|---|---|
1001 | Bob Baker | 78 | No | 25/08/2001 |
1002 | Sally Davies | 55 | Yes | 02/10/1999 |
1003 | Mark Hanmill | 90 | No | 05/06/1995 |
1004 | Anas Ali | 70 | No | 03/08/1990 |
1005 | Cheuk Yin | 45 | Yes | 01/05/2022 |
The above tables satisfy all the requirements of the 2nd Normal Form therefore they are in 2NF.
For a table to be considered in 3NF it needs to (Hillyer, 2005):
1. Be already in 2NF;
2. Have no transitive dependencies, meaning that every non key column should not depend on non-key columns.
Due to the way that 2NF was achieved, all the tables already satisfy the 3rd Normal Form, and therefore they are in 3NF.
The final tables are:
Course Name | Teacher Name |
---|---|
Computer Science | Mr Jones |
Maths | Ms Parker |
Physics | Mr Peters |
Biology | Mr Patel |
Music | Ms Daniels |
Student Number | Course Name | Exam Boards |
---|---|---|
1001 | Computer Science | BCS |
1001 | Maths | EdExcel |
1001 | Physics | CR |
1002 | Maths | AQA |
1002 | Biology | WJEC |
1002 | Music | AQA |
1003 | Computer Science | BCS |
1003 | Maths | EdExcel |
1003 | Physics | OCR |
1004 | Maths | AQA |
1004 | Physics | OCR |
1004 | Biology | WJEC |
1005 | Computer Science | BCS |
1005 | Maths | EdExcel |
1005 | Music | AQA |
Student Number | Student Name | Exam Score | Support | Date of Birth |
---|---|---|---|---|
1001 | Bob Baker | 78 | No | 25/08/2001 |
1002 | Sally Davies | 55 | Yes | 02/10/1999 |
1003 | Mark Hanmill | 90 | No | 05/06/1995 |
1004 | Anas Ali | 70 | No | 03/08/1990 |
1005 | Cheuk Yin | 45 | Yes | 01/05/2022 |
The data build task requires to create the actual relational database based on the tables of the Normalisation Task. The database should have the proper schema created, with linked tables mapped through the foreign keys, and primary key properly recorded. The database should enforce referential integrity.
The tables that we start with from the Normalisation Task are the below:
Course Name | Teacher Name |
---|---|
Computer Science | Mr Jones |
Maths | Ms Parker |
Physics | Mr Peters |
Biology | Mr Patel |
Music | Ms Daniels |
Student Number | Course Name | Exam Boards |
---|---|---|
1001 | Computer Science | BCS |
1001 | Maths | EdExcel |
1001 | Physics | CR |
1002 | Maths | AQA |
1002 | Biology | WJEC |
1002 | Music | AQA |
1003 | Computer Science | BCS |
1003 | Maths | EdExcel |
1003 | Physics | OCR |
1004 | Maths | AQA |
1004 | Physics | OCR |
1004 | Biology | WJEC |
1005 | Computer Science | BCS |
1005 | Maths | EdExcel |
1005 | Music | AQA |
Student Number | Student Name | Exam Score | Support | Date of Birth |
---|---|---|---|---|
1001 | Bob Baker | 78 | No | 25/08/2001 |
1002 | Sally Davies | 55 | Yes | 02/10/1999 |
1003 | Mark Hanmill | 90 | No | 05/06/1995 |
1004 | Anas Ali | 70 | No | 03/08/1990 |
1005 | Cheuk Yin | 45 | Yes | 01/05/2022 |
The database tables are in the below format:
The following checks have been performed in the created table, in order to ensure that referential integrity is properly enforced:
1. Ensuring the no duplicate teachers can be added in the Teacher table;
2. Ensuring that no duplicate students can be added in the Student table;
3. Ensuring that no duplicate exam boards can be added in the Exam Board table;
4. Ensuring that no duplicate combinations of student number and course name can be added in the courses table;
5. Ensuring the added row in the courses table represents existing students (student table), courses (teachers table) and exam boards (exam board table);
The script used for the testing can be found here.
The final notes and the working theory is the following:
- Additional tables created are to ensure referential integrity (e.g. exam board);
- It is good practice to create global unique identifiers (UUIDs/GUIDs) in order to ensure uniqueness of primary IDs. This would also avoid the presence of composite keys. Though these have not been created as we followed the exact columns provided in the exercise;
- Normalization may be performed in different ways and results between two database designers may differ;
- To ensure referential integrity apart from creating the primary and secondary keys we could also follow a different approach where the respective record in created automatically if it is missing on a table, during the insert queries. For example, in our design if we try to add a new student with a course under the table courses the query will return an error as the new student needs to be created in the student table first. If we create a trigger then we can ask the database to create the new student (providing some additional details), instead of returning the error.