Panagiotis Koilakos

Associate Operational Data Management Officer, UNHCR

Go back

MSc in Data Science - Deciphering Big Data

Normalisation and Data Build Task

Normalization Task

 Database normalisation can be described as the process of organising data in a database or table, following a set of standard rules to reduce redundancy, speed up query processing and reduce required disk space. We can also think about database normalisation as a way to organise tables in an anticipated manner for the database users, such as, for example, having only contact details in a contact details table and not invoice number (Microsoft, 2023). The rules applied are called normal forms (NF) and is generally acceptable for data in a database to be organised till the 3rd Normal Form. More rules beyond the 3rd Normal Form do exist, such as the Boyce-Cood Normal Form or, as otherwise called, the 3.5 Normal Form, but are “rarely used outside the academic context” (Rouse, 2014).

 In this exercise, we are provided with an initial data table in an un-normalised form. The task is to normalise the table down to the 3rd Normal Form by showing and explaining each subsequent step (1st Normal Form, 2nd Normal Form, 3rd Normal Form).

Table

 The table headers look as per the below table and is in 0NF (Un-normalised):

Student Number Student Name Exam Score Support Date of Birth Course Name Exam Boards Teacher Name
1001 Bob Baker 78 No 25/08/2001 Computer Science
Maths
Physics
BCS
EdExcel
OCR
Mr Jones
Ms Parker
Mr Peters
1002 Sally Davies 55 Yes 02/10/1999 Maths
Biology
Music
AQA
WJEC
AQA
Mr Parker
Mr Patel
Ms Daniels
1003 Mark Hanmill 90 No 05/06/1995 Computer Science
Maths
Physics
BCS
EdExcel
OCR
Mr Jones
Ms Parker
Mr Peters
1004 Anas Ali 70 No 03/08/1990 Maths
Physics
Biology
AQA
OCR
WJEC
Ms Parker
Mr Peters
Mrs Patel
1005 Cheuk Yin 45 Yes 01/05/2022 Computer Science
Maths
Music
BCS
EdExcel
AQA
Mr Jones
Ms Parker
Ms Daniels

1NF

 A table is considered to be in 1NF, if it (Hillyer, 2005):
  1. Contains only atomic/single values in each and every cell (e.g. a cell cannot have the date of birth and date of school graduation at the same time or the address and the phone number at the same time);
  2. The data type/domain attribute needs to be of the same type for each column (e.g. a cell cannot have date values and integer or text values at the same time);
  3. Each column name should be unique and not be repeated.

The provided table presents the following issue, which violates the constraint of atomic values:
  - Columns ‘Course Name’, ‘Exam Boards’, and ‘Teacher Name’ are not atomic as they contain multiple items. For this exercise, names, we will be considered atomic values as ‘full names’.

None of the other two constraints are violated.

We can also notice that duplicate values in each cell are horizontally distributed (a duplicate value in the Course Name column corresponds to a value in the Exam Board column and the Teacher Name column).

In order to solve the issues, we split our table and reformat it by duplicating each row and keeping only the atomic values. In cells where there are no atomic values, we keep the same data:

Student Number Student Name Exam Score Support Date of Birth Course Name Exam Boards Teacher Name
1001 Bob Baker 78 No 25/08/2001 Computer Science BCS Mr Jones
1001 Bob Baker 78 No 25/08/2001 Maths EdExcel Ms Parker
1001 Bob Baker 78 No 25/08/2001 Physics CR Mr Peters
1002 Sally Davies 55 Yes 02/10/1999 Maths AQA Mr Parker
1002 Sally Davies 55 Yes 02/10/1999 Biology WJEC Mr Patel
1002 Sally Davies 55 Yes 02/10/1999 Music AQA Ms Daniels
1003 Mark Hanmill 90 No 05/06/1995 Computer Science BCS Mr Jones
1003 Mark Hanmill 90 No 05/06/1995 Maths EdExcel Ms Parker
1003 Mark Hanmill 90 No 05/06/1995 Physics OCR Mr Peters
1004 Anas Ali 70 No 03/08/1990 Maths AQA Ms Parker
1004 Anas Ali 70 No 03/08/1990 Physics OCR Mr Peters
1004 Anas Ali 70 No 03/08/1990 Biology WJEC Mrs Patel
1005 Cheuk Yin 45 Yes 01/05/2022 Computer Science BCS Mr Jones
1005 Cheuk Yin 45 Yes 01/05/2022 Maths EdExcel Ms Parker
1005 Cheuk Yin 45 Yes 01/05/2022 Music AQA Ms Daniels

 With a composite primary key in the columns Student Number and Course Name, the above table satisfies all the requirements of the 1st Normal Form; therefore, it is in 1NF. Each observation (column) may have duplicate values, but each instance (row) is unique.

2NF

 A table is considered to be in 2NF, if it (Hillyer, 2005):
  1. Is already in 1NF;
  2. Has no partial dependencies, meaning that every non-candidate key column should depend on the whole identified candidate key (potential primary key).

The provided table presents the following issues violating the constraints of partial dependency:
   - With identified multi-value candidate key of ‘Student Number’ and ‘Course Name’, we observe that the columns ‘Exam Boards’ and ‘Teacher Name’ depend only on a part of the candidate key which is the ‘Course Name’)

In order to solve the issue, we will need to split our table into two tables by adding a foreign key to connect the two tables:

Student table
With primary key the Student Number

Student Number Student Name Exam Score Support Date of Birth
1001 Bob Baker 78 No 25/08/2001
1002 Sally Davies 55 Yes 02/10/1999
1003 Mark Hanmill 90 No 05/06/1995
1004 Anas Ali 70 No 03/08/1990
1005 Cheuk Yin 45 Yes 01/05/2022

Course table
With composite primary key the Student Number and Course Name

Student Number Course Name Exam Boards Teacher Name
1001 Computer Science BCS Mr Jones
1001 Maths EdExcel Ms Parker
1001 Physics CR Mr Peters
1002 Maths AQA Mr Parker
1002 Biology WJEC Mr Patel
1002 Music AQA Ms Daniels
1003 Computer Science BCS Mr Jones
1003 Maths EdExcel Ms Parker
1003 Physics OCR Mr Peters
1004 Maths AQA Ms Parker
1004 Physics OCR Mr Peters
1004 Biology WJEC Mrs Patel
1005 Computer Science BCS Mr Jones
1005 Maths EdExcel Ms Parker
1005 Music AQA Ms Daniels

 The duplicate values have also been removed from the Student table. We can observe that the Course table has the same composite key as before, creating conflict with the 2NF requirements, as there are partial dependencies with the Teacher Name being dependant on the course name only. For this reason we split the table further:

Teacher table
With primary key the Course Name

Course Name Teacher Name
Computer Science Mr Jones
Maths Ms Parker
Physics Mr Peters
Biology Mr Patel
Music Ms Daniels

Course table
With composite primary key the Course Name and Student Number

Student Number Course Name Exam Boards
1001 Computer Science BCS
1001 Maths EdExcel
1001 Physics CR
1002 Maths AQA
1002 Biology WJEC
1002 Music AQA
1003 Computer Science BCS
1003 Maths EdExcel
1003 Physics OCR
1004 Maths AQA
1004 Physics OCR
1004 Biology WJEC
1005 Computer Science BCS
1005 Maths EdExcel
1005 Music AQA

 The final tables are the below:

Teacher table
With primary key the Course Name

Course Name Teacher Name
Computer Science Mr Jones
Maths Ms Parker
Physics Mr Peters
Biology Mr Patel
Music Ms Daniels

Course table
With composite primary key the Course Name and Student Number

Student Number Course Name Exam Boards
1001 Computer Science BCS
1001 Maths EdExcel
1001 Physics CR
1002 Maths AQA
1002 Biology WJEC
1002 Music AQA
1003 Computer Science BCS
1003 Maths EdExcel
1003 Physics OCR
1004 Maths AQA
1004 Physics OCR
1004 Biology WJEC
1005 Computer Science BCS
1005 Maths EdExcel
1005 Music AQA

Student table
With primary key the Student Number

Student Number Student Name Exam Score Support Date of Birth
1001 Bob Baker 78 No 25/08/2001
1002 Sally Davies 55 Yes 02/10/1999
1003 Mark Hanmill 90 No 05/06/1995
1004 Anas Ali 70 No 03/08/1990
1005 Cheuk Yin 45 Yes 01/05/2022

 The above tables satisfy all the requirements of the 2nd Normal Form therefore they are in 2NF.

3NF

 For a table to be considered in 3NF it needs to (Hillyer, 2005):
  1. Be already in 2NF;
  2. Have no transitive dependencies, meaning that every non key column should not depend on non-key columns.

Due to the way that 2NF was achieved, all the tables already satisfy the 3rd Normal Form, and therefore they are in 3NF.

The final tables are:

Teacher table
With primary key the Course Name

Course Name Teacher Name
Computer Science Mr Jones
Maths Ms Parker
Physics Mr Peters
Biology Mr Patel
Music Ms Daniels

Course table
With composite primary key the Course Name and Student Number

Student Number Course Name Exam Boards
1001 Computer Science BCS
1001 Maths EdExcel
1001 Physics CR
1002 Maths AQA
1002 Biology WJEC
1002 Music AQA
1003 Computer Science BCS
1003 Maths EdExcel
1003 Physics OCR
1004 Maths AQA
1004 Physics OCR
1004 Biology WJEC
1005 Computer Science BCS
1005 Maths EdExcel
1005 Music AQA

Student table
With primary key the Student Number

Student Number Student Name Exam Score Support Date of Birth
1001 Bob Baker 78 No 25/08/2001
1002 Sally Davies 55 Yes 02/10/1999
1003 Mark Hanmill 90 No 05/06/1995
1004 Anas Ali 70 No 03/08/1990
1005 Cheuk Yin 45 Yes 01/05/2022

Back to contents

Babar, Z. (n.d.) What is BCNF?. Available from: https://www.educative.io/answers/what-is-bcnf. [Accessed 28 April 2023].

Hillyer, M. (2005) An Introduction to Database Normalization. Available from: https://users.dcc.uchile.cl/~mnmonsal/cc42a/guias/intronorm.pdf. [Accessed 28 April 2023].

Microsoft. (2023) Description of the database normalization basics. Available from: https://learn.microsoft.com/en-us/office/troubleshoot/access/database-normalization-description. [Accessed 28 Aprik 2023].

Rouse, M. (2014) What Does Fourth Normal Form Mean?. Available from: https://www.techopedia.com/definition/19453/fourth-normal-form-4nf. [Accessed 28 April 2023].


Data Build Task

 The data build task requires to create the actual relational database based on the tables of the Normalisation Task. The database should have the proper schema created, with linked tables mapped through the foreign keys, and primary key properly recorded. The database should enforce referential integrity.

Normalisation

 The tables that we start with from the Normalisation Task are the below:

Teacher table
With primary key the Course Name

Course Name Teacher Name
Computer Science Mr Jones
Maths Ms Parker
Physics Mr Peters
Biology Mr Patel
Music Ms Daniels

Course table
With composite primary key the Course Name and Student Number

Student Number Course Name Exam Boards
1001 Computer Science BCS
1001 Maths EdExcel
1001 Physics CR
1002 Maths AQA
1002 Biology WJEC
1002 Music AQA
1003 Computer Science BCS
1003 Maths EdExcel
1003 Physics OCR
1004 Maths AQA
1004 Physics OCR
1004 Biology WJEC
1005 Computer Science BCS
1005 Maths EdExcel
1005 Music AQA

Student table
With primary key the Student Number

Student Number Student Name Exam Score Support Date of Birth
1001 Bob Baker 78 No 25/08/2001
1002 Sally Davies 55 Yes 02/10/1999
1003 Mark Hanmill 90 No 05/06/1995
1004 Anas Ali 70 No 03/08/1990
1005 Cheuk Yin 45 Yes 01/05/2022

Even though the database was adequately normalized we have also added an additional table name Exam Board, in order to ensure referential integrity. There was no need for an additional course table listing only the courses as the Teacher table fulfils that purpose.

Database Creation

 The database tables are in the below format:

Schema

The tables are created in SSMS (SQL Server). The script can be found here.

Data Insertion

In order to check that our tables work as intended the data were added in the tables through this script.

Referential Integrity

 The following checks have been performed in the created table, in order to ensure that referential integrity is properly enforced:
  1. Ensuring the no duplicate teachers can be added in the Teacher table;
  2. Ensuring that no duplicate students can be added in the Student table;
  3. Ensuring that no duplicate exam boards can be added in the Exam Board table;
  4. Ensuring that no duplicate combinations of student number and course name can be added in the courses table;
  5. Ensuring the added row in the courses table represents existing students (student table), courses (teachers table) and exam boards (exam board table);

The script used for the testing can be found here.

Final Notes and Working Theory

 The final notes and the working theory is the following:
  - Additional tables created are to ensure referential integrity (e.g. exam board);
  - It is good practice to create global unique identifiers (UUIDs/GUIDs) in order to ensure uniqueness of primary IDs. This would also avoid the presence of composite keys. Though these have not been created as we followed the exact columns provided in the exercise;
  - Normalization may be performed in different ways and results between two database designers may differ;
  - To ensure referential integrity apart from creating the primary and secondary keys we could also follow a different approach where the respective record in created automatically if it is missing on a table, during the insert queries. For example, in our design if we try to add a new student with a course under the table courses the query will return an error as the new student needs to be created in the student table first. If we create a trigger then we can ask the database to create the new student (providing some additional details), instead of returning the error.

Back to contents

L. Finžgar and M. Trebar, "Use of NFC and QR code identification in an electronic ticket system for public transport," SoftCOM 2011, 19th International Conference on Software, Telecommunications and Computer Networks, 2011, pp. 1-6.

Transport for Wales. (2022) Transport for Wales Rail Service Quality Report. Available from: https://tfw.wales/about-us/transparency/publications/transport-for-wales-rail-service-quality-report-2021-22. [Accessed 2 December 2022].

Connolly, T., Begg, C., Begg, C., & Connolly, T., (2014), Database Systems: a Practical Approach to Design, Implementation, and Management. Global E ed. Harlow: Pearson Education Limited. Available from: ProQuest Ebook Central. [3 December 2022].

Information Commissioner's Office. (n.d) The principles. Available from: https://ico.org.uk/for-organisations/guide-to-data-protection/guide-to-the-general-data-protection-regulation-gdpr/principles/. [Accessed 3 December 2022].

Transport for Wales. (n.d.) Transport for Wales. Available from: https://tfw.wales/about-us/our-story. [Accessed 3 December 2022].

European Data Protection Supervisor, (n.d.) Data Protection Officer (DPO). Available from: https://edps.europa.eu/data-protection/data-protection/reference-library/data-protection-officer-dpo_en. [Accessed 4 December 2022].

Data Culture: What It Is and How To Make It Work, (Southekal, P), 2022, Available at: https://www.forbes.com/sites/forbestechcouncil/2022/06/27/data-culture-what-it-is-and-how-to-make-it-work/?sh=2307330b2096. [Accessed 4 December 2022].