FIT1004 / FIT2010 - Databases
This page contains additional learning material, tips and examples for unit FIT1004/FIT2010 (Databases) at Monash.
FIT1004/2010 is an introductory unit on databases at Monash University directed primarily towards Bachelor of Business Information Systems students. The FIT1004 and FIT2010 units are largely the same, with FIT1004 being offered for first year students, and FIT2010 for second year students.
If you are one of my students and have any questions, do not hesitate to get in touch. You may also be interested in more relevant student resources.
Course summary
Date
Semester 1, 2009.
Details
(For more details, see unit information on the university website, if still available.)
This unit provides an introduction to the concepts of database management. This includes planning, designing, using and implementing a data model using an enterprise-scale relational database system. Methods and techniques are presented to populate, retrieve, update and implement integrity features on data in the implemented database system.
At the completion of this unit the students will have a basic knowledge and understanding of:
- the major objectives of database technology;
- the relational model for databases and competing models;
- the phases of the database development life cycle and their correspondence to the phases of the system development lifecycle;
- the techniques and tools to design and implement a database suitable for an information system;
- a database retrieval and manipulation language (SQL);
- methods that can be put in place to permit efficient operation of a database;
- the role of a database administrator.
Course material
Most of the course material is available only through the university's online content provision system "Moodle". Unfortunately, I cannot make this material available here, as it is likely that someone would then cause me trouble over copyright issues or something similar (see my disclaimer). Here I am making a few selected materials available and provide some information on issues that frequently arise in the labs.
Represent currency amounts as integers
Good practice: Never represent currency amounts as floating point numbers specifying dollar amounts. Instead, use integer numbers to specify cent amounts.
If you are taking this unit you should know a little bit about how numbers are represented inside of computers using binary notation (if you don't, you can have a look at this Wikipedia article, or at this excellent guide on floating point arithmetic). Now imagine that you pay 0.10 dollars for a song that you downloaded from the internet, but someone deducts 0.11 dollars from your account. Seems like a small loss, but now imagine you are a company that buys millions of separate items at 0.10 dollars each. Then your losses can become quite significant. Ant yet, if software developers do not pay attention, this kind of errors can happen very easily. The decimal number 0.1 cannot be represented exactly in binary notation, but is approximately 1.10011001100110011001101 × 2-4. In fact, a similar problem will occur with any rational number of which the denominator is not a power of 10 Floating point representations were designed with scientific calculations in mind, and errors such as these are not of crucial importance there, particularly in the face of the fact that numerical calculations are inherently approximate. However, for financial calculations, the exact representation of monetary amounts is crucial.
Some databases have a dedicated data type currency that should be used to represent monetary amounts, but many do not. Many databases have a special data type for representing rational numbers in decimal rather than in binary notation (by saving an integer number and a position of the decimal point within that number). Such decimals are suitable for representing monetary amounts, but using this data type may make database software less portable as they are not present on all DBMSs. Besides, a developer must be sure that exact representation is used internally before committing to such data type.
A common good practice used to avoid the rounding problem with monetary amounts is to employ integers to represent cents instead of rational numbers to represent dollars. For instance, instead of saving that something costs 15.10 dollars, we save that it costs 1510 cents.
This approach easily covers the vast majority of cases, but some specific cases require a little extra planning. For example, the price of petrol at a petrol station may be not 115 cents, but 114.9 cents. When designing a database for a petrol station we may choose to save not cent amounts, but 1/10 cent amounts instead. So, instead of 15.10 dollars we would save 15100 "tenths of a cent". When doing business analysis prior to building a database system, we need to determine what are the smallest monetary fractions that the users will be dealing with. We also need to consider issues such as rounding. For instance, consider a business that sells sugar. A ton of sugar may cost any whole sent amount. But if clients can be expected to order sugar by kilograms (1t = 1000kg), it may be necessary to design the database such that monetary amounts represent 10-3-fractions of a cent, or possibly even 10-4-fractions, depending on what approach to rounding is taken.
As a bottom line you should remember: never represent monetary amounts simply as float or double values. Always remember the representation/rounding issue and if a specialised data type is not available or you are not sure, use integer values to save cent amounts or fractions thereof.
Verbose identifiers
I am sure students have been told many times over, but it seems this cannot be repeated too often: Always use verbose identifiers!
If you have an object/table/entity customer quote that is associated with an object/table/entity employee through two different relations, such as processing customer account manager and worker who performs the service, do *not* call the corresponding fields employeeID1 and employeeID2. Call them accManagerEmplID and workerEmplID, or something similar!
Do not call a field number if you can call it projectTypeNumber. Do not call a variable x or s if you can call it salary or (even better) employeeSalary. Do use verbose identifiers!
And as you do so, make sure to follow the naming conventions accepted in the area in which you are working. Such conventions can be quite different. E.g. in modern Java programming you would use something like variableName or ClassName, while depending on the particular database, the convention may be to use field_name and TABLE_NAME. In any case, whichever naming convention you use, make sure to be consistent!
Consistent identifiers
And here is another advice on naming conventions:
Always use the same conventions when naming identifiers!
I decided to add this section when reading the problem statement for the 2nd assignment which had a few very unfortunate violations of this important good practice rule.
Say, you have a variable (field/column/...) that stores the ID number of an employee and you name it EMPLOYEE_NUMB. Say, you also have another variable that stores the ID number of a quote and you name it QUOTE_NUM. Another variable storing, say, a vehicle ID number may be called VEHICLE_NO. Now imagine working with these variables! Every time you refer to something you need to remember whether it was ..._NO, ..._NUM or ..._NUMB. With 3 different things this may seem feasible, but what happens when you work on a larger system?
So, besides using a verbose identifiers and a consistent approach to their syntax (see above), we need to use a consistent approach to using abbreviations. If you always abbreviate the same thing in the same way (e.g. "number" as NUM), it will be easy to remember the correct variable names without having to look things up. This good practice rule should be applied in many different situations. Another example is the ordering of semantic identifier parts. For example, say you have a variable called countEdges and another one called vertexCount. I guarantee you will waste a lot of time and cause a lot of frustration by incorrectly writing edgeCount or countVertices. Don't even go there. Be consistent and use vertexCount and edgeCount (or countEdges and countVertices) from the start on and avoid lots of trouble.
Normal forms
In my opinion, normalisation along with basic SQL are the most important lessons to learn from the entire unit. Whether you are a developer, a consultant, or an analyst, you will not be able to do any good work with databases if you do not know what normalisation is and how to do it. Besides, it is very simple, so make sure to learn it.
There is plenty of information about normalisation in the text book as well as on the web. A simple search will reveal a wealth of guides and tutorials. If you are not sure where to start, Wikipedia is always a good bet. I will not reiterate definitions and guidelines here, but I would like to add a few comments that may be useful.
The unit syllabus does not not teach all of the important normal forms (NFs). The exact number of NFs and their definitions depend on particular sources, but the gist is always the same and it is the gist that is important. For instance, Wikipedia lists 8 NFs (1st NF, 2nd NF, 3rd NF, Boyce-Codd NF, 4th NF, 5th NF, domain/key NF and 6th NF). While only the first three NFs are examinable for the FIT1004 unit, I strongly advise to have a brief look at the other ones. You may not want to spend too much time learning the non-examinable forms in detail (I don't realistically expect many 1st year students would learn non-examinable stuff), but if you get the gist of the higher NFs, it will make it a lot easier for you to learn, understand and use the first 3 NFs. The reason for this is that the all follow the same idea and once you get the idea, normalisation will become very intuitive and easy. Trust me. Give it a go.
Here is the gist in short:
The first NF ensures that the data is properly structured by requiring unique rows with primary keys and atomic attribute values. It also requires that tables describe well defined entities (or relationships) by requiring that everything depends on primary keys.
The next NFs ensure that everything depends on the primary key, the whole primary key, and nothing but the primary key. Simple.
The higher order NFs ensure that tables do not bring together independent relations that should be kept separate, as well as that further semantic constrains are satisfied.
Sample solution for assignment 1
I provide a sample solution for assignment 1, parts A and B as it was set for the FIT1004 students (note that the FIT2010 requirements are slightly different). The solution is available as a PDF document at the bottom of this page. Note that this is an example solution, not a model solution. In particular, there is no claim that this solution would get full marks, or that it is based on the marking scheme. The assignment, in particular part B, has many different correct solutions. Students may consult this example in order to learn how I would approach the assignment, but not to draw any conclusions about what design features are "required" or "wrong".
Here are a few comments that come to my mind after marking quite a large number of assignment 1B papers submitted by students. These comments are by no means comprehensive, but I hope this helps:
- Dimension is not atomic. The field
container_dimensionsis clearly not atomic. What data type would you give to a value like "6.05x2.44m"? How would you determine an answer to a customer question such as "I have a sofa that is 4 metres long and 3 metres wide. Do you have a container that fits it?" Splitdimensionsinto its two atomic components -lengthandwidth. - Descriptive names. Use them! See above.
- Do not use
floatordoubletypes for currency amounts. Useinteger. It is curious how few students got this right considering that I discussed this in the labs. See above. - Do not restrict
varcharfields to be too short. A type specification likechar(255)means that exactly 255 characters are to be reserved for the field. On the other hand,varchar(255)means that the field has a variable length of 0 to 255 characters. I.e. if a field value is shorter, only as much memory as necessary is used.char-strings are quicker to process, whilevarchar-strings are more memory efficient. There is no need to restrict them to a small maximum length. In most cases where students usevarchar(20)(e.g. for address fields),varchar(255)would be more appropriate. - Do not use
varcharvariables for primary keys. As explained in the previous point,varcharvariables are strings of variable length. Often quite a large length. As database rows are often accessed via the primary keys, one needs to make sure that whatever is chosen for primary keys should allow for fast indexing, hashing and searching. Performing these operations on variable-length strings is inefficient. It is thus good practice to provide synthetic integer primary keys in places where avarcharprimary key may otherwise seem appropriate. For instance, instead oftruckMakeandtruckModel(bothvarcharvariables that may be used as a composite primary key in theTruckTypetable) one can introduce a synthetic primary keytruckTypeIDof typeinteger.
Feedback & sample solution for assignment 2
Some general feedback for the assignment 2 is listed below. I also provide a sample solution for that assignment. The solution is available as a PDF document at the bottom of this page. Note that this is an example solution, not a model solution. In particular, there is no claim that this solution would get full marks, or that it is based on the marking scheme. Several questions have many different correct solutions. Students may consult this example in order to learn how I would approach the assignment, but not to draw any conclusions about what answers are "right" or "wrong".
Some selected feedback on the assignment:
- I used to provide some feedback, tips and tricks, and good practice suggestions for selected tasks from the assignment.
According to the feedback of some students, this was a useful resource as it provided some information not otherwise available in the official course materials. Unfortunately I was asked by important people who I do not directly work with to remove this section from on-line access. Maybe future assignments are planned to be too similar, or maybe this is yet another case of prevalence of formalism and bureaucracy over common sense that is so typical for the Faculty of Information Technology at Monash. In any case, this is sad for the interested students who would benefit from the extra materials.
I was initially reluctant to take the info down (I do not agree with the faculty on many things, but this is not the right place to discuss those issues). However, the unit lecturer at Clayton – a person who I greatly respect as a teacher and as a colleague – seconded the request to remove the feedback, and I decided to do it. Sorry guys; all I can say – go back to the source! ;)