Normalization is one of the key concepts involved when designing a good relational database model. There is quite a lot of theory behind this relatively simple concept as you can see from the wikipedia article.
To make a long story short, you have to evaluate the dependencies between different attributes (for instance, all attributes in a table depend on the primary key, but there can be many more dependencies among the attributes).
These dependencies will result in a grade of normalization of the model, going from first normal form up to the sixth one. The condition for first normal form is just having a primary key, so nearly everything you will do in a relational database will be in first normal form.
There are, however, cases in which first normal form is violated, such as non-relational databases. These are said to be in non-first normal form, and have the particularity of having multi-valued attributes.
|Doll||Pink, Red, White|
|Action Figure||Blue, Black, Brown|
This case would be solved in a relational database by creating an ItemColours table containing item-id and all the colours available (or even better, their keys). But non-relational databases, such as SimpleDb, can solve this in a single table (or domain, as they call it).
What's more, SimpleDb doesn't even require you to define the columns for each domain, as each item inserted can have its own attributes. Therefore, the domain ends up being a collection of items, each item being a set of multi-valued attributes.
|John Doe||Age=20; Phone=555,556,557; Car=Escort,Mondeo|
|Alice Doe||Age=30; Phone=333; Book=DaVinci|
This leads to some very interesting schemas, which may not seem very natural to someone used to a standard database like SQL.
What's most interesting about this is the query language. The fact that attributes are multi-valued leads to some unexpected results, such as 'not a = b' having different semantics than 'a != b'. Let's see why by analyzing the SimpleDb query language.
Without diving into the grammar definition, a SimpleDb query can be represented by something like:
not? ['att1' comp 'val1' op 'att1' comp 'val2' op ...] union|intersection not? ['att2' comp 'val1' op 'att2' comp 'val2' op ...] union|intersection ...
This is, the query is composed by predicates (a predicate is anything between square brackets) joined by set operations: union or intersection, with the possibility to negate them.
All comparisons within a single predicate must be done against a single attribute, and these may be equals, not equals, greater than, etc. The boolean operators to join them are the usual: and|or.
So, if in our example we want all items (well, they are people, but they won't mind if we treat them like items, won't they?) aged above 25, we just query ['Age' > '25']. Range query? ['Age' > '25' and 'Age' < '35']. Multi-predicate query? ['Age' > '25' and 'Age' < '35'] intersection ['Book' starts-with 'Da']. So far so good.
Now to the mean cases. When you have multi predicate queries, the engine evaluates each predicate condition against all of the values of the corresponding attribute, and adds the item if any of the values matches the condition.
Let's suppose we have a numeric attribute (let's forget for now that SimpleDb does not have typed attributes and all are considered strings) called Foo. And we have the following items:
The query ['Foo' > '20' and 'Foo' < '30'] will return both items A and B, because the 25 value of B will make the predicate true, and since any of the values verifies, B is added.
Now, if we pick the query ['Foo' > '20'] intersection ['Foo' < '30'], it will return A, B and C. This is because the first predicate will match 40 in C, and the second one will match 10, and therefore C is added. Since the conditions are expressed on different predicates (although they refer to the same attribute) they are not evaluated over the same values.
Therefore ['Foo' > '20' and 'Foo' < '30'] is not the same as ['Foo' > '20'] intersection ['Foo' < '30'].
An excellent example to understand these differences is the one shown in the already mentioned Query 101 article. Consider ['Foo' = '25' and 'Foo' = '35']. This will never return anything, no matter the dataset, since we are requesting all items that have a value for Foo that is simultaneously 20 and 30.
On the other hand, ['Foo' = '25'] intersection ['Foo' = '35'], will return B.
Now to the initial case, the not equals. Let's pick the query ['Foo' != '25']. It will clearly not return A. However, it will return B, because when the predicate is evaluated on the 35 value, it is true, so B is added to the result.
The query not ['Foo' = '25'] will have the expected behaviour, and return items C and D. It will also return E, because E has no value defined for Foo, so any comparison over Foo will be false, and its negation, true.
Manipulating data is also non-trivial. Whenever you update an item, you must specify whether the new values should be added to the existing ones or replace them.
To sum up, working with denormalized may be appealing during the modeling process due to its flexibility; but it requires being extra careful when querying and manipulating the data.
Nevertheless, bear in mind that the lack of normalization enforcement frees up a lot of database resources that (supposedly) allow for a much greater scalability and handling huge amounts of data. And this is what you should have in mind when you consider whether to use a relational or a completely denormalized database, since this will have the greatest impact on your users.
Remember that unnatural queries can be dealt with much easier than a whole crippling database already deployed onto production.