In order to store the country information for a person I did:
CREATE TABLE test
(
id INT IDENTITY(1, 1),
name VARCHAR(100) NOT NULL,
country VARCHAR(100) NOT NULL,
PRIMARY KEY(id)
);
INSERT INTO test
VALUES ('Amy', 'Mexico'),
('Tom', 'US'),
('Mark', 'Morocco'),
('Izzy', 'Mexico');
-- milions of other rows
A lot of the countries will repeat themselves in the country column.
Another option would be to get the country in it's own table and reference the country_id as a FK in the test table:
CREATE TABLE countries
(
id INT IDENTITY(1, 1),
name VARCHAR(100) NOT NULL,
PRIMARY KEY(id)
);
CREATE TABLE test
(
id INT IDENTITY(1, 1),
name VARCHAR(100) NOT NULL,
country_id INT NOT NULL,
PRIMARY KEY(id),
FOREIGN KEY(country_id) REFERENCES countries(id)
);
My question is: is there benefit of doing the second scenario from performance point of view/ indexes point of view or it's just cumbersome to do so? ( I know I am not breaking any normal form with the first scenario)
The second version has an obvious performance benefit, namely that only multiple country IDs need be stored for each person-country relationship. This, in turn, means that your storage requirements for the tables and indices would be reduced.
Because the index of the second version would use the integer country ID rather than a string name, I would expect index performance to improve. Your database doesn't "know" that there are only a fixed number of countries. So, the index for the first version would be a B-tree splitting across text, rather than integers. And the former is more verbose than the latter.