Execution time for a self-join of a 50-million-row table

58 Views Asked by Vitit Kantabutra At 24 December 2022 at 22:34

This concerns the benchmark results on page 11, Chapter 1 of the book entitled, Neo4j in Action. The authors claimed that this self-join takes 0.016 seconds on a "commodity" computer (laptop) with 8GB RAM and an Intel i7 Core. How would that be possible? The size of the join is 2.5e15 rows. Each row has not only numbers but names and other information. However, to be too "optimistic," let's say each row just has three numbers, namely, the persons' identifications. Suppose the laptop has 8 cores running at 4GHz! Suppose further that all cores can and do write the results all at the same time, in just a single clock cycle. So, in 0.016s, you can write 0.016 times 8 times 4e9 = 5.12e8 words, far fewer than the 7.5e15 needed.

I expected the execution time to be a lot greater than what was tabulated.

Here's the code, from Aleksa Vukotic and Nicki Watt, with Tareq Abedrabbo, Dominic Fox, and Jonas Partner, Neo4j in Action, Manning Publications Co., Shelter Island, NY, USA, 2015.

create table t_user (
id bigint not null,
name varchar(255) not null,
primary key (id)
);

create table t_user_friend (
id bigint not null,
user_1 bigint not null,
user_2 bigint not null,
primary key (id)
);

alter table t_user_friend
add index FK416055ABC6132571 (user_1),
add constraint FK416055ABC6132571
foreign key (user_1) references t_user (id);

alter table t_user_friend
add index FK416055ABC6132572 (user_2),
add constraint FK416055ABC6132572
foreign key (user_2) references t_user (id);

Size of t_user_friend table: 50 million rows.

How about finding all friends of a user’s friends? This time you’d typically join the t_user_friend table with itself before querying:

select count(distinct uf2.*) from t_user_friend uf1 ➥ inner join t_user_friend uf2 on uf1.user_1 = uf2.user_2 ➥ where uf1.user_1 = ?

My question: Shouldn't the condition be "on uf1.user_2 = uf2.user_1? Thanks!

Original Q&A

There are 1 best solutions below

Bill Karwin On 24 December 2022 at 22:41

You didn't include in your question the query, or the table definition, or the EXPLAIN report for that query, so I can't answer specifically for the query you are referring to.

But I can answer generally.

The strategy to optimizing SQL queries is: don't read the whole table on every query.

If you look up a phone number in a telephone book, you don't read the whole book, right? You search only for the row for the person whose number you need. The fact that the telephone book is sorted alphabetically makes finding their name pretty quick. You don't have to read the whole book.

Likewise, once you find their name and number, if you needed to find related entries, you would use an index — that is, a pre-sorted list of values — to make that lookup more efficient too.

Execution time for a self-join of a 50-million-row table

There are 1 best solutions below

Related Questions in MYSQL

Related Questions in DATABASE

Related Questions in JOIN

Related Questions in BENCHMARKING

Related Questions in RELATIONAL

Trending Questions

Popular # Hahtags

Popular Questions