What did I misunderstand about inner join meaning for wrds database?

103 Views Asked by At

I learned join methods in sql, and I know that inner join means returning only the intersections of the two different tables that we want to set.

I thought for python the concept is same. But I have problem understanding the certain code.

crsp1=pd.merge(crsp, crsp_maxme, how='inner', on=['jdate','permco','me'])

crsp1=crsp1.drop(['me'], axis=1)

crsp2=pd.merge(crsp1, crsp_summe, how='inner', on=['jdate','permco'])

If I understood correctly, the first line merges table crsp and crsp_maxme with intersection on column 'jdate', 'permco', 'me'. So the table crsp1 would have 3 columns. The second line drops the 'me' column of table crsp1. The last lien would merge newly adjusted table crsp1 and crsp_summe with inner join, with intersection on 'jdate' and 'permco'. Which makes newly merged table crsp2 only having 2 columns.

However, the code explanation from line 2 says that the second and third lines drop 'me' column from crsp1 and then replace it with 'me' from crsp_summe table, which I had problem understanding.

Could anyone clarify these lines for me?

PS: I thought it isn't necessary to explain what the table crsp, crsp_summe, and crsp_maxme since they are all framed by inner join function. So please excuse the lack of background info.

1

There are 1 best solutions below

0
On BEST ANSWER

The merge() functions on parameter specifies on what columns you want to make joins. how specifies what type of join you want to apply (similar to sql joins as outer, inner, left, right etc.).

Ex: suppose there are two tables A and B containing columns as A['x1','x2','x3'] and B['x2','y1'] so joining them based on 'x1' (as it is common column in both table) would produce A_join_B_on_x1['A_B_x1','A_x2','A_x3','B_y1'] and the join will based on how you want to join.

in your current code consider,

A = crsp1
B = crsp_maxme
C = crsp_summe

Now in your program your first line merges your A,B on ['jdate','permco','me'] columns and creates a new dataframe A_B containing ['jdate','permco','me',...'+columns_from_both_tables(A)(B)'] as inner join (i.e rows which are common in both A,B based on ['jdate','permco','me'] columns)

enter image description here

second line drops 'me' column from A_B dataframe. so it will be something like ['jdate','permco',...'+columns_from_both_tables(A)(B)']

enter image description here

third line merges your A_B,C on ['jdate','permco'] and creates ['jdate','permco',...'+columns_from_both_tables(A_B)(C)'] as inner join (i.e rows which are common in both A_B,C based on ['jdate','permco','me'] columns)

enter image description here