How can I visualize word embeddings extracted with the trained Glove library using T-sne?

201 Views Asked by At

I'm studing with Glove on word embedding. I used the Glove trained library glove.6B.50d.txt on the first try.

(train.csv)

Make,Model,Year,Engine Fuel Type,Engine HP,Engine Cylinders,Transmission Type,Driven_Wheels,Number of Doors,Market Category,Vehicle Size,Vehicle Style,highway MPG,city mpg,Popularity,MSRP
BMW,1 Series M,2011,premium unleaded (required),335,6,MANUAL,rear wheel drive,2,Factory Tuner,Luxury,High-Performance,Compact,Coupe,26,19,3916,46135
BMW,1 Series,2011,premium unleaded (required),300,6,MANUAL,rear wheel drive,2,Luxury,Performance,Compact,Convertible,28,19,3916,40650
BMW,1 Series,2011,premium unleaded (required),300,6,MANUAL,rear wheel drive,2,Luxury,High-Performance,Compact,Coupe,28,20,3916,36350
Audi,100,1992,regular unleaded,172,6,MANUAL,front wheel drive,4,Luxury,Midsize,Sedan,24,17,3105,2000
Audi,100,1992,regular unleaded,172,6,MANUAL,front wheel drive,4,Luxury,Midsize,Sedan,24,17,3105,2000
Audi,100,1992,regular unleaded,172,6,AUTOMATIC,all wheel drive,4,Luxury,Midsize,Wagon,20,16,3105,2000
Audi,100,1992,regular unleaded,172,6,MANUAL,front wheel drive,4,Luxury,Midsize,Sedan,24,17,3105,2000
Audi,100,1992,regular unleaded,172,6,MANUAL,all wheel drive,4,Luxury,Midsize,Sedan,21,16,3105,2000

I collected the data I read from the dataset(train.csv) in a single row with the twoone function below.

X = df.iloc[:,1:].values
def twoone(list1):
    list2 = []
    a = ""
    for x in range(len(list1)):
        for y in range(len(list1[1])):
            list1
            if list1[0][0] == list1[x][y]:
              a+=""+str(list1[x][y])  
            elif list1[1][0] == list1[x][y]:
              a+=""+str(list1[x][y])  
            else:
              a+=" "+str(list1[x][y])
        list2.append(a)
        a = ""
    return list2
X = twoone(X)
print(X)

print(len(X))

I gave the values which ​​I got with the embedding index to an embedding array. I got an error because some values ​​are not in the glove.6B.50d.txt. I've written other values ​​manually in order to fix this problem temporarily. Since I cannot read all of them, I bought manual embedding indexes for only 2 lines (Audi and BMW). In addition, I took the embedding index corresponding to BMW and Auidi only and printed it.

Embedding indexed single line (50d):

[-0.79954, 1.32006, -0.058246, 3.9524, 0.83058, -1.4129, 0.51006, -0.90706, -0.103168, -0.8644, 0.14027, 1.14064, 0.26346, -1.41698, -0.22546, 0.041738, -0.51298, 0.156538, 0.89884, 2.7938, -0.54082, -0.0642, 2.0558, -1.21382, 1.16802, -1.2238, 0.078408, -0.140382, -2.644, -2.2578, 1.46556, 0.65876, 1.59616, 2.2354, -1.2485, -0.49032, -1.32034, 0.71436, 0.65634, 0.41044, 0.63574, 0.39114, -0.21028, -0.45792, 0.52182, -2.3596, -0.89312, -0.54108, 1.46664, -0.40282]
bmw_list = list()
for i in range(len(bmw)-1):
  bmw_satir = list()
  for j in range(len(bmw[0])):
    toplam = bmw[i][j] + bmw[i+1][j]
    bmw_satir.append(toplam)
bmw_list.extend(bmw_satir)
print(bmw_list)

a = []
a.append(audi_list)
a.append(bmw_list)
a.append(bmw_orj)
a.append(audi_orj)

Since I cannot print for all lines, I wrote a function separately for the two examples with the above code.

model = TSNE(learning_rate = 1000)
transformed = model.fit_transform(arr_x)


arr_x = []

for i in range(len(a)):
  arr_x.append(a[i])

xs = transformed[:,0]
ys = transformed[:,1]


groups = [1,0,1,0] 
plt.scatter(xs, ys, c= groups)

As a result I got such a printout. enter image description here

You can find all codes here.

1-) How can we update the function to print all lines?
2-) How can we train word embedding without using the pretrained glove.6B.50d.txt?

Thanks a lot.

0

There are 0 best solutions below