I'm studing with Glove
on word embedding. I used the Glove trained library glove.6B.50d.txt
on the first try.
(train.csv)
Make,Model,Year,Engine Fuel Type,Engine HP,Engine Cylinders,Transmission Type,Driven_Wheels,Number of Doors,Market Category,Vehicle Size,Vehicle Style,highway MPG,city mpg,Popularity,MSRP
BMW,1 Series M,2011,premium unleaded (required),335,6,MANUAL,rear wheel drive,2,Factory Tuner,Luxury,High-Performance,Compact,Coupe,26,19,3916,46135
BMW,1 Series,2011,premium unleaded (required),300,6,MANUAL,rear wheel drive,2,Luxury,Performance,Compact,Convertible,28,19,3916,40650
BMW,1 Series,2011,premium unleaded (required),300,6,MANUAL,rear wheel drive,2,Luxury,High-Performance,Compact,Coupe,28,20,3916,36350
Audi,100,1992,regular unleaded,172,6,MANUAL,front wheel drive,4,Luxury,Midsize,Sedan,24,17,3105,2000
Audi,100,1992,regular unleaded,172,6,MANUAL,front wheel drive,4,Luxury,Midsize,Sedan,24,17,3105,2000
Audi,100,1992,regular unleaded,172,6,AUTOMATIC,all wheel drive,4,Luxury,Midsize,Wagon,20,16,3105,2000
Audi,100,1992,regular unleaded,172,6,MANUAL,front wheel drive,4,Luxury,Midsize,Sedan,24,17,3105,2000
Audi,100,1992,regular unleaded,172,6,MANUAL,all wheel drive,4,Luxury,Midsize,Sedan,21,16,3105,2000
I collected the data I read from the dataset(train.csv)
in a single row with the twoone function
below.
X = df.iloc[:,1:].values
def twoone(list1):
list2 = []
a = ""
for x in range(len(list1)):
for y in range(len(list1[1])):
list1
if list1[0][0] == list1[x][y]:
a+=""+str(list1[x][y])
elif list1[1][0] == list1[x][y]:
a+=""+str(list1[x][y])
else:
a+=" "+str(list1[x][y])
list2.append(a)
a = ""
return list2
X = twoone(X)
print(X)
print(len(X))
I gave the values which I got with the embedding index to an embedding array. I got an error because some values are not in the glove.6B.50d.txt
. I've written other values manually in order to fix this problem temporarily. Since I cannot read all of them, I bought manual embedding indexes for only 2 lines (Audi and BMW). In addition, I took the embedding index corresponding to BMW and Auidi only and printed it.
Embedding indexed single line (50d):
[-0.79954, 1.32006, -0.058246, 3.9524, 0.83058, -1.4129, 0.51006, -0.90706, -0.103168, -0.8644, 0.14027, 1.14064, 0.26346, -1.41698, -0.22546, 0.041738, -0.51298, 0.156538, 0.89884, 2.7938, -0.54082, -0.0642, 2.0558, -1.21382, 1.16802, -1.2238, 0.078408, -0.140382, -2.644, -2.2578, 1.46556, 0.65876, 1.59616, 2.2354, -1.2485, -0.49032, -1.32034, 0.71436, 0.65634, 0.41044, 0.63574, 0.39114, -0.21028, -0.45792, 0.52182, -2.3596, -0.89312, -0.54108, 1.46664, -0.40282]
bmw_list = list()
for i in range(len(bmw)-1):
bmw_satir = list()
for j in range(len(bmw[0])):
toplam = bmw[i][j] + bmw[i+1][j]
bmw_satir.append(toplam)
bmw_list.extend(bmw_satir)
print(bmw_list)
a = []
a.append(audi_list)
a.append(bmw_list)
a.append(bmw_orj)
a.append(audi_orj)
Since I cannot print for all lines, I wrote a function separately for the two examples with the above code
.
model = TSNE(learning_rate = 1000)
transformed = model.fit_transform(arr_x)
arr_x = []
for i in range(len(a)):
arr_x.append(a[i])
xs = transformed[:,0]
ys = transformed[:,1]
groups = [1,0,1,0]
plt.scatter(xs, ys, c= groups)
As a result I got such a printout.
You can find all codes here.
1-) How can we update the function to print all lines?
2-) How can we train word embedding without using the pretrained glove.6B.50d.txt
?
Thanks a lot.