For two R libraries, I'm trying to understand the differencs between the embeddings for httr2 (OpenAI) and text (huggingface) libraries, respectively. For example, is it possible to exchange/convert these two output embeddings? Why are the embedding outputs so different? And which one will be more valuable in performing cosine similarity comparisons against a corpus text? More importantly, I'm struggling to make use of the OpenAI output, so curious about the OpenAI (httr2) value and purpose being in hex format. The text library seems to be returning more detail versus the OpenAI (httr2) embedding. How do I make use of the httr2 (OpenAI) returned hex embedding in comparison with the HuggingFace (via ‘text’)?
# Input code
library(httr2)
url_base <-"https://api.openai.com/v1/"
prompt <- "Please tell me a dad joke"
model_type<-"text-embedding-ada-002"
body <- list(input = prompt,
model = model_type)
response<-request(url_base) |>
req_url_path_append("embeddings") |>
req_auth_bearer_token(token = api_key) |>
req_headers("Content-Type" = "application/json") |>
req_user_agent("JustADude @justadude") |>
req_body_json(body) |>
req_perform()
# Output
response$body
[1] 7b 0a 20 20 22 6f 62 6a 65 63 74 22 3a 20 22 6c 69 73 74 22 2c 0a 20 20 22 64 61 74 61 22 3a 20 5b 0a 20 20 20 20 7b 0a 20
[42] 20 20 20 20 20 22 6f 62 6a 65 63 74 22 3a 20 22 65 6d 62 65 64 64 69 6e 67 22 2c 0a 20 20 20 20 20 20 22 69 6e 64 65 78 22
Versus:
library(text)
texts <- c("Tell me a dad joke.")
# Transform the text data to BERT word embeddings
word_embeddings <- textEmbed(texts = texts,
model = "bert-base-uncased",
layers = -2,
aggregation_from_tokens_to_texts = "mean",
aggregation_from_tokens_to_word_types = "mean",
keep_token_embeddings = FALSE)
# See how word embeddings are structured
word_embeddings
Output:
# A tibble: 8 × 770
words n Dim1 Dim2 Dim3 Dim4 Dim5 Dim6 Dim7 Dim8 Dim9 Dim10 Dim11 Dim12 Dim13 Dim14
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 . 1 0.0300 -0.523 -0.818 0.356 -0.374 -0.273 -3.26e-1 0.688 0.132 -0.358 0.590 0.0501 -0.578 -0.0596
2 [CLS] 1 0.137 -0.383 -0.259 0.216 -0.668 -0.511 7.39e-2 0.789 0.763 -0.155 0.361 -0.320 -0.196 -0.0264
3 [SEP] 1 0.0553 0.0223 -0.0227 0.0169 -0.0463 -0.0391 -3.77e-5 -0.0389 0.0405 0.0284 -0.0154 -0.0473 0.0269 -0.0656
4 a 1 -0.165 0.503 -0.254 0.0613 0.645 -0.630 7.91e-1 1.56 0.0416 0.267 0.604 -1.02 -0.782 0.693
Thank you for any details on the comparison on these two embedding types and styles and purposes.