Comparing embedding differences between OpenAI (via httr2) and HuggingFace (via ‘text’)?

582 Views Asked by At

For two R libraries, I'm trying to understand the differencs between the embeddings for httr2 (OpenAI) and text (huggingface) libraries, respectively. For example, is it possible to exchange/convert these two output embeddings? Why are the embedding outputs so different? And which one will be more valuable in performing cosine similarity comparisons against a corpus text? More importantly, I'm struggling to make use of the OpenAI output, so curious about the OpenAI (httr2) value and purpose being in hex format. The text library seems to be returning more detail versus the OpenAI (httr2) embedding. How do I make use of the httr2 (OpenAI) returned hex embedding in comparison with the HuggingFace (via ‘text’)?

# Input code
library(httr2)
url_base <-"https://api.openai.com/v1/"
  
prompt <- "Please tell me a dad joke"
  
model_type<-"text-embedding-ada-002"
body <- list(input = prompt, 
model = model_type)
response<-request(url_base) |>
        req_url_path_append("embeddings") |>
        req_auth_bearer_token(token = api_key) |>
        req_headers("Content-Type" = "application/json") |> 
        req_user_agent("JustADude @justadude") |> 
        req_body_json(body)  |>
        req_perform()
#  Output  
response$body
   [1] 7b 0a 20 20 22 6f 62 6a 65 63 74 22 3a 20 22 6c 69 73 74 22 2c 0a 20 20 22 64 61 74 61 22 3a 20 5b 0a 20 20 20 20 7b 0a 20
  [42] 20 20 20 20 20 22 6f 62 6a 65 63 74 22 3a 20 22 65 6d 62 65 64 64 69 6e 67 22 2c 0a 20 20 20 20 20 20 22 69 6e 64 65 78 22

Versus:

library(text)

texts <- c("Tell me a dad joke.")

# Transform the text data to BERT word embeddings
word_embeddings <- textEmbed(texts = texts,
                             model = "bert-base-uncased",
                             layers = -2,
                             aggregation_from_tokens_to_texts = "mean",
                             aggregation_from_tokens_to_word_types = "mean",
                             keep_token_embeddings = FALSE)

# See how word embeddings are structured
word_embeddings
Output: 
# A tibble: 8 × 770
  words     n    Dim1     Dim2    Dim3     Dim4    Dim5     Dim6     Dim7    Dim8    Dim9   Dim10   Dim11   Dim12   Dim13   Dim14
  <chr> <dbl>   <dbl>    <dbl>   <dbl>    <dbl>   <dbl>    <dbl>    <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
1 .         1  0.0300 -0.523   -0.818   0.356   -0.374  -0.273   -3.26e-1  0.688   0.132  -0.358   0.590   0.0501 -0.578  -0.0596
2 [CLS]     1  0.137  -0.383   -0.259   0.216   -0.668  -0.511    7.39e-2  0.789   0.763  -0.155   0.361  -0.320  -0.196  -0.0264
3 [SEP]     1  0.0553  0.0223  -0.0227  0.0169  -0.0463 -0.0391  -3.77e-5 -0.0389  0.0405  0.0284 -0.0154 -0.0473  0.0269 -0.0656
4 a         1 -0.165   0.503   -0.254   0.0613   0.645  -0.630    7.91e-1  1.56    0.0416  0.267   0.604  -1.02   -0.782   0.693

Thank you for any details on the comparison on these two embedding types and styles and purposes.

0

There are 0 best solutions below