Comparing embedding differences between OpenAI (via httr2) and HuggingFace (via ‘text’)?

582 Views Asked by JustADude At 14 September 2023 at 15:15

For two R libraries, I'm trying to understand the differencs between the embeddings for httr2 (OpenAI) and text (huggingface) libraries, respectively. For example, is it possible to exchange/convert these two output embeddings? Why are the embedding outputs so different? And which one will be more valuable in performing cosine similarity comparisons against a corpus text? More importantly, I'm struggling to make use of the OpenAI output, so curious about the OpenAI (httr2) value and purpose being in hex format. The text library seems to be returning more detail versus the OpenAI (httr2) embedding. How do I make use of the httr2 (OpenAI) returned hex embedding in comparison with the HuggingFace (via ‘text’)?

# Input code
library(httr2)
url_base <-"https://api.openai.com/v1/"
  
prompt <- "Please tell me a dad joke"
  
model_type<-"text-embedding-ada-002"
body <- list(input = prompt, 
model = model_type)
response<-request(url_base) |>
        req_url_path_append("embeddings") |>
        req_auth_bearer_token(token = api_key) |>
        req_headers("Content-Type" = "application/json") |> 
        req_user_agent("JustADude @justadude") |> 
        req_body_json(body)  |>
        req_perform()

#  Output  
response$body
   [1] 7b 0a 20 20 22 6f 62 6a 65 63 74 22 3a 20 22 6c 69 73 74 22 2c 0a 20 20 22 64 61 74 61 22 3a 20 5b 0a 20 20 20 20 7b 0a 20
  [42] 20 20 20 20 20 22 6f 62 6a 65 63 74 22 3a 20 22 65 6d 62 65 64 64 69 6e 67 22 2c 0a 20 20 20 20 20 20 22 69 6e 64 65 78 22

Versus:

library(text)

texts <- c("Tell me a dad joke.")

# Transform the text data to BERT word embeddings
word_embeddings <- textEmbed(texts = texts,
                             model = "bert-base-uncased",
                             layers = -2,
                             aggregation_from_tokens_to_texts = "mean",
                             aggregation_from_tokens_to_word_types = "mean",
                             keep_token_embeddings = FALSE)

# See how word embeddings are structured
word_embeddings

Output: 
# A tibble: 8 × 770
  words     n    Dim1     Dim2    Dim3     Dim4    Dim5     Dim6     Dim7    Dim8    Dim9   Dim10   Dim11   Dim12   Dim13   Dim14
  <chr> <dbl>   <dbl>    <dbl>   <dbl>    <dbl>   <dbl>    <dbl>    <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>   <dbl>
1 .         1  0.0300 -0.523   -0.818   0.356   -0.374  -0.273   -3.26e-1  0.688   0.132  -0.358   0.590   0.0501 -0.578  -0.0596
2 [CLS]     1  0.137  -0.383   -0.259   0.216   -0.668  -0.511    7.39e-2  0.789   0.763  -0.155   0.361  -0.320  -0.196  -0.0264
3 [SEP]     1  0.0553  0.0223  -0.0227  0.0169  -0.0463 -0.0391  -3.77e-5 -0.0389  0.0405  0.0284 -0.0154 -0.0473  0.0269 -0.0656
4 a         1 -0.165   0.503   -0.254   0.0613   0.645  -0.630    7.91e-1  1.56    0.0416  0.267   0.604  -1.02   -0.782   0.693

Thank you for any details on the comparison on these two embedding types and styles and purposes.

Original Q&A

Comparing embedding differences between OpenAI (via httr2) and HuggingFace (via ‘text’)?

There are 0 best solutions below

Related Questions in R

Related Questions in TEXT

Related Questions in WORD-EMBEDDING

Related Questions in HTTR2

Trending Questions

Popular # Hahtags

Popular Questions