web scraping with Rselenium and docker standalone image

396 Views Asked by At

Once again I have read many threads on the subject without being able to understand...

I am using Rselenium and selenium standalone images with Docker on a Ubuntu 22.04 OS.

The following code works just fine when using docker image selenium/standalone-chrome-debug :

system('docker run -d -p 4445:4444 selenium/standalone-chrome-debug')
remDr <- remoteDriver(remoteServerAddr = "localhost",
                      port = 4445L,
                      browserName = "chrome")
remDr$open()
remDr$navigate("https://fr.distance.to/paris/bayonne-france")
el <- remDr$findElement(using = "css", ".headerRoute > #strck > span:nth-child(1)")
road_distance <- el$getElementText()[[1]]
remDr$close()
system('docker rm -f $(docker ps -aq --filter ancestor=selenium/standalone-chrome-debug)')

However, the exact same code, but with the selenium/standalone-chrome image instead, gets stuck at first step of remDr$open() to finally crash with output :

remDr$open()
[1] "Connecting to remote server"
$id
[1] NA

Any ideas why and how to solve this ? I don't really mind using the debug version of the selenium/standalone-chrome image but it seems that it got deprecated and I am keen on understanding what happens here

1

There are 1 best solutions below

1
On

I have encountered the same problem, and your post was actually very helpful in that it actually allowed me to connect to the standalone-chrome-debug docker container.

I believe the basic issue is that RSelenium has not been updated in a long time and is using Selenium version 2. As it happens, the image selenium/standalone-chrome-debug also has not been updated in a long time. I believe the image selenium/standalone-chrome is using a newer version of the Selenium API and the RSelenium code is failing when trying to use the old API against the new image.

It turns out to be pretty easy to connect to the Docker container using the Python Selenium bindings and the reticulate package in R. Here's an example that worked for me:

selenium_conn<-py_run_string("from selenium import webdriver
from selenium.webdriver.common.by import By

opts=webdriver.ChromeOptions()
# Set Chrome options so that PDFs download automatically
opts.add_experimental_option('prefs',
{  'profile.default_content_settings.popups' : 0,
 'download.default_directory' : '/opt/selenium/assets',
 'download.directory_upgrade' : True,
 'download.prompt_for_download' : False,
 'plugins.always_open_pdf_externally':True
}
)

browser = webdriver.Remote('http://localhost:4444', options=opts)
")

web_driver<-selenium_conn$browser

web_driver$get('http://medium.com')