Nokogiri: Clean up data output

190 Views Asked by At

I am trying to scrape player information from MLS sites to create a map of where the players come from, as well as other information. I am as new to this as it gets.

So far I have used this code:

require 'HTTParty'
require 'Nokogiri'
require 'JSON'
require 'Pry'
require 'csv'

page = HTTParty.get('https://www.atlutd.com/players')

parse_page = Nokogiri::HTML(page)

players_array = []

parse_page.css('.player_list.list-reset').css('.row').css('.player_info').map do |a|
    player_info = a.text
    players_array.push(player_info)
end

#CSV.open('atlantaplayers.csv', 'w') do |csv|
#   csv << players_array
#end

pry.start(binding)

The output of the pry function is:

"Miguel Almirón10\nMidfielder\n-\nAsunción, ParaguayAge:\n23\nHT:\n5' 9\"\nWT:\n140\n"

Which when put into the csv creates this in a single cell:

"Miguel Almirón10
Midfielder
-
Asunción, ParaguayAge:
23
HT:
5' 9""
WT:
140
"

I've looked into things and have determined that it is possible nodes (\n)? that is throwing off the formatting.

My desired outcome here is to figure out how to get the pry output into the array as follows:

Miguel, Almiron, 10, Midfielder, Asuncion, Paraguay, 23, 5'9", 140

Bonus points if you can help with the accent marks on names. Also if there is going to be an issue with height, is there a way to convert it to metric?

Thank you in advance!

2

There are 2 best solutions below

0
On

I've looked into things and have determined that it is possible nodes (\n)? that is throwing off the formatting.

Yes that's why it's showing in this odd format, you can strip the rendered text to remove extra spaces/lines then your text will show without the \ns

player_info = a.text.strip

[1] pry(main)> "Miguel Almirón10\n".strip
=> "Miguel Almirón10"

This will only remove the \n if you wish to store them in a CSV in this order Miguel, Almiron, 10, Midfielder, Asuncion, Paraguay, 23, 5'9", 140 then you might want to split by spaces and then create an array for each row so when pushing the line to the CSV file it will look like this:

csv << ["Miguel", "Almiron", 10, "Midfielder", "Asuncion", "Paraguay", 23, "5'9\"", 140]

with the accent marks on names

you can use the transliterate method which will remove accents

[8] pry(main)> ActiveSupport::Inflector.transliterate("Miguel Almirón10")
=> "Miguel Almiron10"

See http://api.rubyonrails.org/classes/ActiveSupport/Inflector.html#method-i-transliterate and you might want to require 'rails' for this

0
On

Here's what I would use, i18n and people gems:

require 'people'
require "i18n"

I18n.available_locales = [:en]
@np = People::NameParser.new

players_array = []

parse_page.css('.player_info').each do |div|
  name = @np.parse I18n.transliterate(div.at('.name a').text)
  players_array << [
    name[:first],
    name[:last],
    div.at('.jersey').text,
    div.at('.position').text,
  ]
end

# => [["Miguel", "Almiron", "10", "Midfielder"],
# ["Mikey", "Ambrose", "22", "Defender"],
# ["Yamil", "Asad", "11", "Forward"],
# ...

That should get you started.