I have the following code for getting IP information:
import requests
import json
import pandas as pd
import swifter
def get_ip(ip):
response = requests.get ("http://ip-api.com/json/" + ip.rstrip())
geo = response.json()
location = {'lat': geo.get('lat', ''),
'lon': geo.get('lon', ''),
'region': geo.get('regionName', ''),
'city': geo.get('city', ''),
'org': geo.get('org', ''),
'country': geo.get('countryCode', ''),
'query': geo.get('query', '')
}
return(location)
For applying it to an entire dataframe of IPs (df) I am using the next:
df=pd.DataFrame(['85.56.19.4','188.85.165.103','81.61.223.131'])
for lab,row in df.iterrows():
dip = get_ip(df.iloc[lab][0])
try:
ip.append(dip["query"])
private.append('no')
country.append(dip["country"])
city.append(dip["city"])
region.append(dip["region"])
organization.append(dip["org"])
latitude.append(dip["lat"])
longitude.append(dip["lon"])
except:
ip.append(df.iloc[lab][0])
private.append("yes")
However, since iterrows is very slow and I need more performance, I want to use swiftapply, which is an extension of apply function. I have used this:
def ip(x):
dip = get_ip(x)
if (dip['ip']=='private')==True:
ip.append(x)
private.append("yes")
else:
ip.append(dip["ip"])
private.append('no')
country.append(dip["country"])
city.append(dip["city"])
region.append(dip["region"])
organization.append(dip["org"])
latitude.append(dip["lat"])
longitude.append(dip["lon"])
df.swifter.apply(ip)
And I get the following error: AttributeError: ("'Series' object has no attribute 'rstrip'", 'occurred at index 0')
How could I fix it?
rstrip
is a string operation. In order to apply a string operation to a seriesSeries
you have to first call thestr
function on the series, which allows vectorized string operations to be performed on aSeries
.Specifically, in your code changing
ip.rstrip()
toip.str.rstrip()
should resolve yourAttributeError
.After digging around a little it turns out the
requests.get
operation you're trying to perform cannot be vectorized withinpandas
(see Using Python Requests for several URLS in a dataframe). I hacked up the following that should be a little more efficient than usingiterrows
. What the following does is utilizesnp.vectorize
to run the function to get information for each IP address. The location input is saved as new columns in a new DataFrame.First, I altered your
get_ip
function to return thelocation
dictionary, not(location)
.Next, I created a vectorization function using
np.vectorize
:Finally,
vec_func
is applied todf
to create a new DataFrame that mergesdf
with the location output fromvec_func
wheredf[0]
is the column with your URLs:The code above retrieves the API response in the form of a dictionary for each row in your DataFrame then maps the dictionary to columns in the DataFrame. In the end your new DataFrame would look like this:
Hopefully this resolves the
InvalidSchema
error and gets you a little better performance thaniterrows()
.