web scraping for sunrise and sunset data using National oceanic and atmospheric administration

96 Views Asked by At

I want to scrape data from NOAA (https://gml.noaa.gov/grad/solcalc/). The data I want to get is sunrise and sunset timings for various counties of the US in the last 3 years. I have the coordinates of those counties. Now the problem which I am facing is I don't know how can I use those coordinates and set time frame to 3 years, while scraping the site such that i don't have to manually specify it each time.

I am using python for scraping.

**I need data in the following format:

latitude | Longitude | year | Month | day | Sunrise | sunset**

I am new to programming I tried available methods listed on web, but nothing served my purpose.

1

There are 1 best solutions below

1
On

You can use the table.php page to get your data and read them with Pandas. This php script need 3 parameters: year, lat and lon.

import pandas as pd
import requests
import time

headers = {
    'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64; rv:109.0) Gecko/20100101 Firefox/109.0'
}

# Fill this table with your counties
counties = {
    'NY': {'lat': 40.72, 'lon': -74.02},
    'LA': {'lat': 37.77, 'lon': -122.42}
}

url = 'https://gml.noaa.gov/grad/solcalc/table.php'

dataset = []
for year in range(2020, 2023):
    for county, params in counties.items():
        print(year, county)
        payload = params | {'year': year}
        r = requests.get(url, headers=headers, params=payload)
        dfs = pd.read_html(r.text)

        # Reshape your data
        dfs = (pd.concat(dfs, keys=['Sunrise', 'Sunset', 'SolarNoon']).droplevel(1)
                 .assign(Year=year, Lat=params['lat'], Lon=params['lon'])
                 .set_index(['Lat', 'Lon', 'Year', 'Day'], append=True)
                 .rename_axis(columns='Month').stack('Month')
                 .unstack(level=0).reset_index())
        dataset.append(dfs)
        time.sleep(10)  # Wait at least 10 seconds not to be banned

out = pd.concat(dataset, ignore_index=True)
out.to_csv('solarcalc.csv', index=False)

Output:

        Lat     Lon  Year  Day Month SolarNoon Sunrise Sunset
0     40.72  -74.02  2020    1   Jan  11:59:16   07:20  16:39
1     40.72  -74.02  2020    1   Feb  12:09:33   07:07  17:13
2     40.72  -74.02  2020    1   Mar  12:08:22   06:29  17:48
3     40.72  -74.02  2020    1   Apr  12:59:52   06:39  19:21
4     40.72  -74.02  2020    1   May  12:53:10   05:54  19:53
...     ...     ...   ...  ...   ...       ...     ...    ...
2187  37.77 -122.42  2022   31   May  13:07:22   05:50  20:25
2188  37.77 -122.42  2022   31   Jul  13:16:06   06:12  20:19
2189  37.77 -122.42  2022   31   Aug  13:10:04   06:39  19:40
2190  37.77 -122.42  2022   31   Oct  12:53:15   07:34  18:12
2191  37.77 -122.42  2022   31   Dec  12:12:35   07:25  17:01

[2192 rows x 8 columns]

Note: if you prefer Month as number, use:

month2num =  {'Jan': 1, 'Feb': 2, 'Mar': 3, 'Apr': 4, 'May': 5, 'Jun': 6,
              'Jul': 7, 'Aug': 8, 'Sep': 9, 'Oct': 10, 'Nov': 11, 'Dec': 12}
out['Month'] = out['Month'].replace(month2num)