I have a pandas dataframe with the following dtypes:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 579585 entries, 0 to 579613
Data columns (total 7 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 itemName 579585 non-null object
1 itemId 579585 non-null string
2 Count 579585 non-null int32
3 Sales 579585 non-null float64
4 Date 579585 non-null datetime64[ns]
5 Unit_margin 579585 non-null float64
6 GrossProfit 579585 non-null float64
dtypes: datetime64[ns](1), float64(3), int32(1), object(1), string(1)
memory usage: 33.2+ MB
I upload it to a BigQuery table using:
df_extended_full.to_gbq('<MY DATSET>.profit', project_id='<MY PROJECT>', chunksize=None, if_exists='append', auth_local_webserver=False, location=None, progress_bar=True)
Everything seem to work well except that the itemId column that is a string becomes a float and that all leading 0:s (which I need) are therefore deleted (wherever there are any).
I could of course define a schema for my table, but I want to avoid that. What am I missing?
The problem is for the “to_gbq” component. For some reason this output omits the quotes from the data field. And without quotes, it changes the datatype to a number.
BigQuery needs this format:
You sent this format:
A solution in this case. You can cast the field “itemId” from pandas using the command “astype”. Here is more documentation about this command.
This is an example.
Another option is to use the parameter
table_schemawith the to_gbq method. And list the Bigquery table fields which will be according to DataFrame conforms.Last option, you can change to google-cloud-bigquery instead of pandas-gbq. You can see this comparison.