I am currently working on a bigger article for Medium but until I can put it words so that it’s really something really worth sharing, just wanted to add on one of the steps that I learned while working.
Normally I wanted to load some info that was stored in CSV to BigQuery for analysis and filtering.
I though that you can just add the header to CSV file and it will automatically recognize it and load it.
Turns out that it’s a little bit more complicated.
Normally it should work, and since it did with my first CSV, couldn’t really understand what was wrong.
Now, there are two parts to this story:
- How can you add the Table schema manually, and it will reveal the actual issue.
- What do you need to be aware of and why this happens
How can you add the table schema manually
The data that is written to CSV is actually a Dataframe, so you have info about the types of the columns directly from code
dtype_mapping = {
'object': 'STRING',
'int64': 'FLOAT',
'float64': 'FLOAT'
}
schema = []
for column, dtype in df.dtypes.items():
schema.append({
'name': column,
'type': dtype_mapping.get(str(dtype), 'STRING') # Default to STRING if type is not in map
})
import json
print(json.dumps(schema, indent=2))
Yes, I know, int64 should be mapped to INTEGER, but it turns out that for my case some columns even if in Python are market as int64, in BigQuery they need to be FLOAT. I know there is more memory allocated but the dataset is quite small.
So you can easily use so that you can exclude the header.
df.to_csv(f"df.csv", index=False, mode='a')
The above piece of code will help you create a SCHEMA from a Dataframe header
What do you need to be aware of and why this happens
The actually reason why this happened is because I was not aware that somewhere in my csv file a line with the header definition still remained (yes, I actually wrote multiple dataframes with header and filtered with a Linux command, and it did not work)
And when the file loaded, I actually saw this:

Normally if that line was missing and no header, that it should had looked like

Things to be learned from this exercise:
- If you have to write multiple dataframes in a CSV file, don’t add the header and use the above code to generate a specific definition of the schema
- Properly check the CSV not to have rogue lines that don’t match the rest of the structure of the data, otherwise you will find out that everything is converted to string and you don’t understand why.
And now you know.
Sorin