I have created a column family in local cassandra as below with cqlsh.
CREATE TABLE sample.stackoverflow_question12 (
id1 int,
class1 int,
name1 text,
PRIMARY KEY (id1)
)
I have a sample csv file with name "data.csv" and the data in the file is as below.
id1 | name1 |class1
1 | hello | 10
2 | world | 20
Used the below python code to connect db and load data from csv by using Anaconda (After installation of Cassandra driver using pip in anaconda)
#Connecting to local Cassandra server
from Cassandra.Cluster import Cluster
from cassandra.auth import PlainTextAuthProvider
auth_provider = PlainTextAuthProvider(username='cassandra', password='cassandra')
cluster = Cluster(["127.0.0.1"],auth_provider = auth_provider,protocol_version=4)
session = cluster.connect()
session.set_keyspace('sample')
cluster.connect()
#File loading
prepared = session.prepare(' Insert into stackoverflow_question12 (id1,class1,name1)VALUES (?, ?, ?)')
with open('D:/Cassandra/NoSQL/data.csv', 'r') as fares:
for fare in fares:
columns=fare.split(",")
id1=columns[0]
class1=columns[1]
name1=columns[2]
session.execute(prepared, [id1,class1,name1])
#closing the file
fares.close()
when I executed the above code getting below error.
Received an argument of invalid type for column "id1". Expected: <class 'cassandra.cqltypes.Int32Type'>, Got: <class 'str'>; (required argument is not an integer)
When I changed data types to text and ran the above code then it loads data with header fields too.
Can anyone help me to make changes in my code to load data without header content? or your successful code also fine if any.
The reason to make column names as id1 and class1 is id and class are keywords and throwing error in the code when used within "fares" loop.
But in real world column names would be seen as class and id. How to run code when these type of columns came into picture?
The another question I got in mind is Cassandra will store primary key first then remaining keys in ascending order. Can we load csv columns which are not indexed same as Cassandra columns storage?
Based on this, I need to build another solution.
You need to use types accordingly to your schema - for integer columns you need to use
int(columns...)
because split generates strings. If you want to skip header, then you can do something like this:Although it's better to use Python's built-in CSV reader that could be customized to skip header automatically...
P.S. If you just want to load data from CSV, I recommend to use external tools, like DSBulk that are flexible and heavily optimized for that task. See following blog posts for examples: