I am having large date set in which some of columns are Date
and other are categorical Data
like Status, Department Name, Country Name.
So how this data is treated in graphlab when i call the graphlab.linear_regression.create
method, does i have to pre-process this data and convert them into numbers or can directly provide to graphlab.
Graphlab
is mostly used for computing tabular and graph based datasets, and have highscalability
andperformance
. Ingraphlab.linear_regression.create
,graphlab
have inbuilt feature of understanding the type of data and giving most suitable method oflinear regression
for optimizing results. For Example, for numeric data of target and feature both, most of the time,graphlab
takesNewtons Method
of linear regression. Similarly, depending on the dataset, understands the need and gives method accordingly.Now, about preprocessing,
graphlab
only takesSFrame
for learning that need to be parsed correctly before any learning. While creating anSFrame
, unprocessed and error creating data are always reflected and throws an error. So, in order to go through any learning, you need to have a clean data. IfSFrame
accepts the data, and also your chosen target and feature for learning that you want, you are good to go butpre-processing
andcleaning data
is always recommended. Also, its always a good practice to dofeature engineering
before any learning algorithm, and redefining data types before learning is always recommended for accuracy.About your point on how data is treated in
Graphlab
, I would say, it depends!. Some datasets are tabular and are treated accordingly and some in graph structure. Graphlab performs very well when comes toregression tree
andboosted classifiers
which followsdecision tree
concept and are quite time and resource consuming in other libraries thangraphlab
.For me,
graphlab
performed very well while creating recommendation engine where I had dataset of nodes and edges andboosted tree classifier
with 18 iterations too worked flawless in quite scalable time and I must say, even for tree structured data,graphlab
performs very well. I hope this answer helps.