I have 100,000 points that I would like to cluster using the OPTICS algorithm in ELKI. I have a upper triangular distance matrix of about 5 billion entries for this point set. In the format that ELKI wants the matrix, it will take about 100GB in memory. I am wondering does ELKI handle that sort of data load? Can any one confirm if you have made this work before?
How big of a Dataset can ELKI handle?
855 Views Asked by jason At
1
There are 1 best solutions below
Related Questions in MACHINE-LEARNING
- How to cluster a set of strings?
- Enforcing that inputs sum to 1 and are contained in the unit interval in scikit-learn
- scikit-learn preperation
- Spark MLLib How to ignore features when training a classifier
- Increasing the efficiency of equipment using Amazon Machine Learning
- How to interpret scikit's learn confusion matrix and classification report?
- Amazon Machine Learning for sentiment analysis
- What Machine Learning algorithm would be appropriate?
- LDA generated topics
- Spectral clustering with Similarity matrix constructed by jaccard coefficient
- Speeding up Viterbi execution
- Memory Error with Classifier fit and partial_fit
- How to find algo type(regression,classification) in Caret in R for all algos at once?
- Difference between weka tool's correlation coefficient and scikit learn's coefficient of determination score
- What are the approaches to the Big-Data problems?
Related Questions in CLUSTER-ANALYSIS
- How to cluster a set of strings?
- What clustering algorithms can I consider for graph?
- Center of clusteres in rapidminer
- Spectral clustering with Similarity matrix constructed by jaccard coefficient
- Selecting initial centroids in Kmeans in R
- kmeans clustering on the basis of fixed number of variables out of all variables
- MinHashing vs SimHashing
- knn predictions with Clustering
- How do I choose a linkage method for Hierarchical Agglomerative Clustering?
- Affinity Propagation (sklearn) - strange behavior
- How to extract cluster centres from agnes for inputting into kmeans?
- Is it possible to estimate at survey data at cluster level?
- How to explain a higher percentage of point variability using kmeans clustering?
- Mahout clustering: How to retrieve the name of a named vector
- String clustering using matlab?
Related Questions in DATA-MINING
- Does MATLAB support the parallelization of supervised machine learning algorithms? Alternatives?
- Why Confidence does not consider B in Association rule mining
- Clustering based on pearson correlation
- Extract relevant attributes from postal addresses data in order to do PCA on those Data (using R)
- ELKI DBSCAN for million files
- Cluster adjacent points
- Using Python to find correlation pairs
- Nominal valued dataset in machine learning
- Repeated ordered sequence search algorithm
- rankall : returning the correct data frame to rank hospitals on performance
- Which data mining algorithm should I use to find optimum performance (in this case)
- Data Mining issue with the apriori algorithm in C#
- What could be the cause for the slow speed of xgboost?
- K-medoids: k = total dataset
- How to assign more weight to bigram and trigram?
Related Questions in DBSCAN
- Clustering based on pearson correlation
- ELKI: Running DBSCAN on custom Objects in Java
- ELKI DBSCAN for million files
- DBSCAN clusters of cluster (sklearn python)
- Clustering data using DBSCAN and spark_sklearn
- Varying cluster labels in DBSCAN
- How to use sklearn's DBSCAN with a spherical metric?
- DBSCAN in hadoop
- Add a criteria to the dataset in clustering
- how can i handle type data for spatial clustering using DBSCAN with python?
- Get the cluster size in sklearn in python
- how to import DBSCAN labels-Python
- DBSCAN with potentially imprecise lat/long coordinates
- dbscan - setting limit on maximum cluster span
- Cluster rectangles on a grid
Related Questions in ELKI
- ELKI: Running DBSCAN on custom Objects in Java
- ELKI DBSCAN for million files
- Outlier dectection Using ELKI
- ELKI tool - outlier detection results for ABOD
- OPTICSXi - ELKI ResultWriter
- ELKI: perform min-max normalization before running k-means
- Very basic thing about ELKI
- How big of a Dataset can ELKI handle?
- Why does ELKI need db.in file in addition to distance matrix? Also what should db.in file contain?
- Elki GDBSCAN Java/Scala - how to modify the CorePredicate
- WeightedCorePredicate Implementation for ELKI - An example
- Getting row indices back from the DBIDs neighbours in ELKI CorePredicate DBCAN
- How to use ELKI for DBSCAN with precomputed distance matrix
- ELKI hierarchical clustering - "mrg_" Cluster object
- How to see ELKI DBSCAN clustering result
Trending Questions
- UIImageView Frame Doesn't Reflect Constraints
- Is it possible to use adb commands to click on a view by finding its ID?
- How to create a new web character symbol recognizable by html/javascript?
- Why isn't my CSS3 animation smooth in Google Chrome (but very smooth on other browsers)?
- Heap Gives Page Fault
- Connect ffmpeg to Visual Studio 2008
- Both Object- and ValueAnimator jumps when Duration is set above API LvL 24
- How to avoid default initialization of objects in std::vector?
- second argument of the command line arguments in a format other than char** argv or char* argv[]
- How to improve efficiency of algorithm which generates next lexicographic permutation?
- Navigating to the another actvity app getting crash in android
- How to read the particular message format in android and store in sqlite database?
- Resetting inventory status after order is cancelled
- Efficiently compute powers of X in SSE/AVX
- Insert into an external database using ajax and php : POST 500 (Internal Server Error)
Popular Questions
- How do I undo the most recent local commits in Git?
- How can I remove a specific item from an array in JavaScript?
- How do I delete a Git branch locally and remotely?
- Find all files containing a specific text (string) on Linux?
- How do I revert a Git repository to a previous commit?
- How do I create an HTML button that acts like a link?
- How do I check out a remote Git branch?
- How do I force "git pull" to overwrite local files?
- How do I list all files of a directory?
- How to check whether a string contains a substring in JavaScript?
- How do I redirect to another webpage?
- How can I iterate over rows in a Pandas DataFrame?
- How do I convert a String to an int in Java?
- Does Python have a string 'contains' substring method?
- How do I check if a string contains a specific word?
I frequently use ELKI with 100k points, up to 10 million.
However, for this to be fast you should use indexes.
For obvious reasons, any dense matrix based approach will scale at best
O(n^2), and needO(n^2)memory. Which is why I cannot process these data sets with R or Weka or scipy. They usually first try to compute the full distance matrix, and either fail halfway through, or run out of memory halfway through, or fail with negative allocation size (Weka, when your data set overflows the 2^31 positive integers, i.e. is around 46k objects).In the binary format, with float precision, the ELKI matrix format should be around
100000*999999/2*4 + 4bytes, maybe add another 4 bytes for size information. This is 20 GB. If you use the "easy to use" ascii format, then it will indeed be more. But if you use gzip compression it may end up being about the same size. It's common to have gzip compress such data to 10-20% of the raw size. In my experience gzip compressed ascii can be as small as binary encoded doubles. The main benefit of the binary format is that it will actually reside on disk, and memory caching will be handled by your operating system.Either way, I recommend to not compute distance matrixes at all in the first place.
Because if you decide to go from 100k to 1 million, the raw matrix would grow to 2 TB, and when you go to 10 million it will be 200 TB. If you want double precision, double that.
If you are using distance matrixes, your method will be at best
O(n^2), and thus not scale. Avoiding computing all pairwise distances in the first place is an important speed factor.I use indexes for everything. For kNN or radius bound approaches (for OPTICS, use the epsion parameter to make indexes effective! Choose a low epsilon!) you can precompute these queries once, if you are going to need them repeatedly.
On a data set I frequently use, with 75k instances and 27 dimensions, the file storing the precomputed 101 nearest neighbors + ties, with double precision, is 81 MB (note: this can be seen as a sparse similarity matrix). By using an index for precomputing this cache, it takes just a few minutes to compute; and then I can ran most kNN based algorithms such as LOF on this 75k dataset in 108 ms (+262 ms for loading the kNN cache + parsing the raw input data 2364 ms, for a total runtime of 3 seconds; dominated by parsing double values).