So here's the situation:
I have millions, possibly billions, of strings that I am trying to parse and put into a sorted structure, lets say I have 5,000,000 strings. I'm trying to write a fast program that can put all of these strings from an unsorted vector into an ordered data structure that can also search the structure fast, thus the reasoning for the AVL tree (which eventually I plan to use a hash table of a-z for even faster lookup, but that comes later). I get all of the strings into a vector first, but they are all jumbled up, unsorted and different lengths. I don't want any repeated strings in my tree, so if the program finds the strings "hello" and "hello" it would only have one AVL entry and would increment an integer holder for the frequency that this string has appeared.
So my question is this: would it be faster to sort the vector first (using something fast like a multi-threaded quicksort or something else) and then input it into the AVL tree, after all the words are sorted together with other same words, OR is it faster to just put all the data from the unsorted vector into the AVL tree, and continously checking the AVL tree for whether or not a word already exists, then incrementing it.
So to picture it in an order of operations here are the two cases:
CASE A:
> Get input/parse strings
> Put strings into vector (unsorted)
> Put vector into array or linked-list
> Quicksort that array/llist
> Input that sorted array into the AVL Tree
CASE B:
> Get input/parse strings
> Put strings into vector (unsorted)
> Insert vector data into AVL tree
> During insertion, check if there are duplicate words, if so, increment the counter
Which case is faster??
--EDIT-- So after hearing some of the comments, inserting a sorted array into an AVL tree from the beginning would be a bad idea, which makes sense due to how many rotations would be made. It seems that directly inserting into the AVL tree is probably a good idea, but what is the best way to efficiently insert when a word is already in the tree somewhere? How can I make sure that I find it? Is that where my sorting can come in?
Think of the way balancing works for AVL trees. It works best if the "middle values" come first. With a sorted input, you will need a lot of re-balancing, thus pre-sorting will probably do more harm than good.
For example, consider the following AVL tree holding the values 1-6:
If the input order is
4, 2, 5, 1, 3, 6
, you'll never need to balance the tree. In contrast, for a sorted input1, 2, 3, 4, 5, 6
, you'll need many re-balancing operations:UPDATE The original question was whether sorting data before inserting into an AVL tree would improve performance. Now the OP edited the question, shifting towards his specific problem.
The whole point of an AVL tree is to efficiently find data, so I don't understand the question. It should be obvious how to traverse the binary search tree to find a value. Why would you want to sort data for that?
Please note that binary search trees are a good data structure to store keys, but it can also manage arbitrary data associated with these keys. In your case, you want to store a count along with your keys. Therefore, you don't need a tree of words/strings, but a tree of pairs (string, integer) that represent the word and its count. For the tree order, just consider the string key, i.e. the word.
For each word to insert, look it up in the tree. If it already exists, update the word count. Otherwise, insert a new pair with a word count of one.
A final note: The C++ standard library comes with a
map
type that is usually (always?) implemented using a balancing tree (AVL or red-black). You'll save yourself a lot of work and bug-fixing by just using this implementation. Since C++11 there is also ununordered_map
, usually (always?) implemented using a hash table.