Explanation on Radix Sort Algorithm

Question

Explanation on Radix Sort Algorithm

934 Views Asked by Yasmin Jones At 01 October 2020 at 03:31

I'm new to progamming.I was searching for a radix sort implementation in c++ and i found this code here.

void countSort(string a[], int size, size_t k)
{
    string *b = NULL; int *c = NULL;
    b = new string[size];
    c = new int[257];

    for (int i = 0; i <257; i++){
        c[i] = 0;   
    }

    for (int j = 0; j <size; j++){   
        c[k < a[j].size() ? (int)(unsigned char)a[j][k] + 1 : 0]++;
        //a[j] is a string
    }

    for (int f = 1; f <257; f++){
        c[f] += c[f - 1];
    }

    for (int r = size - 1; r >= 0; r--){
        b[c[k < a[r].size() ? (int)(unsigned char)a[r][k] + 1 : 0] - 1] = a[r];
        c[k < a[r].size() ? (int)(unsigned char)a[r][k] + 1 : 0]--;
    }

    for (int l = 0; l < size; l++){
        a[l] = b[l];
    }

    // avold memory leak
    delete[] b;
    delete[] c;
}

void radixSort(string b[], int r)
{
    size_t max = getMax(b, r);
    for (size_t digit = max; digit > 0; digit--){ 
        countSort(b, r, digit - 1);
    }
}

So my question is what that lines do:

c[k < a[j].size() ? (int)(unsigned char)a[j][k] + 1 : 0]++;
b[c[k < a[r].size() ? (int)(unsigned char)a[r][k] + 1 : 0] - 1] = a[r];
c[k < a[r].size() ? (int)(unsigned char)a[r][k] + 1 : 0]--;

And is that a MSD or LSD radix Sort ?

Thanks.

Original Q&A

There are 1 best solutions below

**Scheff's Cat** · Accepted Answer · 2020-10-01T06:05:10.217000

This is a neat example of unnecessary compact and, hence, hard to read code.

To analyze it it helps to separate it a bit:

// what a mess...
c[k < a[j].size() ? (int)(unsigned char)a[j][k] + 1 : 0]++;

First the argument of cs subscription is taken out:

// determine index for c
const int iC = k < a[j].size() ? (int)(unsigned char)a[j][k] + 1 : 0;
// post-increment c (as it is it could become a pre-increment as well)
c[iC]++;

The index calculation contains a condition:

// determine index for c
const int iC
  // check whether k is (not) exceeding the size of a
  = k < a[j].size()
  // then 
  ? (int)(unsigned char)a[j][k] + 1
  // else
  : 0;

The array a is an array of std::strings where std::string contains itself an array of char. So, a[j][k] results in a single char. A char might be signed or unsigned – that's left to the compiler. So, (unsigned char)a[j][k] doesn't change the bits of that char but interprets them as unsigned number. Then (int)(unsigned char)a[j][k] promotes that to an int.

Please, note that this may be different to (int)a[j][k] if the current compiler has signed chars because in that case the possible sign of a value would be kept. (This is called sign extension.) So, the whole thing is just responsible to convert the current character into a (positive) index and adds 1 finally.

Actually, I intended to leave the rest as practice for the reader but then I saw this:

b[c[k < a[r].size() ? (int)(unsigned char)a[r][k] + 1 : 0] - 1] = a[r];

Separating it like above, it results in:

const int iC = k < a[r].size() ? (int)(unsigned char)a[r][k] + 1 : 0;
const int iB = c[iC - 1]; // What?
b[iB] = a[r];

Considering that iC might result in 0 (while I didn't check the whole code whether this is possible at all), iC - 1 might result in -1. So, c[-1] would be accessed.

This might be correct if e.g. c where a pointer into a bigger array but not at its begin. So, a negative index would access valid storage. This seems not to be the case here:

c = new int[257];

and I couldn't see any other assignment to c.

This all doesn't look much trustworthy. At best, the condition is over-pessimistic and 0 is never assigned.

I'm quite sure that I could demonstrate that less compact code may improve readability if not help to easier uncover possible issues in it.

So, is the non-compact code slower? According to my experience, it is not on modern compilers with its amazing optimization capabilities.

I once read an article about optimization and Static single assignment form. As well I see all the funny $$ variables in Visual Studios debugger watch window from time to time when I debug my C++ code (which definitely doesn't contain any variable named $$). So, I believe the compiler will do internally something similar as well. – Doing this explicitly to improve readability shouldn't have the least impact on to performance.

If I really would be in doubt I still could check the assembler output. (Compiler Explorer is a good place for example.)

Btw. c = new int[257];?

Why not int c[257];?

257 int values are not that much that I would afraid to exceed the stack size immediately.

Not to mention that, arrays and especially arrays allocated with new are really bad C++ style and asking for U.B.. As if std::vector was not yet invented…

I somehow missed the lessons about Radix sorting when I was a student (while I must admit that I didn't yet miss this knowledge in daily business). So, out of curiosity, I had a look into Wikipedia and re-implemented the descriptions found there. This is intended to provide a (hopefully better) replacement for what OP found and exposed in the question.

Thereby, I implemented

a naïve approach according to the description on en.wikipedia.org: Radix sort – History
and then OPs shown approach (with counting sort) which I found on de.wikipedia.org: Countingsort – Algorithmus.

#include <iostream>
#include <sstream>
#include <string>
#include <vector>

/* helper to find max. length in data strings
 */
size_t maxLength(const std::vector<std::string> &data)
{
  size_t lenMax = 0;
  for (const std::string &value : data) {
    if (lenMax < value.size()) lenMax = value.size();
  }
  return lenMax;
}

/* a naive implementation of radix sort
 * like described in https://en.wikipedia.org/wiki/Radix_sort
 */
void radixSort(std::vector<std::string> &data)
{
  /* A char has 8 bits - which encode (unsigned) the numbers of [0, 255].
   * Hence, 256 buckets are used for sorting.
   */
  std::vector<std::string> buckets[256];
  // determine max. length of input data:
  const size_t len = maxLength(data);
  /* iterate over data for according to max. length
   */
  for (size_t i = len; i--;) { // i-- -> check for 0 and post-decrement
    // sort data into buckets according to the current "digit":
    for (std::string &value : data) {
      /* digits after end of string are considered as '\0'
       * because 0 is the usual end-marker of C strings
       * and the least possible value of an unsigned char.
       * This shall ensure that an string goes before a longer
       * string with same prefix.
       */
      const unsigned char digit = i < value.size() ? value[i] : '\0';
      // move current string into the corresponding bucket
      buckets[digit].push_back(std::move(value));
    }
    // store buckets back into data (preserving current order)
    data.clear();
    for (std::vector<std::string> &bucket : buckets) {
      // append bucket to the data
      data.insert(data.end(),
        std::make_move_iterator(bucket.begin()),
        std::make_move_iterator(bucket.end()));
      bucket.clear();
    }
  }
}

/* counting sort as helper for the not so naive radix sort
 */
void countSort(std::vector<std::string> &data, size_t i)
{
  /* There are 256 possible values for an unsigned char
   * (which may have a value in [0, 255]).
   */
  size_t counts[256] = { 0 }; // initialize all counters with 0.
  // count how often a certain charater appears at the place i
  for (const std::string &value : data) {
    /* digits after end of string are considered as '\0'
     * because 0 is the usual end-marker of C strings
     * and the least possible value of an unsigned char.
     * This shall ensure that an string goes before a longer
     * string with same prefix.
     */
    const unsigned char digit = i < value.size() ? value[i] : '\0';
    // count the resp. bucket counter
    ++counts[digit];
  }
  // turn counts of digits into offsets in data
  size_t total = 0;
  for (size_t &count : counts) {
#if 0 // could be compact (and, maybe, confusing):
    total = count += total; // as C++ assignment is right-associative
#else // but is the same as:
    count += total; // add previous total sum to count
    total = count; // remember new total
#endif // 0
  }
  // an auxiliary buffer to sort the input data into.
  std::vector<std::string> buffer(data.size());
  /* Move input into aux. buffer
   * while using the bucket offsets (the former counts)
   * for addressing of new positions.
   * This is done backwards intentionally as the offsets
   * are decremented from end to begin of partitions.
   */
  for (size_t j = data.size(); j--;) { // j-- -> check for 0 and post-decrement
    std::string &value = data[j];
    // see comment for digit above...
    const unsigned char digit = i < value.size() ? value[i] : '\0';
    /* decrement offset and use as index
     * Arrays (and vectors) in C++ are 0-based.
     * Hence, this is adjusted respectively (compared to the source of algorithm).
     */
    const size_t k = --counts[digit];
    // move input element into auxiliary buffer at the determined offset
    buffer[k] = std::move(value);
  }
  /* That's it.
   * Move aux. buffer back into data.
   */
  data = std::move(buffer);
}

/* radix sort using count sort internally
 */
void radixCountSort(std::vector<std::string> &data)
{
  // determine max. length of input data:
  const size_t len = maxLength(data);
  /* iterate over data according to max. length
   */
  for (size_t i = len; i--;) { // i-- -> check for 0 and post-decrement
    countSort(data, i);
  }
}

/* output of vector with strings
 */
std::ostream& operator<<(std::ostream &out, const std::vector<std::string> &data)
{
  const char *sep = " ";
  for (const std::string &value : data) {
    out << sep << '"' << value << '"';
    sep = ", ";
  }
  return out;
}

/* do a test for certain data
 */
void test(const std::vector<std::string> &data)
{
  std::cout << "Data: {" << data << " }\n";
  std::vector<std::string> data1 = data;
  radixSort(data1);
  std::cout << "Radix Sorted:       {" << data1 << " }\n";
  std::vector<std::string> data2 = data;
  radixCountSort(data2);
  std::cout << "Radix Count Sorted: {" << data2 << " }\n";
}

/* helper to turn a text into a vector of strings
 * (by separating at white spaces)
 */
std::vector<std::string> tokenize(const char *text)
{
  std::istringstream in(text);
  std::vector<std::string> tokens;
  for (std::string token; in >> token;) tokens.push_back(token);
  return tokens;
}

/* main program
 */
int main()
{
  // do some tests:
  test({ "Hi", "He", "Hello", "World", "Wide", "Web" });
  test({ });
  test(
    tokenize(
      "Radix sort dates back as far as 1887 to the work of Herman Hollerith on tabulating machines.\n"
      "Radix sorting algorithms came into common use as a way to sort punched cards as early as 1923.\n"
      "The first memory-efficient computer algorithm was developed in 1954 at MIT by Harold H. Seward.\n"
      "Computerized radix sorts had previously been dismissed as impractical "
      "because of the perceived need for variable allocation of buckets of unknown size.\n"
      "Seward's innovation was to use a linear scan to determine the required bucket sizes and offsets beforehand, "
      "allowing for a single static allocation of auxiliary memory.\n"
      "The linear scan is closely related to Seward's other algorithm - counting sort."));
}

Output:

Data: { "Hi", "He", "Hello", "World", "Wide", "Web" }
Radix Sorted:       { "He", "Hello", "Hi", "Web", "Wide", "World" }
Radix Count Sorted: { "He", "Hello", "Hi", "Web", "Wide", "World" }
Data: { }
Radix Sorted:       { }
Radix Count Sorted: { }
Data: { "Radix", "sort", "dates", "back", "as", "far", "as", "1887", "to", "the", "work", "of", "Herman", "Hollerith", "on", "tabulating", "machines.", "Radix", "sorting", "algorithms", "came", "into", "common", "use", "as", "a", "way", "to", "sort", "punched", "cards", "as", "early", "as", "1923.", "The", "first", "memory-efficient", "computer", "algorithm", "was", "developed", "in", "1954", "at", "MIT", "by", "Harold", "H.", "Seward.", "Computerized", "radix", "sorts", "had", "previously", "been", "dismissed", "as", "impractical", "because", "of", "the", "perceived", "need", "for", "variable", "allocation", "of", "buckets", "of", "unknown", "size.", "Seward's", "innovation", "was", "to", "use", "a", "linear", "scan", "to", "determine", "the", "required", "bucket", "sizes", "and", "offsets", "beforehand,", "allowing", "for", "a", "single", "static", "allocation", "of", "auxiliary", "memory.", "The", "linear", "scan", "is", "closely", "related", "to", "Seward's", "other", "algorithm", "-", "counting", "sort." }
Radix Sorted:       { "-", "1887", "1923.", "1954", "Computerized", "H.", "Harold", "Herman", "Hollerith", "MIT", "Radix", "Radix", "Seward's", "Seward's", "Seward.", "The", "The", "a", "a", "a", "algorithm", "algorithm", "algorithms", "allocation", "allocation", "allowing", "and", "as", "as", "as", "as", "as", "as", "at", "auxiliary", "back", "because", "been", "beforehand,", "bucket", "buckets", "by", "came", "cards", "closely", "common", "computer", "counting", "dates", "determine", "developed", "dismissed", "early", "far", "first", "for", "for", "had", "impractical", "in", "innovation", "into", "is", "linear", "linear", "machines.", "memory-efficient", "memory.", "need", "of", "of", "of", "of", "of", "offsets", "on", "other", "perceived", "previously", "punched", "radix", "related", "required", "scan", "scan", "single", "size.", "sizes", "sort", "sort", "sort.", "sorting", "sorts", "static", "tabulating", "the", "the", "the", "to", "to", "to", "to", "to", "unknown", "use", "use", "variable", "was", "was", "way", "work" }
Radix Count Sorted: { "-", "1887", "1923.", "1954", "Computerized", "H.", "Harold", "Herman", "Hollerith", "MIT", "Radix", "Radix", "Seward's", "Seward's", "Seward.", "The", "The", "a", "a", "a", "algorithm", "algorithm", "algorithms", "allocation", "allocation", "allowing", "and", "as", "as", "as", "as", "as", "as", "at", "auxiliary", "back", "because", "been", "beforehand,", "bucket", "buckets", "by", "came", "cards", "closely", "common", "computer", "counting", "dates", "determine", "developed", "dismissed", "early", "far", "first", "for", "for", "had", "impractical", "in", "innovation", "into", "is", "linear", "linear", "machines.", "memory-efficient", "memory.", "need", "of", "of", "of", "of", "of", "offsets", "on", "other", "perceived", "previously", "punched", "radix", "related", "required", "scan", "scan", "single", "size.", "sizes", "sort", "sort", "sort.", "sorting", "sorts", "static", "tabulating", "the", "the", "the", "to", "to", "to", "to", "to", "unknown", "use", "use", "variable", "was", "was", "way", "work" }

Live Demo on coliru

Please, note that the strings are sorted interpreting the numerical values of the characters. If instead an English-dictionary sorting would be intended than the digit to bucket mapping had to be modified. Thereby, the order of character values might be changed as well as mapping corresponding uppercase and lowercase characters to the same bucket.

Frequently copying strings (or other containers) is space and time consuming and something, I would prevent at best in productive code. The move semantics is one option to lower the stress for the CPU while keeping the code quite clean and comparable to the algorithm behind. This is what I tried to consider (to my best knowledge) in the sample code.

Explanation on Radix Sort Algorithm

There are 1 best solutions below

Related Questions in C++

Related Questions in SORTING

Related Questions in RADIX-SORT

Trending Questions

Popular # Hahtags

Popular Questions