Most votes on algorithm questions 2

Most votes on algorithm questions 2. #11 Ukkonen's suffix tree algorithm in plain English #12 Calculate distance between two latitude-longitude points? (Haversine formula) #13 How to find time complexity of an algorithm #14 Big O, how do you calculate/approximate it? #15 How to count the number of set bits in a 32-bit integer? #16 What is tail call optimization? #17 How do I determine whether my calculation of pi is accurate? #18 Sorting 1 million 8-decimal-digit numbers with 1 MB of RAM #19 Expand a random range from 1–5 to 1–7 #20 How do I create a URL shortener?

Read all the top votes questions and answers in a single page.

#11: Ukkonen's suffix tree algorithm in plain English (Score: 1151)

Created: 2012-02-26 Last updated: 2020-07-29

Tags: string, algorithm, data-structures, language-agnostic, suffix-tree

I feel a bit thick at this point. I’ve spent days trying to fully wrap my head around suffix tree construction, but because I don’t have a mathematical background, many of the explanations elude me as they start to make excessive use of mathematical symbology. The closest to a good explanation that I’ve found is Fast String Searching With Suffix Trees, but he glosses over various points and some aspects of the algorithm remain unclear.

A step-by-step explanation of this algorithm here on Stack Overflow would be invaluable for many others besides me, I’m sure.

For reference, here’s Ukkonen’s paper on the algorithm:

My basic understanding, so far:

  • I need to iterate through each prefix P of a given string T
  • I need to iterate through each suffix S in prefix P and add that to tree
  • To add suffix S to the tree, I need to iterate through each character in S, with the iterations consisting of either walking down an existing branch that starts with the same set of characters C in S and potentially splitting an edge into descendent nodes when I reach a differing character in the suffix, OR if there was no matching edge to walk down. When no matching edge is found to walk down for C, a new leaf edge is created for C.

The basic algorithm appears to be O(n2), as is pointed out in most explanations, as we need to step through all of the prefixes, then we need to step through each of the suffixes for each prefix. Ukkonen’s algorithm is apparently unique because of the suffix pointer technique he uses, though I think that is what I’m having trouble understanding.

I’m also having trouble understanding:

  • exactly when and how the “active point” is assigned, used and changed
  • what is going on with the canonization aspect of the algorithm
  • Why the implementations I’ve seen need to “fix” bounding variables that they are using

Here is the completed C# source code. It not only works correctly, but supports automatic canonization and renders a nicer looking text graph of the output. Source code and sample output is at:

Update 2017-11-04

After many years I’ve found a new use for suffix trees, and have implemented the algorithm in JavaScript. Gist is below. It should be bug-free. Dump it into a js file, npm install chalk from the same location, and then run with node.js to see some colourful output. There’s a stripped down version in the same Gist, without any of the debugging code.

#11 Best answer 1 of Ukkonen's suffix tree algorithm in plain English (Score: 2444)

Created: 2012-03-01 Last updated: 2019-11-08

The following is an attempt to describe the Ukkonen algorithm by first showing what it does when the string is simple (i.e. does not contain any repeated characters), and then extending it to the full algorithm.

First, a few preliminary statements.

  1. What we are building, is basically like a search trie. So there is a root node, edges going out of it leading to new nodes, and further edges going out of those, and so forth

  2. But: Unlike in a search trie, the edge labels are not single characters. Instead, each edge is labeled using a pair of integers [from,to]. These are pointers into the text. In this sense, each edge carries a string label of arbitrary length, but takes only O(1) space (two pointers).

Basic principle

I would like to first demonstrate how to create the suffix tree of a particularly simple string, a string with no repeated characters:


The algorithm works in steps, from left to right. There is one step for every character of the string. Each step might involve more than one individual operation, but we will see (see the final observations at the end) that the total number of operations is O(n).

So, we start from the left, and first insert only the single character a by creating an edge from the root node (on the left) to a leaf, and labeling it as [0,#], which means the edge represents the substring starting at position 0 and ending at the current end. I use the symbol # to mean the current end, which is at position 1 (right after a).

So we have an initial tree, which looks like this:

And what it means is this:

Now we progress to position 2 (right after b). Our goal at each step is to insert all suffixes up to the current position. We do this by

  • expanding the existing a-edge to ab
  • inserting one new edge for b

In our representation this looks like

enter image description here

And what it means is:

We observe two things:

  • The edge representation for ab is the same as it used to be in the initial tree: [0,#]. Its meaning has automatically changed because we updated the current position # from 1 to 2.
  • Each edge consumes O(1) space, because it consists of only two pointers into the text, regardless of how many characters it represents.

Next we increment the position again and update the tree by appending a c to every existing edge and inserting one new edge for the new suffix c.

In our representation this looks like

And what it means is:

We observe:

  • The tree is the correct suffix tree up to the current position after each step
  • There are as many steps as there are characters in the text
  • The amount of work in each step is O(1), because all existing edges are updated automatically by incrementing #, and inserting the one new edge for the final character can be done in O(1) time. Hence for a string of length n, only O(n) time is required.

First extension: Simple repetitions

Of course this works so nicely only because our string does not contain any repetitions. We now look at a more realistic string:


It starts with abc as in the previous example, then ab is repeated and followed by x, and then abc is repeated followed by d.

Steps 1 through 3: After the first 3 steps we have the tree from the previous example:

Step 4: We move # to position 4. This implicitly updates all existing edges to this:

and we need to insert the final suffix of the current step, a, at the root.

Before we do this, we introduce two more variables (in addition to #), which of course have been there all the time but we haven’t used them so far:

  • The active point, which is a triple (active_node,active_edge,active_length)
  • The remainder, which is an integer indicating how many new suffixes we need to insert

The exact meaning of these two will become clear soon, but for now let’s just say:

  • In the simple abc example, the active point was always (root,'\0x',0), i.e. active_node was the root node, active_edge was specified as the null character '\0x', and active_length was zero. The effect of this was that the one new edge that we inserted in every step was inserted at the root node as a freshly created edge. We will see soon why a triple is necessary to represent this information.
  • The remainder was always set to 1 at the beginning of each step. The meaning of this was that the number of suffixes we had to actively insert at the end of each step was 1 (always just the final character).

Now this is going to change. When we insert the current final character a at the root, we notice that there is already an outgoing edge starting with a, specifically: abca. Here is what we do in such a case:

  • We do not insert a fresh edge [4,#] at the root node. Instead we simply notice that the suffix a is already in our tree. It ends in the middle of a longer edge, but we are not bothered by that. We just leave things the way they are.
  • We set the active point to (root,'a',1). That means the active point is now somewhere in the middle of outgoing edge of the root node that starts with a, specifically, after position 1 on that edge. We notice that the edge is specified simply by its first character a. That suffices because there can be only one edge starting with any particular character (confirm that this is true after reading through the entire description).
  • We also increment remainder, so at the beginning of the next step it will be 2.

Observation: When the final suffix we need to insert is found to exist in the tree already, the tree itself is not changed at all (we only update the active point and remainder). The tree is then not an accurate representation of the suffix tree up to the current position any more, but it contains all suffixes (because the final suffix a is contained implicitly). Hence, apart from updating the variables (which are all of fixed length, so this is O(1)), there was no work done in this step.

Step 5: We update the current position # to 5. This automatically updates the tree to this:

And because remainder is 2, we need to insert two final suffixes of the current position: ab and b. This is basically because:

  • The a suffix from the previous step has never been properly inserted. So it has remained, and since we have progressed one step, it has now grown from a to ab.
  • And we need to insert the new final edge b.

In practice this means that we go to the active point (which points to behind the a on what is now the abcab edge), and insert the current final character b. But: Again, it turns out that b is also already present on that same edge.

So, again, we do not change the tree. We simply:

  • Update the active point to (root,'a',2) (same node and edge as before, but now we point to behind the b)
  • Increment the remainder to 3 because we still have not properly inserted the final edge from the previous step, and we don’t insert the current final edge either.

To be clear: We had to insert ab and b in the current step, but because ab was already found, we updated the active point and did not even attempt to insert b. Why? Because if ab is in the tree, every suffix of it (including b) must be in the tree, too. Perhaps only implicitly, but it must be there, because of the way we have built the tree so far.

We proceed to step 6 by incrementing #. The tree is automatically updated to:

Because remainder is 3, we have to insert abx, bx and x. The active point tells us where ab ends, so we only need to jump there and insert the x. Indeed, x is not there yet, so we split the abcabx edge and insert an internal node:

The edge representations are still pointers into the text, so splitting and inserting an internal node can be done in O(1) time.

So we have dealt with abx and decrement remainder to 2. Now we need to insert the next remaining suffix, bx. But before we do that we need to update the active point. The rule for this, after splitting and inserting an edge, will be called Rule 1 below, and it applies whenever the active_node is root (we will learn rule 3 for other cases further below). Here is rule 1:

After an insertion from root,

  • active_node remains root
  • active_edge is set to the first character of the new suffix we need to insert, i.e. b
  • active_length is reduced by 1

Hence, the new active-point triple (root,'b',1) indicates that the next insert has to be made at the bcabx edge, behind 1 character, i.e. behind b. We can identify the insertion point in O(1) time and check whether x is already present or not. If it was present, we would end the current step and leave everything the way it is. But x is not present, so we insert it by splitting the edge:

Again, this took O(1) time and we update remainder to 1 and the active point to (root,'x',0) as rule 1 states.

But there is one more thing we need to do. We’ll call this Rule 2:

If we split an edge and insert a new node, and if that is not the first node created during the current step, we connect the previously inserted node and the new node through a special pointer, a suffix link. We will later see why that is useful. Here is what we get, the suffix link is represented as a dotted edge:

We still need to insert the final suffix of the current step, x. Since the active_length component of the active node has fallen to 0, the final insert is made at the root directly. Since there is no outgoing edge at the root node starting with x, we insert a new edge:

As we can see, in the current step all remaining inserts were made.

We proceed to step 7 by setting #=7, which automatically appends the next character, a, to all leaf edges, as always. Then we attempt to insert the new final character to the active point (the root), and find that it is there already. So we end the current step without inserting anything and update the active point to (root,'a',1).

In step 8, #=8, we append b, and as seen before, this only means we update the active point to (root,'a',2) and increment remainder without doing anything else, because b is already present. However, we notice (in O(1) time) that the active point is now at the end of an edge. We reflect this by re-setting it to (node1,'\0x',0). Here, I use node1 to refer to the internal node the ab edge ends at.

Then, in step #=9, we need to insert ‘c’ and this will help us to understand the final trick:

As always, the # update appends c automatically to the leaf edges and we go to the active point to see if we can insert ‘c’. It turns out ‘c’ exists already at that edge, so we set the active point to (node1,'c',1), increment remainder and do nothing else.

Now in step #=10, remainder is 4, and so we first need to insert abcd (which remains from 3 steps ago) by inserting d at the active point.

Attempting to insert d at the active point causes an edge split in O(1) time:

The active_node, from which the split was initiated, is marked in red above. Here is the final rule, Rule 3:

After splitting an edge from an active_node that is not the root node, we follow the suffix link going out of that node, if there is any, and reset the active_node to the node it points to. If there is no suffix link, we set the active_node to the root. active_edge and active_length remain unchanged.

So the active point is now (node2,'c',1), and node2 is marked in red below:

Since the insertion of abcd is complete, we decrement remainder to 3 and consider the next remaining suffix of the current step, bcd. Rule 3 has set the active point to just the right node and edge so inserting bcd can be done by simply inserting its final character d at the active point.

Doing this causes another edge split, and because of rule 2, we must create a suffix link from the previously inserted node to the new one:

We observe: Suffix links enable us to reset the active point so we can make the next remaining insert at O(1) effort. Look at the graph above to confirm that indeed node at label ab is linked to the node at b (its suffix), and the node at abc is linked to bc.

The current step is not finished yet. remainder is now 2, and we need to follow rule 3 to reset the active point again. Since the current active_node (red above) has no suffix link, we reset to root. The active point is now (root,'c',1).

Hence the next insert occurs at the one outgoing edge of the root node whose label starts with c: cabxabcd, behind the first character, i.e. behind c. This causes another split:

And since this involves the creation of a new internal node,we follow rule 2 and set a new suffix link from the previously created internal node:

(I am using Graphviz Dot for these little graphs. The new suffix link caused dot to re-arrange the existing edges, so check carefully to confirm that the only thing that was inserted above is a new suffix link.)

With this, remainder can be set to 1 and since the active_node is root, we use rule 1 to update the active point to (root,'d',0). This means the final insert of the current step is to insert a single d at root:

That was the final step and we are done. There are number of final observations, though:

  • In each step we move # forward by 1 position. This automatically updates all leaf nodes in O(1) time.

  • But it does not deal with a) any suffixes remaining from previous steps, and b) with the one final character of the current step.

  • remainder tells us how many additional inserts we need to make. These inserts correspond one-to-one to the final suffixes of the string that ends at the current position #. We consider one after the other and make the insert. Important: Each insert is done in O(1) time since the active point tells us exactly where to go, and we need to add only one single character at the active point. Why? Because the other characters are contained implicitly (otherwise the active point would not be where it is).

  • After each such insert, we decrement remainder and follow the suffix link if there is any. If not we go to root (rule 3). If we are at root already, we modify the active point using rule 1. In any case, it takes only O(1) time.

  • If, during one of these inserts, we find that the character we want to insert is already there, we don’t do anything and end the current step, even if remainder>0. The reason is that any inserts that remain will be suffixes of the one we just tried to make. Hence they are all implicit in the current tree. The fact that remainder>0 makes sure we deal with the remaining suffixes later.

  • What if at the end of the algorithm remainder>0? This will be the case whenever the end of the text is a substring that occurred somewhere before. In that case we must append one extra character at the end of the string that has not occurred before. In the literature, usually the dollar sign $ is used as a symbol for that. Why does that matter? –> If later we use the completed suffix tree to search for suffixes, we must accept matches only if they end at a leaf. Otherwise we would get a lot of spurious matches, because there are many strings implicitly contained in the tree that are not actual suffixes of the main string. Forcing remainder to be 0 at the end is essentially a way to ensure that all suffixes end at a leaf node. However, if we want to use the tree to search for general substrings, not only suffixes of the main string, this final step is indeed not required, as suggested by the OP’s comment below.

  • So what is the complexity of the entire algorithm? If the text is n characters in length, there are obviously n steps (or n+1 if we add the dollar sign). In each step we either do nothing (other than updating the variables), or we make remainder inserts, each taking O(1) time. Since remainder indicates how many times we have done nothing in previous steps, and is decremented for every insert that we make now, the total number of times we do something is exactly n (or n+1). Hence, the total complexity is O(n).

  • However, there is one small thing that I did not properly explain: It can happen that we follow a suffix link, update the active point, and then find that its active_length component does not work well with the new active_node. For example, consider a situation like this:

(The dashed lines indicate the rest of the tree. The dotted line is a suffix link.)

Now let the active point be (red,'d',3), so it points to the place behind the f on the defg edge. Now assume we made the necessary updates and now follow the suffix link to update the active point according to rule 3. The new active point is (green,'d',3). However, the d-edge going out of the green node is de, so it has only 2 characters. In order to find the correct active point, we obviously need to follow that edge to the blue node and reset to (blue,'f',1).

In a particularly bad case, the active_length could be as large as remainder, which can be as large as n. And it might very well happen that to find the correct active point, we need not only jump over one internal node, but perhaps many, up to n in the worst case. Does that mean the algorithm has a hidden O(n2) complexity, because in each step remainder is generally O(n), and the post-adjustments to the active node after following a suffix link could be O(n), too?

No. The reason is that if indeed we have to adjust the active point (e.g. from green to blue as above), that brings us to a new node that has its own suffix link, and active_length will be reduced. As we follow down the chain of suffix links we make the remaining inserts, active_length can only decrease, and the number of active-point adjustments we can make on the way can’t be larger than active_length at any given time. Since active_length can never be larger than remainder, and remainder is O(n) not only in every single step, but the total sum of increments ever made to remainder over the course of the entire process is O(n) too, the number of active point adjustments is also bounded by O(n).

#11 Best answer 2 of Ukkonen's suffix tree algorithm in plain English(Score: 134)

Created: 2013-01-29 Last updated: 2016-10-10

I tried to implement the Suffix Tree with the approach given in jogojapan’s answer, but it didn’t work for some cases due to wording used for the rules. Moreover, I’ve mentioned that nobody managed to implement an absolutely correct suffix tree using this approach. Below I will write an “overview” of jogojapan’s answer with some modifications to the rules. I will also describe the case when we forget to create important suffix links.

Additional variables used

  1. active point - a triple (active_node; active_edge; active_length), showing from where we must start inserting a new suffix.
  2. remainder - shows the number of suffixes we must add explicitly. For instance, if our word is ‘abcaabca’, and remainder = 3, it means we must process 3 last suffixes: bca, ca and a.

Let’s use a concept of an internal node - all the nodes, except the root and the leafs are internal nodes.

Observation 1

When the final suffix we need to insert is found to exist in the tree already, the tree itself is not changed at all (we only update the active point and remainder).

Observation 2

If at some point active_length is greater or equal to the length of current edge (edge_length), we move our active point down until edge_length is strictly greater than active_length.

Now, let’s redefine the rules:

Rule 1

If after an insertion from the active node = root, the active length is greater than 0, then:

  1. active node is not changed
  2. active length is decremented
  3. active edge is shifted right (to the first character of the next suffix we must insert)

Rule 2

If we create a new internal node OR make an inserter from an internal node, and this is not the first SUCH internal node at current step, then we link the previous SUCH node with THIS one through a suffix link.

This definition of the Rule 2 is different from jogojapan', as here we take into account not only the newly created internal nodes, but also the internal nodes, from which we make an insertion.

Rule 3

After an insert from the active node which is not the root node, we must follow the suffix link and set the active node to the node it points to. If there is no a suffix link, set the active node to the root node. Either way, active edge and active length stay unchanged.

In this definition of Rule 3 we also consider the inserts of leaf nodes (not only split-nodes).

And finally, Observation 3:

When the symbol we want to add to the tree is already on the edge, we, according to Observation 1, update only active point and remainder, leaving the tree unchanged. BUT if there is an internal node marked as needing suffix link, we must connect that node with our current active node through a suffix link.

Let’s look at the example of a suffix tree for cdddcdc if we add a suffix link in such case and if we don’t:

  1. If we DON’T connect the nodes through a suffix link:

    • before adding the last letter c:

    • after adding the last letter c:

  2. If we DO connect the nodes through a suffix link:

    • before adding the last letter c:

- after adding the last letter **c**:

Seems like there is no significant difference: in the second case there are two more suffix links. But these suffix links are correct, and one of them - from the blue node to the red one - is very important for our approach with active point. The problem is that if we don’t put a suffix link here, later, when we add some new letters to the tree, we might omit adding some nodes to the tree due to the Rule 3, because, according to it, if there’s no a suffix link, then we must put the active_node to the root.

When we were adding the last letter to the tree, the red node had already existed before we made an insert from the blue node (the edge labled ‘c’). As there was an insert from the blue node, we mark it as needing a suffix link. Then, relying on the active point approach, the active node was set to the red node. But we don’t make an insert from the red node, as the letter ‘c’ is already on the edge. Does it mean that the blue node must be left without a suffix link? No, we must connect the blue node with the red one through a suffix link. Why is it correct? Because the active point approach guarantees that we get to a right place, i.e., to the next place where we must process an insert of a shorter suffix.

Finally, here are my implementations of the Suffix Tree:

  1. Java
  2. C++

Hope that this “overview” combined with jogojapan’s detailed answer will help somebody to implement his own Suffix Tree.

See also original question in stackoverflow

#12: Calculate distance between two latitude-longitude points? (Haversine formula) (Score: 983)

Created: 2008-08-26 Last updated: 2017-09-11

Tags: algorithm, math, maps, latitude-longitude, haversine

How do I calculate the distance between two points specified by latitude and longitude?

For clarification, I’d like the distance in kilometers; the points use the WGS84 system and I’d like to understand the relative accuracies of the approaches available.

#12 Best answer 1 of Calculate distance between two latitude-longitude points? (Haversine formula) (Score: 1227)

Created: 2008-08-26 Last updated: 2014-07-31

This link might be helpful to you, as it details the use of the Haversine formula to calculate the distance.


This script [in Javascript] calculates great-circle distances between the two points – that is, the shortest distance over the earth’s surface – using the ‘Haversine’ formula.

function getDistanceFromLatLonInKm(lat1,lon1,lat2,lon2) {
  var R = 6371; // Radius of the earth in km
  var dLat = deg2rad(lat2-lat1);  // deg2rad below
  var dLon = deg2rad(lon2-lon1); 
  var a = 
    Math.sin(dLat/2) * Math.sin(dLat/2) +
    Math.cos(deg2rad(lat1)) * Math.cos(deg2rad(lat2)) * 
    Math.sin(dLon/2) * Math.sin(dLon/2)
  var c = 2 * Math.atan2(Math.sqrt(a), Math.sqrt(1-a)); 
  var d = R * c; // Distance in km
  return d;

function deg2rad(deg) {
  return deg * (Math.PI/180)

#12 Best answer 2 of Calculate distance between two latitude-longitude points? (Haversine formula)(Score: 414)

Created: 2014-02-07 Last updated: 2020-03-13

I needed to calculate a lot of distances between the points for my project, so I went ahead and tried to optimize the code, I have found here. On average in different browsers my new implementation runs 2 times faster than the most upvoted answer.

function distance(lat1, lon1, lat2, lon2) {
  var p = 0.017453292519943295;    // Math.PI / 180
  var c = Math.cos;
  var a = 0.5 - c((lat2 - lat1) * p)/2 + 
          c(lat1 * p) * c(lat2 * p) * 
          (1 - c((lon2 - lon1) * p))/2;

  return 12742 * Math.asin(Math.sqrt(a)); // 2 * R; R = 6371 km

You can play with my jsPerf and see the results here.

Recently I needed to do the same in python, so here is a python implementation:

from math import cos, asin, sqrt, pi

def distance(lat1, lon1, lat2, lon2):
    p = pi/180
    a = 0.5 - cos((lat2-lat1)*p)/2 + cos(lat1*p) * cos(lat2*p) * (1-cos((lon2-lon1)*p))/2
    return 12742 * asin(sqrt(a)) #2*R*asin...

And for the sake of completeness: Haversine on wiki.

See also original question in stackoverflow

#13: How to find time complexity of an algorithm (Score: 952)

Created: 2012-06-14 Last updated: 2017-10-27

Tags: algorithm, time-complexity, complexity-theory

The Question

How to find time complexity of an algorithm?

What have I done before posting a question on SO ?

I have gone through this, this and many other links

But no where I was able to find a clear and straight forward explanation for how to calculate time complexity.

What do I know ?

Say for a code as simple as the one below:

char h = 'y'; // This will be executed 1 time
int abc = 0; // This will be executed 1 time

Say for a loop like the one below:

for (int i = 0; i < N; i++) {        
    Console.Write('Hello World !');

int i=0; This will be executed only once. The time is actually calculated to i=0 and not the declaration.

i < N; This will be executed N+1 times

i++ ; This will be executed N times

So the number of operations required by this loop are

{1+(N+1)+N} = 2N+2

Note: This still may be wrong, as I am not confident about my understanding on calculating time complexity

What I want to know ?

Ok, so these small basic calculations I think I know, but in most cases I have seen the time complexity as

O(N), O(n2), O(log n), O(n!)…. and many other,

Can anyone help me understand how does one calculate time complexity of an algorithm? I am sure there are plenty of newbies like me wanting to know this.

#13 Best answer 1 of How to find time complexity of an algorithm (Score: 424)

Created: 2012-06-14 Last updated: 2017-03-29

How to find time complexity of an algorithm

You add up how many machine instructions it will execute as a function of the size of its input, and then simplify the expression to the largest (when N is very large) term and can include any simplifying constant factor.

For example, lets see how we simplify 2N + 2 machine instructions to describe this as just O(N).

Why do we remove the two 2s ?

We are interested in the performance of the algorithm as N becomes large.

Consider the two terms 2N and 2.

What is the relative influence of these two terms as N becomes large? Suppose N is a million.

Then the first term is 2 million and the second term is only 2.

For this reason, we drop all but the largest terms for large N.

So, now we have gone from 2N + 2 to 2N.

Traditionally, we are only interested in performance up to constant factors.

This means that we don’t really care if there is some constant multiple of difference in performance when N is large. The unit of 2N is not well-defined in the first place anyway. So we can multiply or divide by a constant factor to get to the simplest expression.

So 2N becomes just N.

#13 Best answer 2 of How to find time complexity of an algorithm(Score: 417)

Created: 2013-01-18 Last updated: 2020-05-04

This is an excellent article :

The below answer is copied from above (in case the excellent link goes bust)

The most common metric for calculating time complexity is Big O notation. This removes all constant factors so that the running time can be estimated in relation to N as N approaches infinity. In general you can think of it like this:


Is constant. The running time of the statement will not change in relation to N.

for ( i = 0; i < N; i++ )

Is linear. The running time of the loop is directly proportional to N. When N doubles, so does the running time.

for ( i = 0; i < N; i++ ) {
  for ( j = 0; j < N; j++ )

Is quadratic. The running time of the two loops is proportional to the square of N. When N doubles, the running time increases by N * N.

while ( low <= high ) {
  mid = ( low + high ) / 2;
  if ( target < list[mid] )
    high = mid - 1;
  else if ( target > list[mid] )
    low = mid + 1;
  else break;

Is logarithmic. The running time of the algorithm is proportional to the number of times N can be divided by 2. This is because the algorithm divides the working area in half with each iteration.

void quicksort ( int list[], int left, int right )
  int pivot = partition ( list, left, right );
  quicksort ( list, left, pivot - 1 );
  quicksort ( list, pivot + 1, right );

Is N * log ( N ). The running time consists of N loops (iterative or recursive) that are logarithmic, thus the algorithm is a combination of linear and logarithmic.

In general, doing something with every item in one dimension is linear, doing something with every item in two dimensions is quadratic, and dividing the working area in half is logarithmic. There are other Big O measures such as cubic, exponential, and square root, but they’re not nearly as common. Big O notation is described as O ( <type> ) where <type> is the measure. The quicksort algorithm would be described as O ( N * log ( N ) ).

Note that none of this has taken into account best, average, and worst case measures. Each would have its own Big O notation. Also note that this is a VERY simplistic explanation. Big O is the most common, but it’s also more complex that I’ve shown. There are also other notations such as big omega, little o, and big theta. You probably won’t encounter them outside of an algorithm analysis course. ;)

See also original question in stackoverflow

#14: Big O, how do you calculate/approximate it? (Score: 918)

Created: 2008-08-06 Last updated: 2019-12-19

Tags: algorithm, optimization, complexity-theory, big-o, performance

Most people with a degree in CS will certainly know what Big O stands for. It helps us to measure how well an algorithm scales.

But I’m curious, how do you calculate or approximate the complexity of your algorithms?

#14 Best answer 1 of Big O, how do you calculate/approximate it? (Score: 1504)

Created: 2011-01-31 Last updated: 2019-03-05

I’ll do my best to explain it here on simple terms, but be warned that this topic takes my students a couple of months to finally grasp. You can find more information on the Chapter 2 of the Data Structures and Algorithms in Java book.

There is no mechanical procedure that can be used to get the BigOh.

As a “cookbook”, to obtain the BigOh from a piece of code you first need to realize that you are creating a math formula to count how many steps of computations get executed given an input of some size.

The purpose is simple: to compare algorithms from a theoretical point of view, without the need to execute the code. The lesser the number of steps, the faster the algorithm.

For example, let’s say you have this piece of code:

int sum(int* data, int N) {
    int result = 0;               // 1

    for (int i = 0; i < N; i++) { // 2
        result += data[i];        // 3

    return result;                // 4

This function returns the sum of all the elements of the array, and we want to create a formula to count the computational complexity of that function:

Number_Of_Steps = f(N)

So we have f(N), a function to count the number of computational steps. The input of the function is the size of the structure to process. It means that this function is called such as:

Number_Of_Steps = f(data.length)

The parameter N takes the data.length value. Now we need the actual definition of the function f(). This is done from the source code, in which each interesting line is numbered from 1 to 4.

There are many ways to calculate the BigOh. From this point forward we are going to assume that every sentence that doesn’t depend on the size of the input data takes a constant C number computational steps.

We are going to add the individual number of steps of the function, and neither the local variable declaration nor the return statement depends on the size of the data array.

That means that lines 1 and 4 takes C amount of steps each, and the function is somewhat like this:

f(N) = C + ??? + C

The next part is to define the value of the for statement. Remember that we are counting the number of computational steps, meaning that the body of the for statement gets executed N times. That’s the same as adding C, N times:

f(N) = C + (C + C + ... + C) + C = C + N * C + C

There is no mechanical rule to count how many times the body of the for gets executed, you need to count it by looking at what does the code do. To simplify the calculations, we are ignoring the variable initialization, condition and increment parts of the for statement.

To get the actual BigOh we need the Asymptotic analysis of the function. This is roughly done like this:

  1. Take away all the constants C.
  2. From f() get the polynomium in its standard form.
  3. Divide the terms of the polynomium and sort them by the rate of growth.
  4. Keep the one that grows bigger when N approaches infinity.

Our f() has two terms:

f(N) = 2 * C * N ^ 0 + 1 * C * N ^ 1

Taking away all the C constants and redundant parts:

f(N) = 1 + N ^ 1

Since the last term is the one which grows bigger when f() approaches infinity (think on limits) this is the BigOh argument, and the sum() function has a BigOh of:


There are a few tricks to solve some tricky ones: use summations whenever you can.

As an example, this code can be easily solved using summations:

for (i = 0; i < 2*n; i += 2) {  // 1
    for (j=n; j > i; j--) {     // 2
        foo();                  // 3

The first thing you needed to be asked is the order of execution of foo(). While the usual is to be O(1), you need to ask your professors about it. O(1) means (almost, mostly) constant C, independent of the size N.

The for statement on the sentence number one is tricky. While the index ends at 2 * N, the increment is done by two. That means that the first for gets executed only N steps, and we need to divide the count by two.

f(N) = Summation(i from 1 to 2 * N / 2)( ... ) = 
     = Summation(i from 1 to N)( ... )

The sentence number two is even trickier since it depends on the value of i. Take a look: the index i takes the values: 0, 2, 4, 6, 8, …, 2 * N, and the second for get executed: N times the first one, N - 2 the second, N - 4 the third… up to the N / 2 stage, on which the second for never gets executed.

On formula, that means:

f(N) = Summation(i from 1 to N)( Summation(j = ???)(  ) )

Again, we are counting the number of steps. And by definition, every summation should always start at one, and end at a number bigger-or-equal than one.

f(N) = Summation(i from 1 to N)( Summation(j = 1 to (N - (i - 1) * 2)( C ) )

(We are assuming that foo() is O(1) and takes C steps.)

We have a problem here: when i takes the value N / 2 + 1 upwards, the inner Summation ends at a negative number! That’s impossible and wrong. We need to split the summation in two, being the pivotal point the moment i takes N / 2 + 1.

f(N) = Summation(i from 1 to N / 2)( Summation(j = 1 to (N - (i - 1) * 2)) * ( C ) ) + Summation(i from 1 to N / 2) * ( C )

Since the pivotal moment i > N / 2, the inner for won’t get executed, and we are assuming a constant C execution complexity on its body.

Now the summations can be simplified using some identity rules:

  1. Summation(w from 1 to N)( C ) = N * C
  2. Summation(w from 1 to N)( A (+/-) B ) = Summation(w from 1 to N)( A ) (+/-) Summation(w from 1 to N)( B )
  3. Summation(w from 1 to N)( w * C ) = C * Summation(w from 1 to N)( w ) (C is a constant, independent of w)
  4. Summation(w from 1 to N)( w ) = (N * (N + 1)) / 2

Applying some algebra:

f(N) = Summation(i from 1 to N / 2)( (N - (i - 1) * 2) * ( C ) ) + (N / 2)( C )

f(N) = C * Summation(i from 1 to N / 2)( (N - (i - 1) * 2)) + (N / 2)( C )

f(N) = C * (Summation(i from 1 to N / 2)( N ) - Summation(i from 1 to N / 2)( (i - 1) * 2)) + (N / 2)( C )

f(N) = C * (( N ^ 2 / 2 ) - 2 * Summation(i from 1 to N / 2)( i - 1 )) + (N / 2)( C )

=> Summation(i from 1 to N / 2)( i - 1 ) = Summation(i from 1 to N / 2 - 1)( i )

f(N) = C * (( N ^ 2 / 2 ) - 2 * Summation(i from 1 to N / 2 - 1)( i )) + (N / 2)( C )

f(N) = C * (( N ^ 2 / 2 ) - 2 * ( (N / 2 - 1) * (N / 2 - 1 + 1) / 2) ) + (N / 2)( C )

=> (N / 2 - 1) * (N / 2 - 1 + 1) / 2 = 

   (N / 2 - 1) * (N / 2) / 2 = 

   ((N ^ 2 / 4) - (N / 2)) / 2 = 

   (N ^ 2 / 8) - (N / 4)

f(N) = C * (( N ^ 2 / 2 ) - 2 * ( (N ^ 2 / 8) - (N / 4) )) + (N / 2)( C )

f(N) = C * (( N ^ 2 / 2 ) - ( (N ^ 2 / 4) - (N / 2) )) + (N / 2)( C )

f(N) = C * (( N ^ 2 / 2 ) - (N ^ 2 / 4) + (N / 2)) + (N / 2)( C )

f(N) = C * ( N ^ 2 / 4 ) + C * (N / 2) + C * (N / 2)

f(N) = C * ( N ^ 2 / 4 ) + 2 * C * (N / 2)

f(N) = C * ( N ^ 2 / 4 ) + C * N

f(N) = C * 1/4 * N ^ 2 + C * N

And the BigOh is:


#14 Best answer 2 of Big O, how do you calculate/approximate it?(Score: 205)

Created: 2008-08-06

Big O gives the upper bound for time complexity of an algorithm. It is usually used in conjunction with processing data sets (lists) but can be used elsewhere.

A few examples of how it’s used in C code.

Say we have an array of n elements

int array[n];

If we wanted to access the first element of the array this would be O(1) since it doesn’t matter how big the array is, it always takes the same constant time to get the first item.

x = array[0];

If we wanted to find a number in the list:

for(int i = 0; i < n; i++){
    if(array[i] == numToFind){ return i; }

This would be O(n) since at most we would have to look through the entire list to find our number. The Big-O is still O(n) even though we might find our number the first try and run through the loop once because Big-O describes the upper bound for an algorithm (omega is for lower bound and theta is for tight bound).

When we get to nested loops:

for(int i = 0; i < n; i++){
    for(int j = i; j < n; j++){
        array[j] += 2;

This is O(n^2) since for each pass of the outer loop ( O(n) ) we have to go through the entire list again so the n’s multiply leaving us with n squared.

This is barely scratching the surface but when you get to analyzing more complex algorithms complex math involving proofs comes into play. Hope this familiarizes you with the basics at least though.

See also original question in stackoverflow

#15: How to count the number of set bits in a 32-bit integer? (Score: 918)

Created: 2008-09-20 Last updated: 2014-09-18

Tags: algorithm, binary, bit-manipulation, hammingweight, iec10967

8 bits representing the number 7 look like this:


Three bits are set.

What are algorithms to determine the number of set bits in a 32-bit integer?

#15 Best answer 1 of How to count the number of set bits in a 32-bit integer? (Score: 898)

Created: 2008-09-20 Last updated: 2021-03-13

This is known as the ‘Hamming Weight’, ‘popcount’ or ‘sideways addition’.

Some CPUs have a single built-in instruction to do it and others have parallel instructions which act on bit vectors. Instructions like x86’s popcnt (on CPUs where it’s supported) will almost certainly be fastest for a single integer. Some other architectures may have a slow instruction implemented with a microcoded loop that tests a bit per cycle (citation needed - hardware popcount is normally fast if it exists at all.).

The ‘best’ algorithm really depends on which CPU you are on and what your usage pattern is.

Your compiler may know how to do something that’s good for the specific CPU you’re compiling for, e.g. C++20 std::popcount(), or C++ std::bitset<32>::count(), as a portable way to access builtin / intrinsic functions (see another answer on this question). But your compiler’s choice of fallback for target CPUs that don’t have hardware popcnt might not be optimal for your use-case. Or your language (e.g. C) might not expose any portable function that could use a CPU-specific popcount when there is one.

Portable algorithms that don’t need (or benefit from) any HW support

A pre-populated table lookup method can be very fast if your CPU has a large cache and you are doing lots of these operations in a tight loop. However it can suffer because of the expense of a ‘cache miss’, where the CPU has to fetch some of the table from main memory. (Look up each byte separately to keep the table small.) If you want popcount for a contiguous range of numbers, only the low byte is changing for groups of 256 numbers, making this very good.

If you know that your bytes will be mostly 0’s or mostly 1’s then there are efficient algorithms for these scenarios, e.g. clearing the lowest set with a bithack in a loop until it becomes zero.

I believe a very good general purpose algorithm is the following, known as ‘parallel’ or ‘variable-precision SWAR algorithm’. I have expressed this in a C-like pseudo language, you may need to adjust it to work for a particular language (e.g. using uint32_t for C++ and »> in Java):

int numberOfSetBits(uint32_t i)
     // Java: use int, and use >>> instead of >>
     // C or C++: use uint32_t
     i = i - ((i >> 1) & 0x55555555);
     i = (i & 0x33333333) + ((i >> 2) & 0x33333333);
     return (((i + (i >> 4)) & 0x0F0F0F0F) * 0x01010101) >> 24;

For JavaScript: coerce to integer with |0 for performance: change the first line to i = (i|0) - ((i >> 1) & 0x55555555);

This has the best worst-case behaviour of any of the algorithms discussed, so will efficiently deal with any usage pattern or values you throw at it. (Its performance is not data-dependent on normal CPUs where all integer operations including multiply are constant-time. It doesn’t get any faster with “simple” inputs, but it’s still pretty decent.)


How this SWAR bithack works:

i = i - ((i >> 1) & 0x55555555);

The first step is an optimized version of masking to isolate the odd / even bits, shifting to line them up, and adding. This effectively does 16 separate additions in 2-bit accumulators (SWAR = SIMD Within A Register). Like (i & 0x55555555) + ((i>>1) & 0x55555555).

The next step takes the odd/even eight of those 16x 2-bit accumulators and adds again, producing 8x 4-bit sums. The i - ... optimization isn’t possible this time so it does just mask before / after shifting. Using the same 0x33... constant both times instead of 0xccc... before shifting is a good thing when compiling for ISAs that need to construct 32-bit constants in registers separately.

The final shift-and-add step of (i + (i >> 4)) & 0x0F0F0F0F widens to 4x 8-bit accumulators. It masks after adding instead of before, because the maximum value in any 4-bit accumulator is 4, if all 4 bits of the corresponding input bits were set. 4+4 = 8 which still fits in 4 bits, so carry between nibble elements is impossible in i + (i >> 4).

So far this is just fairly normal SIMD using SWAR techniques with a few clever optimizations. Continuing on with the same pattern for 2 more steps can widen to 2x 16-bit then 1x 32-bit counts. But there is a more efficient way on machines with fast hardware multiply:

Once we have few enough “elements”, a multiply with a magic constant can sum all the elements into the top element. In this case byte elements. Multiply is done by left-shifting and adding, so a multiply of x * 0x01010101 results in x + (x<<8) + (x<<16) + (x<<24). Our 8-bit elements are wide enough (and holding small enough counts) that this doesn’t produce carry into that top 8 bits.

A 64-bit version of this can do 8x 8-bit elements in a 64-bit integer with a 0x0101010101010101 multiplier, and extract the high byte with >>56. So it doesn’t take any extra steps, just wider constants. This is what GCC uses for __builtin_popcountll on x86 systems when the hardware popcnt instruction isn’t enabled. If you can use builtins or intrinsics for this, do so to give the compiler a chance to do target-specific optimizations.

With full SIMD for wider vectors (e.g. counting a whole array)

This bitwise-SWAR algorithm could parallelize to be done in multiple vector elements at once, instead of in a single integer register, for a speedup on CPUs with SIMD but no usable popcount instruction. (e.g. x86-64 code that has to run on any CPU, not just Nehalem or later.)

However, the best way to use vector instructions for popcount is usually by using a variable-shuffle to do a table-lookup for 4 bits at a time of each byte in parallel. (The 4 bits index a 16 entry table held in a vector register).

On Intel CPUs, the hardware 64bit popcnt instruction can outperform an SSSE3 PSHUFB bit-parallel implementation by about a factor of 2, but only if your compiler gets it just right. Otherwise SSE can come out significantly ahead. Newer compiler versions are aware of the popcnt false dependency problem on Intel.

#15 Best answer 2 of How to count the number of set bits in a 32-bit integer?(Score: 225)

Created: 2008-09-20 Last updated: 2021-03-13

Some languages portably expose the operation in a way that can use efficient hardware support if available, otherwise some library fallback that’s hopefully decent.

For example (from a table by language):

  • C++ has std::bitset<>::count(), or C++20 std::popcount(T x)
  • Java has java.lang.Integer.bitCount() (also for Long or BigInteger)
  • C# has System.Numerics.BitOperations.PopCount()
  • Python has int.bit_count()

Not all compilers / libraries actually manage to use HW support when it’s available, though. (Notably MSVC, even with options that make std::popcount inline as x86 popcnt, its std::bitset::count still always uses a lookup table. This will hopefully change in future versions.)

Also consider the built-in functions of your compiler when the portable language doesn’t have this basic bit operation. In GNU C for example:

int __builtin_popcount (unsigned int x);
int __builtin_popcountll (unsigned long long x);

In the worst case (no single-instruction HW support) the compiler will generate a call to a function (which in current GCC uses a shift/and bit-hack like this answer, at least for x86). In the best case the compiler will emit a cpu instruction to do the job. (Just like a * or / operator - GCC will use a hardware multiply or divide instruction if available, otherwise will call a libgcc helper function.) Or even better, if the operand is a compile-time constant after inlining, it can do constant-propagation to get a compile-time-constant popcount result.

The GCC builtins even work across multiple platforms. Popcount has almost become mainstream in the x86 architecture, so it makes sense to start using the builtin now so you can recompile to let it inline a hardware instruction when you compile with -mpopcnt or something that includes that (e.g. Other architectures have had popcount for years, but in the x86 world there are still some ancient Core 2 and similar vintage AMD CPUs in use.

On x86, you can tell the compiler that it can assume support for popcnt instruction with -mpopcnt (also implied by -msse4.2). See GCC x86 options. -march=nehalem -mtune=skylake (or -march= whatever CPU you want your code to assume and to tune for) could be a good choice. Running the resulting binary on an older CPU will result in an illegal-instruction fault.

To make binaries optimized for the machine you build them on, use -march=native (with gcc, clang, or ICC).

MSVC provides an intrinsic for the x86 popcnt instruction, but unlike gcc it’s really an intrinsic for the hardware instruction and requires hardware support.

Using std::bitset<>::count() instead of a built-in

In theory, any compiler that knows how to popcount efficiently for the target CPU should expose that functionality through ISO C++ std::bitset<>. In practice, you might be better off with the bit-hack AND/shift/ADD in some cases for some target CPUs.

For target architectures where hardware popcount is an optional extension (like x86), not all compilers have a std::bitset that takes advantage of it when available. For example, MSVC has no way to enable popcnt support at compile time, and it’s std::bitset<>::count always uses a table lookup, even with /Ox /arch:AVX (which implies SSE4.2, which in turn implies the popcnt feature.) (Update: see below; that does get MSVC’s C++20 std::popcount to use x86 popcnt, but still not its bitset<>::count. MSVC could fix that by updating their standard library headers to use std::popcount when available.)

But at least you get something portable that works everywhere, and with gcc/clang with the right target options, you get hardware popcount for architectures that support it.

#include <bitset>
#include <limits>
#include <type_traits>

template<typename T>
//static inline  // static if you want to compile with -mpopcnt in one compilation unit but not others
typename std::enable_if<std::is_integral<T>::value,  unsigned >::type 
popcount(T x)
    static_assert(std::numeric_limits<T>::radix == 2, "non-binary type");

    // sizeof(x)*CHAR_BIT
    constexpr int bitwidth = std::numeric_limits<T>::digits + std::numeric_limits<T>::is_signed;
    // std::bitset constructor was only unsigned long before C++11.  Beware if porting to C++03
    static_assert(bitwidth <= std::numeric_limits<unsigned long long>::digits, "arg too wide for std::bitset() constructor");

    typedef typename std::make_unsigned<T>::type UT;        // probably not needed, bitset width chops after sign-extension

    std::bitset<bitwidth> bs( static_cast<UT>(x) );
    return bs.count();

See asm from gcc, clang, icc, and MSVC on the Godbolt compiler explorer.

x86-64 gcc -O3 -std=gnu++11 -mpopcnt emits this:

unsigned test_short(short a) { return popcount(a); }
    movzx   eax, di      # note zero-extension, not sign-extension
    popcnt  rax, rax

unsigned test_int(int a) { return popcount(a); }
    mov     eax, edi
    popcnt  rax, rax        # unnecessary 64-bit operand size

unsigned test_u64(unsigned long long a) { return popcount(a); }
    xor     eax, eax     # gcc avoids false dependencies for Intel CPUs
    popcnt  rax, rdi

PowerPC64 gcc -O3 -std=gnu++11 emits (for the int arg version):

    rldicl 3,3,0,32     # zero-extend from 32 to 64-bit
    popcntd 3,3         # popcount

This source isn’t x86-specific or GNU-specific at all, but only compiles well with gcc/clang/icc, at least when targeting x86 (including x86-64).

Also note that gcc’s fallback for architectures without single-instruction popcount is a byte-at-a-time table lookup. This isn’t wonderful for ARM, for example.

C++20 has std::popcount(T)

Current libstdc++ headers unfortunately define it with a special case if(x==0) return 0; at the start, which clang doesn’t optimize away when compiling for x86:

#include <bit>
int bar(unsigned x) {
    return std::popcount(x);

clang 11.0.1 -O3 -std=gnu++20 -march=nehalem (

# clang 11
    bar(unsigned int):                                # @bar(unsigned int)
        popcnt  eax, edi
        cmove   eax, edi         # redundant: if popcnt result is 0, return the original 0 instead of the popcnt-generated 0...

But GCC compiles nicely:

# gcc 10
        xor     eax, eax         # break false dependency on Intel SnB-family before Ice Lake.
        popcnt  eax, edi

Even MSVC does well with it, as long as you use -arch:AVX or later (and enable C++20 with -std:c++latest).

int bar(unsigned int) PROC                                 ; bar, COMDAT
        popcnt  eax, ecx
        ret     0
int bar(unsigned int) ENDP                                 ; bar

See also original question in stackoverflow

#16: What is tail call optimization? (Score: 883)

Created: 2008-11-22 Last updated: 2020-04-18

Tags: algorithm, recursion, language-agnostic, tail-recursion, tail-call-optimization

Very simply, what is tail-call optimization?

More specifically, what are some small code snippets where it could be applied, and where not, with an explanation of why?

#16 Best answer 1 of What is tail call optimization? (Score: 798)

Created: 2008-11-22 Last updated: 2021-02-12

Tail-call optimization is where you are able to avoid allocating a new stack frame for a function because the calling function will simply return the value that it gets from the called function. The most common use is tail-recursion, where a recursive function written to take advantage of tail-call optimization can use constant stack space.

Scheme is one of the few programming languages that guarantee in the spec that any implementation must provide this optimization, so here are two examples of the factorial function in Scheme:

(define (fact x)
  (if (= x 0) 1
      (* x (fact (- x 1)))))

(define (fact x)
  (define (fact-tail x accum)
    (if (= x 0) accum
        (fact-tail (- x 1) (* x accum))))
  (fact-tail x 1))

The first function is not tail recursive because when the recursive call is made, the function needs to keep track of the multiplication it needs to do with the result after the call returns. As such, the stack looks as follows:

(fact 3)
(* 3 (fact 2))
(* 3 (* 2 (fact 1)))
(* 3 (* 2 (* 1 (fact 0))))
(* 3 (* 2 (* 1 1)))
(* 3 (* 2 1))
(* 3 2)

In contrast, the stack trace for the tail recursive factorial looks as follows:

(fact 3)
(fact-tail 3 1)
(fact-tail 2 3)
(fact-tail 1 6)
(fact-tail 0 6)

As you can see, we only need to keep track of the same amount of data for every call to fact-tail because we are simply returning the value we get right through to the top. This means that even if I were to call (fact 1000000), I need only the same amount of space as (fact 3). This is not the case with the non-tail-recursive fact, and as such large values may cause a stack overflow.

#16 Best answer 2 of What is tail call optimization?(Score: 588)

Created: 2012-03-22 Last updated: 2015-07-08

Let’s walk through a simple example: the factorial function implemented in C.

We start with the obvious recursive definition

unsigned fac(unsigned n)
    if (n < 2) return 1;
    return n * fac(n - 1);

A function ends with a tail call if the last operation before the function returns is another function call. If this call invokes the same function, it is tail-recursive.

Even though fac() looks tail-recursive at first glance, it is not as what actually happens is

unsigned fac(unsigned n)
    if (n < 2) return 1;
    unsigned acc = fac(n - 1);
    return n * acc;

ie the last operation is the multiplication and not the function call.

However, it’s possible to rewrite fac() to be tail-recursive by passing the accumulated value down the call chain as an additional argument and passing only the final result up again as the return value:

unsigned fac(unsigned n)
    return fac_tailrec(1, n);

unsigned fac_tailrec(unsigned acc, unsigned n)
    if (n < 2) return acc;
    return fac_tailrec(n * acc, n - 1);

Now, why is this useful? Because we immediately return after the tail call, we can discard the previous stackframe before invoking the function in tail position, or, in case of recursive functions, reuse the stackframe as-is.

The tail-call optimization transforms our recursive code into

unsigned fac_tailrec(unsigned acc, unsigned n)
    if (n < 2) return acc;
    acc = n * acc;
    n = n - 1;
    goto TOP;

This can be inlined into fac() and we arrive at

unsigned fac(unsigned n)
    unsigned acc = 1;

    if (n < 2) return acc;
    acc = n * acc;
    n = n - 1;
    goto TOP;

which is equivalent to

unsigned fac(unsigned n)
    unsigned acc = 1;

    for (; n > 1; --n)
        acc *= n;

    return acc;

As we can see here, a sufficiently advanced optimizer can replace tail-recursion with iteration, which is far more efficient as you avoid function call overhead and only use a constant amount of stack space.

See also original question in stackoverflow

#17: How do I determine whether my calculation of pi is accurate? (Score: 780)

Created: 2013-01-11 Last updated: 2018-04-06

Tags: algorithm, math, language-agnostic, pi

I was trying various methods to implement a program that gives the digits of pi sequentially. I tried the Taylor series method, but it proved to converge extremely slowly (when I compared my result with the online values after some time). Anyway, I am trying better algorithms.

So, while writing the program I got stuck on a problem, as with all algorithms: How do I know that the n digits that I’ve calculated are accurate?

#17 Best answer 1 of How do I determine whether my calculation of pi is accurate? (Score: 1639)

Created: 2013-01-11 Last updated: 2017-08-31

Since I’m the current world record holder for the most digits of pi, I’ll add my two cents:

Unless you’re actually setting a new world record, the common practice is just to verify the computed digits against the known values. So that’s simple enough.

In fact, I have a webpage that lists snippets of digits for the purpose of verifying computations against them:

But when you get into world-record territory, there’s nothing to compare against.

Historically, the standard approach for verifying that computed digits are correct is to recompute the digits using a second algorithm. So if either computation goes bad, the digits at the end won’t match.

This does typically more than double the amount of time needed (since the second algorithm is usually slower). But it’s the only way to verify the computed digits once you’ve wandered into the uncharted territory of never-before-computed digits and a new world record.

Back in the days where supercomputers were setting the records, two different AGM algorithms were commonly used:

These are both O(N log(N)^2) algorithms that were fairly easy to implement.

However, nowadays, things are a bit different. In the last three world records, instead of performing two computations, we performed only one computation using the fastest known formula (Chudnovsky Formula):

Enter image description here

This algorithm is much harder to implement, but it is a lot faster than the AGM algorithms.

Then we verify the binary digits using the BBP formulas for digit extraction.

Enter image description here

This formula allows you to compute arbitrary binary digits without computing all the digits before it. So it is used to verify the last few computed binary digits. Therefore it is much faster than a full computation.

The advantage of this is:

  1. Only one expensive computation is needed.

The disadvantage is:

  1. An implementation of the Bailey–Borwein–Plouffe (BBP) formula is needed.
  2. An additional step is needed to verify the radix conversion from binary to decimal.

I’ve glossed over some details of why verifying the last few digits implies that all the digits are correct. But it is easy to see this since any computation error will propagate to the last digits.

Now this last step (verifying the conversion) is actually fairly important. One of the previous world record holders actually called us out on this because, initially, I didn’t give a sufficient description of how it worked.

So I’ve pulled this snippet from my blog:

N = # of decimal digits desired
p = 64-bit prime number

Enter image description here

Compute A using base 10 arithmetic and B using binary arithmetic.

Enter image description here

If A = B, then with “extremely high probability”, the conversion is correct.

For further reading, see my blog post Pi - 5 Trillion Digits.

#17 Best answer 2 of How do I determine whether my calculation of pi is accurate?(Score: 48)

Created: 2013-01-11 Last updated: 2020-06-07

Undoubtedly, for your purposes (which I assume is just a programming exercise), the best thing is to check your results against any of the listings of the digits of pi on the web.

And how do we know that those values are correct? Well, I could say that there are computer-science-y ways to prove that an implementation of an algorithm is correct.

More pragmatically, if different people use different algorithms, and they all agree to (pick a number) a thousand (million, whatever) decimal places, that should give you a warm fuzzy feeling that they got it right.

Historically, William Shanks published pi to 707 decimal places in 1873. Poor guy, he made a mistake starting at the 528th decimal place.

Very interestingly, in 1995 an algorithm was published that had the property that would directly calculate the nth digit (base 16) of pi without having to calculate all the previous digits!

Finally, I hope your initial algorithm wasn’t pi/4 = 1 - 1/3 + 1/5 - 1/7 + ... That may be the simplest to program, but it’s also one of the slowest ways to do so. Check out the pi article on Wikipedia for faster approaches.

See also original question in stackoverflow

#18: Sorting 1 million 8-decimal-digit numbers with 1 MB of RAM (Score: 754)

Created: 2012-10-05 Last updated: 2020-03-30

Tags: algorithm, sorting, embedded, ram

I have a computer with 1 MB of RAM and no other local storage. I must use it to accept 1 million 8-digit decimal numbers over a TCP connection, sort them, and then send the sorted list out over another TCP connection.

The list of numbers may contain duplicates, which I must not discard. The code will be placed in ROM, so I need not subtract the size of my code from the 1 MB. I already have code to drive the Ethernet port and handle TCP/IP connections, and it requires 2 KB for its state data, including a 1 KB buffer via which the code will read and write data. Is there a solution to this problem?

Sources Of Question And Answer:

#18 Best answer 1 of Sorting 1 million 8-decimal-digit numbers with 1 MB of RAM (Score: 773)

Created: 2012-10-21 Last updated: 2020-03-30

There is one rather sneaky trick not mentioned here so far. We assume that you have no extra way to store data, but that is not strictly true.

One way around your problem is to do the following horrible thing, which should not be attempted by anyone under any circumstances: Use the network traffic to store data. And no, I don’t mean NAS.

You can sort the numbers with only a few bytes of RAM in the following way:

  • First take 2 variables: COUNTER and VALUE.
  • First set all registers to 0;
  • Every time you receive an integer I, increment COUNTER and set VALUE to max(VALUE, I);
  • Then send an ICMP echo request packet with data set to I to the router. Erase I and repeat.
  • Every time you receive the returned ICMP packet, you simply extract the integer and send it back out again in another echo request. This produces a huge number of ICMP requests scuttling backward and forward containing the integers.

Once COUNTER reaches 1000000, you have all of the values stored in the incessant stream of ICMP requests, and VALUE now contains the maximum integer. Pick some threshold T >> 1000000. Set COUNTER to zero. Every time you receive an ICMP packet, increment COUNTER and send the contained integer I back out in another echo request, unless I=VALUE, in which case transmit it to the destination for the sorted integers. Once COUNTER=T, decrement VALUE by 1, reset COUNTER to zero and repeat. Once VALUE reaches zero you should have transmitted all integers in order from largest to smallest to the destination, and have only used about 47 bits of RAM for the two persistent variables (and whatever small amount you need for the temporary values).

I know this is horrible, and I know there can be all sorts of practical issues, but I thought it might give some of you a laugh or at least horrify you.

#18 Best answer 2 of Sorting 1 million 8-decimal-digit numbers with 1 MB of RAM(Score: 428)

Created: 2012-10-25 Last updated: 2015-02-25

Here’s some working C++ code which solves the problem.

Proof that the memory constraints are satisfied:

Editor: There is no proof of the maximum memory requirements offered by the author either in this post or in his blogs. Since the number of bits necessary to encode a value depends on the values previously encoded, such a proof is likely non-trivial. The author notes that the largest encoded size he could stumble upon empirically was 1011732, and chose the buffer size 1013000 arbitrarily.

typedef unsigned int u32;

namespace WorkArea
    static const u32 circularSize = 253250;
    u32 circular[circularSize] = { 0 };         // consumes 1013000 bytes

    static const u32 stageSize = 8000;
    u32 stage[stageSize];                       // consumes 32000 bytes


Together, these two arrays take 1045000 bytes of storage. That leaves 1048576 - 1045000 - 2×1024 = 1528 bytes for remaining variables and stack space.

It runs in about 23 seconds on my Xeon W3520. You can verify that the program works using the following Python script, assuming a program name of sort1mb.exe.

from subprocess import *
import random

sequence = [random.randint(0, 99999999) for i in xrange(1000000)]

sorter = Popen('sort1mb.exe', stdin=PIPE, stdout=PIPE)
for value in sequence:
    sorter.stdin.write('%08d\n' % value)

result = [int(line) for line in sorter.stdout]
print('OK!' if result == sorted(sequence) else 'Error!')

A detailed explanation of the algorithm can be found in the following series of posts:

See also original question in stackoverflow

#19: Expand a random range from 1–5 to 1–7 (Score: 703)

Created: 2008-09-26 Last updated: 2012-09-14

Tags: algorithm, random, puzzle

Given a function which produces a random integer in the range 1 to 5, write a function which produces a random integer in the range 1 to 7.

  1. What is a simple solution?
  2. What is an effective solution to reduce memory usage or run on a slower CPU?

#19 Best answer 1 of Expand a random range from 1–5 to 1–7 (Score: 581)

Created: 2009-05-08 Last updated: 2010-10-05

This is equivalent to Adam Rosenfield’s solution, but may be a bit more clear for some readers. It assumes rand5() is a function that returns a statistically random integer in the range 1 through 5 inclusive.

int rand7()
    int vals[5][5] = {
        { 1, 2, 3, 4, 5 },
        { 6, 7, 1, 2, 3 },
        { 4, 5, 6, 7, 1 },
        { 2, 3, 4, 5, 6 },
        { 7, 0, 0, 0, 0 }

    int result = 0;
    while (result == 0)
        int i = rand5();
        int j = rand5();
        result = vals[i-1][j-1];
    return result;

How does it work? Think of it like this: imagine printing out this double-dimension array on paper, tacking it up to a dart board and randomly throwing darts at it. If you hit a non-zero value, it’s a statistically random value between 1 and 7, since there are an equal number of non-zero values to choose from. If you hit a zero, just keep throwing the dart until you hit a non-zero. That’s what this code is doing: the i and j indexes randomly select a location on the dart board, and if we don’t get a good result, we keep throwing darts.

Like Adam said, this can run forever in the worst case, but statistically the worst case never happens. :)

#19 Best answer 2 of Expand a random range from 1–5 to 1–7(Score: 360)

Created: 2008-09-26

There is no (exactly correct) solution which will run in a constant amount of time, since 1/7 is an infinite decimal in base 5. One simple solution would be to use rejection sampling, e.g.:

int i;
  i = 5 * (rand5() - 1) + rand5();  // i is now uniformly random between 1 and 25
} while(i > 21);
// i is now uniformly random between 1 and 21
return i % 7 + 1;  // result is now uniformly random between 1 and 7

This has an expected runtime of 25/21 = 1.19 iterations of the loop, but there is an infinitesimally small probability of looping forever.

See also original question in stackoverflow

#20: How do I create a URL shortener? (Score: 703)

Created: 2009-04-12 Last updated: 2018-10-17

Tags: algorithm, url

I want to create a URL shortener service where you can write a long URL into an input field and the service shortens the URL to “”.

Instead of “abcdef” there can be any other string with six characters containing a-z, A-Z and 0-9. That makes 56~57 billion possible strings.

My approach:

I have a database table with three columns:

  1. id, integer, auto-increment
  2. long, string, the long URL the user entered
  3. short, string, the shortened URL (or just the six characters)

I would then insert the long URL into the table. Then I would select the auto-increment value for “id” and build a hash of it. This hash should then be inserted as “short”. But what sort of hash should I build? Hash algorithms like MD5 create too long strings. I don’t use these algorithms, I think. A self-built algorithm will work, too.

My idea:

For “” I get the auto-increment id 239472. Then I do the following steps:

short = '';
if divisible by 2, add "a"+the result to short
if divisible by 3, add "b"+the result to short
... until I have divisors for a-z and A-Z.

That could be repeated until the number isn’t divisible any more. Do you think this is a good approach? Do you have a better idea?

Due to the ongoing interest in this topic, I’ve published an efficient solution to GitHub, with implementations for JavaScript, PHP, Python and Java. Add your solutions if you like :)

#20 Best answer 1 of How do I create a URL shortener? (Score: 862)

Created: 2009-04-12 Last updated: 2019-07-08

I would continue your “convert number to string” approach. However, you will realize that your proposed algorithm fails if your ID is a prime and greater than 52.

Theoretical background

You need a Bijective Function f. This is necessary so that you can find a inverse function g(‘abc’) = 123 for your f(123) = ‘abc’ function. This means:

  • There must be no x1, x2 (with x1 ≠ x2) that will make f(x1) = f(x2),
  • and for every y you must be able to find an x so that f(x) = y.

How to convert the ID to a shortened URL

  1. Think of an alphabet we want to use. In your case, that’s [a-zA-Z0-9]. It contains 62 letters.

  2. Take an auto-generated, unique numerical key (the auto-incremented id of a MySQL table for example).

    For this example, I will use 12510 (125 with a base of 10).

  3. Now you have to convert 12510 to X62 (base 62).

    12510 = 2×621 + 1×620 = [2,1]

    This requires the use of integer division and modulo. A pseudo-code example:

    digits = []
    while num > 0
      remainder = modulo(num, 62)
      num = divide(num, 62)
    digits = digits.reverse

    Now map the indices 2 and 1 to your alphabet. This is how your mapping (with an array for example) could look like:

    0  → a
    1  → b
    25 → z
    52 → 0
    61 → 9

    With 2 → c and 1 → b, you will receive cb62 as the shortened URL.


How to resolve a shortened URL to the initial ID

The reverse is even easier. You just do a reverse lookup in your alphabet.

  1. e9a62 will be resolved to “4th, 61st, and 0th letter in the alphabet”.

    e9a62 = [4,61,0] = 4×622 + 61×621 + 0×620 = 1915810

  2. Now find your database-record with WHERE id = 19158 and do the redirect.

Example implementations (provided by commenters)

#20 Best answer 2 of How do I create a URL shortener?(Score: 57)

Created: 2009-04-12 Last updated: 2020-07-24

Why would you want to use a hash?

You can just use a simple translation of your auto-increment value to an alphanumeric value. You can do that easily by using some base conversion. Say you character space (A-Z, a-z, 0-9, etc.) has 62 characters, convert the id to a base-40 number and use the characters as the digits.

See also original question in stackoverflow

  1. This page use API to get the relevant data from stackoverflow community.
  2. Content license on this page is CC BY-SA 3.0.
  3. score = up votes - down votes.