Most votes on algorithm questions 7

Most votes on algorithm questions 7. #61 Extremely small or NaN values appear in training neural network #62 How do you rotate a two dimensional array? #63 Why do we use Base64? #64 Cost of len() function #65 What are the underlying data structures used for Redis? #66 Big-O for Eight Year Olds? #67 Algorithm to randomly generate an aesthetically-pleasing color palette #68 What is the difference between tree depth and height? #69 Write a program to find 100 largest numbers out of an array of 1 billion numbers #70 Peak signal detection in realtime timeseries data

Read all the top votes questions and answers in a single page.

#61: Extremely small or NaN values appear in training neural network (Score: 328)

Created: 2017-06-21 Last updated: 2020-08-04

Tags: algorithm, haskell, neural-network, backpropagation

I’m trying to implement a neural network architecture in Haskell, and use it on MNIST.

I’m using the hmatrix package for linear algebra. My training framework is built using the pipes package.

My code compiles and doesn’t crash. But the problem is, certain combinations of layer size (say, 1000), minibatch size, and learning rate give rise to NaN values in the computations. After some inspection, I see that extremely small values (order of 1e-100) eventually appear in the activations. But, even when that doesn’t happen, the training still doesn’t work. There’s no improvement over its loss or accuracy.

I checked and rechecked my code, and I’m at a loss as to what the root of the problem could be.

Here’s the backpropagation training, which computes the deltas for each layer:

backward lf n (out,tar) das = do
    let δout = tr (derivate lf (tar, out)) -- dE/dy
        deltas = scanr (\(l, a') δ ->
                         let w = weights l
                         in (tr a') * (w <> δ)) δout (zip (tail $ toList n) das)
    return (deltas)

lf is the loss function, n is the network (weight matrix and bias vector for each layer), out and tar are the actual output of the network and the target (desired) output, and das are the activation derivatives of each layer.

In batch mode, out, tar are matrices (rows are output vectors), and das is a list of the matrices.

Here’s the actual gradient computation:

  grad lf (n, (i,t)) = do
    -- Forward propagation: compute layers outputs and activation derivatives
    let (as, as') = unzip $ runLayers n i
        (out) = last as
    (ds) <- backward lf n (out, t) (init as') -- Compute deltas with backpropagation
    let r  = fromIntegral $ rows i -- Size of minibatch
    let gs = zipWith (\δ a -> tr (δ <> a)) ds (i:init as) -- Gradients for weights
    return $ GradBatch ((recip r .*) <$> gs, (recip r .*) <$> squeeze <$> ds)

Here, lf and n are the same as above, i is the input, and t is the target output (both in batch form, as matrices).

squeeze transforms a matrix into a vector by summing over each row. That is, ds is a list of matrices of deltas, where each column corresponds to the deltas for a row of the minibatch. So, the gradients for the biases are the average of the deltas over all the minibatch. The same thing for gs, which corresponds to the gradients for the weights.

Here’s the actual update code:

move lr (n, (i,t)) (GradBatch (gs, ds)) = do
    -- Update function
    let update = (\(FC w b af) g δ -> FC (w + (lr).*g) (b + (lr).*δ) af)
        n' = Network.fromList $ zipWith3 update (Network.toList n) gs ds
    return (n', (i,t))

lr is the learning rate. FC is the layer constructor, and af is the activation function for that layer.

The gradient descent algorithm makes sure to pass in a negative value for the learning rate. The actual code for the gradient descent is simply a loop around a composition of grad and move, with a parameterized stop condition.

Finally, here’s the code for a mean square error loss function:

mse :: (Floating a) => LossFunction a a
mse = let f (y,y') = let gamma = y'-y in gamma**2 / 2
          f' (y,y') = (y'-y)
      in  Evaluator f f'

Evaluator just bundles a loss function and its derivative (for calculating the delta of the output layer).

The rest of the code is up on GitHub: NeuralNetwork.

So, if anyone has an insight into the problem, or even just a sanity check that I’m correctly implementing the algorithm, I’d be grateful.

#61 Best answer of Extremely small or NaN values appear in training neural network (Score: 4)

Created: 2020-08-11 Last updated: 2020-08-11

Do you know about “vanishing” and “exploding” gradients in backpropagation? I’m not too familiar with Haskell so I can’t easily see what exactly your backprop is doing, but it does look like you are using a logistic curve as your activation function.

If you look at the plot of this function you’ll see that the gradient of this function is nearly 0 at the ends (as input values get very large or very small, the slope of the curve is almost flat), so multiplying or dividing by this during backpropagation will result in a very big or very small number. Doing this repeatedly as you pass through multiple layers causes the activations to approach zero or infinity. Since backprop updates your weights by doing this during training, you end up with a lot of zeros or infinities in your network.

Solution: there are loads of methods out there that you can search for to solve the vanishing gradient problem, but one easy thing to try is to change the type of activation function you are using to a non-saturating one. ReLU is a popular choice as it mitigates this particular problem (but might introduce others).

See also original question in stackoverflow

#62: How do you rotate a two dimensional array? (Score: 325)

Created: 2008-09-03 Last updated: 2020-02-22

Tags: algorithm, matrix, multidimensional-array

Inspired by Raymond Chen’s post, say you have a 4x4 two dimensional array, write a function that rotates it 90 degrees. Raymond links to a solution in pseudo code, but I’d like to see some real world stuff.




Update: Nick’s answer is the most straightforward, but is there a way to do it better than n^2? What if the matrix was 10000x10000?

#62 Best answer 1 of How do you rotate a two dimensional array? (Score: 428)

Created: 2011-12-29 Last updated: 2016-12-22

O(n^2) time and O(1) space algorithm ( without any workarounds and hanky-panky stuff! )

Rotate by +90:

  1. Transpose
  2. Reverse each row

Rotate by -90:

Method 1 :

  1. Transpose
  2. Reverse each column

Method 2 :

  1. Reverse each row
  2. Transpose

Rotate by +180:

Method 1: Rotate by +90 twice

Method 2: Reverse each row and then reverse each column (Transpose)

Rotate by -180:

Method 1: Rotate by -90 twice

Method 2: Reverse each column and then reverse each row

Method 3: Rotate by +180 as they are same

#62 Best answer 2 of How do you rotate a two dimensional array?(Score: 212)

Created: 2016-02-16 Last updated: 2020-12-16

I’d like to add a little more detail. In this answer, key concepts are repeated, the pace is slow and intentionally repetitive. The solution provided here is not the most syntactically compact, it is however, intended for those who wish to learn what matrix rotation is and the resulting implementation.

Firstly, what is a matrix? For the purposes of this answer, a matrix is just a grid where the width and height are the same. Note, the width and height of a matrix can be different, but for simplicity, this tutorial considers only matrices with equal width and height (square matrices). And yes, matrices is the plural of matrix.

Example matrices are: 2×2, 3×3 or 5×5. Or, more generally, N×N. A 2×2 matrix will have 4 squares because 2×2=4. A 5×5 matrix will have 25 squares because 5×5=25. Each square is called an element or entry. We’ll represent each element with a period (.) in the diagrams below:

2×2 matrix

. .
. .

3×3 matrix

. . .
. . .
. . .

4×4 matrix

. . . .
. . . .
. . . .
. . . .

So, what does it mean to rotate a matrix? Let’s take a 2×2 matrix and put some numbers in each element so the rotation can be observed:

0 1
2 3

Rotating this by 90 degrees gives us:

2 0
3 1

We literally turned the whole matrix once to the right just like turning the steering wheel of a car. It may help to think of “tipping” the matrix onto its right side. We want to write a function, in Python, that takes a matrix and rotates in once to the right. The function signature will be:

def rotate(matrix):
	# Algorithm goes here.

The matrix will be defined using a two-dimensional array:

matrix = [

Therefore the first index position accesses the row. The second index position accesses the column:


We’ll define a utility function to print a matrix.

def print_matrix(matrix):
	for row in matrix:
        print row

One method of rotating a matrix is to do it a layer at a time. But what is a layer? Think of an onion. Just like the layers of an onion, as each layer is removed, we move towards the center. Other analogies is a Matryoshka doll or a game of pass-the-parcel.

The width and height of a matrix dictate the number of layers in that matrix. Let’s use different symbols for each layer:

A 2×2 matrix has 1 layer

. .
. .

A 3×3 matrix has 2 layers

. . .
. x .
. . .

A 4×4 matrix has 2 layers

. . . .
. x x .
. x x .
. . . .

A 5×5 matrix has 3 layers

. . . . .
. x x x .
. x O x .
. x x x .
. . . . .

A 6×6 matrix has 3 layers

. . . . . .
. x x x x .
. x O O x .
. x O O x .
. x x x x .
. . . . . .

A 7×7 matrix has 4 layers

. . . . . . .
. x x x x x .
. x O O O x .
. x O - O x .
. x O O O x .
. x x x x x .
. . . . . . .

You may notice that incrementing the width and height of a matrix by one, does not always increase the number of layers. Taking the above matrices and tabulating the layers and dimensions, we see the number of layers increases once for every two increments of width and height:

| N×N | Layers |
| 1×1 |      1 |
| 2×2 |      1 |
| 3×3 |      2 |
| 4×4 |      2 |
| 5×5 |      3 |
| 6×6 |      3 |
| 7×7 |      4 |

However, not all layers need rotating. A 1×1 matrix is the same before and after rotation. The central 1×1 layer is always the same before and after rotation no matter how large the overall matrix:

| N×N | Layers | Rotatable Layers |
| 1×1 |      1 |                0 |
| 2×2 |      1 |                1 |
| 3×3 |      2 |                1 |
| 4×4 |      2 |                2 |
| 5×5 |      3 |                2 |
| 6×6 |      3 |                3 |
| 7×7 |      4 |                3 |

Given N×N matrix, how can we programmatically determine the number of layers we need to rotate? If we divide the width or height by two and ignore the remainder we get the following results.

| N×N | Layers | Rotatable Layers |   N/2   |
| 1×1 |      1 |                0 | 1/2 = 0 |
| 2×2 |      1 |                1 | 2/2 = 1 |
| 3×3 |      2 |                1 | 3/2 = 1 |
| 4×4 |      2 |                2 | 4/2 = 2 |
| 5×5 |      3 |                2 | 5/2 = 2 |
| 6×6 |      3 |                3 | 6/2 = 3 |
| 7×7 |      4 |                3 | 7/2 = 3 |

Notice how N/2 matches the number of layers that need to be rotated? Sometimes the number of rotatable layers is one less the total number of layers in the matrix. This occurs when the innermost layer is formed of only one element (i.e. a 1×1 matrix) and therefore need not be rotated. It simply gets ignored.

We will undoubtedly need this information in our function to rotate a matrix, so let’s add it now:

def rotate(matrix):
	size = len(matrix)
	# Rotatable layers only.
    layer_count = size / 2

Now we know what layers are and how to determine the number of layers that actually need rotating, how do we isolate a single layer so we can rotate it? Firstly, we inspect a matrix from the outermost layer, inwards, to the innermost layer. A 5×5 matrix has three layers in total and two layers that need rotating:

. . . . .
. x x x .
. x O x .
. x x x .
. . . . .

Let’s look at columns first. The position of the columns defining the outermost layer, assuming we count from 0, are 0 and 4:

| Column | 0 1 2 3 4 |
|        | . . . . . |
|        | . x x x . |
|        | . x O x . |
|        | . x x x . |
|        | . . . . . |

0 and 4 are also the positions of the rows for the outermost layer.

| Row |           |
|   0 | . . . . . |
|   1 | . x x x . |
|   2 | . x O x . |
|   3 | . x x x . |
|   4 | . . . . . |

This will always be the case since the width and height are the same. Therefore we can define the column and row positions of a layer with just two values (rather than four).

Moving inwards to the second layer, the position of the columns are 1 and 3. And, yes, you guessed it, it’s the same for rows. It’s important to understand we had to both increment and decrement the row and column positions when moving inwards to the next layer.

|   Layer   |  Rows   | Columns | Rotate? |
| Outermost | 0 and 4 | 0 and 4 | Yes	  |
| Inner     | 1 and 3 | 1 and 3 | Yes	  |
| Innermost | 2       | 2       | No      |

So, to inspect each layer, we want a loop with both increasing and decreasing counters that represent moving inwards, starting from the outermost layer. We’ll call this our ‘layer loop’.

def rotate(matrix):
	size = len(matrix)
	layer_count = size / 2

	for layer in range(0, layer_count):
        first = layer
        last = size - first - 1
        print 'Layer %d: first: %d, last: %d' % (layer, first, last)

# 5x5 matrix
matrix = [
    [ 0, 1, 2, 3, 4],
	[ 5, 6, 6, 8, 9],


The code above loops through the (row and column) positions of any layers that need rotating.

Layer 0: first: 0, last: 4
Layer 1: first: 1, last: 3

We now have a loop providing the positions of the rows and columns of each layer. The variables first and last identify the index position of the first and last rows and columns. Referring back to our row and column tables:

| Column | 0 1 2 3 4 |
|        | . . . . . |
|        | . x x x . |
|        | . x O x . |
|        | . x x x . |
|        | . . . . . |

| Row |           |
|   0 | . . . . . |
|   1 | . x x x . |
|   2 | . x O x . |
|   3 | . x x x . |
|   4 | . . . . . |

So we can navigate through the layers of a matrix. Now we need a way of navigating within a layer so we can move elements around that layer. Note, elements never ‘jump’ from one layer to another, but they do move within their respective layers.

Rotating each element in a layer rotates the entire layer. Rotating all layers in a matrix rotates the entire matrix. This sentence is very important, so please try your best to understand it before moving on.

Now, we need a way of actually moving elements, i.e. rotate each element, and subsequently the layer, and ultimately the matrix. For simplicity, we’ll revert to a 3x3 matrix — that has one rotatable layer.

0 1 2
3 4 5
6 7 8

Our layer loop provides the indexes of the first and last columns, as well as first and last rows:

| Col | 0 1 2 |
|     | 0 1 2 |
|     | 3 4 5 |
|     | 6 7 8 |

| Row |       |
|   0 | 0 1 2 |
|   1 | 3 4 5 |
|   2 | 6 7 8 |

Because our matrices are always square, we need just two variables, first and last, since index positions are the same for rows and columns.

def rotate(matrix):
	size = len(matrix)
	layer_count = size / 2

	# Our layer loop i=0, i=1, i=2
	for layer in range(0, layer_count):

        first = layer
        last = size - first - 1
        # We want to move within a layer here.

The variables first and last can easily be used to reference the four corners of a matrix. This is because the corners themselves can be defined using various permutations of first and last (with no subtraction, addition or offset of those variables):

| Corner 		| Position 		    | 3x3 Values  |
| top left 		| (first, first)	| (0,0)		  |
| top right		| (first, last)	    | (0,2)		  |
| bottom right	| (last, last)	    | (2,2)		  |
| bottom left	| (last, first)	    | (2,0)		  |

For this reason, we start our rotation at the outer four corners — we’ll rotate those first. Let’s highlight them with *.

* 1 *
3 4 5
* 7 *

We want to swap each * with the * to the right of it. So let’s go ahead a print out our corners defined using only various permutations of first and last:

def rotate(matrix):
	size = len(matrix)
	layer_count = size / 2
	for layer in range(0, layer_count):

        first = layer
        last = size - first - 1

        top_left = (first, first)
        top_right = (first, last)
        bottom_right = (last, last)
        bottom_left = (last, first)

        print 'top_left: %s' % (top_left)
        print 'top_right: %s' % (top_right)
        print 'bottom_right: %s' % (bottom_right)
        print 'bottom_left: %s' % (bottom_left)

matrix = [
[0, 1, 2],
[3, 4, 5],
[6, 7, 8]


Output should be:

top_left: (0, 0)
top_right: (0, 2)
bottom_right: (2, 2)
bottom_left: (2, 0)

Now we could quite easily swap each of the corners from within our layer loop:

def rotate(matrix):
	size = len(matrix)
	layer_count = size / 2
	for layer in range(0, layer_count):
        first = layer
        last = size - first - 1

        top_left = matrix[first][first]
        top_right = matrix[first][last]
        bottom_right = matrix[last][last]
        bottom_left = matrix[last][first]

        # bottom_left -> top_left
        matrix[first][first] = bottom_left
        # top_left -> top_right
        matrix[first][last] = top_left
        # top_right -> bottom_right
        matrix[last][last] = top_right
        # bottom_right -> bottom_left
        matrix[last][first] = bottom_right

print '---------'

Matrix before rotating corners:

[0, 1, 2]
[3, 4, 5]
[6, 7, 8]

Matrix after rotating corners:

[6, 1, 0]
[3, 4, 5]
[8, 7, 2]

Great! We have successfully rotated each corner of the matrix. But, we haven’t rotated the elements in the middle of each layer. Clearly we need a way of iterating within a layer.

The problem is, the only loop in our function so far (our layer loop), moves to the next layer on each iteration. Since our matrix has only one rotatable layer, the layer loop exits after rotating only the corners. Let’s look at what happens with a larger, 5×5 matrix (where two layers need rotating). The function code has been omitted, but it remains the same as above:

matrix = [
[0, 1, 2, 3, 4],
[5, 6, 7, 8, 9],
[10, 11, 12, 13, 14],
[15, 16, 17, 18, 19],
[20, 21, 22, 23, 24]
print '--------------------'

The output is:

[20,  1,  2,  3,  0]
[ 5, 16,  7,  6,  9]
[10, 11, 12, 13, 14]
[15, 18, 17,  8, 19]
[24, 21, 22, 23,  4]

It shouldn’t be a surprise that the corners of the outermost layer have been rotated, but, you may also notice the corners of the next layer (inwards) have also been rotated. This makes sense. We’ve written code to navigate through layers and also to rotate the corners of each layer. This feels like progress, but unfortunately we must take a step back. It’s just no good moving onto the next layer until the previous (outer) layer has been fully rotated. That is, until each element in the layer has been rotated. Rotating only the corners won’t do!

Take a deep breath. We need another loop. A nested loop no less. The new, nested loop, will use the first and last variables, plus an offset to navigate within a layer. We’ll call this new loop our ‘element loop’. The element loop will visit each element along the top row, each element down the right side, each element along the bottom row and each element up the left side.

  • Moving forwards along the top row requires the column index to be incremented.
  • Moving down the right side requires the row index to be incremented.
  • Moving backwards along the bottom requires the column index to be decremented.
  • Moving up the left side requires the row index to be decremented.

This sounds complex, but it’s made easy because the number of times we increment and decrement to achieve the above remains the same along all four sides of the matrix. For example:

  • Move 1 element across the top row.
  • Move 1 element down the right side.
  • Move 1 element backwards along the bottom row.
  • Move 1 element up the left side.

This means we can use a single variable in combination with the first and last variables to move within a layer. It may help to note that moving across the top row and down the right side both require incrementing. While moving backwards along the bottom and up the left side both require decrementing.

def rotate(matrix):
	size = len(matrix)
	layer_count = size / 2
    # Move through layers (i.e. layer loop).
    for layer in range(0, layer_count):
            first = layer
            last = size - first - 1

            # Move within a single layer (i.e. element loop).
            for element in range(first, last):
                offset = element - first

                # 'element' increments column (across right)
                top = (first, element)
                # 'element' increments row (move down)
                right_side = (element, last)
                # 'last-offset' decrements column (across left)
                bottom = (last, last-offset)
                # 'last-offset' decrements row (move up)
                left_side = (last-offset, first)

                print 'top: %s' % (top)
                print 'right_side: %s' % (right_side)
                print 'bottom: %s' % (bottom)
                print 'left_side: %s' % (left_side)

Now we simply need to assign the top to the right side, right side to the bottom, bottom to the left side, and left side to the top. Putting this all together we get:

def rotate(matrix):
	size = len(matrix)
	layer_count = size / 2

	for layer in range(0, layer_count):
        first = layer
        last = size - first - 1

        for element in range(first, last):
            offset = element - first

            top = matrix[first][element]
            right_side = matrix[element][last]
            bottom = matrix[last][last-offset]
            left_side = matrix[last-offset][first]

            matrix[first][element] = left_side
            matrix[element][last] = top
            matrix[last][last-offset] = right_side
            matrix[last-offset][first] = bottom

Given the matrix:

0,  1,  2  
3,  4,  5  
6,  7,  8 

Our rotate function results in:

6,  3,  0  
7,  4,  1  
8,  5,  2  

See also original question in stackoverflow

#63: Why do we use Base64? (Score: 325)

Created: 2010-08-21 Last updated: 2017-09-11

Tags: algorithm, character-encoding, binary, ascii, base64

Wikipedia says

Base64 encoding schemes are commonly used when there is a need to encode binary data that needs be stored and transferred over media that are designed to deal with textual data. This is to ensure that the data remains intact without modification during transport.

But is it not that data is always stored/transmitted in binary because the memory that our machines have store binary and it just depends how you interpret it? So, whether you encode the bit pattern 010011010110000101101110 as Man in ASCII or as TWFu in Base64, you are eventually going to store the same bit pattern.

If the ultimate encoding is in terms of zeros and ones and every machine and media can deal with them, how does it matter if the data is represented as ASCII or Base64?

What does it mean “media that are designed to deal with textual data”? They can deal with binary => they can deal with anything.

Thanks everyone, I think I understand now.

When we send over data, we cannot be sure that the data would be interpreted in the same format as we intended it to be. So, we send over data coded in some format (like Base64) that both parties understand. That way even if sender and receiver interpret same things differently, but because they agree on the coded format, the data will not get interpreted wrongly.

From Mark Byers example

If I want to send


One way is to send it in ASCII like

72 101 108 108 111 10 119 111 114 108 100 33

But byte 10 might not be interpreted correctly as a newline at the other end. So, we use a subset of ASCII to encode it like this

83 71 86 115 98 71 56 115 67 110 100 118 99 109 120 107 73 61 61

which at the cost of more data transferred for the same amount of information ensures that the receiver can decode the data in the intended way, even if the receiver happens to have different interpretations for the rest of the character set.

#63 Best answer 1 of Why do we use Base64? (Score: 363)

Created: 2010-08-21 Last updated: 2020-11-15

Your first mistake is thinking that ASCII encoding and Base64 encoding are interchangeable. They are not. They are used for different purposes.

  • When you encode text in ASCII, you start with a text string and convert it to a sequence of bytes.
  • When you encode data in Base64, you start with a sequence of bytes and convert it to a text string.

To understand why Base64 was necessary in the first place we need a little history of computing.

Computers communicate in binary - 0s and 1s - but people typically want to communicate with more rich forms data such as text or images. In order to transfer this data between computers it first has to be encoded into 0s and 1s, sent, then decoded again. To take text as an example - there are many different ways to perform this encoding. It would be much simpler if we could all agree on a single encoding, but sadly this is not the case.

Originally a lot of different encodings were created (e.g. Baudot code) which used a different number of bits per character until eventually ASCII became a standard with 7 bits per character. However most computers store binary data in bytes consisting of 8 bits each so ASCII is unsuitable for tranferring this type of data. Some systems would even wipe the most significant bit. Furthermore the difference in line ending encodings across systems mean that the ASCII character 10 and 13 were also sometimes modified.

To solve these problems Base64 encoding was introduced. This allows you to encode arbitrary bytes to bytes which are known to be safe to send without getting corrupted (ASCII alphanumeric characters and a couple of symbols). The disadvantage is that encoding the message using Base64 increases its length - every 3 bytes of data is encoded to 4 ASCII characters.

To send text reliably you can first encode to bytes using a text encoding of your choice (for example UTF-8) and then afterwards Base64 encode the resulting binary data into a text string that is safe to send encoded as ASCII. The receiver will have to reverse this process to recover the original message. This of course requires that the receiver knows which encodings were used, and this information often needs to be sent separately.

Historically it has been used to encode binary data in email messages where the email server might modify line-endings. A more modern example is the use of Base64 encoding to embed image data directly in HTML source code. Here it is necessary to encode the data to avoid characters like ‘<’ and ‘>’ being interpreted as tags.

Here is a working example:

I wish to send a text message with two lines:


If I send it as ASCII (or UTF-8) it will look like this:

72 101 108 108 111 10 119 111 114 108 100 33

The byte 10 is corrupted in some systems so we can base 64 encode these bytes as a Base64 string:


Which when encoded using ASCII looks like this:

83 71 86 115 98 71 56 75 100 50 57 121 98 71 81 104

All the bytes here are known safe bytes, so there is very little chance that any system will corrupt this message. I can send this instead of my original message and let the receiver reverse the process to recover the original message.

#63 Best answer 2 of Why do we use Base64?(Score: 67)

Created: 2010-08-21 Last updated: 2018-06-19

Encoding binary data in XML

Suppose you want to embed a couple images within an XML document. The images are binary data, while the XML document is text. But XML cannot handle embedded binary data. So how do you do it?

One option is to encode the images in base64, turning the binary data into text that XML can handle.

Instead of:

  <image name="Sally">{binary gibberish that breaks XML parsers}</image>
  <image name="Bobby">{binary gibberish that breaks XML parsers}</image>

you do:

  <image name="Sally" encoding="base64">j23894uaiAJSD3234kljasjkSD...</image>
  <image name="Bobby" encoding="base64">Ja3k23JKasil3452AsdfjlksKsasKD...</image>

And the XML parser will be able to parse the XML document correctly and extract the image data.

See also original question in stackoverflow

#64: Cost of len() function (Score: 324)

Created: 2009-07-12 Last updated: 2014-08-13

Tags: python, algorithm, collections, complexity-theory

What is the cost of len() function for Python built-ins? (list/tuple/string/dictionary)

#64 Best answer 1 of Cost of len() function (Score: 395)

Created: 2009-07-12 Last updated: 2017-02-27

It’s O(1) (constant time, not depending of actual length of the element - very fast) on every type you’ve mentioned, plus set and others such as array.array.

#64 Best answer 2 of Cost of len() function(Score: 146)

Created: 2009-07-12

Calling len() on those data types is O(1) in CPython, the most common implementation of the Python language. Here’s a link to a table that provides the algorithmic complexity of many different functions in CPython:

TimeComplexity Python Wiki Page

See also original question in stackoverflow

#65: What are the underlying data structures used for Redis? (Score: 320)

Created: 2012-03-08 Last updated: 2012-03-08

Tags: algorithm, data-structures, redis

I’m trying to answer two questions in a definitive list:

  1. What are the underlying data structures used for Redis?
  2. And what are the main advantages/disadvantages/use cases for each type?

So, I’ve read the Redis lists are actually implemented with linked lists. But for other types, I’m not able to dig up any information. Also, if someone were to stumble upon this question and not have a high level summary of the pros and cons of modifying or accessing different data structures, they’d have a complete list of when to best use specific types to reference as well.

Specifically, I’m looking to outline all types: string, list, set, zset and hash.

Oh, I’ve looked at these article, among others, so far:

#65 Best answer 1 of What are the underlying data structures used for Redis? (Score: 624)

Created: 2012-03-08 Last updated: 2013-09-04

I’ll try to answer your question, but I’ll start with something that may look strange at first: if you are not interested in Redis internals you should not care about how data types are implemented internally. This is for a simple reason: for every Redis operation you’ll find the time complexity in the documentation and, if you have the set of operations and the time complexity, the only other thing you need is some clue about memory usage (and because we do many optimizations that may vary depending on data, the best way to get these latter figures are doing a few trivial real world tests).

But since you asked, here is the underlying implementation of every Redis data type.

  • Strings are implemented using a C dynamic string library so that we don’t pay (asymptotically speaking) for allocations in append operations. This way we have O(N) appends, for instance, instead of having quadratic behavior.
  • Lists are implemented with linked lists.
  • Sets and Hashes are implemented with hash tables.
  • Sorted sets are implemented with skip lists (a peculiar type of balanced trees).

But when lists, sets, and sorted sets are small in number of items and size of the largest values, a different, much more compact encoding is used. This encoding differs for different types, but has the feature that it is a compact blob of data that often forces an O(N) scan for every operation. Since we use this format only for small objects this is not an issue; scanning a small O(N) blob is cache oblivious so practically speaking it is very fast, and when there are too many elements the encoding is automatically switched to the native encoding (linked list, hash, and so forth).

But your question was not really just about internals, your point was What type to use to accomplish what?.


This is the base type of all the types. It’s one of the four types but is also the base type of the complex types, because a List is a list of strings, a Set is a set of strings, and so forth.

A Redis string is a good idea in all the obvious scenarios where you want to store an HTML page, but also when you want to avoid converting your already encoded data. So for instance, if you have JSON or MessagePack you may just store objects as strings. In Redis 2.6 you can even manipulate this kind of object server side using Lua scripts.

Another interesting usage of strings is bitmaps, and in general random access arrays of bytes, since Redis exports commands to access random ranges of bytes, or even single bits. For instance check this good blog post: Fast Easy real time metrics using Redis.


Lists are good when you are likely to touch only the extremes of the list: near tail, or near head. Lists are not very good to paginate stuff, because random access is slow, O(N). So good uses of lists are plain queues and stacks, or processing items in a loop using RPOPLPUSH with same source and destination to “rotate” a ring of items.

Lists are also good when we want just to create a capped collection of N items where usually we access just the top or bottom items, or when N is small.


Sets are an unordered data collection, so they are good every time you have a collection of items and it is very important to check for existence or size of the collection in a very fast way. Another cool thing about sets is support for peeking or popping random elements (SRANDMEMBER and SPOP commands).

Sets are also good to represent relations, e.g., “What are friends of user X?” and so forth. But other good data structures for this kind of stuff are sorted sets as we’ll see.

Sets support complex operations like intersections, unions, and so forth, so this is a good data structure for using Redis in a “computational” manner, when you have data and you want to perform transformations on that data to obtain some output.

Small sets are encoded in a very efficient way.


Hashes are the perfect data structure to represent objects, composed of fields and values. Fields of hashes can also be atomically incremented using HINCRBY. When you have objects such as users, blog posts, or some other kind of item, hashes are likely the way to go if you don’t want to use your own encoding like JSON or similar.

However, keep in mind that small hashes are encoded very efficiently by Redis, and you can ask Redis to atomically GET, SET or increment individual fields in a very fast fashion.

Hashes can also be used to represent linked data structures, using references. For instance check the implementation of comments.

Sorted Sets

Sorted sets are the only other data structures, besides lists, to maintain ordered elements. You can do a number of cool stuff with sorted sets. For instance, you can have all kinds of Top Something lists in your web application. Top users by score, top posts by pageviews, top whatever, but a single Redis instance will support tons of insertion and get-top-elements operations per second.

Sorted sets, like regular sets, can be used to describe relations, but they also allow you to paginate the list of items and to remember the ordering. For instance, if I remember friends of user X with a sorted set I can easily remember them in order of accepted friendship.

Sorted sets are good for priority queues.

Sorted sets are like more powerful lists where inserting, removing, or getting ranges from the the middle of the list is always fast. But they use more memory, and are O(log(N)) data structures.


I hope that I provided some info in this post, but it is far better to download the source code of lamernews from and understand how it works. Many data structures from Redis are used inside Lamer News, and there are many clues about what to use to solve a given task.

Sorry for grammar typos, it’s midnight here and too tired to review the post ;)

#65 Best answer 2 of What are the underlying data structures used for Redis?(Score: 84)

Created: 2012-05-11 Last updated: 2017-05-23

Most of the time, you don’t need to understand the underlying data structures used by Redis. But a bit of knowledge helps you make CPU v/s Memory trade offs. It also helps you model your data in an efficient manner.

Internally, Redis uses the following data structures :

  1. String
  2. Dictionary
  3. Doubly Linked List
  4. Skip List
  5. Zip List
  6. Int Sets
  7. Zip Maps (deprecated in favour of zip list since Redis 2.6)

To find the encoding used by a particular key, use the command object encoding <key>.

1. Strings

In Redis, Strings are called Simple Dynamic Strings, or SDS. It’s a smallish wrapper over a char * that allows you to store the length of the string and number of free bytes as a prefix.

Because the length of the string is stored, strlen is an O(1) operation. Also, because the length is known, Redis strings are binary safe. It is perfectly legal for a string to contain the null character.

Strings are the most versatile data structure available in Redis. A String is all of the following:

  1. A string of characters that can store text. See SET and GET commands.
  2. A byte array that can store binary data.
  3. A long that can store numbers. See INCR, DECR, INCRBY and DECRBY commands.
  4. An Array (of chars, ints, longs or any other data type) that can allow efficient random access. See SETRANGE and GETRANGE commands.
  5. A bit array that allows you to set or get individual bits. See SETBIT and GETBIT commands.
  6. A block of memory that you can use to build other data structures. This is used internally to build ziplists and intsets, which are compact, memory-efficient data structures for small number of elements. More on this below.

2. Dictionary

Redis uses a Dictionary for the following:

  1. To map a key to its associated value, where value can be a string, hash, set, sorted set or list.
  2. To map a key to its expiry timestamp.
  3. To implement Hash, Set and Sorted Set data types.
  4. To map Redis commands to the functions that handle those commands.
  5. To map a Redis key to a list of clients that are blocked on that key. See BLPOP.

Redis Dictionaries are implemented using Hash Tables. Instead of explaining the implementation, I will just explain the Redis specific things :

  1. Dictionaries use a structure called dictType to extend the behaviour of a hash table. This structure has function pointers, and so the following operations are extendable: a) hash function, b) key comparison, c) key destructor, and d) value destructor.
  2. Dictionaries use the murmurhash2. (Previously they used the djb2 hash function, with seed=5381, but then the hash function was switched to murmur2. See this question for an explanation of the djb2 hash algorithm.)
  3. Redis uses Incremental Hashing, also known as Incremental Resizing. The dictionary has two hash tables. Every time the dictionary is touched, one bucket is migrated from the first (smaller) hash table to the second. This way, Redis prevents an expensive resize operation.

The Set data structure uses a Dictionary to guarantee there are no duplicates. The Sorted Set uses a dictionary to map an element to its score, which is why ZSCORE is an O(1) operation.

3. Doubly Linked Lists

The list data type is implemented using Doubly Linked Lists. Redis' implementation is straight-from-the-algorithm-textbook. The only change is that Redis stores the length in the list data structure. This ensures that LLEN has O(1) complexity.

4. Skip Lists

Redis uses Skip Lists as the underlying data structure for Sorted Sets. Wikipedia has a good introduction. William Pugh’s paper Skip Lists: A Probabilistic Alternative to Balanced Trees has more details.

Sorted Sets use both a Skip List and a Dictionary. The dictionary stores the score of each element.

Redis' Skip List implementation is different from the standard implementation in the following ways:

  1. Redis allows duplicate scores. If two nodes have the same score, they are sorted by the lexicographical order.
  2. Each node has a back pointer at level 0. This allows you to traverse elements in reverse order of the score.

5. Zip List

A Zip List is like a doubly linked list, except it does not use pointers and stores the data inline.

Each node in a doubly linked list has at 3 pointers - one forward pointer, one backward pointer and one pointer to reference the data stored at that node. Pointers require memory (8 bytes on a 64 bit system), and so for small lists, a doubly linked list is very inefficient.

A Zip List stores elements sequentially in a Redis String. Each element has a small header that stores the length and data type of the element, the offset to the next element and the offset to the previous element. These offsets replace the forward and backward pointers. Since the data is stored inline, we don’t need a data pointer.

The Zip list is used to store small lists, sorted sets and hashes. Sorted sets are flattened into a list like [element1, score1, element2, score2, element3, score3] and stored in the Zip List. Hashes are flattened into a list like [key1, value1, key2, value2] etc.

With Zip Lists you have the power to make a tradeoff between CPU and Memory. Zip Lists are memory-efficient, but they use more CPU than a linked list (or Hash table/Skip List). Finding an element in the zip list is O(n). Inserting a new element requires reallocating memory. Because of this, Redis uses this encoding only for small lists, hashes and sorted sets. You can tweak this behaviour by altering the values of <datatype>-max-ziplist-entries and <datatype>-max-ziplist-value> in redis.conf. See Redis Memory Optimization, section “Special encoding of small aggregate data types” for more information.

The comments on ziplist.c are excellent, and you can understand this data structure completely without having to read the code.

6. Int Sets

Int Sets are a fancy name for “Sorted Integer Arrays”.

In Redis, sets are usually implemented using hash tables. For small sets, a hash table is inefficient memory wise. When the set is composed of integers only, an array is often more efficient.

An Int Set is a sorted array of integers. To find an element a binary search algorithm is used. This has a complexity of O(log N). Adding new integers to this array may require a memory reallocation, which can become expensive for large integer arrays.

As a further memory optimization, Int Sets come in 3 variants with different integer sizes: 16 bits, 32 bits and 64 bits. Redis is smart enough to use the right variant depending on the size of the elements. When a new element is added and it exceeds the current size, Redis automatically migrates it to the next size. If a string is added, Redis automatically converts the Int Set to a regular Hash Table based set.

Int Sets are a tradeoff between CPU and Memory. Int Sets are extremely memory efficient, and for small sets they are faster than a hash table. But after a certain number of elements, the O(log N) retrieval time and the cost of reallocating memory become too much. Based on experiments, the optimal threshold to switch over to a regular hash table was found to be 512. However, you can increase this threshold (decreasing it doesn’t make sense) based on your application’s needs. See set-max-intset-entries in redis.conf.

7. Zip Maps

Zip Maps are dictionaries flattened and stored in a list. They are very similar to Zip Lists.

Zip Maps have been deprecated since Redis 2.6, and small hashes are stored in Zip Lists. To learn more about this encoding, refer to the comments in zipmap.c.

See also original question in stackoverflow

#66: Big-O for Eight Year Olds? (Score: 311)

Created: 2008-09-20 Last updated: 2019-05-15

Tags: algorithm, theory, big-o, metrics

I’m asking more about what this means to my code. I understand the concepts mathematically, I just have a hard time wrapping my head around what they mean conceptually. For example, if one were to perform an O(1) operation on a data structure, I understand that the number of operations it has to perform won’t grow because there are more items. And an O(n) operation would mean that you would perform a set of operations on each element. Could somebody fill in the blanks here?

  • Like what exactly would an O(n^2) operation do?
  • And what the heck does it mean if an operation is O(n log(n))?
  • And does somebody have to smoke crack to write an O(x!)?

#66 Best answer 1 of Big-O for Eight Year Olds? (Score: 295)

Created: 2008-09-20 Last updated: 2019-05-15

One way of thinking about it is this:

O(N^2) means for every element, you’re doing something with every other element, such as comparing them. Bubble sort is an example of this.

O(N log N) means for every element, you’re doing something that only needs to look at log N of the elements. This is usually because you know something about the elements that let you make an efficient choice. Most efficient sorts are an example of this, such as merge sort.

O(N!) means to do something for all possible permutations of the N elements. Traveling salesman is an example of this, where there are N! ways to visit the nodes, and the brute force solution is to look at the total cost of every possible permutation to find the optimal one.

#66 Best answer 2 of Big-O for Eight Year Olds?(Score: 269)

Created: 2008-09-20 Last updated: 2017-12-05

The big thing that Big-O notation means to your code is how it will scale when you double the amount of “things” it operates on. Here’s a concrete example:

Big-O       |  computations for 10 things |  computations for 100 things
O(1)        |   1                         |     1
O(log(n))   |   3                         |     7
O(n)        |  10                         |   100
O(n log(n)) |  30                         |   700
O(n^2)      | 100                         | 10000

So take quicksort which is O(n log(n)) vs bubble sort which is O(n^2). When sorting 10 things, quicksort is 3 times faster than bubble sort. But when sorting 100 things, it’s 14 times faster! Clearly picking the fastest algorithm is important then. When you get to databases with million rows, it can mean the difference between your query executing in 0.2 seconds, versus taking hours.

Another thing to consider is that a bad algorithm is one thing that Moore’s law cannot help. For example, if you’ve got some scientific calculation that’s O(n^3) and it can compute 100 things a day, doubling the processor speed only gets you 125 things in a day. However, knock that calculation to O(n^2) and you’re doing 1000 things a day.

clarification: Actually, Big-O says nothing about comparative performance of different algorithms at the same specific size point, but rather about comparative performance of the same algorithm at different size points:

                 computations     computations       computations
Big-O       |   for 10 things |  for 100 things |  for 1000 things
O(1)        |        1        |        1        |         1
O(log(n))   |        1        |        3        |         7
O(n)        |        1        |       10        |       100
O(n log(n)) |        1        |       33        |       664
O(n^2)      |        1        |      100        |     10000

See also original question in stackoverflow

#67: Algorithm to randomly generate an aesthetically-pleasing color palette (Score: 310)

Created: 2008-09-04 Last updated: 2013-02-11

Tags: algorithm, colors

I’m looking for a simple algorithm to generate a large number of random, aesthetically pleasing colors. So no crazy neon colors, colors reminiscent of feces, etc.

I’ve found solutions to this problem but they rely on alternative color palettes than RGB. I would rather just use straight RGB than mapping back and forth. These other solutions also can at most generate only 32 or so pleasing random colors.

Any ideas would be great.

#67 Best answer 1 of Algorithm to randomly generate an aesthetically-pleasing color palette (Score: 433)

Created: 2008-09-04 Last updated: 2012-04-02

You could average the RGB values of random colors with those of a constant color:

(example in Java)

public Color generateRandomColor(Color mix) {
    Random random = new Random();
    int red = random.nextInt(256);
    int green = random.nextInt(256);
    int blue = random.nextInt(256);

    // mix the color
    if (mix != null) {
        red = (red + mix.getRed()) / 2;
        green = (green + mix.getGreen()) / 2;
        blue = (blue + mix.getBlue()) / 2;

    Color color = new Color(red, green, blue);
    return color;

Mixing random colors with white (255, 255, 255) creates neutral pastels by increasing the lightness while keeping the hue of the original color. These randomly generated pastels usually go well together, especially in large numbers.

Here are some pastel colors generated using the above method:


You could also mix the random color with a constant pastel, which results in a tinted set of neutral colors. For example, using a light blue creates colors like these:


Going further, you could add heuristics to your generator that take into account complementary colors or levels of shading, but it all depends on the impression you want to achieve with your random colors.

Some additional resources:

#67 Best answer 2 of Algorithm to randomly generate an aesthetically-pleasing color palette(Score: 88)

Created: 2011-02-15 Last updated: 2013-02-12

I would use a color wheel and given a random position you could add the golden angle (137,5 degrees)

in order to get different colours each time that do not overlap.

Adjusting the brightness for the color wheel you could get also different bright/dark color combinations.

I’ve found this blog post that explains really well the problem and the solution using the golden ratio.

UPDATE: I’ve just found this other approach:

It’s called RYB(red, yellow, blue) method and it’s described in this paper:

as “Paint Inspired Color Compositing”.

The algorithm generates the colors and each new color is chosen to maximize its euclidian distance to the previously selected ones.

Here you can find a a good implementation in javascript:


The Sciences Po Medialb have just released a tool called “I want Hue” that generate color palettes for data scientists. Using different color spaces and generating the palettes by using k-means clustering or force vectors ( repulsion graphs) The results from those methods are very good, they show the theory and an implementation in their web page.

See also original question in stackoverflow

#68: What is the difference between tree depth and height? (Score: 309)

Created: 2010-04-08 Last updated: 2019-01-27

Tags: algorithm, data-structures, tree, nodes, terminology

This is a simple question from algorithms theory.
The difference between them is that in one case you count number of nodes and in other number of edges on the shortest path between root and concrete node.
Which is which?

#68 Best answer 1 of What is the difference between tree depth and height? (Score: 766)

Created: 2010-04-08 Last updated: 2017-11-06

I learned that depth and height are properties of a node:

  • The depth of a node is the number of edges from the node to the tree’s root node.
    A root node will have a depth of 0.

  • The height of a node is the number of edges on the longest path from the node to a leaf.
    A leaf node will have a height of 0.

Properties of a tree:

  • The height of a tree would be the height of its root node,
    or equivalently, the depth of its deepest node.

  • The diameter (or width) of a tree is the number of nodes on the longest path between any two leaf nodes. The tree below has a diameter of 6 nodes.

A tree, with height and depth of each node

#68 Best answer 2 of What is the difference between tree depth and height?(Score: 48)

Created: 2014-09-14 Last updated: 2017-07-15

height and depth of a tree is equal…

but height and depth of a node is not equal because…

the height is calculated by traversing from the given node to the deepest possible leaf.

depth is calculated from traversal from root to the given node…..

See also original question in stackoverflow

#69: Write a program to find 100 largest numbers out of an array of 1 billion numbers (Score: 303)

Created: 2013-10-07 Last updated: 2013-10-18

Tags: algorithm, sorting

I recently attended an interview where I was asked “write a program to find 100 largest numbers out of an array of 1 billion numbers.”

I was only able to give a brute force solution which was to sort the array in O(nlogn) time complexity and take the last 100 numbers.


The interviewer was looking for a better time complexity, I tried a couple of other solutions but failed to answer him. Is there a better time complexity solution?

#69 Best answer 1 of Write a program to find 100 largest numbers out of an array of 1 billion numbers (Score: 330)

Created: 2013-10-07 Last updated: 2013-10-09

You can keep a priority queue of the 100 biggest numbers, iterate through the billion numbers, whenever you encounter a number greater than the smallest number in the queue (the head of the queue), remove the head of the queue and add the new number to the queue.

EDIT: as Dev noted, with a priority queue implemented with a heap, the complexity of insertion to queue is O(logN)

In the worst case you get

billionlog2(100) which is better than 

In general, if you need the largest K numbers from a set of N numbers, the complexity is O(NlogK) rather than O(NlogN), this can be very significant when K is very small comparing to N.


The expected time of this algorithm is pretty interesting, since in each iteration an insertion may or may not occur. The probability of the i’th number to be inserted to the queue is the probability of a random variable being larger than at least i-K random variables from the same distribution (the first k numbers are automatically added to the queue). We can use order statistics (see link) to calculate this probability. For example, lets assume the numbers were randomly selected uniformly from {0, 1}, the expected value of (i-K)th number (out of i numbers) is (i-k)/i, and chance of a random variable being larger than this value is 1-[(i-k)/i] = k/i.

Thus, the expected number of insertions is:

enter image description here

And the expected running time can be expressed as:

enter image description here

(k time to generate the queue with the first k elements, then n-k comparisons, and the expected number of insertions as described above, each takes an average log(k)/2 time)

Note that when N is very large comparing to K, this expression is a lot closer to n rather than NlogK. This is somewhat intuitive, as in the case of the question, even after 10000 iterations (which is very small comparing to a billion), the chance of a number to be inserted to the queue is very small.

#69 Best answer 2 of Write a program to find 100 largest numbers out of an array of 1 billion numbers(Score: 140)

Created: 2013-10-08 Last updated: 2016-01-18

If this is asked in an interview, I think the interviewer probably wants to see your problem solving process, not just your knowledge of algorithms.

The description is quite general so maybe you can ask him the range or meaning of these numbers to make the problem clear. Doing this may impress an interviewer. If, for example, these numbers stands for people’s age of within a country (e.g. China),then it’s a much easier problem. With a reasonable assumption that nobody alive is older than 200, you can use an int array of size 200(maybe 201) to count the number of people with the same age in just one iteration. Here the index means the age. After this it’s a piece of cake to find 100 largest number. By the way this algo is called counting sort.

Anyway, making the question more specific and clearer is good for you in an interview.

See also original question in stackoverflow

#70: Peak signal detection in realtime timeseries data (Score: 302)

Created: 2014-03-22 Last updated: 2021-03-08

Tags: algorithm, language-agnostic, time-series, signal-processing, data-analysis

Update: The best performing algorithm so far is this one.

This question explores robust algorithms for detecting sudden peaks in real-time timeseries data.

Consider the following example data:

Plot of data

Example of this data is in Matlab format (but this question is not about the language but about the algorithm):

p = [1 1 1.1 1 0.9 1 1 1.1 1 0.9 1 1.1 1 1 0.9 1 1 1.1 1 1 1 1 1.1 0.9 1 1.1 1 1 0.9, ...
     1 1.1 1 1 1.1 1 0.8 0.9 1 1.2 0.9 1 1 1.1 1.2 1 1.5 1 3 2 5 3 2 1 1 1 0.9 1 1, ... 
     3 2.6 4 3 3.2 2 1 1 0.8 4 4 2 2.5 1 1 1];

You can clearly see that there are three large peaks and some small peaks. This dataset is a specific example of the class of timeseries datasets that the question is about. This class of datasets has two general features:

  1. There is basic noise with a general mean
  2. There are large ‘peaks’ or ‘higher data points’ that significantly deviate from the noise.

Let’s also assume the following:

  • The width of the peaks cannot be determined beforehand
  • The height of the peaks significantly deviates from the other values
  • The algorithm updates in realtime (so updates with each new datapoint)

For such a situation, a boundary value needs to be constructed which triggers signals. However, the boundary value cannot be static and must be determined realtime based on an algorithm.

My Question: what is a good algorithm to calculate such thresholds in realtime? Are there specific algorithms for such situations? What are the most well-known algorithms?

Robust algorithms or useful insights are all highly appreciated. (can answer in any language: it’s about the algorithm)

#70 Best answer 1 of Peak signal detection in realtime timeseries data (Score: 427)

Created: 2014-03-25 Last updated: 2021-03-19

Robust peak detection algorithm (using z-scores)

I came up with an algorithm that works very well for these types of datasets. It is based on the principle of dispersion: if a new datapoint is a given x number of standard deviations away from some moving mean, the algorithm signals (also called z-score). The algorithm is very robust because it constructs a separate moving mean and deviation, such that signals do not corrupt the threshold. Future signals are therefore identified with approximately the same accuracy, regardless of the amount of previous signals. The algorithm takes 3 inputs: lag = the lag of the moving window, threshold = the z-score at which the algorithm signals and influence = the influence (between 0 and 1) of new signals on the mean and standard deviation. For example, a lag of 5 will use the last 5 observations to smooth the data. A threshold of 3.5 will signal if a datapoint is 3.5 standard deviations away from the moving mean. And an influence of 0.5 gives signals half of the influence that normal datapoints have. Likewise, an influence of 0 ignores signals completely for recalculating the new threshold. An influence of 0 is therefore the most robust option (but assumes stationarity); putting the influence option at 1 is least robust. For non-stationary data, the influence option should therefore be put somewhere between 0 and 1.

It works as follows:


# Let y be a vector of timeseries data of at least length lag+2
# Let mean() be a function that calculates the mean
# Let std() be a function that calculates the standard deviaton
# Let absolute() be the absolute value function

# Settings (the ones below are examples: choose what is best for your data)
set lag to 5;          # lag 5 for the smoothing functions
set threshold to 3.5;  # 3.5 standard deviations for signal
set influence to 0.5;  # between 0 and 1, where 1 is normal influence, 0.5 is half

# Initialize variables
set signals to vector 0,...,0 of length of y;   # Initialize signal results
set filteredY to y(1),...,y(lag)                # Initialize filtered series
set avgFilter to null;                          # Initialize average filter
set stdFilter to null;                          # Initialize std. filter
set avgFilter(lag) to mean(y(1),...,y(lag));    # Initialize first value
set stdFilter(lag) to std(y(1),...,y(lag));     # Initialize first value

for i=lag+1,...,t do
  if absolute(y(i) - avgFilter(i-1)) > threshold*stdFilter(i-1) then
    if y(i) > avgFilter(i-1) then
      set signals(i) to +1;                     # Positive signal
      set signals(i) to -1;                     # Negative signal
    set filteredY(i) to influence*y(i) + (1-influence)*filteredY(i-1);
    set signals(i) to 0;                        # No signal
    set filteredY(i) to y(i);
  set avgFilter(i) to mean(filteredY(i-lag+1),...,filteredY(i));
  set stdFilter(i) to std(filteredY(i-lag+1),...,filteredY(i));

Rules of thumb for selecting good parameters for your data can be found below.


Demonstration of robust thresholding algorithm

The Matlab code for this demo can be found here. To use the demo, simply run it and create a time series yourself by clicking on the upper chart. The algorithm starts working after drawing lag number of observations.


For the original question, this algorithm will give the following output when using the following settings: lag = 30, threshold = 5, influence = 0:

Thresholding algorithm example

Implementations in different programming languages:

Rules of thumb for configuring the algorithm

lag: the lag parameter determines how much your data will be smoothed and how adaptive the algorithm is to changes in the long-term average of the data. The more stationary your data is, the more lags you should include (this should improve the robustness of the algorithm). If your data contains time-varying trends, you should consider how quickly you want the algorithm to adapt to these trends. I.e., if you put lag at 10, it takes 10 ‘periods’ before the algorithm’s treshold is adjusted to any systematic changes in the long-term average. So choose the lag parameter based on the trending behavior of your data and how adaptive you want the algorithm to be.

influence: this parameter determines the influence of signals on the algorithm’s detection threshold. If put at 0, signals have no influence on the threshold, such that future signals are detected based on a threshold that is calculated with a mean and standard deviation that is not influenced by past signals. If put at 0.5, signals have half the influence of normal data points. Another way to think about this is that if you put the influence at 0, you implicitly assume stationarity (i.e. no matter how many signals there are, you always expect the time series to return to the same average over the long term). If this is not the case, you should put the influence parameter somewhere between 0 and 1, depending on the extent to which signals can systematically influence the time-varying trend of the data. E.g., if signals lead to a structural break of the long-term average of the time series, the influence parameter should be put high (close to 1) so the threshold can react to structural breaks quickly.

threshold: the threshold parameter is the number of standard deviations from the moving mean above which the algorithm will classify a new datapoint as being a signal. For example, if a new datapoint is 4.0 standard deviations above the moving mean and the threshold parameter is set as 3.5, the algorithm will identify the datapoint as a signal. This parameter should be set based on how many signals you expect. For example, if your data is normally distributed, a threshold (or: z-score) of 3.5 corresponds to a signaling probability of 0.00047 (from this table), which implies that you expect a signal once every 2128 datapoints (1/0.00047). The threshold therefore directly influences how sensitive the algorithm is and thereby also determines how often the algorithm signals. Examine your own data and choose a sensible threshold that makes the algorithm signal when you want it to (some trial-and-error might be needed here to get to a good threshold for your purpose).

WARNING: The code above always loops over all datapoints everytime it runs. When implementing this code, make sure to split the calculation of the signal into a separate function (without the loop). Then when a new datapoint arrives, update filteredY, avgFilter and stdFilter once. Do not recalculate the signals for all data everytime there is a new datapoint (like in the example above), that would be extremely inefficient and slow in real-time applications.

Other ways to modify the algorithm (for potential improvements) are:

  1. Use median instead of mean
  2. Use a robust measure of scale, such as the MAD, instead of the standard deviation
  3. Use a signalling margin, so the signal doesn’t switch too often
  4. Change the way the influence parameter works
  5. Treat up and down signals differently (asymmetric treatment)
  6. Create a separate influence parameter for the mean and std (as in this Swift translation)

(Known) academic citations to this StackOverflow answer:

Other works using the algorithm from this answer

Other applications of the algorithm from this answer

Links to other peak detection algorithms

How to reference this algorithm:

Brakel, J.P.G. van (2014). “Robust peak detection algorithm using z-scores”. Stack Overflow. Available at: (version: 2020-11-08).

If you use this function somewhere, please credit me by using above reference. If you have any questions about the algorithm, post them in the comments below or reach out to me on LinkedIn.

#70 Best answer 2 of Peak signal detection in realtime timeseries data(Score: 51)

Created: 2017-04-20 Last updated: 2019-07-28

Here is the Python / numpy implementation of the smoothed z-score algorithm (see answer above). You can find the gist here.

#!/usr/bin/env python
# Implementation of algorithm from
import numpy as np
import pylab

def thresholding_algo(y, lag, threshold, influence):
    signals = np.zeros(len(y))
    filteredY = np.array(y)
    avgFilter = [0]*len(y)
    stdFilter = [0]*len(y)
    avgFilter[lag - 1] = np.mean(y[0:lag])
    stdFilter[lag - 1] = np.std(y[0:lag])
    for i in range(lag, len(y)):
        if abs(y[i] - avgFilter[i-1]) > threshold * stdFilter [i-1]:
            if y[i] > avgFilter[i-1]:
                signals[i] = 1
                signals[i] = -1

            filteredY[i] = influence * y[i] + (1 - influence) * filteredY[i-1]
            avgFilter[i] = np.mean(filteredY[(i-lag+1):i+1])
            stdFilter[i] = np.std(filteredY[(i-lag+1):i+1])
            signals[i] = 0
            filteredY[i] = y[i]
            avgFilter[i] = np.mean(filteredY[(i-lag+1):i+1])
            stdFilter[i] = np.std(filteredY[(i-lag+1):i+1])

    return dict(signals = np.asarray(signals),
                avgFilter = np.asarray(avgFilter),
                stdFilter = np.asarray(stdFilter))

Below is the test on the same dataset that yields the same plot as in the original answer for R/Matlab

# Data
y = np.array([1,1,1.1,1,0.9,1,1,1.1,1,0.9,1,1.1,1,1,0.9,1,1,1.1,1,1,1,1,1.1,0.9,1,1.1,1,1,0.9,

# Settings: lag = 30, threshold = 5, influence = 0
lag = 30
threshold = 5
influence = 0

# Run algo with settings from above
result = thresholding_algo(y, lag=lag, threshold=threshold, influence=influence)

# Plot result
pylab.plot(np.arange(1, len(y)+1), y)

pylab.plot(np.arange(1, len(y)+1),
           result["avgFilter"], color="cyan", lw=2)

pylab.plot(np.arange(1, len(y)+1),
           result["avgFilter"] + threshold * result["stdFilter"], color="green", lw=2)

pylab.plot(np.arange(1, len(y)+1),
           result["avgFilter"] - threshold * result["stdFilter"], color="green", lw=2)

pylab.step(np.arange(1, len(y)+1), result["signals"], color="red", lw=2)
pylab.ylim(-1.5, 1.5)

See also original question in stackoverflow

  1. This page use API to get the relevant data from stackoverflow community.
  2. Content license on this page is CC BY-SA 3.0.
  3. score = up votes - down votes.