Most votes on algorithm questions 4

Most votes on algorithm questions 4. #31 What is the most efficient/elegant way to parse a flat table into a tree? #32 How to replace all occurrences of a character in string? #33 Why does Java's hashCode() in String use 31 as a multiplier? #34 Best way to reverse a string #35 What's the Hi/Lo algorithm? #36 Algorithm to detect overlapping periods #37 Why do we check up to the square root of a prime number to determine if it is prime? #38 What is an NP-complete in computer science? #39 How to detect a loop in a linked list? #40 How does the Google "Did you mean?" Algorithm work?

Read all the top votes questions and answers in a single page.

#31: What is the most efficient/elegant way to parse a flat table into a tree? (Score: 539)

Created: 2008-10-10 Last updated: 2017-05-23

Tags: sql, algorithm, recursion, tree, hierarchical-data

Assume you have a flat table that stores an ordered tree hierarchy:

Id   Name         ParentId   Order
 1   'Node 1'            0      10
 2   'Node 1.1'          1      10
 3   'Node 2'            0      20
 4   'Node 1.1.1'        2      10
 5   'Node 2.1'          3      10
 6   'Node 1.2'          1      20

Here’s a diagram, where we have [id] Name. Root node 0 is fictional.

                       [0] ROOT
                          /    \ 
              [1] Node 1          [3] Node 2
              /       \                   \
    [2] Node 1.1     [6] Node 1.2      [5] Node 2.1
 [4] Node 1.1.1

What minimalistic approach would you use to output that to HTML (or text, for that matter) as a correctly ordered, correctly indented tree?

Assume further you only have basic data structures (arrays and hashmaps), no fancy objects with parent/children references, no ORM, no framework, just your two hands. The table is represented as a result set, which can be accessed randomly.

Pseudo code or plain English is okay, this is purely a conceptional question.

Bonus question: Is there a fundamentally better way to store a tree structure like this in a RDBMS?


To answer one commenter’s (Mark Bessey’s) question: A root node is not necessary, because it is never going to be displayed anyway. ParentId = 0 is the convention to express “these are top level”. The Order column defines how nodes with the same parent are going to be sorted.

The “result set” I spoke of can be pictured as an array of hashmaps (to stay in that terminology). For my example was meant to be already there. Some answers go the extra mile and construct it first, but thats okay.

The tree can be arbitrarily deep. Each node can have N children. I did not exactly have a “millions of entries” tree in mind, though.

Don’t mistake my choice of node naming (‘Node 1.1.1’) for something to rely on. The nodes could equally well be called ‘Frank’ or ‘Bob’, no naming structure is implied, this was merely to make it readable.

I have posted my own solution so you guys can pull it to pieces.

#31 Best answer 1 of What is the most efficient/elegant way to parse a flat table into a tree? (Score: 471)

Created: 2008-10-10 Last updated: 2020-01-01

Now that MySQL 8.0 supports recursive queries, we can say that all popular SQL databases support recursive queries in standard syntax.

    SELECT m.* FROM MyTABLE AS m JOIN MyTree AS t ON m.ParentId = t.Id

I tested recursive queries in MySQL 8.0 in my presentation Recursive Query Throwdown in 2017.

Below is my original answer from 2008:

There are several ways to store tree-structured data in a relational database. What you show in your example uses two methods:

  • Adjacency List (the “parent” column) and
  • Path Enumeration (the dotted-numbers in your name column).

Another solution is called Nested Sets, and it can be stored in the same table too. Read “Trees and Hierarchies in SQL for Smarties” by Joe Celko for a lot more information on these designs.

I usually prefer a design called Closure Table (aka “Adjacency Relation”) for storing tree-structured data. It requires another table, but then querying trees is pretty easy.

I cover Closure Table in my presentation Models for Hierarchical Data with SQL and PHP and in my book SQL Antipatterns: Avoiding the Pitfalls of Database Programming.

CREATE TABLE ClosureTable (
  ancestor_id   INT NOT NULL REFERENCES FlatTable(id),
  descendant_id INT NOT NULL REFERENCES FlatTable(id),
  PRIMARY KEY (ancestor_id, descendant_id)

Store all paths in the Closure Table, where there is a direct ancestry from one node to another. Include a row for each node to reference itself. For example, using the data set you showed in your question:

INSERT INTO ClosureTable (ancestor_id, descendant_id) VALUES
  (1,1), (1,2), (1,4), (1,6),
  (2,2), (2,4),
  (3,3), (3,5),

Now you can get a tree starting at node 1 like this:

FROM FlatTable f 
  JOIN ClosureTable a ON ( = a.descendant_id)
WHERE a.ancestor_id = 1;

The output (in MySQL client) looks like the following:

| id |
|  1 | 
|  2 | 
|  4 | 
|  6 | 

In other words, nodes 3 and 5 are excluded, because they’re part of a separate hierarchy, not descending from node 1.

Re: comment from e-satis about immediate children (or immediate parent). You can add a “path_length” column to the ClosureTable to make it easier to query specifically for an immediate child or parent (or any other distance).

INSERT INTO ClosureTable (ancestor_id, descendant_id, path_length) VALUES
  (1,1,0), (1,2,1), (1,4,2), (1,6,1),
  (2,2,0), (2,4,1),
  (3,3,0), (3,5,1),

Then you can add a term in your search for querying the immediate children of a given node. These are descendants whose path_length is 1.

FROM FlatTable f 
  JOIN ClosureTable a ON ( = a.descendant_id)
WHERE a.ancestor_id = 1
  AND path_length = 1;

| id |
|  2 | 
|  6 | 

Re comment from @ashraf: “How about sorting the whole tree [by name]?”

Here’s an example query to return all nodes that are descendants of node 1, join them to the FlatTable that contains other node attributes such as name, and sort by the name.

FROM FlatTable f 
JOIN ClosureTable a ON ( = a.descendant_id)
WHERE a.ancestor_id = 1

Re comment from @Nate:

SELECT, GROUP_CONCAT(b.ancestor_id order by b.path_length desc) AS breadcrumbs
FROM FlatTable f 
JOIN ClosureTable a ON ( = a.descendant_id) 
JOIN ClosureTable b ON (b.descendant_id = a.descendant_id) 
WHERE a.ancestor_id = 1 
GROUP BY a.descendant_id 

| name       | breadcrumbs |
| Node 1     | 1           |
| Node 1.1   | 1,2         |
| Node 1.1.1 | 1,2,4       |
| Node 1.2   | 1,6         |

A user suggested an edit today. SO moderators approved the edit, but I am reversing it.

The edit suggested that the ORDER BY in the last query above should be ORDER BY b.path_length,, presumably to make sure the ordering matches the hierarchy. But this doesn’t work, because it would order “Node 1.1.1” after “Node 1.2”.

If you want the ordering to match the hierarchy in a sensible way, that is possible, but not simply by ordering by the path length. For example, see my answer to MySQL Closure Table hierarchical database - How to pull information out in the correct order.

#31 Best answer 2 of What is the most efficient/elegant way to parse a flat table into a tree?(Score: 58)

Created: 2008-10-11 Last updated: 2011-11-15

If you use nested sets (sometimes referred to as Modified Pre-order Tree Traversal) you can extract the entire tree structure or any subtree within it in tree order with a single query, at the cost of inserts being more expensive, as you need to manage columns which describe an in-order path through thee tree structure.

For django-mptt, I used a structure like this:

id  parent_id  tree_id  level  lft  rght
--  ---------  -------  -----  ---  ----
 1       null        1      0    1    14
 2          1        1      1    2     7
 3          2        1      2    3     4
 4          2        1      2    5     6
 5          1        1      1    8    13
 6          5        1      2    9    10
 7          5        1      2    11   12

Which describes a tree which looks like this (with id representing each item):

 +-- 2
 |   +-- 3
 |   +-- 4
 +-- 5
     +-- 6
     +-- 7

Or, as a nested set diagram which makes it more obvious how the lft and rght values work:

|  Root 1                                                                  |
|   ________________________________    ________________________________   |
|  |  Child 1.1                     |  |  Child 1.2                     |  |
|  |   ___________    ___________   |  |   ___________    ___________   |  |
|  |  |  C 1.1.1  |  |  C 1.1.2  |  |  |  |  C 1.2.1  |  |  C 1.2.2  |  |  |
1  2  3___________4  5___________6  7  8  9___________10 11__________12 13 14
|  |________________________________|  |________________________________|  |

As you can see, to get the entire subtree for a given node, in tree order, you simply have to select all rows which have lft and rght values between its lft and rght values. It’s also simple to retrieve the tree of ancestors for a given node.

The level column is a bit of denormalisation for convenience more than anything and the tree_id column allows you to restart the lft and rght numbering for each top-level node, which reduces the number of columns affected by inserts, moves and deletions, as the lft and rght columns have to be adjusted accordingly when these operations take place in order to create or close gaps. I made some development notes at the time when I was trying to wrap my head around the queries required for each operation.

In terms of actually working with this data to display a tree, I created a tree_item_iterator utility function which, for each node, should give you sufficient information to generate whatever kind of display you want.

More info about MPTT:

See also original question in stackoverflow

#32: How to replace all occurrences of a character in string? (Score: 524)

Created: 2010-05-24 Last updated: 2016-07-22

Tags: c++, algorithm, str-replace, stdstring

What is the effective way to replace all occurrences of a character with another character in std::string?

#32 Best answer 1 of How to replace all occurrences of a character in string? (Score: 815)

Created: 2010-05-24 Last updated: 2014-02-28

std::string doesn’t contain such function but you could use stand-alone replace function from algorithm header.

#include <algorithm>
#include <string>

void some_func() {
  std::string s = "example string";
  std::replace( s.begin(), s.end(), 'x', 'y'); // replace all 'x' to 'y'

#32 Best answer 2 of How to replace all occurrences of a character in string?(Score: 140)

Created: 2014-06-19 Last updated: 2020-06-20

The question is centered on character replacement, but, as I found this page very useful (especially Konrad’s remark), I’d like to share this more generalized implementation, which allows to deal with substrings as well:

std::string ReplaceAll(std::string str, const std::string& from, const std::string& to) {
    size_t start_pos = 0;
    while((start_pos = str.find(from, start_pos)) != std::string::npos) {
        str.replace(start_pos, from.length(), to);
        start_pos += to.length(); // Handles case where 'to' is a substring of 'from'
    return str;


std::cout << ReplaceAll(string("Number Of Beans"), std::string(" "), std::string("_")) << std::endl;
std::cout << ReplaceAll(string("ghghjghugtghty"), std::string("gh"), std::string("X")) << std::endl;
std::cout << ReplaceAll(string("ghghjghugtghty"), std::string("gh"), std::string("h")) << std::endl;






The above can be implemented in a more suitable way, in case performances are of your concern, by returning nothing (void) and performing the changes directly on the string str given as argument, passed by address instead of by value. This would avoid useless and costly copy of the original string, while returning the result. Your call, then…

Code :

static inline void ReplaceAll2(std::string &str, const std::string& from, const std::string& to)
    // Same inner code...
    // No return statement

Hope this will be helpful for some others…

See also original question in stackoverflow

#33: Why does Java's hashCode() in String use 31 as a multiplier? (Score: 513)

Created: 2008-11-18 Last updated: 2019-01-03

Tags: java, string, algorithm, hash

Per the Java documentation, the hash code for a String object is computed as:

s[0]*31^(n-1) + s[1]*31^(n-2) + ... + s[n-1]

using int arithmetic, where s[i] is the ith character of the string, n is the length of the string, and ^ indicates exponentiation.

Why is 31 used as a multiplier?

I understand that the multiplier should be a relatively large prime number. So why not 29, or 37, or even 97?

#33 Best answer 1 of Why does Java's hashCode() in String use 31 as a multiplier? (Score: 430)

Created: 2008-11-18

According to Joshua Bloch’s Effective Java (a book that can’t be recommended enough, and which I bought thanks to continual mentions on stackoverflow):

The value 31 was chosen because it is an odd prime. If it were even and the multiplication overflowed, information would be lost, as multiplication by 2 is equivalent to shifting. The advantage of using a prime is less clear, but it is traditional. A nice property of 31 is that the multiplication can be replaced by a shift and a subtraction for better performance: 31 * i == (i << 5) - i. Modern VMs do this sort of optimization automatically.

(from Chapter 3, Item 9: Always override hashcode when you override equals, page 48)

#33 Best answer 2 of Why does Java's hashCode() in String use 31 as a multiplier?(Score: 83)

Created: 2008-11-18 Last updated: 2020-09-12

Goodrich and Tamassia computed from over 50,000 English words (formed as the union of the word lists provided in two variants of Unix) that using the constants 31, 33, 37, 39, and 41 will produce fewer than 7 collisions in each case. This may be the reason that so many Java implementations choose such constants.

See section 9.2 Hash Tables (page 522) of Data Structures and Algorithms in Java.

See also original question in stackoverflow

#34: Best way to reverse a string (Score: 482)

Created: 2008-10-23 Last updated: 2013-03-03

Tags: c#, .net, performance, algorithm, unicode

I’ve just had to write a string reverse function in C# 2.0 (i.e. LINQ not available) and came up with this:

public string Reverse(string text)
    char[] cArray = text.ToCharArray();
    string reverse = String.Empty;
    for (int i = cArray.Length - 1; i > -1; i--)
        reverse += cArray[i];
    return reverse;

Personally I’m not crazy about the function and am convinced that there’s a better way to do it. Is there?

#34 Best answer 1 of Best way to reverse a string (Score: 658)

Created: 2008-10-23 Last updated: 2020-11-15

public static string Reverse( string s )
    char[] charArray = s.ToCharArray();
    Array.Reverse( charArray );
    return new string( charArray );

#34 Best answer 2 of Best way to reverse a string(Score: 199)

Created: 2013-02-27 Last updated: 2013-07-24

Here a solution that properly reverses the string "Les Mise\u0301rables" as "selbare\u0301siM seL". This should render just like selbarésiM seL, not selbaŕesiM seL (note the position of the accent), as would the result of most implementations based on code units (Array.Reverse, etc) or even code points (reversing with special care for surrogate pairs).

using System;
using System.Collections.Generic;
using System.Globalization;
using System.Linq;

public static class Test
    private static IEnumerable<string> GraphemeClusters(this string s) {
        var enumerator = StringInfo.GetTextElementEnumerator(s);
        while(enumerator.MoveNext()) {
            yield return (string)enumerator.Current;
    private static string ReverseGraphemeClusters(this string s) {
        return string.Join("", s.GraphemeClusters().Reverse().ToArray());
    public static void Main()
        var s = "Les Mise\u0301rables";
        var r = s.ReverseGraphemeClusters();

(And live running example here:

It simply uses the .NET API for grapheme cluster iteration, which has been there since ever, but a bit “hidden” from view, it seems.

See also original question in stackoverflow

#35: What's the Hi/Lo algorithm? (Score: 481)

Created: 2008-11-11 Last updated: 2020-12-24

Tags: database, algorithm, primary-key, identifier, hilo

What’s the Hi/Lo algorithm?

I’ve found this in the NHibernate documentation (it’s one method to generate unique keys, section, but I haven’t found a good explanation of how it works.

I know that Nhibernate handles it, and I don’t need to know the inside, but I’m just curious.

#35 Best answer 1 of What's the Hi/Lo algorithm? (Score: 565)

Created: 2008-11-11

The basic idea is that you have two numbers to make up a primary key- a “high” number and a “low” number. A client can basically increment the “high” sequence, knowing that it can then safely generate keys from the entire range of the previous “high” value with the variety of “low” values.

For instance, supposing you have a “high” sequence with a current value of 35, and the “low” number is in the range 0-1023. Then the client can increment the sequence to 36 (for other clients to be able to generate keys while it’s using 35) and know that keys 35/0, 35/1, 35/2, 35/3… 35/1023 are all available.

It can be very useful (particularly with ORMs) to be able to set the primary keys on the client side, instead of inserting values without primary keys and then fetching them back onto the client. Aside from anything else, it means you can easily make parent/child relationships and have the keys all in place before you do any inserts, which makes batching them simpler.

#35 Best answer 2 of What's the Hi/Lo algorithm?(Score: 163)

Created: 2008-11-11

In addition to Jon’s answer:

It is used to be able to work disconnected. A client can then ask the server for a hi number and create objects increasing the lo number itself. It does not need to contact the server until the lo range is used up.

See also original question in stackoverflow

#36: Algorithm to detect overlapping periods (Score: 478)

Created: 2012-11-22 Last updated: 2018-05-22

Tags: c#, .net, algorithm, datetime, time

I’ve to detect if two time periods are overlapping.
Every period has a start date and an end date.
I need to detect if my first time period (A) is overlapping with another one(B/C).
In my case, if the start of B is equal to the end of A, they are not overlapping(the inverse too)
I found the following cases:

enter image description here

So actually I’m doing this like this:

tStartA < tStartB && tStartB < tEndA //For case 1
tStartA < tEndB && tEndB <= tEndA //For case 2
tStartB < tStartA  && tEndB > tEndA //For case 3

(The case 4 is taken in the account either in case 1 or in case 2)

It works, but it seems not very efficient.

So, first is there an existing class in c# that can modelize this(a time period), something like a timespan, but with a fixed start date.

Secondly: Is there already a c# code(like in the DateTime class) which can handle this?

Third: if no, what would be your approach to make this comparison the most fast?

#36 Best answer 1 of Algorithm to detect overlapping periods (Score: 831)

Created: 2012-11-22

Simple check to see if two time periods overlap:

bool overlap = a.start < b.end && b.start < a.end;

or in your code:

bool overlap = tStartA < tEndB && tStartB < tEndA;

(Use <= instead of < if you change your mind about wanting to say that two periods that just touch each other overlap.)

#36 Best answer 2 of Algorithm to detect overlapping periods(Score: 48)

Created: 2012-11-22 Last updated: 2015-11-29

There is a wonderful library with good reviews on CodeProject:

That library does a lot of work concerning overlap, intersecting them, etc. It’s too big to copy/paste all of it, but I’ll see which specific parts which can be useful to you.

See also original question in stackoverflow

#37: Why do we check up to the square root of a prime number to determine if it is prime? (Score: 460)

Created: 2011-04-27 Last updated: 2016-10-26

Tags: algorithm, primes, primality-test

To test whether a number is prime or not, why do we have to test whether it is divisible only up to the square root of that number?

#37 Best answer 1 of Why do we check up to the square root of a prime number to determine if it is prime? (Score: 755)

Created: 2011-04-27 Last updated: 2020-05-26

If a number n is not a prime, it can be factored into two factors a and b:

n = a * b

Now a and b can’t be both greater than the square root of n, since then the product a * b would be greater than sqrt(n) * sqrt(n) = n. So in any factorization of n, at least one of the factors must be smaller than the square root of n, and if we can’t find any factors less than or equal to the square root, n must be a prime.

#37 Best answer 2 of Why do we check up to the square root of a prime number to determine if it is prime?(Score: 360)

Created: 2011-04-28 Last updated: 2019-04-18

Let’s say m = sqrt(n) then m × m = n. Now if n is not a prime then n can be written as n = a × b, so m × m = a × b. Notice that m is a real number whereas n, a and b are natural numbers.

Now there can be 3 cases:

  1. a > m ⇒ b < m
  2. a = m ⇒ b = m
  3. a < m ⇒ b > m

In all 3 cases, min(a, b) ≤ m. Hence if we search till m, we are bound to find at least one factor of n, which is enough to show that n is not prime.

See also original question in stackoverflow

#38: What is an NP-complete in computer science? (Score: 456)

Created: 2008-10-17 Last updated: 2017-06-15

Tags: algorithm, language-agnostic, mathematical-optimization, theory, np-complete

What is an NP-complete problem? Why is it such an important topic in computer science?

#38 Best answer 1 of What is an NP-complete in computer science? (Score: 447)

Created: 2008-11-24 Last updated: 2020-06-20

What is NP?

NP is the set of all decision problems (questions with a yes-or-no answer) for which the ‘yes’-answers can be verified in polynomial time (O(nk) where n is the problem size, and k is a constant) by a deterministic Turing machine. Polynomial time is sometimes used as the definition of fast or quickly.

What is P?

P is the set of all decision problems which can be solved in polynomial time by a deterministic Turing machine. Since they can be solved in polynomial time, they can also be verified in polynomial time. Therefore P is a subset of NP.

What is NP-Complete?

A problem x that is in NP is also in NP-Complete if and only if every other problem in NP can be quickly (ie. in polynomial time) transformed into x.

In other words:

  1. x is in NP, and
  2. Every problem in NP is reducible to x

So, what makes NP-Complete so interesting is that if any one of the NP-Complete problems was to be solved quickly, then all NP problems can be solved quickly.

See also the post What’s “P=NP?”, and why is it such a famous question?

What is NP-Hard?

NP-Hard are problems that are at least as hard as the hardest problems in NP. Note that NP-Complete problems are also NP-hard. However not all NP-hard problems are NP (or even a decision problem), despite having NP as a prefix. That is the NP in NP-hard does not mean non-deterministic polynomial time. Yes, this is confusing, but its usage is entrenched and unlikely to change.

#38 Best answer 2 of What is an NP-complete in computer science?(Score: 218)

Created: 2008-10-17 Last updated: 2019-12-12

NP stands for Non-deterministic Polynomial time.

This means that the problem can be solved in Polynomial time using a Non-deterministic Turing machine (like a regular Turing machine but also including a non-deterministic “choice” function). Basically, a solution has to be testable in poly time. If that’s the case, and a known NP problem can be solved using the given problem with modified input (an NP problem can be reduced to the given problem) then the problem is NP complete.

The main thing to take away from an NP-complete problem is that it cannot be solved in polynomial time in any known way. NP-Hard/NP-Complete is a way of showing that certain classes of problems are not solvable in realistic time.

Edit: As others have noted, there are often approximate solutions for NP-Complete problems. In this case, the approximate solution usually gives an approximation bound using special notation which tells us how close the approximation is.

See also original question in stackoverflow

#39: How to detect a loop in a linked list? (Score: 452)

Created: 2010-04-18 Last updated: 2013-05-05

Tags: java, algorithm, data-structures, linked-list

Say you have a linked list structure in Java. It’s made up of Nodes:

class Node {
    Node next;
    // some user data

and each Node points to the next node, except for the last Node, which has null for next. Say there is a possibility that the list can contain a loop - i.e. the final Node, instead of having a null, has a reference to one of the nodes in the list which came before it.

What’s the best way of writing

boolean hasLoop(Node first)

which would return true if the given Node is the first of a list with a loop, and false otherwise? How could you write so that it takes a constant amount of space and a reasonable amount of time?

Here’s a picture of what a list with a loop looks like:

alt text

#39 Best answer 1 of How to detect a loop in a linked list? (Score: 565)

Created: 2010-04-18 Last updated: 2020-02-27

You can make use of Floyd’s cycle-finding algorithm, also known as tortoise and hare algorithm.

The idea is to have two references to the list and move them at different speeds. Move one forward by 1 node and the other by 2 nodes.

  • If the linked list has a loop they will definitely meet.
  • Else either of the two references(or their next) will become null.

Java function implementing the algorithm:

boolean hasLoop(Node first) {
	if(first == null) // list does not no loop either
        return false;

	Node slow, fast; // create two references.
	slow = fast = first; // make both refer to the start of the list
	while(true) {

        slow =;          // 1 hop

        if( != null)
            fast =; // 2 hops
            return false;          // next node null => no loop

        if(slow == null || fast == null) // if either hits loop
            return false;

        if(slow == fast) // if the two ever meet...we must have a loop
            return true;

#39 Best answer 2 of How to detect a loop in a linked list?(Score: 135)

Created: 2010-07-21

Here’s a refinement of the Fast/Slow solution, which correctly handles odd length lists and improves clarity.

boolean hasLoop(Node first) {
	Node slow = first;
    Node fast = first;
	while(fast != null && != null) {
        slow =;          // 1 hop
        fast =;     // 2 hops 

        if(slow == fast)  // fast caught up to slow, so there is a loop
            return true;
    return false;  // fast reached null, so the list terminates

See also original question in stackoverflow

#40: How does the Google "Did you mean?" Algorithm work? (Score: 450)

Created: 2008-11-20 Last updated: 2010-12-06

Tags: algorithm, machine-learning, nlp, spell-checking, text-search

I’ve been developing an internal website for a portfolio management tool. There is a lot of text data, company names etc. I’ve been really impressed with some search engines ability to very quickly respond to queries with “Did you mean: xxxx”.

I need to be able to intelligently take a user query and respond with not only raw search results but also with a “Did you mean?” response when there is a highly likely alternative answer etc

[I’m developing in ASP.NET (VB - don’t hold it against me! )]

UPDATE: OK, how can I mimic this without the millions of ‘unpaid users’?

  • Generate typos for each ‘known’ or ‘correct’ term and perform lookups?
  • Some other more elegant method?

#40 Best answer 1 of How does the Google "Did you mean?" Algorithm work? (Score: 379)

Created: 2008-11-20 Last updated: 2010-08-06

Here’s the explanation directly from the source ( almost )

Search 101!

at min 22:03

Worth watching!

Basically and according to Douglas Merrill former CTO of Google it is like this:

  1. You write a ( misspelled ) word in google

  2. You don’t find what you wanted ( don’t click on any results )

  3. You realize you misspelled the word so you rewrite the word in the search box.

  4. You find what you want ( you click in the first links )

This pattern multiplied millions of times, shows what are the most common misspells and what are the most “common” corrections.

This way Google can almost instantaneously, offer spell correction in every language.

Also this means if overnight everyone start to spell night as “nigth” google would suggest that word instead.


@ThomasRutter: Douglas describe it as “statistical machine learning”.

They know who correct the query, because they know which query comes from which user ( using cookies )

If the users perform a query, and only 10% of the users click on a result and 90% goes back and type another query ( with the corrected word ) and this time that 90% clicks on a result, then they know they have found a correction.

They can also know if those are “related” queries of two different, because they have information of all the links they show.

Furthermore, they are now including the context into the spell check, so they can even suggest different word depending on the context.

See this demo of google wave ( @ 44m 06s ) that shows how the context is taken into account to automatically correct the spelling.

Here it is explained how that natural language processing works.

And finally here is an awesome demo of what can be done adding automatic machine translation ( @ 1h 12m 47s ) to the mix.

I've added anchors of minute and seconds to the videos to skip directly to the content, if they don't work, try reloading the page or scrolling by hand to the mark.

#40 Best answer 2 of How does the Google "Did you mean?" Algorithm work?(Score: 109)

Created: 2008-11-20 Last updated: 2018-05-10

I found this article some time ago: How to Write a Spelling Corrector, written by Peter Norvig (Director of Research at Google Inc.).

It’s an interesting read about the “spelling correction” topic. The examples are in Python but it’s clear and simple to understand, and I think that the algorithm can be easily translated to other languages.

Below follows a short description of the algorithm. The algorithm consists of two steps, preparation and word checking.

Step 1: Preparation - setting up the word database

Best is if you can use actual search words and their occurence. If you don’t have that a large set of text can be used instead. Count the occurrence (popularity) of each word.

Step 2. Word checking - finding words that are similar to the one checked

Similar means that the edit distance is low (typically 0-1 or 0-2). The edit distance is the minimum number of inserts/deletes/changes/swaps needed to transform one word to another.

Choose the most popular word from the previous step and suggest it as a correction (if other than the word itself).

See also original question in stackoverflow

  1. This page use API to get the relevant data from stackoverflow community.
  2. Content license on this page is CC BY-SA 3.0.
  3. score = up votes - down votes.