Most votes on algorithm questions 4
Read all the top votes questions and answers in a single page.
#31: What is the most efficient/elegant way to parse a flat table into a tree? (Score: 539)
Created: 20081010 Last updated: 20170523
Tags: sql, algorithm, recursion, tree, hierarchicaldata
Assume you have a flat table that stores an ordered tree hierarchy:
Id Name ParentId Order
1 'Node 1' 0 10
2 'Node 1.1' 1 10
3 'Node 2' 0 20
4 'Node 1.1.1' 2 10
5 'Node 2.1' 3 10
6 'Node 1.2' 1 20
Here’s a diagram, where we have [id] Name
. Root node 0 is fictional.
[0] ROOT / \ [1] Node 1 [3] Node 2 / \ \ [2] Node 1.1 [6] Node 1.2 [5] Node 2.1 / [4] Node 1.1.1
What minimalistic approach would you use to output that to HTML (or text, for that matter) as a correctly ordered, correctly indented tree?
Assume further you only have basic data structures (arrays and hashmaps), no fancy objects with parent/children references, no ORM, no framework, just your two hands. The table is represented as a result set, which can be accessed randomly.
Pseudo code or plain English is okay, this is purely a conceptional question.
Bonus question: Is there a fundamentally better way to store a tree structure like this in a RDBMS?
EDITS AND ADDITIONS
To answer one commenter’s (Mark Bessey’s) question: A root node is not necessary, because it is never going to be displayed anyway. ParentId = 0 is the convention to express “these are top level”. The Order column defines how nodes with the same parent are going to be sorted.
The “result set” I spoke of can be pictured as an array of hashmaps (to stay in that terminology). For my example was meant to be already there. Some answers go the extra mile and construct it first, but thats okay.
The tree can be arbitrarily deep. Each node can have N children. I did not exactly have a “millions of entries” tree in mind, though.
Don’t mistake my choice of node naming (‘Node 1.1.1’) for something to rely on. The nodes could equally well be called ‘Frank’ or ‘Bob’, no naming structure is implied, this was merely to make it readable.
I have posted my own solution so you guys can pull it to pieces.
#31 Best answer 1 of What is the most efficient/elegant way to parse a flat table into a tree? (Score: 471)
Created: 20081010 Last updated: 20200101
Now that MySQL 8.0 supports recursive queries, we can say that all popular SQL databases support recursive queries in standard syntax.
WITH RECURSIVE MyTree AS (
SELECT * FROM MyTable WHERE ParentId IS NULL
UNION ALL
SELECT m.* FROM MyTABLE AS m JOIN MyTree AS t ON m.ParentId = t.Id
)
SELECT * FROM MyTree;
I tested recursive queries in MySQL 8.0 in my presentation Recursive Query Throwdown in 2017.
Below is my original answer from 2008:
There are several ways to store treestructured data in a relational database. What you show in your example uses two methods:
 Adjacency List (the “parent” column) and
 Path Enumeration (the dottednumbers in your name column).
Another solution is called Nested Sets, and it can be stored in the same table too. Read “Trees and Hierarchies in SQL for Smarties” by Joe Celko for a lot more information on these designs.
I usually prefer a design called Closure Table (aka “Adjacency Relation”) for storing treestructured data. It requires another table, but then querying trees is pretty easy.
I cover Closure Table in my presentation Models for Hierarchical Data with SQL and PHP and in my book SQL Antipatterns: Avoiding the Pitfalls of Database Programming.
CREATE TABLE ClosureTable (
ancestor_id INT NOT NULL REFERENCES FlatTable(id),
descendant_id INT NOT NULL REFERENCES FlatTable(id),
PRIMARY KEY (ancestor_id, descendant_id)
);
Store all paths in the Closure Table, where there is a direct ancestry from one node to another. Include a row for each node to reference itself. For example, using the data set you showed in your question:
INSERT INTO ClosureTable (ancestor_id, descendant_id) VALUES
(1,1), (1,2), (1,4), (1,6),
(2,2), (2,4),
(3,3), (3,5),
(4,4),
(5,5),
(6,6);
Now you can get a tree starting at node 1 like this:
SELECT f.*
FROM FlatTable f
JOIN ClosureTable a ON (f.id = a.descendant_id)
WHERE a.ancestor_id = 1;
The output (in MySQL client) looks like the following:
++
 id 
++
 1 
 2 
 4 
 6 
++
In other words, nodes 3 and 5 are excluded, because they’re part of a separate hierarchy, not descending from node 1.
Re: comment from esatis about immediate children (or immediate parent). You can add a “path_length
” column to the ClosureTable
to make it easier to query specifically for an immediate child or parent (or any other distance).
INSERT INTO ClosureTable (ancestor_id, descendant_id, path_length) VALUES
(1,1,0), (1,2,1), (1,4,2), (1,6,1),
(2,2,0), (2,4,1),
(3,3,0), (3,5,1),
(4,4,0),
(5,5,0),
(6,6,0);
Then you can add a term in your search for querying the immediate children of a given node. These are descendants whose path_length
is 1.
SELECT f.*
FROM FlatTable f
JOIN ClosureTable a ON (f.id = a.descendant_id)
WHERE a.ancestor_id = 1
AND path_length = 1;
++
 id 
++
 2 
 6 
++
Re comment from @ashraf: “How about sorting the whole tree [by name]?”
Here’s an example query to return all nodes that are descendants of node 1, join them to the FlatTable that contains other node attributes such as name
, and sort by the name.
SELECT f.name
FROM FlatTable f
JOIN ClosureTable a ON (f.id = a.descendant_id)
WHERE a.ancestor_id = 1
ORDER BY f.name;
Re comment from @Nate:
SELECT f.name, GROUP_CONCAT(b.ancestor_id order by b.path_length desc) AS breadcrumbs
FROM FlatTable f
JOIN ClosureTable a ON (f.id = a.descendant_id)
JOIN ClosureTable b ON (b.descendant_id = a.descendant_id)
WHERE a.ancestor_id = 1
GROUP BY a.descendant_id
ORDER BY f.name
+++
 name  breadcrumbs 
+++
 Node 1  1 
 Node 1.1  1,2 
 Node 1.1.1  1,2,4 
 Node 1.2  1,6 
+++
A user suggested an edit today. SO moderators approved the edit, but I am reversing it.
The edit suggested that the ORDER BY in the last query above should be ORDER BY b.path_length, f.name
, presumably to make sure the ordering matches the hierarchy. But this doesn’t work, because it would order “Node 1.1.1” after “Node 1.2”.
If you want the ordering to match the hierarchy in a sensible way, that is possible, but not simply by ordering by the path length. For example, see my answer to MySQL Closure Table hierarchical database  How to pull information out in the correct order.
#31 Best answer 2 of What is the most efficient/elegant way to parse a flat table into a tree?(Score: 58)
Created: 20081011 Last updated: 20111115
If you use nested sets (sometimes referred to as Modified Preorder Tree Traversal) you can extract the entire tree structure or any subtree within it in tree order with a single query, at the cost of inserts being more expensive, as you need to manage columns which describe an inorder path through thee tree structure.
For djangomptt, I used a structure like this:
id parent_id tree_id level lft rght       1 null 1 0 1 14 2 1 1 1 2 7 3 2 1 2 3 4 4 2 1 2 5 6 5 1 1 1 8 13 6 5 1 2 9 10 7 5 1 2 11 12
Which describes a tree which looks like this (with id
representing each item):
1 + 2  + 3  + 4  + 5 + 6 + 7
Or, as a nested set diagram which makes it more obvious how the lft
and rght
values work:
__________________________________________________________________________  Root 1   ________________________________ ________________________________    Child 1.1   Child 1.2     ___________ ___________   ___________ ___________      C 1.1.1   C 1.1.2     C 1.2.1   C 1.2.2    1 2 3___________4 5___________6 7 8 9___________10 11__________12 13 14  ________________________________ ________________________________  __________________________________________________________________________
As you can see, to get the entire subtree for a given node, in tree order, you simply have to select all rows which have lft
and rght
values between its lft
and rght
values. It’s also simple to retrieve the tree of ancestors for a given node.
The level
column is a bit of denormalisation for convenience more than anything and the tree_id
column allows you to restart the lft
and rght
numbering for each toplevel node, which reduces the number of columns affected by inserts, moves and deletions, as the lft
and rght
columns have to be adjusted accordingly when these operations take place in order to create or close gaps. I made some development notes at the time when I was trying to wrap my head around the queries required for each operation.
In terms of actually working with this data to display a tree, I created a tree_item_iterator
utility function which, for each node, should give you sufficient information to generate whatever kind of display you want.
More info about MPTT:
See also original question in stackoverflow
#32: How to replace all occurrences of a character in string? (Score: 524)
Created: 20100524 Last updated: 20160722
Tags: c++, algorithm, strreplace, stdstring
What is the effective way to replace all occurrences of a character with another character in std::string
?
#32 Best answer 1 of How to replace all occurrences of a character in string? (Score: 815)
Created: 20100524 Last updated: 20140228
std::string
doesn’t contain such function but you could use standalone replace
function from algorithm
header.
#include <algorithm>
#include <string>
void some_func() {
std::string s = "example string";
std::replace( s.begin(), s.end(), 'x', 'y'); // replace all 'x' to 'y'
}
#32 Best answer 2 of How to replace all occurrences of a character in string?(Score: 140)
Created: 20140619 Last updated: 20200620
The question is centered on character
replacement, but, as I found this page very useful (especially Konrad’s remark), I’d like to share this more generalized implementation, which allows to deal with substrings
as well:
std::string ReplaceAll(std::string str, const std::string& from, const std::string& to) {
size_t start_pos = 0;
while((start_pos = str.find(from, start_pos)) != std::string::npos) {
str.replace(start_pos, from.length(), to);
start_pos += to.length(); // Handles case where 'to' is a substring of 'from'
}
return str;
}
Usage:
std::cout << ReplaceAll(string("Number Of Beans"), std::string(" "), std::string("_")) << std::endl;
std::cout << ReplaceAll(string("ghghjghugtghty"), std::string("gh"), std::string("X")) << std::endl;
std::cout << ReplaceAll(string("ghghjghugtghty"), std::string("gh"), std::string("h")) << std::endl;
Outputs:
Number_Of_Beans
XXjXugtXty
hhjhugthty
EDIT:
The above can be implemented in a more suitable way, in case performances are of your concern, by returning nothing (void
) and performing the changes directly on the string str
given as argument, passed by address instead of by value. This would avoid useless and costly copy of the original string, while returning the result. Your call, then…
Code :
static inline void ReplaceAll2(std::string &str, const std::string& from, const std::string& to)
{
// Same inner code...
// No return statement
}
Hope this will be helpful for some others…
See also original question in stackoverflow
#33: Why does Java's hashCode() in String use 31 as a multiplier? (Score: 513)
Created: 20081118 Last updated: 20190103
Tags: java, string, algorithm, hash
Per the Java documentation, the hash code for a String
object is computed as:
s[0]*31^(n1) + s[1]*31^(n2) + ... + s[n1]
using
int
arithmetic, wheres[i]
is the ith character of the string,n
is the length of the string, and^
indicates exponentiation.
Why is 31 used as a multiplier?
I understand that the multiplier should be a relatively large prime number. So why not 29, or 37, or even 97?
#33 Best answer 1 of Why does Java's hashCode() in String use 31 as a multiplier? (Score: 430)
Created: 20081118
According to Joshua Bloch’s Effective Java (a book that can’t be recommended enough, and which I bought thanks to continual mentions on stackoverflow):
The value 31 was chosen because it is an odd prime. If it were even and the multiplication overflowed, information would be lost, as multiplication by 2 is equivalent to shifting. The advantage of using a prime is less clear, but it is traditional. A nice property of 31 is that the multiplication can be replaced by a shift and a subtraction for better performance:
31 * i == (i << 5)  i
. Modern VMs do this sort of optimization automatically.
(from Chapter 3, Item 9: Always override hashcode when you override equals, page 48)
#33 Best answer 2 of Why does Java's hashCode() in String use 31 as a multiplier?(Score: 83)
Created: 20081118 Last updated: 20200912
Goodrich and Tamassia computed from over 50,000 English words (formed as the union of the word lists provided in two variants of Unix) that using the constants 31, 33, 37, 39, and 41 will produce fewer than 7 collisions in each case. This may be the reason that so many Java implementations choose such constants.
See section 9.2 Hash Tables (page 522) of Data Structures and Algorithms in Java.
See also original question in stackoverflow
#34: Best way to reverse a string (Score: 482)
Created: 20081023 Last updated: 20130303
Tags: c#, .net, performance, algorithm, unicode
I’ve just had to write a string reverse function in C# 2.0 (i.e. LINQ not available) and came up with this:
public string Reverse(string text)
{
char[] cArray = text.ToCharArray();
string reverse = String.Empty;
for (int i = cArray.Length  1; i > 1; i)
{
reverse += cArray[i];
}
return reverse;
}
Personally I’m not crazy about the function and am convinced that there’s a better way to do it. Is there?
#34 Best answer 1 of Best way to reverse a string (Score: 658)
Created: 20081023 Last updated: 20201115
public static string Reverse( string s )
{
char[] charArray = s.ToCharArray();
Array.Reverse( charArray );
return new string( charArray );
}
#34 Best answer 2 of Best way to reverse a string(Score: 199)
Created: 20130227 Last updated: 20130724
Here a solution that properly reverses the string "Les Mise\u0301rables"
as "selbare\u0301siM seL"
. This should render just like selbarésiM seL
, not selbaŕesiM seL
(note the position of the accent), as would the result of most implementations based on code units (Array.Reverse
, etc) or even code points (reversing with special care for surrogate pairs).
using System;
using System.Collections.Generic;
using System.Globalization;
using System.Linq;
public static class Test
{
private static IEnumerable<string> GraphemeClusters(this string s) {
var enumerator = StringInfo.GetTextElementEnumerator(s);
while(enumerator.MoveNext()) {
yield return (string)enumerator.Current;
}
}
private static string ReverseGraphemeClusters(this string s) {
return string.Join("", s.GraphemeClusters().Reverse().ToArray());
}
public static void Main()
{
var s = "Les Mise\u0301rables";
var r = s.ReverseGraphemeClusters();
Console.WriteLine(r);
}
}
(And live running example here: https://ideone.com/DqAeMJ)
It simply uses the .NET API for grapheme cluster iteration, which has been there since ever, but a bit “hidden” from view, it seems.
See also original question in stackoverflow
#35: What's the Hi/Lo algorithm? (Score: 481)
Created: 20081111 Last updated: 20201224
Tags: database, algorithm, primarykey, identifier, hilo
What’s the Hi/Lo algorithm?
I’ve found this in the NHibernate documentation (it’s one method to generate unique keys, section 5.1.4.2), but I haven’t found a good explanation of how it works.
I know that Nhibernate handles it, and I don’t need to know the inside, but I’m just curious.
#35 Best answer 1 of What's the Hi/Lo algorithm? (Score: 565)
Created: 20081111
The basic idea is that you have two numbers to make up a primary key a “high” number and a “low” number. A client can basically increment the “high” sequence, knowing that it can then safely generate keys from the entire range of the previous “high” value with the variety of “low” values.
For instance, supposing you have a “high” sequence with a current value of 35, and the “low” number is in the range 01023. Then the client can increment the sequence to 36 (for other clients to be able to generate keys while it’s using 35) and know that keys 35/0, 35/1, 35/2, 35/3… 35/1023 are all available.
It can be very useful (particularly with ORMs) to be able to set the primary keys on the client side, instead of inserting values without primary keys and then fetching them back onto the client. Aside from anything else, it means you can easily make parent/child relationships and have the keys all in place before you do any inserts, which makes batching them simpler.
#35 Best answer 2 of What's the Hi/Lo algorithm?(Score: 163)
Created: 20081111
In addition to Jon’s answer:
It is used to be able to work disconnected. A client can then ask the server for a hi number and create objects increasing the lo number itself. It does not need to contact the server until the lo range is used up.
See also original question in stackoverflow
#36: Algorithm to detect overlapping periods (Score: 478)
Created: 20121122 Last updated: 20180522
Tags: c#, .net, algorithm, datetime, time
I’ve to detect if two time periods are overlapping.
Every period has a start date and an end date.
I need to detect if my first time period (A) is overlapping with another one(B/C).
In my case, if the start of B is equal to the end of A, they are not overlapping(the inverse too)
I found the following cases:
So actually I’m doing this like this:
tStartA < tStartB && tStartB < tEndA //For case 1
OR
tStartA < tEndB && tEndB <= tEndA //For case 2
OR
tStartB < tStartA && tEndB > tEndA //For case 3
(The case 4 is taken in the account either in case 1 or in case 2)
It works, but it seems not very efficient.
So, first is there an existing class in c# that can modelize this(a time period), something like a timespan, but with a fixed start date.
Secondly: Is there already a c# code(like in the DateTime
class) which can handle this?
Third: if no, what would be your approach to make this comparison the most fast?
#36 Best answer 1 of Algorithm to detect overlapping periods (Score: 831)
Created: 20121122
Simple check to see if two time periods overlap:
bool overlap = a.start < b.end && b.start < a.end;
or in your code:
bool overlap = tStartA < tEndB && tStartB < tEndA;
(Use <=
instead of <
if you change your mind about wanting to say that two periods that just touch each other overlap.)
#36 Best answer 2 of Algorithm to detect overlapping periods(Score: 48)
Created: 20121122 Last updated: 20151129
There is a wonderful library with good reviews on CodeProject: http://www.codeproject.com/Articles/168662/TimePeriodLibraryforNET
That library does a lot of work concerning overlap, intersecting them, etc. It’s too big to copy/paste all of it, but I’ll see which specific parts which can be useful to you.
See also original question in stackoverflow
#37: Why do we check up to the square root of a prime number to determine if it is prime? (Score: 460)
Created: 20110427 Last updated: 20161026
Tags: algorithm, primes, primalitytest
To test whether a number is prime or not, why do we have to test whether it is divisible only up to the square root of that number?
#37 Best answer 1 of Why do we check up to the square root of a prime number to determine if it is prime? (Score: 755)
Created: 20110427 Last updated: 20200526
If a number n
is not a prime, it can be factored into two factors a
and b
:
n = a * b
Now a
and b
can’t be both greater than the square root of n
, since then the product a * b
would be greater than sqrt(n) * sqrt(n) = n
. So in any factorization of n
, at least one of the factors must be smaller than the square root of n
, and if we can’t find any factors less than or equal to the square root, n
must be a prime.
#37 Best answer 2 of Why do we check up to the square root of a prime number to determine if it is prime?(Score: 360)
Created: 20110428 Last updated: 20190418
Let’s say m = sqrt(n)
then m × m = n
. Now if n
is not a prime then n
can be written as n = a × b
, so m × m = a × b
. Notice that m
is a real number whereas n
, a
and b
are natural numbers.
Now there can be 3 cases:
 a > m ⇒ b < m
 a = m ⇒ b = m
 a < m ⇒ b > m
In all 3 cases, min(a, b) ≤ m
. Hence if we search till m
, we are bound to find at least one factor of n
, which is enough to show that n
is not prime.
See also original question in stackoverflow
#38: What is an NPcomplete in computer science? (Score: 456)
Created: 20081017 Last updated: 20170615
Tags: algorithm, languageagnostic, mathematicaloptimization, theory, npcomplete
What is an NPcomplete problem? Why is it such an important topic in computer science?
#38 Best answer 1 of What is an NPcomplete in computer science? (Score: 447)
Created: 20081124 Last updated: 20200620
What is NP?
NP is the set of all decision problems (questions with a yesorno answer) for which the ‘yes’answers can be verified in polynomial time (O(n^{k}) where n is the problem size, and k is a constant) by a deterministic Turing machine. Polynomial time is sometimes used as the definition of fast or quickly.
What is P?
P is the set of all decision problems which can be solved in polynomial time by a deterministic Turing machine. Since they can be solved in polynomial time, they can also be verified in polynomial time. Therefore P is a subset of NP.
What is NPComplete?
A problem x that is in NP is also in NPComplete if and only if every other problem in NP can be quickly (ie. in polynomial time) transformed into x.
In other words:
 x is in NP, and
 Every problem in NP is reducible to x
So, what makes NPComplete so interesting is that if any one of the NPComplete problems was to be solved quickly, then all NP problems can be solved quickly.
See also the post What’s “P=NP?”, and why is it such a famous question?
What is NPHard?
NPHard are problems that are at least as hard as the hardest problems in NP. Note that NPComplete problems are also NPhard. However not all NPhard problems are NP (or even a decision problem), despite having NP
as a prefix. That is the NP in NPhard does not mean nondeterministic polynomial time. Yes, this is confusing, but its usage is entrenched and unlikely to change.
#38 Best answer 2 of What is an NPcomplete in computer science?(Score: 218)
Created: 20081017 Last updated: 20191212
NP stands for Nondeterministic Polynomial time.
This means that the problem can be solved in Polynomial time using a Nondeterministic Turing machine (like a regular Turing machine but also including a nondeterministic “choice” function). Basically, a solution has to be testable in poly time. If that’s the case, and a known NP problem can be solved using the given problem with modified input (an NP problem can be reduced to the given problem) then the problem is NP complete.
The main thing to take away from an NPcomplete problem is that it cannot be solved in polynomial time in any known way. NPHard/NPComplete is a way of showing that certain classes of problems are not solvable in realistic time.
Edit: As others have noted, there are often approximate solutions for NPComplete problems. In this case, the approximate solution usually gives an approximation bound using special notation which tells us how close the approximation is.
See also original question in stackoverflow
#39: How to detect a loop in a linked list? (Score: 452)
Created: 20100418 Last updated: 20130505
Tags: java, algorithm, datastructures, linkedlist
Say you have a linked list structure in Java. It’s made up of Nodes:
class Node {
Node next;
// some user data
}
and each Node points to the next node, except for the last Node, which has null for next. Say there is a possibility that the list can contain a loop  i.e. the final Node, instead of having a null, has a reference to one of the nodes in the list which came before it.
What’s the best way of writing
boolean hasLoop(Node first)
which would return true
if the given Node is the first of a list with a loop, and false
otherwise? How could you write so that it takes a constant amount of space and a reasonable amount of time?
Here’s a picture of what a list with a loop looks like:
#39 Best answer 1 of How to detect a loop in a linked list? (Score: 565)
Created: 20100418 Last updated: 20200227
You can make use of Floyd’s cyclefinding algorithm, also known as tortoise and hare algorithm.
The idea is to have two references to the list and move them at different speeds. Move one forward by 1
node and the other by 2
nodes.
 If the linked list has a loop they will definitely meet.
 Else either of
the two references(or their
next
) will becomenull
.
Java function implementing the algorithm:
boolean hasLoop(Node first) {
if(first == null) // list does not exist..so no loop either
return false;
Node slow, fast; // create two references.
slow = fast = first; // make both refer to the start of the list
while(true) {
slow = slow.next; // 1 hop
if(fast.next != null)
fast = fast.next.next; // 2 hops
else
return false; // next node null => no loop
if(slow == null  fast == null) // if either hits null..no loop
return false;
if(slow == fast) // if the two ever meet...we must have a loop
return true;
}
}
#39 Best answer 2 of How to detect a loop in a linked list?(Score: 135)
Created: 20100721
Here’s a refinement of the Fast/Slow solution, which correctly handles odd length lists and improves clarity.
boolean hasLoop(Node first) {
Node slow = first;
Node fast = first;
while(fast != null && fast.next != null) {
slow = slow.next; // 1 hop
fast = fast.next.next; // 2 hops
if(slow == fast) // fast caught up to slow, so there is a loop
return true;
}
return false; // fast reached null, so the list terminates
}
See also original question in stackoverflow
#40: How does the Google "Did you mean?" Algorithm work? (Score: 450)
Created: 20081120 Last updated: 20101206
Tags: algorithm, machinelearning, nlp, spellchecking, textsearch
I’ve been developing an internal website for a portfolio management tool. There is a lot of text data, company names etc. I’ve been really impressed with some search engines ability to very quickly respond to queries with “Did you mean: xxxx”.
I need to be able to intelligently take a user query and respond with not only raw search results but also with a “Did you mean?” response when there is a highly likely alternative answer etc
[I’m developing in ASP.NET (VB  don’t hold it against me! )]
UPDATE: OK, how can I mimic this without the millions of ‘unpaid users’?
 Generate typos for each ‘known’ or ‘correct’ term and perform lookups?
 Some other more elegant method?
#40 Best answer 1 of How does the Google "Did you mean?" Algorithm work? (Score: 379)
Created: 20081120 Last updated: 20100806
Here’s the explanation directly from the source ( almost )
Search 101!
at min 22:03
Worth watching!
Basically and according to Douglas Merrill former CTO of Google it is like this:

You write a ( misspelled ) word in google

You don’t find what you wanted ( don’t click on any results )

You realize you misspelled the word so you rewrite the word in the search box.

You find what you want ( you click in the first links )
This pattern multiplied millions of times, shows what are the most common misspells and what are the most “common” corrections.
This way Google can almost instantaneously, offer spell correction in every language.
Also this means if overnight everyone start to spell night as “nigth” google would suggest that word instead.
EDIT
@ThomasRutter: Douglas describe it as “statistical machine learning”.
They know who correct the query, because they know which query comes from which user ( using cookies )
If the users perform a query, and only 10% of the users click on a result and 90% goes back and type another query ( with the corrected word ) and this time that 90% clicks on a result, then they know they have found a correction.
They can also know if those are “related” queries of two different, because they have information of all the links they show.
Furthermore, they are now including the context into the spell check, so they can even suggest different word depending on the context.
See this demo of google wave ( @ 44m 06s ) that shows how the context is taken into account to automatically correct the spelling.
Here it is explained how that natural language processing works.
And finally here is an awesome demo of what can be done adding automatic machine translation ( @ 1h 12m 47s ) to the mix.
_{ I've added anchors of minute and seconds to the videos to skip directly to the content, if they don't work, try reloading the page or scrolling by hand to the mark. }#40 Best answer 2 of How does the Google "Did you mean?" Algorithm work?(Score: 109)
Created: 20081120 Last updated: 20180510
I found this article some time ago: How to Write a Spelling Corrector, written by Peter Norvig (Director of Research at Google Inc.).
It’s an interesting read about the “spelling correction” topic. The examples are in Python but it’s clear and simple to understand, and I think that the algorithm can be easily translated to other languages.
Below follows a short description of the algorithm. The algorithm consists of two steps, preparation and word checking.
Step 1: Preparation  setting up the word database
Best is if you can use actual search words and their occurence. If you don’t have that a large set of text can be used instead. Count the occurrence (popularity) of each word.
Step 2. Word checking  finding words that are similar to the one checked
Similar means that the edit distance is low (typically 01 or 02). The edit distance is the minimum number of inserts/deletes/changes/swaps needed to transform one word to another.
Choose the most popular word from the previous step and suggest it as a correction (if other than the word itself).
See also original question in stackoverflow
Notes:
 This page use API to get the relevant data from stackoverflow community.
 Content license on this page is CC BYSA 3.0.
score
=up votes
down votes
.
Feedback
Was this page helpful?
Glad to hear it!
Sorry to hear that.