## A creative question for Udacians

In unit X we have learned how to sort keys into buckets based on the sum of the value of their characters (for example "abd" has the same value as "acc", as "b"+1="d"-1="c"). That's it in a nutshell, not too precise I know.

Let's suppose I want the search engine to be able to deal with queries such as "U?acit?" (which will search the keys Udacity, Ubacity, Udacita, etc...). Obviously without buckets it wouldn't be much of a problem with what we've learned so far in Python, but as the intro vid for the unit suggests, we want to "make things faster", which is harder without a hashtable.

So to conclude, my question is, do you have a suggestion as to what solution could be applied here? The bucketing we've learned so far won't work since we don't know the value of some characters, and without a hashtable this can get pretty slow with giant indexes. Is there a hashtable that will work for the situation? (or perhaps a totally different solution)

note: This is just a hypothetical question. If this is way over what we have learned so far, please tell me.

6.2k1025103

accept rate: 46%

There are some pretty smart guys/girls on this forum that will probably have better answers. What makes google so fast is that they cache a lot of searches. Meaning they don't have to query their database very often they just return the cached result. Your question regarding wildcards may not be a huge deal. If google doesn't have your search request in their cache then maybe they will have to actually go through and find the results manually. I believe your question is how to make this manual search as fast as possible.

Here is one man's solution to this problem:

http://drdobbs.com/architecture-and-design/210200888

jksdrum
3.6k102058

Suppose you had an index that could return a list of all keywords starting with 'U', and another that could list all keywords with 'a' in the third position, and another that could list all keywords with 'c' in the fourth position, etc. Then the problem reduces to finding the intersection of all the returned lists. This is basically how bibliographic retrieval systems work: intersect works by the specified author, works with certain title words, works with specified words in the text, etc. This is also how Google works, with lists of pages that contain specified words. (Or specified word roots.) If you ask for a quoted phrase, Google then scans the retrieved pages for one that contains the keywords in exactly the required order. (And, I assume, they don't compute beyond the first page of answers until you request a second page.)

For your hypothetical alphabetic search, however, it would be more common to set up a regular expression function that can be applied quickly to each keyword in the dictionary. One pass through a list of a million words doesn't take very long. I do such searches on my home computer to answer word puzzles and (if absolutely necessary) to finish Saturday Stumper crossword puzzles.

Kenneth I. L...
28.3k1984187

Crosswords were actually my thought behind the question, but with crosswords scanning through the answers is not a problem (because as you've mentioned, the scan time is pretty short), so I intended my question towards the search engine we're building (which turns it into a practical problem).

BTW, do you have a crossword-answers DB on your PC? or did you mean that you're using a designated website for that?

(24 Mar '12, 18:46)

How would you do regular expressions over the keys of a dictionary?

(24 Mar '12, 18:52)

My Panorama database program (\$300 from ProVUE, and worth every cent, despite a few quirks and bugs) has a built-in wildcard query for retrieving records. I don't know the code behind it. Not quite regular expressions, which is a pity, but I can write my own search functions when I need to.

I do have my own linguistic database. I started with the Moby dictionary from Grady Ward. (It was an enormous effort on his part, pretty much his life's work. Not much money to be made in selling it, though, and eventually he put it into the public domain.) I've done a lot of work cleaning it up, adding cities, names, technical terms, etc., and assigning semantic tags to some of the word classes. I'm currently doing some inferencing on phonetic encodings. It's not a world-class word list and it doesn't include definitions, but it suits my purposes.

(24 Mar '12, 19:07)

That is actually no problem in Python's built-in dictionary type because it uses the id() function on the objects to compute the hash, so the computed hash is different from what we know, e.g.:

id("abd") # returns 25138504
id("acc") # returns 25137952


And actually Python doesn't use any buckets. Buckets are a collision resolution strategy (there are many different). Python uses another one, which is called open addressing, that takes O(1) time for lookup, but the input may take more time (it essentially hops several fields further).
Here's an exhaustive explanation of the built-in dictionary: Python dictionary implementation

implicit_kno...
3.7k82858

Question text:

Markdown Basics

• *italic* or _italic_
• **bold** or __bold__
• image?![alt text](/path/img.jpg "Title")
• numbered list: 1. Foo 2. Bar
• to add a line break simply add two spaces to where you would like the new line to be.
• basic HTML tags are also supported

×30,796
×33
×5