Friday, June 19, 2009

Anagrams Part 3 – Using a Map

So how can we speed up our anagram algorithm? Well, how about we rearrange the input word list so that it stores all anagrams for us? This would be a one-off operation at the start of the program. From then on we would simply find all anagrams with a simple look up operation.

Hmmm how can we do that?

Well all anagrams of a particular word are made up of the same letters. If we sort all the anagrams into lexicographical order we get the same ordered sequence of characters. For example the following three words:

part trap rapt

are made up of the sorted sequence aprt. Lets call this the key, and the three original words the values.

What we need is a data structure that can store the key and values like this:

aprt => part trap rapt

This can be represented nicely in C++ as follows:

string => vector<string>

We can use a map to store this:

map<string, vector<string> >

We build the map by reading each word from the word list, sorting it, then adding it to the map and appending the unsorted word to the associated vector. We will end up with an in-memory representation like this:

aprt => part trap rapt
act => act cat
aeprs => spare pares

The code can explain this much better than I can, here is the complete program:

#include <iostream>
#include <hash_map>
#include <vector>
#include <string>
#include <fstream>
#include <algorithm>

using namespace std;
using stdext::hash_map;

typedef vector<string> value_type;
typedef hash_map<string, value_type> Dict;
typedef Dict::const_iterator dict_iter;

int main()
{	
	Dict dict;
	int word_count = 0;
	
	ifstream in_file("words.txt");
	if (!in_file) {
		cerr << "Input file not found" << endl;
		exit(1);
	}
	
	string s, sorted_s;
	// read a word, sort the chars then use it as a key to the hash table. Append the original
	// unsorted word to the vector that is the value.
	while (in_file >> s) {
		sorted_s = s;
		sort(sorted_s.begin(), sorted_s.end());
		dict[sorted_s].push_back(s);
		word_count++;
	}
	
	cout << word_count << " words read" << endl;
	
	string str;
	while (str != "q") {
		cout << "\nEnter letters (q to quit): ";
		cin >> str;
		sort(str.begin(), str.end());
		dict_iter iter = dict.find(str);		// find the sorted string in the hash table
		if (iter != dict.end()) {				// if it exists...
			value_type vec = (*iter).second;	// ...grab the vector 
			cout << vec.size() << (vec.size() == 1 ? " anagram -> " : " anagrams -> ");
			// output the anagrams
			copy (vec.begin(), vec.end(), ostream_iterator<string>(cout, " "));
			cout << endl;
		}
		else 
			if (str != "q") cout << "word not found" << endl;
	}
	return 0;
}

You can see the map is built in only 4 lines of C++ (lines 29 to 32). To keep things interesting I have used a hashmap instead. This is a drop-in replacement for std::map offering potentially better performance. Note that it lives in the stdext namespace if you use Visual Studio.

If you run this program you will find it returns answers instantaneously. Even if you enter a long nonsensical string it will determine that there are no anagrams instantly. In fact, using a hashmap can offer us potentially O(1) performance i.e. constant time. This means the performance does not depend on the length of the input string.

What this teaches us is that it is sometimes beneficial to change the representation of the input data in order to speed up the program. A map (or hashmap) is an extremely efficient way of storing related data. If you think about it we have actually created a database, where each <key, value> pair is a record in the database. The value is the data we require (the anagrams) and the key is the index (the sorted string).

Sunday, June 7, 2009

Anagrams Part 2 – Binary Search

One of the problems with the previous algorithm (see Anagrams Part 1) is that it performed a linear search of the word list to find each permutation. That is, if there are 38000 words then the worst case will be 38000 comparisons. And for anagrams most cases will be worst cases. That’s fine if the word list is a random jumble, but most word lists are stored in alphabetical order. That means we can choose a much better search algorithm – the binary search.

Binary search works with sorted data. Imagine if you had a phone book and wanted to find if “Flubberblubber” is in the book. You wouldn’t start at the beginning and scan through to the end would you? No, you’d probably open the book roughly halfway through. You might see that all the names start with “M”. You’d then open the book again roughly halfway between the start and the “M” page. You might hit names that start with “H”. So again you’d open the book halfway to “H”. This time you hit “E”. Now you’d open the book halfway between “E” and “H”. See how you are homing in on the desired “F” pages after only 3 lookups?

In fact, if the phone book contained 1 million numbers you’d get to the desired name (or find it didn’t exist) within 20 lookups if you followed this method! So a worst case of 20 compared to 1,000,000 using linear search. Even better, binary search scales remarkably well. If the phone book contained 1 billion numbers, the number of lookups would only increase to about 30. Now that’s impressive. It should be obvious that the reason this works is that we can discard 50% of the remaining pages of the phone book every time we do a lookup. In fact binary search is an O(log n) algorithm, while linear search is O(n).

So does the C++ Standard Library contain this useful facility? Of course it does! There is a binary_search() function in the <algorithm> header that does exactly what we want.

Simply replace the following lines from the original program:

vector<string>::const_iterator iter = find(dict.begin(), dict.end(), str);
if (iter != dict.end())
    // we have found an anagram so print it
    cout << *iter << " ";

with:

if (binary_search(dict.begin(), dict.end(), str))
    // we have found an anagram so print it
    cout << str << " ";

Not only is it faster it’s also a bit clearer (no iterators). Running the program to find anagrams of integral results in a near instantaneous result, as opposed to 21 secs with linear search.

So is this the best we can do? Not really. Running the program on a 16 character string still takes minutes to return an answer.

We’ve solved the problem of searching but unfortunately we are still generating in the order of n! permutations of the input string. This is now the bottleneck, so in the next part we’ll look at reducing that.

Thursday, May 28, 2009

Anagrams Part 1 - Brute Force

Beginner programmers are often surprised when a supposedly simple program takes a lot longer to execute than they expect. After all if a modern PC can simulate an entire battle consisting of hundreds of soldiers in real-time 3D, surely it can display all anagrams of a phrase without pausing to think for hours? Consider the problem of finding all anagrams of a word that are the same length as the word. For example, the word "integral" has 8 characters and has the following 8 character anagrams:

alerting
altering
integral
relating
triangle

For completeness we include the original string in the list of anagrams. I used a word list of around 38,000 common English words to get the anagrams. So how can we program a computer to find these anagrams? A first guess algorithm might be:
for each permutation p of characters in string s
...if p exists in the wordlist then print anagram
Luckily the C++ Standard Library contains a handy function called next_permutation(first, last) that takes a pair of iterators as arguments. The function changes the order of the elements in the range [first, last) to the next lexicographic permutation and returns true. If there is no next_permutation, it arranges the sequence to be the first permutation and returns false. For all permutations to be generated the starting sequence must be initially sorted in ascending order.

So, using this function we can easily generate all permutations of a word and check them against a word list.

Such a program might look like this:
#include <iostream>
#include <vector>
#include <string>
#include <fstream>
#include <algorithm>

using namespace std;

int main() 
{ 
    vector<string> dict;
    int word_count = 0;
    
    // Make sure words.txt is in your current directory.
    // Search the web for a suitable word list.
    ifstream in_file("words.txt");
    if (!in_file) {
        cerr << "word list not found" << endl;
        exit(1);
    }
    // read word list and store words in a vector
    string s;
    while (in_file >> s) {
        dict.push_back(s);
        word_count++;
    }
    
    cout << word_count << " words read" << endl;
    string str;
    while (true) {
        int count = 0;
        cout << "\nEnter letters (q to quit): ";
        cin >> str;
        if (str == "q") break;
        bool perms_remaining = true;
        sort(str.begin(), str.end()); // necessary so that next_permutation finds all permutations
        while (perms_remaining) {
            count++;
            vector<string>::const_iterator iter = find(dict.begin(), dict.end(), str);
            if (iter != dict.end())
                // we have found an anagram so print it
                cout << *iter << " ";
            perms_remaining = next_permutation(str.begin(), str.end());
        }
        cout << endl << count << " permutations" << endl;
    }
}


When I run this on my Core 2 Duo 2.1GHz PC it takes 21 seconds to find all 5 anagrams. There are noticeable pauses between each found word. Adding another letter to the input to make, say, integrals and the time rockets to 3 min 12 sec. So going from 8 characters to 9 characters results in a 9-fold increase in time. Yes folks this is an O(n!) algorithm, one of the worst performing possible.

There are n! permutations if the string has n unique characters, so for n = 8 there are 40320 permutations. For each of these the program scans the word list of 38000 words to find a match. Considering there are only a handful of anagrams for any given word, that means in effect the full list is being scanned every time. That makes 40320*38000 = 1,532,160,000 string comparisons. Now that's brute-force! That's where the 21 seconds goes. And that's for a relatively short word. Surely there must be a better way? Of course there is! See Part 2 to find out.

Friday, May 15, 2009

TTMath - a bignum library for C++

Came across this handy BigNum library for arbitraily large numbers in C++. All you need to do is include a single header and then declare a bignum type such as:
typedef ttmath::UInt<100> BigInt;
which creates a type that can hold unsigned integers between 0 and 2 ^ (32*100)-1. Which is absolutely huge. Actually you can go even bigger if the stack size on your OS permits. Yes, this library uses the stack for all calculations, there are no dynamic allocations. So now you can define a factorial function like so
BigInt factorial(BigInt n)
{
    if (n == 1 || n == 0)
        return 1;
    n *= factorial(n-1);
    return n;
}
and display all those fun factorials that we love so much. Did you know 35! = 10333147966386144929666651337523200000000 Fascinating eh!

Wednesday, April 29, 2009

Qt Part 3 - Configuration for Visual Studio 2008

Before the Nokia buyout Qt could only be integrated with Visual Studio if you purchased the Commercial License. Opensource users were limited to the MinGW toolset. That has now changed with the availability of the Visual Studio Add-In. This is a kind of cut-down version of the full integration that is still available in the Commercial edition, however from what I can see it does offer most of what you need, namely:
  • API Help integration with MSDN (so F1 context sensitive help works)
  • Wizards for Qt projects and skeleton apps
  • Qt Designer integration for forms and layout (although this runs as a separate app)
  • Automated build process (so it runs the Qt pre-processors then compiles and links the app)
Partner this with Whole Tomato Visual Assist X and you have a pretty nice Qt dev environment.
But before you can actually use Visual Studio you need to compile Qt from source so that you get the libs and DLLs that are compatible with VS. The best way to do this is to download the source Framework (not the prebuilt SDK). I think the only difference is that the framework does not contain Qt Creator. These steps assume you already have Visual Studio 2008 installed (any version except Express, which doesn't support add-ins), and that Qt is not already installed.
  1. Download and run qt-win-opensource-4.5.1-mingw.exe (the version number may have changed).
  2. Choose a destination folder, I chose D:\Qt\4.5.1
  3. Select the checkbox to download and install MinGW - it is needed for the build process.
  4. The actual archive unpacking can take quite a while, up to 20 mins or so.
  5. Add Qt to your system PATH - this means the bin folder, so D:\Qt\4.5.1\bin in my case.
  6. Add the system environment variable QTDIR=D:\Qt\4.5.1 (substitute your path)
  7. Reboot your computer for the changes to take effect.
  8. Run the Visual Studio 2008 Command Prompt (or open an ordinary cmd.exe and run vsvars32.bat from Microsoft Visual Studio 9.0\Common7\Tools folder). This will set the environment variables for the Microsoft compiler tools.
  9. cd to D:\Qt\4.5.1 (not the bin folder).
  10. Run the command: configure -platform win32-msvc2008
  11. Answer O for Opensource, and agree to the license.
  12. It will make a new qmake.exe, then build pro files for each library and app.
  13. Run nmake. This can take upwards of 5 hours on a P4 3GHz machine to give you some idea. Also, it will require around 10GB of temporary scratch space to build.
  14. Wait for hours for the compile to finish...it will build debug and release versions of the libraries and DLLs in the \lib folder, and also all the tools and examples.
  15. You can delete the *.obj and *.pdb files to claim back around 9GB of disk space. The easiest way to do this is to open a Command Prompt and cd to the Qt folder. Then enter del /s /q *.obj *.pdb
  16. That's it! If you're a completist you can now uninstall MinGW and also delete mingwm10.dll from the Qt \bin folder, as they are not needed. Confirm everything worked by running qtdemo.exe from \bin.
Now you need to download and install the Visual Studio Add-In. Just double-click the .exe and follow the wizard.
  1. Start Visual Studio 2008.
  2. Click Help -> Contents. This will start MSDN and perform an update of the help files to merge the Qt documentation. This can take quite a while. As in a big cup of coffee.
  3. In VS select Qt -> Qt Options. Click Add and browse to the installation folder (eg D:\Qt\4.5.1). Enter a version name, eg Qt-4.5.1. Click Ok.
All done! You now have an excellent VS 2008 dev environment for creating Qt apps. When you click File -> New -> Project you'll see a new entry for Qt4 Projects, along with some predefined templates. If you use Whole Tomato Visual Assist X (like I do, it's a necessity for any serious VS work) read this blog link on their site which describes how to configure Visual Assist to parse the Qt headers.

The screenshot above shows a Qt app being developed in VS. Notice the Qt menu and the test2.ui form file in Solution Explorer. Double-clicking the .ui file will open Qt Designer as an external app. Here you can create your GUI, then just save, head back to VS and compile. Job's a good 'un!

Monday, April 27, 2009

Qt Part 2 - Installation and MinGW

So I downloaded the Qt SDK 4.5.1 for Windows, a measly 167MB. It's packaged as an .exe which when run expands out to a folder of nearly 1GB. It set up some Start Menu items including the Qt Creator and the all-important Qt Demo which showcases the API features. This is actually pretty impressive, featuring a window which resizes the contents on the fly (including all text and images) and fancy transitions and animations.
I suspect it uses OpenGL to do this because when I ran QtDemo.exe through Dependency Walker it showed the following DLLs that were linked in:
As you can see QTOPENGL4.DLL is a linked DLL. Notice also the other dependencies - this is something to be aware of when deploying Qt apps. The above five QT*.DLLs total 15MB. There are many more that can end up being referenced depending on what your app does. You can try linking statically to reduce the DLL count but this has it's own issues (the main one being I couldn't get it to work!)
So...how do you actually create programs now that the SDK is installed? Well, by default you are expected to use the Qt Creator to create your programs and compile using the mingw C++ compiler, both of which are included in the SDK. Mingw is the Windows port of the famous Unix GCC compiler. Qt Creator looks good, I haven't had much chance to play with it yet, but nice to see it has built-in support for Perforce, Git and Subversion source control systems. It also has the integrated Qt Designer for visual forms design, code completion, syntax highlighting and lots of other goodies.
The other way to create apps is to use a standard text editor like Notepad++ to create the C++ code, and then compile manually using the command-line build tools. This is actually the best method when learning the basics of Qt because there is no hand-holding. It's pretty straightforward though, just cd to the folder containing the cpp files, and run
qmake -project
qmake
mingw32-make (default is debug build, for release add -f Makefile.Release)
This will create a debug (or release) folder containing your exe which you can run. Caveat - do all this using the Qt Command Prompt shortcut, which sets up the appropriate env variables. Of course you can add the vars to your system globally, they are referenced in qtenv.bat.
Running the exe you will notice the nice thing about C++ native code - almost instant startup compared to non-native managed languages. For small utility apps this is very important, less so for big apps where users don't mind a splash screen for a few seconds.
Now mingw is all well and good, but 99% of Windows developers use Visual Studio (yes I made that stat up but I'm sure it's not far off) and so in Part 3 I'll go over setting up Qt with VS2008 support.

Qt Part 1 - licenses and motivation

Qt is a cross-platform application and UI framework primarily aimed at C++ users (although plenty of other bindings exist). In plain English that means we can use it to create GUIs. It can do a ton of other stuff including networking, database access and OpenGL to name a few. It's been around since the early 90s and some of it's concepts pre-date the usage of "modern" C++ (unsurprising since the C++ standard was only ratified in 1998). The history of Qt is quite a read - it took a while for the product to gain momentum, but two events helped it along. First it was chosen by the European Space Agency in 1996 and then it was chosen as the toolkit upon which the fledgling KDE Linux desktop environment was built. From then it was upwards all the way to the present day where we are at v4.5.
In 2008 Trolltech (the original owner and creater of Qt) was bought by Nokia, the telecomms giant. Nokia changed the business plan for Qt. Originally the product was only available under two licenses - commercial (big bucks) or GPL (open source). GPL means any product created with Qt must also be open source. So the only way to create and sell a commercial product was to pay hundreds of dollars. This obviously limited adoption of Qt. Nokia decided that in order to spread the usage of Qt it was necessary to allow people to create commercial software either for free or at least very cheaply. In the end they chose the free option (phew!) and used the LGPL (Lesser GPL ) license. This means it now doesn't cost a penny to create commercial apps with Qt. Of course, you might want to stump up for Visual Studio Professional but that's a different story.
Anyway, next I'll see how easy or difficult it is to get Qt up and running...