Software developer blog

How to find maximal cliques

Facebook used to have this awesome list of programming puzzles that were freely available. One of them was peak-traffic, which can be rephrased as the maximal cliques problem pretty easily. Unfortunately a graph can have exponential number of maximal cliques, so it's not surprising that we can not find a polynomial algorithm for listing all possible maximal cliques in a graph. However there are output sensitive algorithms: as long as the number of maximal cliques is relatively low each can be reported in a short time.

A complete sub-graph of a graph is called a clique, much like we call a group of friends a clique if they all know each other.

I knew that I was looking for such an algorithm, but it was kind of hard to find a detailed enough description on the Internet.  At the time of writing Wikipedia does provide a short description of the algorithm devised by Tsukiyama et al. but there are a number of details omitted, making it kind of hard to come up with the right implementation.

My aim is to provide a simple to understand description of the algorithm (without proof that it works correctly) and share my implementation in C++. (Type `make smoke` to compile and test.) The source I've been using - apart from Wikipedia - was a 2004 paper written by Kazuhisa Makino and Takeaki Uno: New Algorithms for Enumerating All Maximal Cliques.

The algorithm

Skiena in his book - The Algorithm Design Manual - jokingly notes that a computer scientist is a mathematician, who proves everything by induction, since most algorithms are either incremental or recursive. This algorithm is no exception: we will add graph nodes one by one. To do that we have to describe how the addition of a node can change the set of maximal cliques in a graph.

First lets examine what happens when we delete a node. It's obvious, that all cliques will still be cliques. On the other hand they might not be maximal, in which case they can be extended to become maximal. The only problem is that if the clique is not maximal after deletion, than there might be more than one maximal clique that contains it.

If we reverse the above reasoning, than we can arrive to a pretty simple incremental algorithm. A graph with a single node has only one clique, and it is maximal. When we add a node, we should take it's neighbours, and intersect them with each clique in the old graph. If the intersection is the original clique, than by adding the new node to it, we get a maximal clique in the new graph. If the intersection is smaller, than we have a candidate clique: add the new node to the intersection. Unfortunately this result clique may not be maximal. (See figure below.) The second problem arises from the fact that the new candidate clique may be produced from several different source cliques: we need to make sure that each new clique is added once, and only once.

Left: An example where the newly added node (red) and the selected clique (encircled) generate a non maximal candidate clique. Right: An example where the newly added node (red) and two different old cliques (encircled) generate the same candidate clique twice.

Based on the above description we can come up with the following algorithm:

vector<MaximalCliqueGenerator::Clique> MaximalCliqueGenerator::generate() const
{
    vector<Clique> maximalCliques;
    Graph::const_iterator currentNodeIt = graph.begin();
 
    maximalCliques.push_back(makeClique(**currentNodeIt));
 
    while( (++currentNodeIt) != graph.end() )
    {
        for(int i = 0, nofOldCliques = maximalCliques.size(); i < nofOldCliques; ++i)
        {
            Clique candidateClique = 
                maximalCliques[i].intersectWithNodesNeighbours(**currentNodeIt);
 
            if( candidateClique.size() == maximalCliques[i].size() )
            {
                maximalCliques[i].addNode(**currentNodeIt);
            }
            else if( candidateClique.isMaximalWhenExtended(**currentNodeIt) &&
                     candidateClique.isGeneratedFromTheFirstSourceClique(maximalCliques[i], 
                                                                         **currentNodeIt) )
            {
                candidateClique.addNode(**currentNodeIt);
                maximalCliques.push_back(candidateClique);
            }
        }
    }
 
    return maximalCliques;
}

That might seem pretty straightforward, but there is a catch: with the naive implementation of isMaximalWhenExtended and isGeneratedFromTheFirstSourceClique it will still be slow. If your implementation is not entirely naive, odds are you test if the new clique is maximal by only examining the neighbours of the new node, and not by going through each node of the graph. So we are good with that.

The tricky part is how to make sure that we add each clique only once. The easy way out would be to use an STL set, but that turns out to be inefficient, since a comparison of two cliques is proportional to the size of the clique, and we need to do that log(n) times, where n is the number of cliques detected so far. It's slightly better to use an unordered_set (one of the most important additions of C++11) but it still won't be constant time: the time needed to generate the hash is linear in clique size, just as well as the final comparison to make sure we didn't just have a hash collision. This idea can still be taken further by employing a rolling hash that is calculated as nodes are added one by one: this is more efficient since most cliques get incrementally grown.

The approach described above will give a fairly usable algorithm. However the original Tsukiyama et al. algorithm takes a different approach, and one that turns out to be even faster. The idea is to check if the source clique used to generate the candidate clique, is the first among possible source cliques when sorted in lexicographic order. To do that the algorithm will just try to add each earlier graph node to the clique that wasn't in the source clique and would not change the candidate clique. If we find such a node, than the source clique is not lexicographically minimal.

bool MaximalCliqueGenerator::Clique::isGeneratedFromTheFirstSourceClique(
        const Clique& sourceClique, const Graph::Node& newNode) const
{
    const_iterator sourceCliqueMemberIt = sourceClique.begin();
    Clique testClique = *this;
    iterator insertLocation = testClique.begin();
    Graph::const_iterator nodeIt;
 
    for (nodeIt = graph->begin(); *nodeIt != &newNode; ++nodeIt)
    {
        if (nodeWasInSourceClique(sourceCliqueMemberIt, sourceClique, nodeIt))
        {
            insertLocation = 
                testClique.insertNodeIfNotAtLocation(**sourceCliqueMemberIt, insertLocation);
            ++insertLocation;
            ++sourceCliqueMemberIt;
        }
        else
        {
            if(testClique.cliqueExtendableByNode(**nodeIt))
                return false;
        }
    }
    return true;
}

Hacking the constant

The other functions are mostly just boilerplate code, and some other easy to understand bits. However there are still some tiny details that can decrease the running time at least by a constant factor. The most important among them is the usage of sorted vectors to represent the graph instead of a linked data structure. When I first read the suggestion in Scott Meyers' book - Effective STL - that one should always consider to use a sorted vector instead of a set I was rather sceptic. This exercise proves to be a good example where that suggestion is worth a 10 times speed up, so never under estimate the power of vectors.

Representing the graph nodes and their neighbours as a sorted vector raises another problem: the data files contain a list of edges represented by the names of the two ends of them. To build the graph we need a sorted data structure that allows us to add nodes dynamically (when we first see them) and also allows to add the nodes quickly. To do that the simplest would be to use STL maps and sets.

This is a classic example where the builder pattern can help us. We have two different graph representations: one for loading the data from file, and another one that is used by our algorithm. Once the data is loaded we pass the graph builder (map/set based representation) to the vector based graphs constructor, which will create the graph we need.

Another trick that could give me a 2-3 times speed up was less tricky, and I wouldn't have come up with this if it wasn't for firing up a profiler to see where I could save up a little more. Examine the function below:

bool MaximalCliqueGenerator::Clique::cliqueExtendableByNode(const Graph::Node& node) const
{
    Clique nodeNeighboursInClique = intersectWithNodesNeighbours(node);
    return nodeNeighboursInClique.size() == size();
}

That works perfectly fine, and it's a nice way of code reuse. However this way we generate the entire intersection (and also copy it) even though we are only interested if it is identical with the original set. Unfortunately I couldn't come up with any better solution than to copy the intersectWithNodesNeighbours function, and change a few things in it. A little code duplication is a price worth paying for the speed up here.

Refactoring lessons

Of course the source code currently available is far from being the first version. Once I had a working algorithm - that was still a lot slower that it is now - I started to refactor. The most important lesson was that clean code is a lot faster. Not only did the code almost spontaneously speed up by about 10-15%, but also as the concepts behind the algorithm started to take a shape in forms of classes it became easier to spot any inefficiencies.

When coding in C++ typedef can make it really easy to fall into the trap of primitive obsession. In my first version of the code I had the concept of Clique and of Graph, but all they really were, was some kind of STL data structure of nodes hidden behind a hideous typedef. The problem with that was, that it added a lot of complexity to the code. The MaximalCliqueGenerator class was huge, and messy. Once I declared a war on primitive obsession, and reorganized my code accordingly the code got a lot simpler, and pretty expressive.

There was one thing though that disturbed me: the interface of clique class is not what you would expect from such a class. In fact it is something that only makes sense in the context of this algorithm. So instead of putting it in a separate file, I just made it a nested class, along with some other similar small classes. I'm still not entirely sure that it was the right call to make - and at most companies the coding guidelines would have made that illegal to begin with - I still feel that this is a nice way to imply that the Clique class is not something that you might want to use in any other context. Also this way it doesn't pollute the global namespace.

Closing thoughts

I think that the enumeration of maximal cliques is an exciting problem, and the algorithm of Tsukiyama et al. is an interesting one. I learned a lot from implementing this, and expect to learn even more from it in the future. I encourage everyone to try and challenge my implementation either by tweaking it even further, or by reimplementing it. If you have any toughs on that, please let me know.

Finally I'd like to say thanks to Athos who took a look at my implementation before I published it, and had some very useful suggestions to make things even clearer.