(and hope not to fall off the cliff)
* * *
I recently worked on the implementation of ‘Stacked Skews Model’, an algorithm proposed by Andrew Carlson and Charles Schafer.
The idea is to train a web page wrapper induction algorithm (let’s call that a ‘wrapper’) at extracting information using a small number of already trained wrappers for sites in the same domain. For instance, if you already have in hands four wrappers for hotel booking web sites then you can use them to bootstrap new wrappers for virtually any hotel booking web site out there.
sample web page wrapper annotations
What’s clever in Carlson and Schafer’s solution is overcoming the lack of annotated examples, given the huge search space for such a problem, by working on features distribution and distribution divergences instead on relying directly on surface evidences. In other words, when the system learns what the name of a hotel is, it learns how each feature is distributed and how similar the solution must be (e.g., hotel name length is around 20 characters, hotel name often contains the trigram ‘hotel’ or ‘resort’, etc.). It is basically equivalent to creating one classifier for each feature and, as the authors suggest, stack them using linear regression.
My implementation didn’t exactly worked as advertised, which is normal ;) Even if stacked models reduce the feature space and diminish overfitting, the problem is still enormous and one or two features tend to rule out the stack. However, I did some important progress by playing around the published ideas.
First, do connect on an ontology. Ok, I’m not a big fan of ontological features and only use them in the last resort but here, it did a good difference. When wrapping hotel web sites, connect on WordNet synset ‘hotel’ and use all synonyms and related words as features.
Also, do use DOM tree features. In their article, Carlson and Schafer limit the learning to features on textual information (the current node text and the previous node text). However, DOM tree is very useful here. For instance, desirable information tend to be deep and almost in juxtaposition. Also, an hotel name is more likely to be in its own HTML tag (bold, header, etc.) while amenities are often enumerated (lists, table, etc.).
Finally, in order to reduce overfitting further, I split the feature space in independent groups and applied a voting scheme over the ensemble.