pattern

Pattern is a web mining module for the Python programming language.

It bundles tools for data retrieval (Google + Twitter + Wikipedia API, web spider, HTML DOM parser), text analysis (rule-based shallow parser, WordNet interface, syntactical + semantical n-gram search algorithm, tf-idf + cosine similarity + LSA metrics), clustering and classification (k-means, k-NN, SVM), and data visualization (graph networks).

The module is bundled with 30+ example scripts and 350+ unit tests.


Download

download Pattern 2.5 | download (19MB)
  • Requires: Python 2.5+ on Windows | Mac | Linux
  • Licensed under BSD
  • Latest releases: 2.52.42.3 | 2.22.12.0
  • Author:
    Tom De Smedt (tom at organisms.be)

Reference: De Smedt, T. & Daelemans, W. (2012).
Pattern for Python. Journal of Machine Learning Research, 13: 2031–2035.

 


Modules

Helper modules

Command-line

Contribute

 


Installation

Pattern is written for Python 2.5+ (no support for Python 3 yet). The module has no external dependencies except when using LSA in the vector module, which requires NumPy (installed by default on Mac OS X).

To install it so that the module is available in all your scripts, open a terminal and do:

> cd pattern-2.5
> python setup.py install 

If you have pip, you can automatically download and install from the PyPi repository:

> pip install pattern

If none of the above works, you can make Python aware of the module in three ways:

  • Put the pattern subfolder in the same folder as your script.
  • Put the pattern subfolder in the standard location for modules so it is available to all scripts:
    c:\python25\Lib\site-packages\ (Windows),
    /Library/Python/2.5/site-packages/ (Mac OS X),

    /usr/lib/python2.5/site-packages/ (Unix).
  • Add the location of the module to sys.path in your script, before importing it:
>>> MODULE = '/users/tom/desktop/pattern'
>>> import sys; if MODULE not in sys.path: sys.path.append(MODULE)
>>> from pattern.en import parse, Sentence 

 


Quick overview

pattern.web

Module pattern.web is a web toolkit that bundles various API's (Google, Gmail, Bing, Twitter, Facebook, Wikipedia, Flickr) with a robust HTML parser and a crawler. The module's purpose is to retrieve online content in an easy-to-use, uniform way.

>>> from pattern.web import Twitter, plaintext
>>> for tweet in Twitter().search('"more important than"', cached=False):
>>>    print plaintext(tweet.description)

'The mobile web is more important than mobile apps.'
'Start slowly, direction is more important than speed.'
'Imagination is more important than knowledge. - Albert Einstein'
... 

pattern.en

Module pattern.en is a natural language processing (NLP) toolkit for English. It is based on regular expressions, meaning that it is fast but on occasion also prone to incorrect results (see MBSP for a robust approach compatible with Pattern). It has functionality for word inflection (for example: verb conjugation and noun pluralization), a Python interface to the WordNet database, and a Brill-based shallow parser. A shallow parser analyzes a sentence and identifies the constituents (nouns, verbs, etc.).

>>> from pattern.en import parse, pprint
>>> s = 'The mobile web is more important than mobile apps.'
>>> s = parse(s, relations=True, lemmata=True)
>>> pprint(s)

          WORD   TAG    CHUNK    ROLE   ID     PNP    LEMMA       
                                                                  
           The   DT     NP       SBJ    1      -      the         
        mobile   JJ     NP ^     SBJ    1      -      mobile      
           web   NN     NP ^     SBJ    1      -      web         
            is   VBZ    VP       -      1      -      be          
          more   RBR    ADJP     -      -      -      more        
     important   JJ     ADJP ^   -      -      -      important   
          than   IN     PP       -      -      PNP    than        
        mobile   JJ     NP       -      -      PNP    mobile      
          apps   NNS    NP ^     -      -      PNP    app     


Note how the sentence has been annotated with various tags, discerning for example nouns (NN), adjectives (JJ), determiners (DT), verbs (VB), noun phrases (NP), sentence subject (SBJ), and a prepositional noun phrase (PNP). A parse tree is a Python structure of related objects of the parsed text:

>>> from pattern.en import parsetree
>>> s = 'The mobile web is more important than mobile apps.'
>>> t = parsetree(s)
>>> for chunk in t.sentences[0].chunks:
>>>     for word in chunk.words:
>>>         print word,
>>>     print

Word(u'The/DT') Word(u'mobile/JJ') Word(u'web/NN')
Word(u'is/VBZ')
Word(u'more/RBR') Word(u'important/JJ')
Word(u'than/IN')
Word(u'mobile/JJ') Word(u'apps/NNS')

Parsers for Spanish, German and Dutch are also available: pattern.es | pattern.de | pattern.nl.

pattern.search

Module pattern.search contains an elegant search algorithm to retrieve sequences of words (called n-grams) from a parsed sentence.

>>> from pattern.en import parsetree
>>> from pattern.search import search
>>> s = 'The mobile web is more important than mobile apps.'
>>> t = parsetree(s, relations=True, lemmata=True)
>>>
>>> for match in search('NP be (RB)+ important than NP', t):
>>>     print match.constituents()[-1], "=>", \
>>>           match.constituents()[0]

Chunk('mobile apps/NP') => Chunk('The mobile web/NP-SBJ-1')

Observe the given search pattern: "NP be (RB)+ important than NP". It means: any noun phrase followed by the verb to be (is, was, ...), followed by zero or more adverbs (e.g. much, more), followed by the words important than, followed by any noun phrase. It will match any of the following variations:

  • "the mobile web will be much more important than mobile apps"
  • "mobile apps are less important than the mobile web"
  • "a good blog is more important than a fancy facebook page", etc.

pattern.vector

Module pattern.vector is a toolkit for machine learning (classification and clustering). It contains functionality to implement a vector space model of bag-of-words documents with tf-idf weighted features, cosine similarity, information gain, Latent Semantic Analysis (LSA), k-means and hierarchical clustering algorithms and Naive Bayes, k-NN and SVM classifiers.
>>> from pattern.web    import Twitter
>>> from pattern.en     import tag
>>> from pattern.vector import kNN, count
>>> 
>>> knn = kNN()
>>> 
>>> for i in range(1, 10):
>>>     for tweet in Twitter().search('#win OR #fail', start=i, count=100):
>>>         s = tweet.description.lower()
>>>         p = '#win' in s and 'WIN' or 'FAIL'
>>>         v = tag(s)
>>>         v = [word for word, pos in v if pos == 'JJ'] # JJ = adjective
>>>         v = count(v) 
>>>         if len(v) > 0:
>>>             knn.train(v, type=p)
>>> 
>>> print knn.classify('sweet')
>>> print knn.classify('stupid')

'WIN'
'FAIL'   
This example combines three Pattern modules to train a classifier on adjectives mined from Twitter. First, it mines a 1,000 tweets with the hashtag #win or #fail (the classes). For example: “$20 tip off a sweet little old lady today #win”. The part-of-speech tags are parsed for each tweet, filtering everything but adjectives. Each tweet is then transformed to a vector: a dictionary of adjective → count items, with type WIN or FAIL. The vectors are used to train the classifier. The classifier learns which adjectives are commonly associated with either WIN or FAIL. It predicts “sweet” as WIN and “stupid” as FAIL (results may vary depending on what is buzzing on Twitter).

pattern.graph

Module pattern.graph provides a data structure to represent relationships between nodes (e.g. words, concepts, entities ...) The relative importance (or centrality) of each node can then be calculated. Graphs can be exported as an interactive web page using the HTML <canvas> element (demo). 

The screenshot shows an exported graph with nodes pointing to more important nodes (data mined from Bing). Nodes with a lot of "traffic" are marked with a shadow (money, football, life), important nodes are marked in blue (experience, I, nothing, money).

Note: The nothing result could use some extra post-processing, e.g. in: nothing is more important than life, the word life is important, not the word nothing.

Source code:

>>> from pattern.web    import Bing, plaintext
>>> from pattern.en     import parsetree
>>> from pattern.search import search
>>> from pattern.graph  import Graph, Node, Edge, export
>>>  
>>> g = Graph()
>>> for i in range(10):
>>>     for r in Bing().search('"more important than"', start=i+1, count=50):
>>>         s = r.description.lower() 
>>>         s = plaintext(s)
>>>         t = parsetree(s)
>>>         p = '{NP} (VP) more important than {NP}'
>>>         for m in search(p, t):
>>>             a = m.group(1).string # Left NP.
>>>             b = m.group(2).string # Right NP.
>>>             if a not in g:
>>>                 g.add_node(a, radius=5, stroke=(0,0,0,0.8))
>>>             if b not in g:
>>>                 g.add_node(b, radius=5, stroke=(0,0,0,0.8))
>>>             g.add_edge(g[b], g[a], stroke=(0,0,0,0.6))
>>>  
>>> g = g.split()[0] # Largest subgraph.
>>>  
>>> for n in g.sorted()[:40]: # Sorted by Node.weight.
>>>     n.fill = (0.0, 0.5, 1.0, 0.7 * n.weight)
>>>  
>>> export(g, 'test', directed=True, weighted=0.6, distance=6) 

 


Examples & experiments

Belgian elections, June 13, 2010 – Twitter opinion mining

After the fall of the previous government, the New Flemish Alliance emerged as the plurality party with 27 seats. In the week before the elections we analyzed 7,600 tweets that mentioned the name of a Belgian politician. read more

November 2010 – March 2011, 100 days of web mining

During a 100-day period, we collected 6,400 Google News items and 70,000 tweets with the goal of finding a correlation between important news items and personal opinions on Twitter. What we got was profanity, mostly. read more