|resources:||Home Mailing List Installation Source Code Members Bugs Screenshots|
Timmy Miner 0.52
Firefox 3.5 compatibility
Timmy 0.5 dev
Beginning of the project
- Get feedback on Launchpad
Timmy Miner is a free Firefox add-on able to analyze all text content loaded while browsing. With the help of its C++ text-mining engine, it detects language, builds corpus of webpages and display most important keywords and expressions. Data are displayed in a side panel in Firefox browser and are updated each time a page is loaded. Built-in functionalities allow to export data as frequency lists (CSV, TXT) and thesaurus graphs.
The main interest of the extension is to propose either page and corpus analysis. A corpus is a group of web pages. Timmy Miner can makes you discover what is the vocabulary used in the corpus, and for instance help creating a folksonomy.
Because of his efficient core, this extension can treat hundreds of thousands words and an important number of pages. Then it is easy to get the vocabulary used by a group of websites or blogs. From these data many things can be done, for instance studying what are the domain of interests of a given community or make a do a survey about a given theme.
Besides, Timmy Miner detect the page language and can build expressions graphs (thesaurus) which represent co-occurrences of expressions into pages of the corpus.
Timmy Miner 0.52
Firefox 3.0/3.5 Windows XP/Vista
Timmy Miner 0.52
Firefox 3.0/3.5 Linux Available soon.
Timmy Miner is released under GPL 3 licence.
Get source code, bugs, answers on Launchpad.
- Text gathering from the page: The extension get all the text from the loaded web page and keep up to expressions with three words. Words with less than 3 letters are ignored.
- Language detection: The complete text is compared with languages fingerprints in order to find the best match.
- Filtering with language's stop-words: For each supported language, Timmy Miner contains a stopword list (i.e. stoplist) for rejecting words without specific meaning. In English for instance 'the', 'you', 'since' are stopwords.
- Expressions frequency count: A ranking is carry out with expressions frequency. It can either be keywords or expressions with more than one word.
- Append page to corpus: Results of the text analysis are added to the corpus.
ResultsHere is the data you can export from the "Export" tab:
- Export of the page frequency list.
- Export of the corpus frequency list and the number of pages the expression is present.
- Export of the expressions graphes (see next section). Page results are a simple expression frequency counting. Corpus score is based on both presence on pages and total occurences. More frequent is an expression on different pages, better will be the corpus score. Rankings got from Timmy Miner can be exported as CSV or TXT files. About the expressions networks, see next section.
- Expression occurrence threshold: To be considered as "present" on a page, an expression must be repeated at least the number of this threshold.
- Page repeat threshold: The expression must be present on at least this threshold different page. Hence to be present on the network an expression must be repeated on enough different pages and with enough frequency.
- Maximum expression per page: This parameter has a great influence of the time needed to compute the network. It sets the limit of the number of expressions analyzed for each page, no matter the latter threshold. For instance, if this parameter is set at 50, only the first fifty most frequent expressions of each page will be compared to the first fifty of other pages.
- Language identification: Allow to choose if the language is either automatic or manual. Note that the list of available languages will only be showed after the 'Start' button has been pressed for the first time.
- Occurrence threshold: Minimum threshold for expressions to be kept in the page and corpus results.
- Stop-list: Open the stop-list directory included in the extension package. There is one file per language plus general.txt which are common stop words. You can edit these files to customize the rejected vocabulary. Be careful to always save these text files in the UTF-8 encoding. Firefox has to be restarted after editing them, to let Timmy Miner initialize again.
- Folksonomies: Why do we need controlled vocabulary?
- Creating Custom Firefox Extensions with the Mozilla Build System
- Mozilla Build Documentation
- XPCOM array guide
- Mozilla internal string guide
- A few good C++ coding practices for Mozilla
- XPCOM Objects
- Introduction to XPCOM for the DOM
- Creating Applications with Mozilla
- N-gram-based text categorization
- Adding XPCOM components to Mozilla build system - Makefiles
- Building Firefox with Debug Symbols
- Travaux divers Xul
- Xul Periodic Table
- C++ Portability Guide
- Using Dependent Libraries In Extension Components
Text treatmentTimmy Miner can analyze all types of texts. It doesn't bare on page structure and can read every type of chars, including exotic encoding like Chinese or Arabic.
The text-mining engine is a specifically driven library. Named Jimmy, this lib is developed in C++ et has been integrated as a XPCOM component.
Technically, the engine uses N-GRAM for both language detection and frequency count. The computing is based on a vectorial model.
Expressions network / Thesaurus
Data gathered by Timmy Miner when analyzing web-pages can be used to build thesaurus. A thesaurus is a listing of related words, it is a graph where neighbors of a term are associated terms. These graphs represent co-occurrences of expressions in the corpus' pages. The links between nodes represent two expressions are frequently used together. They represent proximity between expressions, that's why they are associated. Similar techniques are applied by search engine to propose related keywords.|
With Timmy Miner, when browsing a community you can draw a graph of "how this community communicates" and around which expressions they debate.
Possibilities are huge but correct settings are necessary. Without taking attention to them, one risk to get no results or launch computing for 3 days.
More important the number of pages, better the interest of doing networks is. Given the huge amount of text data it is not possible to be thorough. On the other hand well-focuses studies may give excellent results.
Here is an example of small networks build with Timmy Miner and visualized with Gephi.
|A dozen Wikipedia pages around democraty, state...|
Timmy Miner lateral panel is opened by the button which appeared in the status bar. It is not necessary to let the panel open when working with. The 'Start' button let the extension newly loaded web page. The analysis can be stopped by pressing the 'Stop' button but it can be restarted with the 'Start' button. Only the 'Reset' button will delete the current corpus.
When a page is being processed a progress bar appears. It means Timmy Miner is working but users can still navigate normally. However it is recommended to wait analysis have finished before stopping it. The GUI has to be stopped to be able to export.
|The panel has not been yet initialized|
|Extension is stopped|
|Started and 'listening' for webpage input|
|Analysis is running|
Timmy Miner is able to recognize following languages in webpages. More text the webpage has, less error will happened. French, German, English, Spanish, Finnish, Hungarian, Italian, Dutch, Norwegian, Portuguese, Romanian, Russian, Swedish, Arabic, Bulgarian, Czech, Polish.
FAQWhat are the types of supported documents? Only HTML pages. The content in frames and iframes are ignored by default.
What are stopwords ? Where can I find them? See Wikipedia for the definition. Timmy Miner stopwords can be found on this page.
How to open and treat expression network files ? With Gephi or Guess. Examples presented here are made with Gephi.
How to add a language in Timmy Miner We need two things. A language fingerprint, which is a typical small text from the language (syllable speaking). Second a correct list of stopwords in the language.
When Linux and Mac OS X version will be released ? We encountered some problems and may request Linux and Mac OS X specialists about Firefox building.
What is Timmy Miner licence ? GPL 3. See licence.