Nutch is open source web-search software. Nutch is built on top of Lucene, which is an API for text indexing and searching. It is coded completely in the Java programming language, but data is written in language-independent formats.
Nutch has a highly modular architecture allowing developers to create plugins for the following activities: media-type parsing, data retrieval, querying and clustering.
Nutch divides naturally into two pieces: the crawler and the searcher. The crawler fetches pages and turns them into an inverted index, which the searcher uses to answer user’s search queries.
The features why you can use Nutch over other commercial search engine’s like Google, Bing, Yahoo are –
1. Transparency – Nutch is open source, so anyone can see how the ranking algorithms work.
2. Understanding – As Nutch is open source, anybody can see how a large search engine works and it is attractive for researchers who want to try out new search algorithms, since it is so easy to extend.
3. Extensibility – Nutch is very flexible: it can be customized and incorporated into your application.
Nutch installations typically operate at one of three scales: local filesystem, intranet, or whole web.
Introduction to Nutch