Apache Lucene is a high-performance, full-featured text search engine library written entirely in Java. It is a technology suitable for nearly any application that requires full-text search, especially cross-platform.
Bug Fixes, Enhancements (LUCENE-2134, 2133, 1918, 1186, 1185, 954)
Apache Tika™ is a toolkit for detecting and extracting metadata and structured text content from various documents using existing parser libraries. You can find the latest release on the download page. See the Getting Started guide for instructions on how to start using Tika.
Enhancements (TIKA-420, 477)
Designed for extension while providing robust support for the base HTTP protocol, the HttpClient component may be of interest to anyone building HTTP-aware client applications such as web browsers, web service clients, or systems that leverage or extend the HTTP protocol for distributed communication.
Bug Fixes, Developed simple HTTP server for JUnit testing
Heritrix is the Internet Archive's open-source, extensible, web-scale, archival-quality web crawler project.
Bug Fixes, Host-level Bucket-Queuing, Admin UI enhancements, Frontiers refactoring