WEB Content Extractor 1.4 (WebCe)

WEB Content Extractor (WEBCE) is an open source project that has two effective algorithms to eliminate uninformative blocks and efficiently extract content blocks from web pages. Moreover WEBCE produce a XML File that contains main, headline, and information about the article for a given web page.

Green parts represent important blocks where as red parts represent noisy blocks.

HTML DOM

WEBCE is a two-step algorithm. In the first step, we remove noisy blocks and then classify each block according the features given in the previous section.

In the second step, a rule based parser uses the output of the first step – a well-formed structure – to extract the main content.

In first step, we use the sub-tree raising method of decision tree learning method for the extraction of the content blocks. We establish our learning method to DIV and TD html tags. Therefore, in second step we effectively parse web pages by using these tags.

Paper of the WebCE has been published in the Information Processing & Management.

Uzun, Erdinç; H.Agun, Volkan & Yerlikaya, Tarık, (2013) “A hybrid approach for extracting informative content from web pages”, Information Processing & Management, Volume 49, Issue 4, July 2013, Pages 928–944, DOI: 10.1016/j.ipm.2013.02.005