Review content ExtrActor (REA) ver 1.2

Review content ExtrActor is an open source project that has a novel Review Extraction Algorithm. This algorithm has two steps to discover review layout and efficiently extract this layout.

*Green parts represent important blocks where as red parts represent noisy blocks.

First Step (Learning Stage)

//1. Create DOM Object 
HTMLMarkerClass.DOM _dom = new HTMLMarkerClass.DOM(); 
//2. Load HTML Text 
_dom.prepareDOM(html_doc); 
//3. find review main tag method of Decision Class 
_ri = HTMLMarkerClass.desicionClass_Review.find_Review_Main_Tag(_dom._list);

//1. Create DOM Object

HTMLMarkerClass.DOM _dom = new HTMLMarkerClass.DOM();

//2. Load HTML Text

_dom.prepareDOM(html_doc);

//3. find review main tag method of Decision Class

_ri = HTMLMarkerClass.desicionClass_Review.find_Review_Main_Tag(_dom._list);

Second Step (Extraction Stage)

//html_doc is string value. Pattern value is obtained from first stage. 
HTMLMarkerClass.webfilter.Contents_of_givenLayout_Tags_TESTER(html_doc, pattern, false)[0]; 
//webfilter class: Contents of a given Layout method efficiently extract by using the obtained pattern.

//html_doc is string value. Pattern value is obtained from first stage.

HTMLMarkerClass.webfilter.Contents_of_givenLayout_Tags_TESTER(html_doc, pattern, false)[0];

//webfilter class: Contents of a given Layout method efficiently extract by using the obtained pattern.

Paper of the REA has been published in the Journal of Information Science.

Uçar, Erdem; Uzun, Erdinç & Tüfekci, Pınar (2016) “A novel algorithm for extracting the user reviews from web pages”, Journal of Information Science, first published on September 2, 2016 as DOI: 10.1177/0165551516666446