Review content ExtrActor is an open source project that has a novel Review Extraction Algorithm. This algorithm has two steps to discover review layout and efficiently extract this layout.
*Green parts represent important blocks where as red parts represent noisy blocks.
First Step (Learning Stage)
//1. Create DOM Object
HTMLMarkerClass.DOM _dom = new HTMLMarkerClass.DOM();
//2. Load HTML Text
//3. find review main tag method of Decision Class
_ri = HTMLMarkerClass.desicionClass_Review.find_Review_Main_Tag(_dom._list);
Second Step (Extraction Stage)
//html_doc is string value. Pattern value is obtained from first stage.
HTMLMarkerClass.webfilter.Contents_of_givenLayout_Tags_TESTER(html_doc, pattern, false);
//webfilter class: Contents of a given Layout method efficiently extract by using the obtained pattern.
Paper of the REA has been published in the Journal of Information Science.
- Uçar, Erdem; Uzun, Erdinç & Tüfekci, Pınar (2016) “A novel algorithm for extracting the user reviews from web pages”, Journal of Information Science, first published on September 2, 2016 as DOI: 10.1177/0165551516666446