Review content ExtrActor is an open source project that has a novel Review Extraction Algorithm. This algorithm has two steps to discover review layout and efficiently extract this layout.
*Green parts represent important blocks where as red parts represent noisy blocks.
First Step (Learning Stage)
1 2 3 4 5 6 |
//1. Create DOM Object HTMLMarkerClass.DOM _dom = new HTMLMarkerClass.DOM(); //2. Load HTML Text _dom.prepareDOM(html_doc); //3. find review main tag method of Decision Class _ri = HTMLMarkerClass.desicionClass_Review.find_Review_Main_Tag(_dom._list); |
Second Step (Extraction Stage)
1 2 3 |
//html_doc is string value. Pattern value is obtained from first stage. HTMLMarkerClass.webfilter.Contents_of_givenLayout_Tags_TESTER(html_doc, pattern, false)[0]; //webfilter class: Contents of a given Layout method efficiently extract by using the obtained pattern. |
Paper of the REA has been published in the Journal of Information Science.
- Uçar, Erdem; Uzun, Erdinç & Tüfekci, Pınar (2016) “A novel algorithm for extracting the user reviews from web pages”, Journal of Information Science, first published on September 2, 2016 as DOI: 10.1177/0165551516666446