Detecting Semantic Cloaking on the Web
By supplying different versions of a web page to search engines and to browsers, a content provider attempts to cloak the real content from the view of the search engine. Semantic cloaking refers to differences in meaning between pages which have the effect of deceiving search engine ranking algorithms. In this paper, we propose an automated two-step method to detect semantic cloaking pages based on different copies of the same page downloaded by a web crawler and a web browser. The first step is a filtering step, which generates a candidate list of semantic cloaking pages. In the second step, a classifier is used to detect semantic cloaking pages from the candidates generated by the filtering step. Experiments on manually labeled data sets show that we can generate a classifier with a precision of 93% and a recall of 85%. We apply our approach to links from the dmoz Open Directory Project and estimate that more than 50,000 of these pages employ semantic cloaking.
Other items being presented by these speakers
Sponsor of The CIO Dinner