Finding Related Communities in the Web
Masashi Toyoda and Masaru Kitsuregawa
Institute of Industrial Science, University of Tokyo
We propose a new web search technique, which finds related communities
from a given URL. A community is a set of web pages written by
authors who have a common interest on a specific topic, such as fan
pages of a professional baseball team. Our technique finds a
community that includes a given URL, and communities on related
topics, using hyperlink analysis. For example, when a fan page of a
baseball team is given, our technique finds a fan community of the
team, and a community of official homepages of baseball teams.
Our technique allows the user to perform new type of navigation
through the web. It provides additional ways not only to related
pages, but also to related communities.
There are some previous work that find related pages from a given URL.
Kleinberg proposed the HITS algorithm , which extracts strongly
connected web pages from search results of altavista and their
neighbors. He also showed that it can be used for finding related
pages to a given URL. Dean proposed the Companion and the Cocitation
algorithm  that are based on the same concept of the HITS. We have
extended these algorithms to find not only related pages but also
Kumar performed trawling  on a huge snapshot of the web, and found
over 100,000 communities. The trawling puts emphasis on locating
communities on a huge data set, and does not handle relations between
communities. On the other hand, our purpose is finding related
communities near the given URL.
Our algorithm uses a modified version of the HITS , which
calculates the top N related pages to a given URL. It first creates
subgraph of the web by following hyperlinks twice, from a given URL,
in both forward and backward directions. Then it calculates hub and
authority scores of each nodes on the subgraph, and returns top N
authority pages. Our modified HITS is similar to the Companion
algorithm , but have differences in creating subgraph from the
given URL and in putting weights on edges.
The main idea of our algorithm is applying the modified HITS on each
URL in the top N URLs, then clustering these results. The process is
- Apply the modified HITS on a given URL, and get a list (L) of the top N URLs.
- Apply the modified HITS on each URL in L, and collect N result lists (L1,,, LN).
- Create clusters of the result lists (L1,,, LN). In our current implementation, we put two lists into a cluster, when the lists share at least M URLs.
- Output each cluster, removing duplicate URLs in its lists.
In this algorithm, it is important to select appropriate parameters for
the modified HITS and clustering. We omit details of parameter tuning
in this abstract because of the limit in space.
For experiments, we built our own connectivity server on Japanese web
pages. Using a web crawler, we collected 17 million web pages in
Japanese from July to September, 1999. Then, we mapped connectivity
information of those pages on main memory, so that our search engine
can access directly. We implemented the connectivity server and our
algorithm on Sun Enterprise Server 6500 with 8 CPU and 4GB memory.
The current implementation returns a result in 5-10 seconds.
From our experiments, we found that this algorithm first finds a
community near the given URL, then finds communities on broader
topics. For example, when a fan page of a baseball team is given, the
result includes a fan community of the team, and a community of
official homepages of baseball teams. The following is a result when
a fan page of SONY VAIO computers is given.
In this case, the top 10 result of the process (1) includes the VAIO
official page, and this page becomes a seed for the community of
computer makers. This phenomenon happens in many cases.
- Input: http://www.sonyvaio.com/
- Community 1 (Fan pages of SONY VAIO computers):
- Community 2 (Homepages of computer makers):
We have proposed a new web search technique, which finds related
communities from a given URL. Our algorithm first finds a community
near the URL, then finds broader communities. We are now trying to
find smaller communities from a broader community, and construct
navigation ways in both directions.
- Jon M. Kleinberg. Authoritative Sources in a Hyperlinked Environment. Proceedings of the ACM-SIAM Symposium on Discrete Algorithms, 1998.
- Jeffrey Dean and Monika R. Henzinger. Finding related pages in the World Wide Web. Proceedings of the 8th World-Wide Web Conference, 1999.
- Ravi Kumar, Prabhakar Raghavan, Sridhar Rajagopalan and Andrew Tomkins. Trawling the Web for emerging cyber-communities. Proceedings of the 8th World-Wide Web Conference, 1999.