WWW 2007 / Poster Paper Topic: Search MedSearch: A Specialized Search Engine for Medical Information
University of Massachusetts – Amherst2
{luog, ctang, haoyang}@us.ibm.com [email protected]ABSTRACT
panophthalmitis). As a result, it is difficult for him to choose a few
People are thirsty for medical information. Existing Web search
accurate medical phrases as a starting point for his search. Instead,
engines cannot handle medical search well because they do not
considering the importance of his health, the searcher is typically
consider its special requirements. Often a medical information
willing to take his time to describe his situation in detail (e.g., his
searcher is uncertain about his exact questions and unfamiliar with
medical history, where and how he feels uncomfortable, and what
medical terminology. Therefore, he prefers to pose long queries,
happened in the last several days) by posing long queries in plain
describing his symptoms and situation in plain English, and receive
English, much like the way he talks to a doctor. Actually, many
comprehensive, relevant information from search results. This paper
medical questions that people posted on medical forums contain
presents MedSearch, a specialized medical Web search engine, to
several hundred words, and a recent study on medical queries [2]
address these challenges. MedSearch can assist ordinary Internet
has reported that medical information searchers prefer to pose
users to search for medical information, by accepting queries of
detailed long questions to Web search engines. Figure 1 shows one
extended length, providing diversified search results, and suggesting
related medical phrases. A full version of this paper is available in
www.medhelp.org/forums/RespiratoryDisorders/messages/2584.html
. My 23 month old son has been coughing since 6 months old … Seems
to be constantly on antibiotics for every kind of chest infection, on
Categories and Subject Descriptors
pulmicort, albuterol 2x's a day, constant ear infections (tubes, adnoids,
and tonsils are scheduled), chronic loose stools. Seen an allergist, he has
H.3.3 [Information Search and Retrieval]: search process
lots of environmental allergies, did all the mattres covers, rugs are gone,
General Terms: Algorithms, Experimentation
air purifier in. All this to no avail. Chest xray showed streaking in the
Keywords: medical query, medical Web search engine
main bronch tubes (?) perihilar stuff hazy areas, left lobe is alot grayer
than the right. … Went to pedi pulmonologist in Boston, scheduled for
1. INTRODUCTION
sweat test on Friday, he doesnt think he has it, but wants to rule out CF.
He wants to do CT and bronchoscope next week. Mentioned something
Health care is a major business in many countries and a large part
about poss. deformed broch tubes, or weak lung walls, or even a cyst
of this business is related to the management and retrieval of
compressing his lungs causing this cough … what are the possibilities he
medical information. To facilitate people to acquire medical
has a verison of pulmonary micobacterial infection? .
information in the Web era, many medical Web search engines (e.g.,
Figure 1. An exemplary medical question posted on the
Healthline and Google Health) have come into existence. While
Med Help International Medical and Health Forum
these systems have their own merits, they all treat medical search in
(www.medhelp.org/forums.htm).
much the same way as traditional web search.
Medical search has several unique requirements that distinguish
Even after stopword removal, the above query still cannot be fed
itself from traditional Web search. A common scenario in which a
directly into existing medical Web search engines, because they all
person performs medical search is that he feels uncomfortable but is
impose certain limits on query length for various reasons. For
uncertain about his exact medical problems. In this case, the
instance, Google truncates long queries into the length limit of 32
searcher usually prefers to learn all kinds of knowledge that is
words. Such a low limit on query length is a serious obstacle for
related to his situation. However, existing medical Web search
medical information searchers. Moreover, a medical information
engines are optimized for precision and concentrate their search
searcher often prefers the search engine to automatically suggest
results on a few topics. This lack-of-diversity problem is aggravated
diversified, related medical phrases that can help him quickly digest
by the nature of medical web pages. When discussing a medical
search results and refine his query. However, this cannot be done
topic, many medical web sites use similar, but not identical,
with existing medical Web search engines when the query is written
descriptions by paraphrasing contents in medical textbooks and
using plain English description and has a terminological
research papers. Hence, search results provided by existing medical
Web search engines often contain much semantic redundancy,
which cannot be easily handled by existing methods for identifying near-duplicate documents or result diversification. To find useful
2. MEDSEARCH
medical information, the searcher often has to go through a large
In this paper, we present MedSearch, a prototype medical Web
search engine that addresses the aforementioned limitations of
Another unique feature of medical search is due to the fact that
existing systems. MedSearch uses several key techniques that
most Internet users do not have much medical knowledge. A
greatly improve its usability and the quality of search results. First,
medical information searcher is often unclear about the problem that
MedSearch accepts queries of extended length and supports the use
he is facing and unaware of the related medical terminology (e.g.,
of plain English description. This is a great convenience for the majority of Internet users who do not have much medical
knowledge. MedSearch automatically rewrites long queries into
Copyright is held by the author/owner(s).
moderate-length queries by selectively dropping unimportant terms
WWW 2007, May 8–12, 2007, Banff, Alberta, Canada.
(i.e., words). Since unimportant terms not only appear in a large
number of Web pages but also obscure the main theme of the query,
WWW 2007 / Poster Paper Topic: Search
dropping them can both greatly increase the query processing speed
(www.medhelp.org/forums.htm). One such query is shown in
and improve the quality of search results. Second, MedSearch
Figure 1. We crawled 6GB of Web pages from WebMD
returns diversified Web pages without significantly increasing query
(www.webmd.com), one of the most popular medical web sites.
processing time or deteriorating the quality of the returned top Web
Both relevance and diversity are judged using a single metric:
pages, which allows the searcher to see various aspects related to his
usefulness. A returned Web page P is useful if P is relevant to the
situation. Third, MedSearch automatically suggests diversified,
query, and much of P’s relevant content has not been mentioned in
related medical phrases to the searcher based on information from
the Web pages that are ranked higher. If P is useful, its usefulness
several sources: the standard MeSH medical ontology
score score (P) = 1 ; otherwise,
(www.nlm.nih.gov/mesh/meshhome.html), the collection of crawled
definition of usefulness holds for the suggested medical phrases. For
There are several key challenges in designing MedSearch. In
order to rewrite long queries into moderate-length queries, we must aggressively drop unimportant terms yet avoid losing much useful
score = ∑ score (P ) /
information. For this purpose, we rank all the terms in the query
For the suggested 60 medical phrases, their weighted average
according to the Okapi term weighting formula. Those terms with
usefulness score is defined similarly. The mean of the weighted
small weights are treated as unimportant ones and dropped from the
average usefulness scores over the 30 queries is the main quality
metric for the returned pages and the suggested phrases. Five
One major challenge in providing diversified search results is to
colleagues served as assessors and independently determined the
efficiently handle the excessive redundancy among different
usefulness scores of the returned Web pages and the suggested
medical Web pages. For this purpose, all the crawled Web pages are
medical phrases. None of them has formal medical training.
clustered into multiple clusters in a pre-processing step. Each of
To give the reader a feeling of the contents returned by
these clusters roughly corresponds to a different topic. When
MedSearch, we present detailed results of the returned Web pages
ranking Web pages, each cluster can contribute only a limited
and the suggested medical phrases for the query in Figure 1. Table 1
number of results to the returned top few Web pages. Then the
shows some of the returned relevant Web pages. The suggested
searcher is likely to see different aspects in the top results.
relevant medical phrases include bronchoscopy (rank 1), bronchitis
The process of suggesting related medical phrases consists of two
(rank 2), and sarcoidosis (rank 4). In general, for a medical query Q,
sub-steps. The first sub-step is to generate the candidate set S of
MedSearch can find several relevant Web pages and medical
related medical phrases in the MeSH ontology. The second sub-step
phrases that cover multiple aspects of Q.
is to rank the medical phrases in S. In the first sub-step, MedSearch
selects V=60 medical phrases from the returned top-20 Web pages.
Table 1. Some of the returned relevant Web pages.
The suggested medical phrases need to be both relevant and diverse
in order to provide the greatest convenience to the searcher.
Intuitively, to ensure that a medical phrase M is relevant, it is better
for M to appear in one of the returned top Web pages with a large
tf×idf value that is computed using the Okapi formula. To ensure
enough diversity in the list of suggested medical phrases, a single
Web page should not contribute too many medical phrases to that
list. We use a continuous discounting method to achieve these two
goals. Each time a medical phrase is chosen from a Web page P, a
The means of the weighted average usefulness scores over the 30
discount is given to the tf×idf values of the remaining medical
queries for the returned top-20 Web pages and the suggested 60
phrases in P. As a result, the more medical phrases have already
medical phrases are 7.9 and 6.1, respectively. We present a simple
come out from P, the more difficult the remaining medical phrases
calculation below to give the reader some intuition on these
in P will come out in the future. We select V medical phrases in V
numbers. Let wsi denote the weighted average usefulness score
passes. In each pass, we select a medical phrase with the largest
when the returned top-i Web pages (or medical phrases) are useful
while the others are not useful. Then ws
The main challenge in the second sub-step of ranking the
suggested medical phrases is to resolve the terminological
discrepancy between medical phrases and queries written in plain
Our results show that MedSearch can process long queries
English. For this purpose, a set of representative Web pages are
efficiently, at a speed roughly comparable to that of existing
computed offline for each medical phrase M, by using M to retrieve
medical Web search engines in processing short queries. Our
the top-ranked Web pages. Since a large part of these high-quality
experiments also show that users’ satisfaction is crucially tied to
representative Web pages are written in plain English, they provide
MedSearch’s capability of returning diversified Web pages and
good linkages between medical terminology and plain English
suggesting diversified, related medical phrases that can help users
words. The relevance between a query Q and a medical phrase M is
quickly understand the returned pages and refine their queries.
computed as a function of the relevance scores between Q and M’s
representative Web pages. Then all the suggested medical phrases
are sorted in descending order of their relevance scores. A detailed
4. REFERENCES
description of our techniques is available in [1].
[1] Full version of this paper is available at http://www.cs.wisc.edu/~gangluo/medsearch.pdf.
3. RESULTS
[2] A. Spink, Y. Yang, and J. Jansen et al. A Study of Medical and
To demonstrate the effectiveness of our techniques,
Health Queries to Web Search Engines. Health Information and
we conducted experiments using 30 representative medical
questions that people posted on a popular medical forum, the Med Help International Medical and Health Forum
FM 7/2005 Gelukkig zijn, succes hebben, presteren – dat zijn de eisen van deze tijd. Bij tegenslag ofdepressie grijpen we naar Prozac om snel weer verder te kunnen. Maar volgenspsychoanalyticus en filosoof Antoine Mooij ontkennen we hiermee wat wezenlijk is voorde mens: leren omgaan met melancholie en tekort. ting te denken dat alle psychische verschijnselen te reduceren zijntot het fy
“I was about 50 meters away when it blew up. The blast knocked me off my feet and into the side of a Humvee. I must have blacked out for a minute or two, but when I came to there was nothing left of the vehicle. No remnants; just char and a crater.” When 28-year-old Christopher Harmon was discharged on May 26,2006, after eight years in the Marine Corps, he had a chestful of dec-oration