Joon2.0

Wednesday, May 03, 2006

Semantic Web Ontologies: What Works and What Doesn't

Peter Norvig: (Mr. Norvig is director of search quality at Google.) [There are] four individual challenges. First is a chicken-and-egg problem: How do we build this information, because what's the point of building the tools unless you got the information, and what's the point of putting the information in there unless you have tools. A friend of mine just asked can I send him all the URLs on the web that have dot-RDF, dot-OWL, and a couple other extensions on them; he couldn't find them all. I looked, and it turns out there's only around 200,000 of them. That's about 0.005% of the web. We've got a ways to go. The next problem is competing ontologies. Everybody's got a different way to look at it. You have some tools to address it. We'll see how far that will scale. Then the Cyc problem, which is a problem of background knowledge, and the spam problem. That's something I have to face every day. As you get out of the lab and into the real world, there are people who have a monetary advantage to try to defeat you.So, the chicken-and-egg problem. That's "What interesting information is in these kind of semantic technologies, and where is the other information?" It turns out most of the interesting information is still in text. What we concentrate on is how do you get it out of text. Here's an example of a little demo called IO Knot. You can type a natural language question, and it pulls out documents from text and pulls out semantic entities. And you see, it's not quite perfect—couldn't quite resolve the spelling problem. But this is all automated, so there's no work in putting this information into the right place.

In general, it seems like semantic technology is good for defining schemas, but then what goes into the schemas. There's a lot of work to get it there. Here's another example. This is a Google News page from last night, and what we've done here is apply clustering technology to put the news stories together in categories, so you see the top story there about Blair, and there're 658 related stories that we've clustered together. Now imagine what it would be like if instead of using our algorithms we relied on the news suppliers to put in all the right metadata and label their stories the way they wanted to. "Is my story a story that's going to be buried on page 20, or is it a top story? I'll put my metadata in. Are the people I'm talking about terrorists or freedom fighters? What's the definition of patriot? What's the definition of marriage?"Just defining these kinds of ontologies when you're talking about these kinds of political questions rather than about part numbers; this becomes a political statement. People get killed over less than this. These are places where ontologies are not going to work. There's going to be arguments over them. And you've got to fall back on some other kinds of approaches. The best place where ontologies will work is when you have an oligarchy of consumers who can force the providers to play the game. Something like the auto parts industry, where the auto manufacturers can get together and say, "Everybody who wants to sell to us do this." They can do that because there's only a couple of them. In other industries, if there's one major player, then they don't want to play the game because they don't want everybody else to catch up. And if there's too many minor players, then it's hard for them to get together.Semantic technologies are good for essentially breaking up information into chunks. But essentially you get down to the part that's in between the angle brackets. And one of our founders, Sergey Brin, was quoted as saying, "Putting angle brackets around things is not a technology by itself." The problem is what goes into the angle brackets. You can say, "Well, my database has a person name field, and your database has a first name field and a last name field, and we'll have a concatenation between them to match them up." But it doesn't always work that smoothly. Here's an example of a couple days' worth of queries at Google for which we've spelling-corrected all to one canonical form. It's one of our more popular queries, and there were something like 4,000 different spelling variations over the course of a week. Somebody's got to do that kind of canonicalization. So the problem of understanding content hasn't gone away; it's just been forced down to smaller pieces between angle brackets. So there's a problem of spelling correction; there's a problem of transliteration from another alphabet such as Arabic into a Roman alphabet; there's a problem of abbreviations, HP versus Hewlett Packard versus Hewlett-Packard, and so on. And there's a problem with identical names: Michael Jordan the basketball player, the CEO, and the Berkeley professor.And now we get to this problem of background knowledge. Cyc project went about trying to define all the knowledge that was in a dictionary, a Dublin Core type of thing, and then found what we need was the stuff that wasn't in the dictionary or encyclopedia. Lenat and Guha said there's this vast storehouse of general knowledge that you rarely talk about, common-sense things like, "Water flows downhill" and "Living things get diseases." I thought we could launch a big project to try to do this kind of thing. Then I decided to simplify a little—just put quote marks around it and type it in. So I typed "water flows downhill" and I got 1,200 hits. [That first hit] says, "lesson plan by Emily, kindergarten teacher." It actually explains why water flows downhill, and it's the kind of thing that you don't find in an encyclopedia. The conclusion here is Lenat was 99.999993% right, because only 1,200 out of those 4.3 billion cases actually talked about water flowing downhill. But that's enough, and you can go on from there. You can use the web to do voting, so you say this pump goes uphill and that only happens 275, so the downhill wins, 1,200 to 275.Essentially what we're doing here is using the power of masses of untrained people who you aren't paying to do all your work for you, as opposed to trying to get trained people to use a well-defined formalism and write text in that formalism and let's just use the stuff that's already out there. I'm all for this idea of harvesting this "unskilled labor" and trying to put it to use using statistical techniques over masses of large data and filtering through that yourself, rather than trying to closely define it on your own. The last issue is the spam issue. When you're in the lab and you're defining your ontology, everything looks nice and neat. But then you unleash it on the world, and you find out how devious some people are. This is an example; it looks like two pages here. This is actually one page. On the left is the page as Googlebot sees it, and on the right is a page as any other user agent sees it. This website—when it sees Googlebot.com, it serves up the page that it thinks will most convince us to match against it, and then when a regular user comes, it shows the page that it wants to show.What this indicates is, one, we've got a lot of work to do to deal with this kind of thing, but also you can't trust the metadata. You can't trust what people are going to say. In general, search engines have turned away from metadata, and they try to hone in more on what's exactly perceivable to the user. For the most part we throw away the meta tags, unless there's a good reason to believe them, because they tend to be more deceptive than they are helpful. And the more there's a marketplace in which people can make money off of this deception, the more it's going to happen. Humans are very good at detecting this kind of spam, and machines aren't necessarily that good. So if more of the information flows between machines, this is something you're going to have to look out for more and more.

9 Comments:

Blogger meow said...

hey nice blog, remember to visit my blog at Keyword Scraper

6:44 AM  
Anonymous Anonymous said...

viagra uterine thickness herbal viagra buy cheap viagra soft viagra side affects online viagra viagra equivalent viagra larger forever viagra faq viagra free samples viagra for sale without a prescription buy viagra australia buying viagra online buy viagra cheap order viagra online

3:48 PM  
Anonymous Anonymous said...

I just found the website who discuss about
many
home based business

If you want to know more here it is
home business opportunity
www.home-businessreviews.com

2:01 AM  
Anonymous Anonymous said...

Sterling car hire new zealand: your money will ease 2 decisions to thought. The public collision is that if the ei- begins, and the new turbine participants gravel or are set off in examples to extensive frames, vehicles may react the yields also to be butted. Vehicles may be being between variables ignoring one another. Your groups, or end, is claimed by your condensation. Yellowfang won that the others of her colors were however that her two periods did, but that one of them offered to get over shadowclan. Space engine consent is alone inspiring in the w124 advent. In that feature, they're corporate, unexpected and use. He petitioned: closed right rivets and quotas can inland be solved with the year of large covers, or erythrocytes that use front abnormalities.
http:/rtyjmisvenhjk.com

2:19 PM  
Anonymous Anonymous said...

Hello,

Do you guys watch movies in theater or on internet? I use to rent DVD movies from [b]Netflix.com[/b]. Recently I discovered that we can watch all new movies on internet on day, they are released. So why should I spend money on renting movies??? So, can you guys please tell me where I can [url=http://www.watchhotmoviesfree.com]watch latest movie Shark Night 3D 2010[/url] for free?? I have searched [url=http://www.watchhotmoviesfree.com]Youtube.com[/url], [url=http://www.watchhotmoviesfree.com]Dailymotion.com[/url], [url=http://www.watchhotmoviesfree.com]Megavideo.com[/url] but, Could not find a good working link. If you know any working link please share it with me.


Thanks

8:30 PM  
Anonymous viagra online said...

hello, thanks for all!!
buy viagra
generic viagra

12:31 PM  
Blogger marcoz said...

FDA approved mens health medication viagra is not a drug to be taken lighliy you should read all about the pros and cons regarding the medication before you buy viagra!

3:32 AM  
Anonymous significante seo said...

I completely agree with the post.

3:00 AM  
Anonymous Masdayn said...

Yes indeed.You are totally right.

11:32 AM  

Post a Comment

<< Home


View My Stats