[edit] More details needed for personal projectWell, this a basics of the web crawler. I have to design a web crawler that will work in client/server architect. I have to make it using the Java. Actually I am confused about the how will I implement the client/server architect. What I have in my mind is that I will create a light weight component using swing for client interaction and an EJB that will get the instructions from the client to start crawling. Now the server will have another GUI that will monitor the web crawler and administrate it. Do anyone have a simple or another way for doing this.
As far as I concerned, how does a Web crawler collect URL automatically as many as possible? [edit] Transwikied ContentThe following came from b:Web crawler. The sole contributor was 129.186.93.50 Web Crawler A program that downloads pages from the internet by following links Examples: - google bot - yahoo ... In general all the search engines have a web crawler that collects the pages from the web for them. This is done by starting with a page, then downloading the pages that it points to, then downloading the pages that these pages point to and so on and so forth. The names of the already downloaded pages are kept into a databese in order to avoid redownloading them. The reach (pages from the web that are downloaded) of this whole technology is depndent upon the initial pages where the downloading starts. Basically the downloaded pages are all the reacheabel pages from those initial pages (unless addititonal constraints are specified). The current eight bilion Pages that Google crawls are estimated to be only 30% of the web for this reason [edit] Article History
[edit] Anti-mergeI dissaprove of merging this article, as not all web crawlers are search bots, for example maintenance bots and spam bots! The Neokid 09:55, 28 January 2006 (UTC)
[edit] Merge with spideringThe new article on Spidering should definately be moved into this article. Fmccown 18:47, 8 May 2006 (UTC) Absolutely Not! Spidering and Web Crawling are exactly opposite terms. Spidering = The network of web pages and their inter-connection to each other. web-crawling = The art of finding specific information from that web or internet. I guess this is most comprehensive that i can say! Any comment/suggestion is purely welcomed? Raza Kashif (l1f05mscs1025@ucp.edu.pk) 203.161.72.40 11:57, 21 May 2007 (UTC)== Verifiability == "Some spiders have been known to cause viruses." No citation, examples, or explanation for how this is possible. I'm removing this sentance, as I don't believe it is true. Requesting a document by URL can't give the server a virus! ( Of course, if somebody knows something I don't, please restore the sentance, and cite your sources! ) --Sorry, not source for this, but I have heard of several cases where the spider has literally flooded a server with requests, reulting in the server going down temporarily. It's not a virus in any way, but it is certainly possible to try and overwhelm a server with thousands of requests. A simple way to prevent this would be a cap on the number of times a server can receive requests per minute. [edit] QuestionHow comes WebBase is considered as an open source crawler while his source is unknown !! [edit] VandalismI've never come across a vandalized page before and was not quite sure what to do about it. I removed some of the material on the vandalized page, but did not revert the content. If someone with more experience could assist, I would be grateful. AarrowOM 16:16, 20 February 2007 (UTC)AarrowOM 11:15 EST, 20 February 2007 [edit] Seemingly contradictory sectionI added {{confusing}} to the section Crawling policies because it essentially seems to say both that the nature of the Web makes crawling very easy, and that the nature of the Web makes crawling difficult. Can someone rewrite it in a way that clarifies things, or is the problem with how I am reading it? ˜ Lenoxus " * " 13:06, 24 March 2007 (UTC)
[edit] Bad PDF linksThere are a lot of bad links at the PDFs at the bottom of the page. I am not an experienced wiki editor, but they should have their hyperlinks removed or fixed or something. 70.142.217.250 13:33, 15 July 2007 (UTC) [edit] "Web" as a proper nounAs a matter of style I believe that "Web" should be capitalized when used as a proper noun -- for example as in "World Wide Web" (meaning the singular largest connected graph of HTML documents avaiiable by HTTP), or "the Web" (short for the above) -- but not when used in a compound noun such as "web crawler", "web page", "web server", where it acts like an adjective meaning something more like "HTML/HTTP". -- 86.138.3.231 11:58, 13 October 2007 (UTC) It should also be noted that 'Internet' is always a proper noun. Someone with some time ought to clean up this page. —Preceding unsigned comment added by 24.96.244.134 (talk) 21:56, 10 August 2008 (UTC) [edit] SEOENGBotSEOENGBot, originally created for the purpose of providing a focused crawler for the SEOENG engine on a per Website basis (2004-2007), was later retrofitted as a general purpose, highly distributed crawler which is reponsible for crawling millions of webpages, while archiving both webpages and links. The archived data is used to inject into the SEOENG engine for its own commercial use. SEOENGBot, as well as SEOENG, remains a highly guarded system and its source and location are not currently published. Seoeng (talk) 04:45, 10 May 2008 (UTC) Página espejo de la WikipediaDirectorio de Enlaces Directorio dmoz Directorio espejo dmoz Pedro Bernardo |