Commit Graph

  • 0e13022147 - enhanced solr field documentation - added xml api button to IndexFederated_p - the solr schema.xml file can be generated by YaCy Michael Peter Christen 2012-04-26 15:25:07 +02:00
  • 08dcf3e5d1 hack to get all results if the actual number is between 10 and 64 Michael Peter Christen 2012-04-26 00:27:21 +02:00
  • 19efbf1b0f - apply directDocByURL to NOLOAD Queue - choose pushing to NOLOAD as default for site crawl Michael Peter Christen 2012-04-26 00:23:18 +02:00
  • 5c66880be2 fix for search result selection in case that contentdom is not set Michael Peter Christen 2012-04-26 00:04:23 +02:00
  • 659178942f - Redesigned crawler and parser to accept embedded links from the NOLOAD queue and not from virtual documents generated by the parser. - The parser now generates nice description texts for NOLOAD entries which shall make it possible to find media content using the search index and not using the media prefetch algorithm during search (which was costly) - Removed the media-search prefetch process from image search Michael Peter Christen 2012-04-24 16:07:03 +02:00
  • 3bea25c513 increased image preview size Michael Peter Christen 2012-04-24 16:04:13 +02:00
  • a3badd3205 changed search process for images: no more media snippet load process, show only links from index which had been on the text search page before. This creates a superfast search process for images! Michael Peter Christen 2012-04-24 12:55:58 +02:00
  • f5efdb21fd refactoring Michael Peter Christen 2012-04-24 12:54:41 +02:00
  • 4aa0eedead one more scroogle... Michael Peter Christen 2012-04-24 12:05:37 +02:00
  • 347612ddd4 removed scroogle parser Michael Peter Christen 2012-04-24 12:04:44 +02:00
  • c1f6b4fb52 lookupByIP: prevent comparing of port parameter if called with port -1 (=unknown) reger 2012-04-24 00:05:01 +02:00
  • f8cd57c92f new indexing strategy: ALL links that appear anywhere are indexed, not only links where the content can be parsed. All non-parseable links are placed into the noload queue. The search process must therefore be able to filter out non-text search results. - This fixes the problem that image search results appeared in the text search. - The interactive search can retrieve now ALL types of links - The p2p interface is now extended to retrieve only certain types of links (text, image, video, apps) - The search process has an extension to filter the right document type according to the search query Michael Peter Christen 2012-04-22 02:05:17 +02:00
  • 14f67f217c refactoring of ContentDomain: now subclass of Classification Michael Peter Christen 2012-04-22 00:04:36 +02:00
  • 8a08c96a82 removed dependency from logging Michael Peter Christen 2012-04-21 21:32:31 +02:00
  • a1a5b015d8 refactoring: moved document Classification to cora package Michael Peter Christen 2012-04-21 21:31:13 +02:00
  • a5d7da68a0 refactoring: removed dependency from switchboard in Balancer/CrawlQueues Michael Peter Christen 2012-04-21 13:47:48 +02:00
  • 33d1062c79 refactoring: the cache belongs to the crawler Michael Peter Christen 2012-04-21 13:34:07 +02:00
  • 8429967ea7 no more SVN Michael Peter Christen 2012-04-19 13:29:08 +02:00
  • 0466bb0ddf no more SVN.. Michael Peter Christen 2012-04-19 13:28:12 +02:00
  • 4844e124b1 one more warning in case that crawling is paused because of low disk space Michael Peter Christen 2012-04-19 12:35:11 +02:00
  • 0ec2713af8 'download' Michael Peter Christen 2012-04-19 11:50:24 +02:00
  • 2be327b5ab update location update Michael Peter Christen 2012-04-19 11:49:43 +02:00
  • f30c577fdb add hint to speed up search results Michael Peter Christen 2012-04-19 11:11:14 +02:00
  • 6b133de3e9 add hint for consulting support Michael Peter Christen 2012-04-19 11:10:48 +02:00
  • 4d5da75814 fix for parser problem if a <a>-tag is 'within' html tags with unclosed tags. That prevented the <a> tags from beeing recognized. This is a fix for http://forum.yacy-websuche.de/viewtopic.php?p=25516#p25516 Michael Peter Christen 2012-04-18 10:30:04 +02:00
  • eb2c8ffa62 display is not used any more Michael Peter Christen 2012-04-17 12:30:14 +02:00
  • 91a86f0b06 fixed to network graph testing Michael Peter Christen 2012-04-17 11:46:14 +02:00
  • f31ad84d98 automatic generation of blacklist pattern, see http://forum.yacy-websuche.de/viewtopic.php?f=5&t=2685&p=25305#p25305 Michael Peter Christen 2012-04-17 11:22:19 +02:00
  • 7b5b9baee0 added citation rank to ranking profile Michael Peter Christen 2012-04-16 23:43:50 +02:00
  • 046f3a7e8d check if httpc has decompressed the release file and rename the file from .tar.gz to .tar if that happened Michael Peter Christen 2012-04-16 09:50:55 +02:00
  • 06951ef751 remove heuristic scroogle from search option help text in index.html reger 2012-04-16 04:00:04 +02:00
  • e377092198 fix to xml output format Michael Peter Christen 2012-04-13 09:02:18 +02:00
  • 41be98dc9d extended webstructure api to show together with incoming links also outgoing links Michael Christen 2012-04-13 11:53:34 +02:00
  • 02e4dedff2 fix to url citation collection Michael Christen 2012-04-13 11:52:59 +02:00
  • e32055aa15 added stub classes for - a new database for url reference data ('seen links') - a new database extending the references to the full url metadata attributes set which shall replace the old metadata database if it is finished - migration help classes stub to use old and new metadata databases simultanously Michael Christen 2012-04-13 07:09:15 +02:00
  • ac5d124ee0 experimental implementation of a citation ranking as post-ranking method. (ranking coefficient fixed, need to be made configurable) Michael Christen 2012-04-13 06:47:33 +02:00
  • 8f89c8ef07 added information about inbound, outbound and citation links into yacydoc api servlet Michael Christen 2012-03-31 07:38:49 +02:00
  • 71649a1296 added an api to retrieve the new citation.index with the webstructure.xml api. This api will respond with details about a single URL if requested with 'webstructure.xml?about=[url|urlhash|host]'. Michael Christen 2012-03-29 17:22:31 +02:00
  • 8fc86fe397 added storage of full anchor link structure: the links between all pages are now stored. The same index structure as used for the word index is used to make a reverse link index. The new file(s) in SEGMENT/default/citation.index.*.blob store the citation index. This will be used to create much more detailed link structures for the YaCy apis and to create a better ranking. A ranking using the citation.index should provide better results especially for portal indexes and initranets. Michael Christen 2012-03-29 17:20:14 +02:00
  • 22f05c83ff fixed default must-match filter for full domain crawls - the old filter was to restrictive and did not allow intranet crawls Michael Christen 2012-03-28 21:50:00 +02:00
  • 3e61287326 some better feedback on properties change Lotus 2012-03-25 22:21:42 +02:00
  • 96ac95cff9 added hint how to change integration options Lotus 2012-03-23 17:02:50 +01:00
  • 4f61b8fd82 Fixes for compare-search Thomas 2012-03-21 21:43:47 +01:00
  • e0680de7b3 Remove Scroogle from compare-search, Scroogle is dead Thomas 2012-03-20 23:00:06 +01:00
  • 78f0d8f046 no focus on preview frames for search integration fixes bug http://bugs.yacy.net/view.php?id=161 Lotus 2012-03-17 21:10:29 +01:00
  • 0b3f39136e allow custom ppm lower than minimum button on /Crawler_p.html fixes http://bugs.yacy.net/view.php?id=166 Lotus 2012-03-17 20:43:19 +01:00
  • e14eb9de82 checkalive.sh: try to fetch only once (default: 20) Lotus 2012-03-12 09:30:44 +01:00
  • 7792ac6406 fix links & bug #163 Lotus 2012-03-10 10:59:56 +01:00
  • 532c7cf827 added physics experiment to the graph plotter. not active by default Michael Peter Christen 2012-02-28 13:18:46 +01:00
  • aba9b1bfa0 better names for elements of a linked graph Michael Peter Christen 2012-02-27 21:27:17 +01:00
  • 0cc0290978 bugfix for a must-not-match pattern check. This bug did not make the check semantically wrong, but a trick that prevented an IP lookup in case that the filter was not used did not work. That bugfix causes that crawling gets a huge speed boost for noload urls! Michael Peter Christen 2012-02-27 00:52:44 +01:00
  • 2fc8ecee36 ConcurrentLinkedQueue has a VERY long return time on the .size() method. See http://docs.oracle.com/javase/6/docs/api/java/util/concurrent/ConcurrentLinkedQueue.html Michael Peter Christen 2012-02-27 00:42:32 +01:00
  • 8aba045ba1 if a new pop-up page is set in config portal, then this page applies also to the default page configuration for the httpd if no path is given. Michael Peter Christen 2012-02-26 20:53:32 +01:00
  • fa7b3481b3 better navigation in file search: less results by first try, but much faster. after the first search is done, buttons appear to get more results for the same search Michael Peter Christen 2012-02-26 17:32:45 +01:00
  • 5fd2c30318 adjust Netbeans project class path settings to updated httpclient and commons jars reger 2012-02-26 00:06:57 +01:00
  • aae75def69 fix: prevent logging of Solr doc content reger 2012-02-26 00:04:25 +01:00
  • 8c06925984 animation of the web structure picture Release_1.02 Michael Peter Christen 2012-02-25 15:42:29 +01:00
  • 898fa7c3f3 use tld heuristic to check if a domain is local or global Michael Peter Christen 2012-02-25 15:41:20 +01:00
  • 213c8d97f2 use less proccesses in process pool Michael Peter Christen 2012-02-25 14:07:20 +01:00
  • c639248c23 protection against strange answers from remote peers during search Michael Peter Christen 2012-02-25 14:07:02 +01:00
  • 9c51db4243 Release_1.02 Michael Peter Christen 2012-02-25 12:59:19 +01:00
  • 36e4d82b27 changed ranking Michael Peter Christen 2012-02-25 12:58:12 +01:00
  • 99c74699de removed scroogle (scroogle is dead) Michael Peter Christen 2012-02-25 12:57:59 +01:00
  • f7ed050771 Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git Michael Peter Christen 2012-02-25 12:44:02 +01:00
  • 096c17e7cd added test code Michael Peter Christen 2012-02-25 12:42:13 +01:00
  • 84f506da68 update installed jre version Lotus 2012-02-24 09:11:48 +01:00
  • 6e51a00a2f Revert "fix for page navigation: show only as much pages as are available for given navigation constraints, not as given by total results size" Michael Peter Christen 2012-02-24 02:46:56 +01:00
  • 73f5a9e8b3 fix for page navigation: show only as much pages as are available for given navigation constraints, not as given by total results size Michael Peter Christen 2012-02-24 02:31:03 +01:00
  • 9c51dc0f13 fixed a bug with navigation: if a navigation was applied to file type or protocol, then it was not possible to remove that again. This is the fix for that. Michael Peter Christen 2012-02-24 02:28:40 +01:00
  • 665626a51b catch OOM errors during scanning Michael Peter Christen 2012-02-24 02:15:27 +01:00
  • 8bfc987374 enhanced hint how to enter file:// urls Michael Peter Christen 2012-02-24 02:14:54 +01:00
  • f838997126 updated commons io from 2.0.1 to 2.1 Michael Peter Christen 2012-02-24 01:35:01 +01:00
  • 1cd711d005 added classes for citation references (for new citation ranking) Michael Peter Christen 2012-02-24 01:07:15 +01:00
  • eeb57ae824 updated http client libraries Michael Peter Christen 2012-02-24 01:06:30 +01:00
  • 33a405dab8 ipv6 bugfix Michael Peter Christen 2012-02-24 00:50:46 +01:00
  • c6c61be3f0 fix for http://bugs.yacy.net/view.php?id=148 Michael Peter Christen 2012-02-24 00:38:57 +01:00
  • edaa8ac94c Merge commit 'e15e633a0128b8d31011283a65b4ef26a6dddcd8' Michael Peter Christen 2012-02-23 10:07:13 +01:00
  • e15e633a01 Bugfix for IE9 (doesn't accept html form within form) changes of API schedule row data changed form input form to unique field names using row pk. Fix for issue 96 http://bugs.yacy.net/view.php?id=96 reger 2012-02-23 02:40:07 +01:00
  • 716db3b79a Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git Michael Peter Christen 2012-02-23 01:26:09 +01:00
  • e0f1e7d904 added new citation reference data structure that shall be used for a citation ranking Michael Peter Christen 2012-02-23 01:22:29 +01:00
  • e18a4f6b74 more tolerant merge iterator Michael Peter Christen 2012-02-23 01:21:24 +01:00
  • 0d148c3353 more logging in resource observer Michael Peter Christen 2012-02-23 01:20:42 +01:00
  • 2fa037ae1d enhanced crawler Michael Peter Christen 2012-02-23 01:20:24 +01:00
  • 43ffae6590 delete yacy.running after kill as requested in http://forum.yacy-websuche.de/viewtopic.php?t=3835 Lotus 2012-02-22 18:41:32 +01:00
  • e101c2e0e2 added changes from copperdust (submitted by email): 1. Improved and fixed language detection: 1.1 Identificator.java - recognition fix (improved) 1.2 DCEntry.java - fix (changed detection order due to detection from tld in many cases is incorrect) 1.3 MultiProtocolURI.java - fixed and enhanced language from tld detection (all currently used top-level domains; ccTLD added but not tested). 2. Ukrainian language update. 3. Main Slavic languages langstats (tested and works fine). Michael Peter Christen 2012-02-22 12:21:27 +01:00
  • 58e08b1211 Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git Michael Peter Christen 2012-02-21 22:31:49 +01:00
  • a9b4d49b75 removed debug output Michael Peter Christen 2012-02-21 22:31:14 +01:00
  • 2120db289a *) Small change which should solve problem with cgitb module in Python CGI scripts. low012 2012-02-14 20:54:19 +01:00
  • ee89cf5ae5 fix must match filter for full domain crawl Lotus 2012-02-07 16:13:13 +01:00
  • 4f92389550 Merge branch 'master' of git://gitorious.org/yacy/rc1.git reger 2012-02-03 23:34:24 +01:00
  • 8d63a5887c bugfixes Michael Peter Christen 2012-02-02 23:38:23 +01:00
  • 9ad1d8dde2 complete redesign of crawl queue monitoring: do not look at a ready-prepared crawl list but at the stacks of the domains that are stored for balanced crawling. This affects also the balancer since that does not need to prepare the pre-selected crawl list for monitoring. As a effect: - it is no more possible to see the correct order of next to-be-crawled links, since that depends on the actual state of the balancer stack the next time another url is requested for loading - the balancer works better since the next url can be selected according to the current situation and not according to a pre-selected order. Michael Peter Christen 2012-02-02 21:33:42 +01:00
  • 7e4e3fe5b6 free some memory after parsing html Michael Peter Christen 2012-02-02 09:55:27 +01:00
  • 4540174fe0 memory hacks Michael Peter Christen 2012-02-02 07:37:00 +01:00
  • b4409cc803 small redesign of blob column index and usage Michael Peter Christen 2012-02-02 06:43:57 +01:00
  • d5c1f2746e performance hack Michael Peter Christen 2012-02-02 06:43:15 +01:00
  • 803963aebd performance hack: better space grow in CharBuffer (speeds up html parser) Michael Peter Christen 2012-02-01 23:27:59 +01:00
  • 8b0920b0b5 tried to fix the ipv6 problem as reported in bug Michael Peter Christen 2012-02-01 22:26:19 +01:00
  • e2f8f263e8 changed storage of search words: keep order Michael Peter Christen 2012-02-01 18:13:31 +01:00
  • ed39ef2890 changed generation of protocol information Michael Peter Christen 2012-02-01 18:12:59 +01:00