Commit Graph

  • 81dc2aa536 add current css to HTMLResponseWriter to fix metadata view (using css from metas.template except js links) reger 2014-04-23 23:41:10 +02:00
  • 2fd8a0ead6 Merge branch 'master' of git@gitorious.org:yacy/rc1.git orbiter 2014-04-23 23:13:23 +02:00
  • 8e5ce7cd51 fixed a situation where finished crawls had not been detected. orbiter 2014-04-23 23:13:07 +02:00
  • c6f0bd05f8 better removal of stored urls when doing a crawl start orbiter 2014-04-23 23:12:08 +02:00
  • 2f63bd0261 enhanced Host Balancer strategy: fair round robin orbiter 2014-04-23 23:11:37 +02:00
  • 0c88a32c36 do not apply lazy value instantiation for numeric or boolean values because that is misleading and confusing in case of 0- or false-values and may cause NPEs in retrieval functions. orbiter 2014-04-23 08:41:36 +02:00
  • 8e04030596 in case of short memory, do not cut down robinson peers to 1, just reduce by 50% orbiter 2014-04-23 08:37:19 +02:00
  • 86f6975edc exclude html tags in in/outboundlinks_anchortext_txt parsed text - some outboundlinks_anchortext_txt in index contain e.g. <span>text</span> or more tags, remove all tags for text property (inline img tags are still parsed) - added test case for above (to htmlParserTest) - fix solr test case reger 2014-04-23 00:55:16 +02:00
  • 469e0a62f1 added new button to terminate all crawls orbiter 2014-04-22 23:14:54 +02:00
  • ccb1864d55 catch IllegalArgumentException for wrong process types (that is needed for migrations when new process types are introduced or disappear) orbiter 2014-04-22 23:14:05 +02:00
  • 4ee4ba1576 fix for NPE in IndexCreateParserErrors_p.html caused by bad handling of lazy value instantiation of 0-value in crawldepth_i orbiter 2014-04-22 19:48:49 +02:00
  • 12ba890205 removed warnings orbiter 2014-04-22 19:35:15 +02:00
  • d51f9cc863 add custom Jetty errorhandler to provide custom error page footer line - remove redundant mime check in UrlProxyServlet reger 2014-04-21 17:28:21 +02:00
  • c193a02023 defer creation of new ArrayList after possible early return (to skip not used object allocation) reger 2014-04-21 17:16:06 +02:00
  • 727dfb5875 refactore URIMetadataNode to further unify interaction with index - URIMetadataNode extending SolrDocument - use language as stored (String), reducing conversion to string - optimize debug code in transferIndex reger 2014-04-20 01:41:30 +02:00
  • 79e7947442 - remove empty http0_9 status text array and unused default_charset = ISO-8859-1 reger 2014-04-18 22:03:16 +02:00
  • 2dabe2009d - remove unused manual http KeepAlive config (reducing references to obsolete httpdemon) - add port info to settings_http reger 2014-04-18 19:57:35 +02:00
  • 5746aae3db add canonical links to the same crawldepth, not the next crawldepth Michael Peter Christen 2014-04-18 06:51:46 +02:00
  • 74ab5ef9fa increased runtime for postprocessing query job Michael Peter Christen 2014-04-18 06:51:10 +02:00
  • 8b32dd5f9e special strategy for balancer: do not remove targets with zero wait time from the queue Michael Peter Christen 2014-04-18 06:50:07 +02:00
  • 9c6228d948 fix for deadlocks in crawler Michael Peter Christen 2014-04-17 16:58:17 +02:00
  • 7a2f3e2353 increased resource.disk.used.max.steadystate and resource.disk.used.max.overshot by 4 times because first users reached that limit and wondered why the crawler was paused automatically :) Michael Peter Christen 2014-04-17 16:19:38 +02:00
  • 10cf8215bd added crawl depth for failed documents Michael Peter Christen 2014-04-17 13:21:43 +02:00
  • 7fefebaeca Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git Michael Peter Christen 2014-04-17 12:55:38 +02:00
  • c2f62e783f - better subgraph handling, less overhead for crawls without the webgraph - usage of crawler crawldepth cache for the linkgraph target depth computation Michael Peter Christen 2014-04-17 12:54:18 +02:00
  • 06afb568e2 new Strategies in Balancer: - doublecheck cache now records the crawl depth as well - doublecheck cache is available from the outside (made static) - no more need to crawl hosts with lowest depth first, instead all hosts which have only singleton entries are preferred to reduce the number of files. Michael Peter Christen 2014-04-17 12:52:54 +02:00
  • 1aea01fe5b fix for Table in case that requested file does not exist and paths also do not exist Michael Peter Christen 2014-04-17 12:44:05 +02:00
  • 710054bb37 implement gzip input handling directly in defaultservlet (making reference to legacy httpdemon obsolete) reger 2014-04-17 03:20:29 +02:00
  • b4b0d14c04 fix for display bug Michael Peter Christen 2014-04-16 22:24:04 +02:00
  • 9a5ab4e2c1 removed clickdepth_i field and related postprocessing. This information is now available in the crawldepth_i field which is identical to clickdepth_i because of a specific crawler strategy. Michael Peter Christen 2014-04-16 22:16:20 +02:00
  • da86f150ab - added a new Crawler Balancer: HostBalancer and HostQueues: This organizes all urls to be loaded in separate queues for each host. Each host separates the crawl depth into it's own queue. The primary rule for urls taken from any queue is, that the crawl depth is minimal. This produces a crawl depth which is identical to the clickdepth. Furthermorem the crawl is able to create a much better balancing over all hosts which is fair to all hosts that are in the queue. This process will create a very large number of files for wide crawls in the QUEUES folder: for each host a directory, for each crawl depth a file inside the directory. A crawl with maxdepth = 4 will be able to create 10.000s of files. To be able to use that many file readers, it was necessary to implement a new index data structure which opens the file only if an access is wanted (OnDemandOpenFileIndex). The usage of such on-demand file reader shall prevent that the number of file pointers is over the system limit, which is usually about 10.000 open files. Some parts of YaCy had to be adopted to handle the crawl depth number correctly. The logging and the IndexCreateQueues servlet had to be adopted to show the crawl queues differently, because the host name is attached to the port on the host to differentiate between http, https, and ftp services. Michael Peter Christen 2014-04-16 21:34:28 +02:00
  • 075b6f9278 refactoring of the crawl balancer: the balancer is turned into an interface and the old balancer class is moved into LegacyBalancer to make room for a fresh implementation of a crawl balancer. Michael Peter Christen 2014-04-14 13:32:35 +02:00
  • 8470dfe3f8 Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git Michael Peter Christen 2014-04-14 12:17:52 +02:00
  • 46016fa153 autoupdate fails to download latest release (1.71) due to default release blacklist - removed the default version blacklist regex from init (for future versions) reger 2014-04-13 07:32:32 +02:00
  • 8aeef73d49 fix for virtual root nodes Michael Peter Christen 2014-04-11 15:12:34 +02:00
  • 7c7fbb9818 find depth-matches also for edge targets Michael Peter Christen 2014-04-11 12:27:21 +02:00
  • dd12dd392f introduction of a data structure for HyperlinkEdges which should use less memory as it does no double-storage of source links for each edge of the graph. Michael Peter Christen 2014-04-11 12:09:33 +02:00
  • 6ea8bb7348 using MultiProtocolURL for edge data which is faster (hash computation is now much easier) and smaller in size Michael Peter Christen 2014-04-11 10:58:37 +02:00
  • b21c208b4d enhanced hashcode computation for MultiProtocolURL Michael Peter Christen 2014-04-11 10:23:48 +02:00
  • ce1d1b2fa0 fix for maximum tag length in parser Michael Peter Christen 2014-04-11 09:56:44 +02:00
  • 17e0956312 refactoring of SystemLoad calls (only one backend tool) Michael Peter Christen 2014-04-11 09:25:18 +02:00
  • a37d067692 refactoring Michael Peter Christen 2014-04-10 23:46:35 +02:00
  • 95780eed32 Merge branch 'master' of git@gitorious.org:yacy/rc1.git orbiter 2014-04-10 21:40:54 +02:00
  • 67beef657f strong redesign of html parser: object recursion is now made using a stack on html tag objects, not using a recursive parse-again method which may cause bad performance and huge memory allocation. The new method also produced better parsed image objects with exact anchor text references. Michael Peter Christen 2014-04-10 18:58:03 +02:00
  • 6bd8c6f195 fix for wrong status codes of error pages Michael Peter Christen 2014-04-10 09:08:59 +02:00
  • 9e503b3376 also delete the robots.txt file from the cache when a new crawl is started Michael Peter Christen 2014-04-09 21:59:54 +02:00
  • 67501c9dda Merge branch 'master' of git@gitorious.org:yacy/rc1.git orbiter 2014-04-09 19:58:54 +02:00
  • 1c21b3256d fix for robots.txt handling: delete old entry before starting a new crawl. Michael Peter Christen 2014-04-09 18:33:48 +02:00
  • c250fac9f4 linkstructure refactoring to get more options for clickdepth analysis orbiter 2014-04-09 17:52:51 +02:00
  • 8068e68474 Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git Michael Peter Christen 2014-04-09 12:45:15 +02:00
  • bd886054cb new structure and enhancements for link graph computation: - added order option to solr queries to be able to retrieve document lists in specific order, here: link length - added HyperlinkEdge class which manages the link structure - integrated the HyperlinkEdge class into clickdepth computation - extended the linkstructure.json servlet to show also the clickdepth and other statistic information Michael Peter Christen 2014-04-09 12:45:04 +02:00
  • f326a67561 fix: typo in default charset in metadata2solr update pom and NB build to Solr 4.7.1 libs reger 2014-04-06 22:31:22 +02:00
  • df138084c0 do solr optimization independently from memory and load constraints: - not doing an optimization will likely cause a too many files exception - without optimization performance will be even worse which would prevent optimization in the future as well (prevent a deadlock situation) Michael Peter Christen 2014-04-06 11:04:23 +02:00
  • ebd44a7080 replaced solr 4.6.1 with solr 4.7.1 and added index migration to lucene_47 Michael Peter Christen 2014-04-06 10:45:03 +02:00
  • 0f3fbae438 Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git Michael Peter Christen 2014-04-06 08:56:31 +02:00
  • 1a6e0354db update commons-compress.jar to 1.8 reger 2014-04-06 03:59:11 +02:00
  • 68417a05c5 different algorithm to test checkalive as it depends less on the existence of wget (or curl) on the OS. Michael Peter Christen 2014-04-06 01:20:03 +02:00
  • 6b0e62ec59 Emergency bugfix for killYACY.sh as the file yacy00.log does not exist in case that a too many open files error exist. In such a case, the file yacy00.log does not exist but only the file yacy00.log.lck. In the long term a different solution should be addressed. Michael Peter Christen 2014-04-06 01:00:09 +02:00
  • ee92d748b5 test using compound file format, see UseCompoundFile in https://cwiki.apache.org/confluence/display/solr/IndexConfig+in+SolrConfig This appears to be necessary as many times a java.io.FileNotFoundException: (Too many open files) appears. See also: https://issues.apache.org/jira/browse/SOLR-4 and desperate users at http://stackoverflow.com/questions/3828343/too-many-open-file-exception-while-indexin-using-solr We cannot force users to do a "ulimit -n 1000000", so this action seems to be required. Michael Peter Christen 2014-04-06 00:35:35 +02:00
  • d2055f3d4b next development version 1.71 It's nowhere explained or declared, but since some time we follow the schema that uneven version numbers are used for development versions and even numbers for release versions. That concept may change sometime but this is used at this time to distinguish development from main. Michael Peter Christen 2014-04-06 00:32:10 +02:00
  • d1b5180dd9 upd version in pom reger 2014-04-06 00:20:12 +02:00
  • d051d2d85f release 1.7 Release_1.7 Michael Peter Christen 2014-04-04 17:05:03 +02:00
  • 0a95fd27f3 update of seed list Michael Peter Christen 2014-04-04 17:04:49 +02:00
  • 6e84770fd9 Merge branch 'master' of gitorious.org:yacy/icewindxs-rc1 Michael Peter Christen 2014-04-04 16:42:50 +02:00
  • f509cd4aab Update russian translation malykhin.dmitry 2014-04-04 18:28:57 +04:00
  • f296a529d5 update to german locale Michael Peter Christen 2014-04-04 15:45:08 +02:00
  • 734778c0c8 fixed a time-out problem in the default servlet which is also a logging problem because the error log showed the wrong reason (file not found) instead the actual reason (time-out). Michael Peter Christen 2014-04-04 15:27:29 +02:00
  • 466d90ad42 fixed a problem with resource observer; probably coming from uncatched exceptions within the apache library which appear only in concurrency environments. Michael Peter Christen 2014-04-04 15:26:39 +02:00
  • c8d4a63604 eliminating the word 'Facet' from the interface because it is ugly. If people do not know what search navigation is, then they also do not know what a 'facet' is. Michael Peter Christen 2014-04-04 15:25:37 +02:00
  • e8ddd415a8 enhanced the new link structure graph Michael Peter Christen 2014-04-04 14:43:54 +02:00
  • 926d28dd3f fixed a bug which prevented crawl starts after a network switch Michael Peter Christen 2014-04-04 14:43:35 +02:00
  • 8443255e18 better link structure limit calibration Michael Peter Christen 2014-04-04 12:48:55 +02:00
  • 7f5733638b fix for linkstructure computation: now also detecting dead links Michael Peter Christen 2014-04-04 12:47:29 +02:00
  • 3ce8eff21b another fix for inbound/outbound detection Michael Peter Christen 2014-04-04 12:41:59 +02:00
  • d4b5c457e4 NPE fix Michael Peter Christen 2014-04-04 12:34:34 +02:00
  • 36a66b0704 fix for parsing of numeric value in case that boolean values are given Michael Peter Christen 2014-04-04 11:59:51 +02:00
  • 41730c8048 better logging in template engine: shows filename of servlets where errors in templates occur orbiter 2014-04-04 10:55:46 +02:00
  • 3c1274057d fixed thread dump in case of wrong seeds orbiter 2014-04-04 10:54:56 +02:00
  • 18f9c40302 moved Edge class out of linkstructure servlet as this does not work on non-eclipse driven environments (all non-dev cases) orbiter 2014-04-04 10:54:11 +02:00
  • de95e5e524 reduced search activity corona strength in network image orbiter 2014-04-04 10:08:44 +02:00
  • da413af664 move baseurl after parsing orig source in urlproxyservlet to calculate absolute href links for rewrite from unmodified source. reger 2014-04-04 03:11:16 +02:00
  • af6ad20728 fix: remove obsolete ref to yacy.home (use Switchboard instead) reger 2014-04-04 02:45:04 +02:00
  • a6bb9be97e - added d3.js for visualizations using embedded svg - added a servlet api/linkstructure.json which generates a link graph information in json - added a javascript link graph renderer hypertree.js using d3 and the new servlet linkstructure.json - embedded the new link graph in the crawler monitor and the host browser Michael Peter Christen 2014-04-03 14:51:19 +02:00
  • 74ab094587 fix for solr query size; too many documents had been retrieved in case that less than _pagesize_ had been requested. Michael Peter Christen 2014-04-03 13:42:10 +02:00
  • c64c10ef00 Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git Michael Peter Christen 2014-04-03 01:58:06 +02:00
  • 48fbfa60c1 bugfix to inbound/outbound identification Michael Peter Christen 2014-04-03 01:21:43 +02:00
  • 227c42bc96 eleminate obsolete URIMetaDataRow class by joining it with/into URIMetaDataNode. reger 2014-04-03 00:35:15 +02:00
  • cca851a417 introduced new solr field crawldepth_i which records the crawl depth of a document. This is the upper limit for the clickdepth_i value which may be shorter in case that the crawler did not take the shortest path to the document. Michael Peter Christen 2014-04-02 23:37:01 +02:00
  • d321b0314e added missing servlet html Michael Peter Christen 2014-04-02 17:37:21 +02:00
  • b1ba764d81 fix for first start options and added german translation for popup texts orbiter 2014-04-02 17:10:59 +02:00
  • 043d274af5 fixed crawl start path for cloned crawls orbiter 2014-04-02 16:06:29 +02:00
  • 429a874222 - added COLS field in GSA response (non-gsa standard by customer request) - updated document link in GSA response writer orbiter 2014-04-02 16:05:44 +02:00
  • 1b9ec9a1c5 - added popover to p2p/stealth mode button to explain the peer mode and privacy issues. - added popover to first-time use case to explain that specific servlets are only visible after customization and/or crawl starts Michael Peter Christen 2014-04-02 13:33:43 +02:00
  • 8d35fcb1c7 transition.js is also included in bootstrap.js Michael Peter Christen 2014-04-02 12:19:26 +02:00
  • 3abc3c4c4c removed alert.js, modal.js and tooltip.js as these libraries are all included in bootstrap.js Michael Peter Christen 2014-04-02 12:18:33 +02:00
  • 898f78258e fix for naming bug Michael Peter Christen 2014-04-02 04:06:35 +02:00
  • 62a36fa584 Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git Michael Peter Christen 2014-04-02 03:27:08 +02:00
  • c9f92abddc fix: application link count (URIMetadataNode) reger 2014-04-02 03:21:51 +02:00
  • a267c46e1a Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git Michael Peter Christen 2014-04-02 02:35:58 +02:00
  • 5b83887da8 npe fix Michael Peter Christen 2014-04-02 02:34:55 +02:00