Commit Graph

  • b4b0d14c04 fix for display bug Michael Peter Christen 2014-04-16 22:24:04 +02:00
  • 9a5ab4e2c1 removed clickdepth_i field and related postprocessing. This information is now available in the crawldepth_i field which is identical to clickdepth_i because of a specific crawler strategy. Michael Peter Christen 2014-04-16 22:16:20 +02:00
  • da86f150ab - added a new Crawler Balancer: HostBalancer and HostQueues: This organizes all urls to be loaded in separate queues for each host. Each host separates the crawl depth into it's own queue. The primary rule for urls taken from any queue is, that the crawl depth is minimal. This produces a crawl depth which is identical to the clickdepth. Furthermorem the crawl is able to create a much better balancing over all hosts which is fair to all hosts that are in the queue. This process will create a very large number of files for wide crawls in the QUEUES folder: for each host a directory, for each crawl depth a file inside the directory. A crawl with maxdepth = 4 will be able to create 10.000s of files. To be able to use that many file readers, it was necessary to implement a new index data structure which opens the file only if an access is wanted (OnDemandOpenFileIndex). The usage of such on-demand file reader shall prevent that the number of file pointers is over the system limit, which is usually about 10.000 open files. Some parts of YaCy had to be adopted to handle the crawl depth number correctly. The logging and the IndexCreateQueues servlet had to be adopted to show the crawl queues differently, because the host name is attached to the port on the host to differentiate between http, https, and ftp services. Michael Peter Christen 2014-04-16 21:34:28 +02:00
  • 075b6f9278 refactoring of the crawl balancer: the balancer is turned into an interface and the old balancer class is moved into LegacyBalancer to make room for a fresh implementation of a crawl balancer. Michael Peter Christen 2014-04-14 13:32:35 +02:00
  • 8470dfe3f8 Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git Michael Peter Christen 2014-04-14 12:17:52 +02:00
  • 46016fa153 autoupdate fails to download latest release (1.71) due to default release blacklist - removed the default version blacklist regex from init (for future versions) reger 2014-04-13 07:32:32 +02:00
  • 8aeef73d49 fix for virtual root nodes Michael Peter Christen 2014-04-11 15:12:34 +02:00
  • 7c7fbb9818 find depth-matches also for edge targets Michael Peter Christen 2014-04-11 12:27:21 +02:00
  • dd12dd392f introduction of a data structure for HyperlinkEdges which should use less memory as it does no double-storage of source links for each edge of the graph. Michael Peter Christen 2014-04-11 12:09:33 +02:00
  • 6ea8bb7348 using MultiProtocolURL for edge data which is faster (hash computation is now much easier) and smaller in size Michael Peter Christen 2014-04-11 10:58:37 +02:00
  • b21c208b4d enhanced hashcode computation for MultiProtocolURL Michael Peter Christen 2014-04-11 10:23:48 +02:00
  • ce1d1b2fa0 fix for maximum tag length in parser Michael Peter Christen 2014-04-11 09:56:44 +02:00
  • 17e0956312 refactoring of SystemLoad calls (only one backend tool) Michael Peter Christen 2014-04-11 09:25:18 +02:00
  • a37d067692 refactoring Michael Peter Christen 2014-04-10 23:46:35 +02:00
  • 95780eed32 Merge branch 'master' of git@gitorious.org:yacy/rc1.git orbiter 2014-04-10 21:40:54 +02:00
  • 67beef657f strong redesign of html parser: object recursion is now made using a stack on html tag objects, not using a recursive parse-again method which may cause bad performance and huge memory allocation. The new method also produced better parsed image objects with exact anchor text references. Michael Peter Christen 2014-04-10 18:58:03 +02:00
  • 6bd8c6f195 fix for wrong status codes of error pages Michael Peter Christen 2014-04-10 09:08:59 +02:00
  • 9e503b3376 also delete the robots.txt file from the cache when a new crawl is started Michael Peter Christen 2014-04-09 21:59:54 +02:00
  • 67501c9dda Merge branch 'master' of git@gitorious.org:yacy/rc1.git orbiter 2014-04-09 19:58:54 +02:00
  • 1c21b3256d fix for robots.txt handling: delete old entry before starting a new crawl. Michael Peter Christen 2014-04-09 18:33:48 +02:00
  • c250fac9f4 linkstructure refactoring to get more options for clickdepth analysis orbiter 2014-04-09 17:52:51 +02:00
  • 8068e68474 Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git Michael Peter Christen 2014-04-09 12:45:15 +02:00
  • bd886054cb new structure and enhancements for link graph computation: - added order option to solr queries to be able to retrieve document lists in specific order, here: link length - added HyperlinkEdge class which manages the link structure - integrated the HyperlinkEdge class into clickdepth computation - extended the linkstructure.json servlet to show also the clickdepth and other statistic information Michael Peter Christen 2014-04-09 12:45:04 +02:00
  • f326a67561 fix: typo in default charset in metadata2solr update pom and NB build to Solr 4.7.1 libs reger 2014-04-06 22:31:22 +02:00
  • df138084c0 do solr optimization independently from memory and load constraints: - not doing an optimization will likely cause a too many files exception - without optimization performance will be even worse which would prevent optimization in the future as well (prevent a deadlock situation) Michael Peter Christen 2014-04-06 11:04:23 +02:00
  • ebd44a7080 replaced solr 4.6.1 with solr 4.7.1 and added index migration to lucene_47 Michael Peter Christen 2014-04-06 10:45:03 +02:00
  • 0f3fbae438 Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git Michael Peter Christen 2014-04-06 08:56:31 +02:00
  • 1a6e0354db update commons-compress.jar to 1.8 reger 2014-04-06 03:59:11 +02:00
  • 68417a05c5 different algorithm to test checkalive as it depends less on the existence of wget (or curl) on the OS. Michael Peter Christen 2014-04-06 01:20:03 +02:00
  • 6b0e62ec59 Emergency bugfix for killYACY.sh as the file yacy00.log does not exist in case that a too many open files error exist. In such a case, the file yacy00.log does not exist but only the file yacy00.log.lck. In the long term a different solution should be addressed. Michael Peter Christen 2014-04-06 01:00:09 +02:00
  • ee92d748b5 test using compound file format, see UseCompoundFile in https://cwiki.apache.org/confluence/display/solr/IndexConfig+in+SolrConfig This appears to be necessary as many times a java.io.FileNotFoundException: (Too many open files) appears. See also: https://issues.apache.org/jira/browse/SOLR-4 and desperate users at http://stackoverflow.com/questions/3828343/too-many-open-file-exception-while-indexin-using-solr We cannot force users to do a "ulimit -n 1000000", so this action seems to be required. Michael Peter Christen 2014-04-06 00:35:35 +02:00
  • d2055f3d4b next development version 1.71 It's nowhere explained or declared, but since some time we follow the schema that uneven version numbers are used for development versions and even numbers for release versions. That concept may change sometime but this is used at this time to distinguish development from main. Michael Peter Christen 2014-04-06 00:32:10 +02:00
  • d1b5180dd9 upd version in pom reger 2014-04-06 00:20:12 +02:00
  • d051d2d85f release 1.7 Release_1.7 Michael Peter Christen 2014-04-04 17:05:03 +02:00
  • 0a95fd27f3 update of seed list Michael Peter Christen 2014-04-04 17:04:49 +02:00
  • 6e84770fd9 Merge branch 'master' of gitorious.org:yacy/icewindxs-rc1 Michael Peter Christen 2014-04-04 16:42:50 +02:00
  • f509cd4aab Update russian translation malykhin.dmitry 2014-04-04 18:28:57 +04:00
  • f296a529d5 update to german locale Michael Peter Christen 2014-04-04 15:45:08 +02:00
  • 734778c0c8 fixed a time-out problem in the default servlet which is also a logging problem because the error log showed the wrong reason (file not found) instead the actual reason (time-out). Michael Peter Christen 2014-04-04 15:27:29 +02:00
  • 466d90ad42 fixed a problem with resource observer; probably coming from uncatched exceptions within the apache library which appear only in concurrency environments. Michael Peter Christen 2014-04-04 15:26:39 +02:00
  • c8d4a63604 eliminating the word 'Facet' from the interface because it is ugly. If people do not know what search navigation is, then they also do not know what a 'facet' is. Michael Peter Christen 2014-04-04 15:25:37 +02:00
  • e8ddd415a8 enhanced the new link structure graph Michael Peter Christen 2014-04-04 14:43:54 +02:00
  • 926d28dd3f fixed a bug which prevented crawl starts after a network switch Michael Peter Christen 2014-04-04 14:43:35 +02:00
  • 8443255e18 better link structure limit calibration Michael Peter Christen 2014-04-04 12:48:55 +02:00
  • 7f5733638b fix for linkstructure computation: now also detecting dead links Michael Peter Christen 2014-04-04 12:47:29 +02:00
  • 3ce8eff21b another fix for inbound/outbound detection Michael Peter Christen 2014-04-04 12:41:59 +02:00
  • d4b5c457e4 NPE fix Michael Peter Christen 2014-04-04 12:34:34 +02:00
  • 36a66b0704 fix for parsing of numeric value in case that boolean values are given Michael Peter Christen 2014-04-04 11:59:51 +02:00
  • 41730c8048 better logging in template engine: shows filename of servlets where errors in templates occur orbiter 2014-04-04 10:55:46 +02:00
  • 3c1274057d fixed thread dump in case of wrong seeds orbiter 2014-04-04 10:54:56 +02:00
  • 18f9c40302 moved Edge class out of linkstructure servlet as this does not work on non-eclipse driven environments (all non-dev cases) orbiter 2014-04-04 10:54:11 +02:00
  • de95e5e524 reduced search activity corona strength in network image orbiter 2014-04-04 10:08:44 +02:00
  • da413af664 move baseurl after parsing orig source in urlproxyservlet to calculate absolute href links for rewrite from unmodified source. reger 2014-04-04 03:11:16 +02:00
  • af6ad20728 fix: remove obsolete ref to yacy.home (use Switchboard instead) reger 2014-04-04 02:45:04 +02:00
  • a6bb9be97e - added d3.js for visualizations using embedded svg - added a servlet api/linkstructure.json which generates a link graph information in json - added a javascript link graph renderer hypertree.js using d3 and the new servlet linkstructure.json - embedded the new link graph in the crawler monitor and the host browser Michael Peter Christen 2014-04-03 14:51:19 +02:00
  • 74ab094587 fix for solr query size; too many documents had been retrieved in case that less than _pagesize_ had been requested. Michael Peter Christen 2014-04-03 13:42:10 +02:00
  • c64c10ef00 Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git Michael Peter Christen 2014-04-03 01:58:06 +02:00
  • 48fbfa60c1 bugfix to inbound/outbound identification Michael Peter Christen 2014-04-03 01:21:43 +02:00
  • 227c42bc96 eleminate obsolete URIMetaDataRow class by joining it with/into URIMetaDataNode. reger 2014-04-03 00:35:15 +02:00
  • cca851a417 introduced new solr field crawldepth_i which records the crawl depth of a document. This is the upper limit for the clickdepth_i value which may be shorter in case that the crawler did not take the shortest path to the document. Michael Peter Christen 2014-04-02 23:37:01 +02:00
  • d321b0314e added missing servlet html Michael Peter Christen 2014-04-02 17:37:21 +02:00
  • b1ba764d81 fix for first start options and added german translation for popup texts orbiter 2014-04-02 17:10:59 +02:00
  • 043d274af5 fixed crawl start path for cloned crawls orbiter 2014-04-02 16:06:29 +02:00
  • 429a874222 - added COLS field in GSA response (non-gsa standard by customer request) - updated document link in GSA response writer orbiter 2014-04-02 16:05:44 +02:00
  • 1b9ec9a1c5 - added popover to p2p/stealth mode button to explain the peer mode and privacy issues. - added popover to first-time use case to explain that specific servlets are only visible after customization and/or crawl starts Michael Peter Christen 2014-04-02 13:33:43 +02:00
  • 8d35fcb1c7 transition.js is also included in bootstrap.js Michael Peter Christen 2014-04-02 12:19:26 +02:00
  • 3abc3c4c4c removed alert.js, modal.js and tooltip.js as these libraries are all included in bootstrap.js Michael Peter Christen 2014-04-02 12:18:33 +02:00
  • 898f78258e fix for naming bug Michael Peter Christen 2014-04-02 04:06:35 +02:00
  • 62a36fa584 Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git Michael Peter Christen 2014-04-02 03:27:08 +02:00
  • c9f92abddc fix: application link count (URIMetadataNode) reger 2014-04-02 03:21:51 +02:00
  • a267c46e1a Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git Michael Peter Christen 2014-04-02 02:35:58 +02:00
  • 5b83887da8 npe fix Michael Peter Christen 2014-04-02 02:34:55 +02:00
  • 63c9fcf3e0 free configuration of postprocessing clickdepth maximum depth and time Michael Peter Christen 2014-04-02 02:34:39 +02:00
  • 39b641d6cd added tutorial mode - some menu items will only appear if you 'qualify' for them. Thus, the first-time user will only see four menu items. The other items will unfold as the user interacts. Michael Peter Christen 2014-04-02 02:33:17 +02:00
  • f06775850f fix receiving DHT / parse pultipart + another close to fix possible resource leak warning sixcooler 2014-04-02 01:24:15 +02:00
  • 7a49f72480 fix for crawler column width Michael Peter Christen 2014-04-02 01:16:34 +02:00
  • ac5340ba1d Merge branch 'master' of gitorious.org:yacy/icewindxs-rc1 Michael Peter Christen 2014-04-01 21:37:30 +02:00
  • 344e7d4d1c Update russian translation malykhin.dmitry 2014-04-01 22:01:22 +04:00
  • 46a1a15441 added more bootstrap libraries Michael Peter Christen 2014-04-01 17:18:26 +02:00
  • 5ccbfeb803 show host list by default in host browser Michael Peter Christen 2014-04-01 16:55:22 +02:00
  • 8a77f318f4 fix for paths in locale files Michael Peter Christen 2014-04-01 15:55:20 +02:00
  • aac70fea2b Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git Michael Peter Christen 2014-04-01 13:20:03 +02:00
  • 49e76a1c55 make use of detected charset in htmlParser if none is given. reger 2014-04-01 04:02:34 +02:00
  • 71649bf22d add test case htmlParser.parse - getCharset (which fails) reger 2014-04-01 02:55:22 +02:00
  • ba0e3fb0dc fixed crawl start links after renaming them in latest commit Michael Peter Christen 2014-04-01 00:35:58 +02:00
  • d29b6db270 made crawl start pages public since they do not reveal individual information and they are also not used as servlet to actually start the crawl (which is Crawler_p.html). orbiter 2014-03-31 20:42:39 +02:00
  • e41db47cac added (again) underline to a tags Michael Peter Christen 2014-03-31 18:25:11 +02:00
  • ff82a80eb3 Integrated HostBrowser back to administration interface; it can appear with and without navigation bar. Michael Peter Christen 2014-03-31 18:19:24 +02:00
  • 94366ba2e5 added template for latest commit Michael Peter Christen 2014-03-31 16:00:13 +02:00
  • 701df02ead Complete redesign of administration top-level menu. This follows two principles: - provide an easy tutorial-like "what should I do first" menu - provide all elements which are subject to most first questions to YaCy exibition people on top level: Resource limitation, Parser and Ranking settings I apologize to everyone who are used to the old style and need to find the menu items (again) after this change. I hope that this will make the interface more usable for new users who see a web indexer/crawler the first time. Michael Peter Christen 2014-03-31 15:47:58 +02:00
  • a3b7366aee Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git Michael Peter Christen 2014-03-31 03:21:02 +02:00
  • 6b66bb7109 redesign of search page integration menu structure Michael Peter Christen 2014-03-31 03:18:38 +02:00
  • 92811d7850 fix: 3 more links pointing to old /xml path reger 2014-03-31 02:58:43 +02:00
  • c183d66d40 fix: blacklist xml export path to xml template reger 2014-03-31 02:48:28 +02:00
  • 656e2ce62a replacing direct html table cellspacing with css set-up for cellspacing Michael Peter Christen 2014-03-31 01:15:35 +02:00
  • e11504309f adding a hint to javascript browser short cut on Url-Proxy page (AugmentedBrowsing_p.html) reger 2014-03-30 05:11:42 +02:00
  • b12200cafe alternative UrlProxyServlet (for /proxy.html) using different url rewrite rules - use JSoup parser for selective rewrite of html body <a href= links only, instead of regex which rewrites also header href/src links - this improves display of pages which use header <base> tag - tags with src attribute are taken from original location (like css) improving display and are not routed trough the indexer Disadvantage: scripting links will drop out of proxy reger 2014-03-30 04:04:02 +02:00
  • 7f29eee9ac fix: cut-off button in WatchWebStructure_p.html (by header css dd hight/line-hight) reger 2014-03-30 00:23:54 +01:00
  • 2953ebe701 fix: port in local target adress & button style reger 2014-03-29 00:34:01 +01:00
  • fda591695c fixed visibility of custom icon Michael Peter Christen 2014-03-28 17:25:39 +01:00