Commit Graph

  • 172aefaeeb adjust YaCySecurityHandler to Jetty 9 conventions - mainly adjust prepareConstraintInfo to use the RoleInfo.setChecked as in Jetty Source distribution - use constraint check behavior as in ConstraintSecurityHandler see http://git.eclipse.org/c/jetty/org.eclipse.jetty.project.git/tree/jetty-security/src/main/java/org/eclipse/jetty/security/ConstraintSecurityHandler.java?id=jetty-9.0.5.v20130813 reger 2013-10-03 19:38:03 +02:00
  • ba3c173077 Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git orbiter 2013-10-03 18:19:02 +02:00
  • 6f9ed439d3 - expand localHostName check of AbstractRemoteHandler to pevent request is handled as proxy request - make domain handler not relay on included path in resolved .yacy address reger 2013-10-01 03:04:32 +02:00
  • 561ea135af fix : forgot adding security handler reger 2013-09-30 04:35:17 +02:00
  • f46771bdf5 upd build script from rc1/master reger 2013-09-30 03:47:55 +02:00
  • c7c706fd9f merge with rc1/master reger 2013-09-30 03:46:39 +02:00
  • 272b196d05 update Jetty server init() to activate yacy-domain and transparent proxy handler - adding domain & proxy handler to a context (as it was in inital design) (context required for dispatcher) - make handler context and servlet context parallel available (to allow use of YaCyDefaultServlet to handle legacyServlets) - set transparent proxy request handled after dispatch.forward to skip further handling for .yacy domain requests reger 2013-09-30 03:12:52 +02:00
  • fd119deb00 fix NPE on modified since check ( Response.requestHeader allowed to be null) reger 2013-09-30 02:50:53 +02:00
  • 66145a0410 - add welcome file (index.html) support to YaCyDefaultServlet - change SolrServlet default search field (&df) to text_t reger 2013-09-29 03:34:00 +02:00
  • a3b5d84c81 Merge remote-tracking branch 'origin/master' orbiter 2013-09-28 15:46:59 +02:00
  • adfae074cf added classpath for debugging orbiter 2013-09-28 15:45:33 +02:00
  • b28d43decc added two more fields source_cr_host_norm_i,target_cr_host_norm_i in webgraph and an addition to postprocessing to copy all cr ranking attributes to the link edges associated to the postprocessing documents Michael Peter Christen 2013-09-27 16:57:05 +02:00
  • a52f3a597e fix for canonical-from-http-header feature Michael Peter Christen 2013-09-27 15:09:04 +02:00
  • 2dd7c5be44 added parsing of http-canonical tags (untested, could not find an example page) Michael Peter Christen 2013-09-27 13:17:50 +02:00
  • 4476dea5ba do not fail if a wrong boost key is used; instead, print only a warning See also: http://bugs.yacy.net/view.php?id=293 Michael Peter Christen 2013-09-27 12:28:09 +02:00
  • ab9583d429 add default field (&df) to SolrServlet query if missing reger 2013-09-26 22:20:35 +02:00
  • 3bf0104199 fix for crawl domain counter limitation (limit was reached too early) Michael Peter Christen 2013-09-26 13:41:52 +02:00
  • 82bfd9e00a - crawl profiles shall be deleted from active and passive stacks if they are deleted to terminate the crawl because otherwise the crawl will go on after the load-from-passive stack policy. - better check if a crawl is terminated using the loader queue. Michael Peter Christen 2013-09-26 10:22:31 +02:00
  • 1b3d26dd23 hack to remove most of the warning: deprecated messages (but not all, one is left) Michael Peter Christen 2013-09-25 21:14:52 +02:00
  • a496313248 Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git Michael Peter Christen 2013-09-25 20:41:02 +02:00
  • 3c48fc65fd reverted RemoteInstance to deprecated methods of httpClient-4.2 this should work with current remote-Solr-Instances sixcooler 2013-09-25 18:45:16 +02:00
  • 91a875dff5 self-healing of mistakenly deactivated crawl profiles. This fixes a bug which can happen in rare cases when a crawl start and a cleanup process happen at the same time. Michael Peter Christen 2013-09-25 18:27:54 +02:00
  • 095053a9b4 Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git Michael Peter Christen 2013-09-25 17:32:52 +02:00
  • 0cae420d8e some dns-timing changes: since httpclient uses the domain-cache it is useful not to clean the domain cache until crawling is running (domains are filled into this cache) On huge crawl-starts (eg. from file) my DNS did not follow the high rates - so I reduced the rate and give some more time(-out) sixcooler 2013-09-25 15:01:28 +02:00
  • 15b1bb2513 bump to httpClient-4.3 sixcooler 2013-09-25 14:48:37 +02:00
  • 4f83d5f18c added the new field harvestkey_s to the collection index and the webgraph index which is temporary filled with the crawl profile key. This is used to select a set of documents for post-processing as soon as a crawl is finished. Now the postprocessing for a specific crawl is started when that specific crawl is finished and not at the end of all post-processing steps. Michael Peter Christen 2013-09-25 14:38:24 +02:00
  • 14442efa6d when profiles are cleaned, there shall be first a callback showing which profiles are cleaned. This shall enable a profile-termination-driven postprocessing. To do this, index writings must carry the profile key which will be implemented in another (next) step. orbiter 2013-09-25 11:04:12 +02:00
  • 0013d0d0bb removed superfluous class orbiter 2013-09-24 21:18:37 +02:00
  • f90d5296cb Added new data structure to be used by the balancer (not used yet). These data structures will enable the balancer to store the crawl queue into individual queues, one each for a single host. orbiter 2013-09-24 21:08:40 +02:00
  • 0e8d752462 refactoring orbiter 2013-09-24 19:55:59 +02:00
  • 8ac2e8c8c9 added location navigator which causes that the image to the map search is visible whenever a location is available in the search result. To activate this, the search.navigation property in yacy.conf must be modified to the new default values. orbiter 2013-09-24 11:26:51 +02:00
  • d86d2be5c3 automatically removed Places autotagging if no location library is wanted orbiter 2013-09-24 11:23:45 +02:00
  • 214a087cdf Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git orbiter 2013-09-23 20:59:03 +02:00
  • 96ed0c980e - added hosthash to all documents (also fail documents which is needed there for deletion), this fixes a problem for the deletion of old documents for new crawl starts - added clickdepth and citation computation for fail documents Michael Peter Christen 2013-09-23 18:09:42 +02:00
  • 179ad281f9 close include byte buffer after usage Michael Peter Christen 2013-09-23 12:19:51 +02:00
  • 52dd491c04 fix not necessary use of DigestURL reger 2013-09-23 03:05:09 +02:00
  • 6b9a624808 remove double declaration of TLD_any_zone_filter reger 2013-09-23 03:01:08 +02:00
  • 5111841e5b - reduce Jetty debug logging - fix Context path initialization reger 2013-09-23 01:30:45 +02:00
  • bc6ebb3c06 adjust to DigestURI changes from master to DigestURL reger 2013-09-22 20:57:50 +02:00
  • 561cbc7ee2 use more YaCy HeaderFramework constants (instead of Jetty's) reger 2013-09-22 04:23:42 +02:00
  • 5c4ba9b5db merge rc1 master reger 2013-09-22 02:21:24 +02:00
  • 70c51775ae Merge remote-tracking branch 'origin/master' into jetty reger 2013-09-22 02:09:02 +02:00
  • 4b77733e59 implement a YaCyDefaultServlet to handle YaCy-servlets within Jetty server - the implementation is inspired by Jetty's DefaultServlet - handles static html content and YaCy servlets - translates between standard servlet request/response and YaCy request/response specification With the implementation of YaCy-servlets as servlet instead via a jetty handler it's closer to servlet standard and carries less jetty specific dependencies. reger 2013-09-22 01:57:32 +02:00
  • d2effd21db fix for npe during location search orbiter 2013-09-21 21:03:58 +02:00
  • 828603e4f1 fix for 100%CPU problem in error cache cleaning process orbiter 2013-09-21 10:20:13 +02:00
  • c64b51134e hack to add all tokens from the url to text_t. This was working for the RWI index (and still is working) but not for solr-only search indexes. Maybe we should find a solution using a separate search field instead. orbiter 2013-09-21 08:57:43 +02:00
  • 6e8377b8ad do not check all words with synonym library if the library is empty orbiter 2013-09-21 08:56:24 +02:00
  • 70ba74b23a disabled ipv4 preference to enable ipv6-only networks like freifunk orbiter 2013-09-20 16:52:37 +02:00
  • f3be1930cb CPU problem when pusing to the error cache; wrong class, ConcurrentHashMap needed for concurrency orbiter 2013-09-20 16:51:50 +02:00
  • e40671ddb7 better and consistent deletions for error urls Michael Peter Christen 2013-09-17 15:52:57 +02:00
  • 2602be8d1e - removed ZURL data structure; removed also the ZURL data file - replaced load failure logging by information which is stored in Solr - fixed a bug with crawling of feeds: added must-match pattern application to feed urls to filter out such urls which shall not be in a wanted domain - delegatedURLs, which also used ZURLs are now temporary objects in memory Michael Peter Christen 2013-09-17 15:27:02 +02:00
  • 31920385f7 set anchor rel attribute of all links to "nofollow" if the html meta contains a robots:nofollow or if the http header contains a "X-Robots-Tag: nofollow" Michael Peter Christen 2013-09-16 16:14:56 +02:00
  • 9619b8743c add Solr Servlet reger 2013-09-16 03:01:18 +02:00
  • 57e00baf26 fix for parsing of image links inside of anchor links (image-links) Michael Peter Christen 2013-09-15 23:54:46 +02:00
  • 61c5e40687 - replaced the properties object in AnchorURL with distinct variables for anchor attributes. - this caused that large portions of the parser code had to be adopted as well - added a counter target_order_i for anchor links in webgraph computation Michael Peter Christen 2013-09-15 23:27:04 +02:00
  • 3ea9bb4427 Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git Michael Peter Christen 2013-09-15 00:30:41 +02:00
  • 5e31bad711 - the webgraph shall store all links which appear on a web page and not all unique links! This made it necessary, that a large portion of the parser and link processing classes must be adopted to carry a different type of link collection which carry a property attribute which are attached to web anchors. - introduction of a new URL class, AnchorURL - the other url classes, DigestURI and MultiProtocolURI had been renamed and refactored to fit into a new document package schema, document.id - cleanup of net.yacy.cora.document package and refactoring Michael Peter Christen 2013-09-15 00:30:23 +02:00
  • 13fc86c960 Merge remote-tracking branch 'origin/master' into jetty reger 2013-09-14 21:10:24 +02:00
  • 850609937f update Info.plist for Jetty 9 jars reger 2013-09-14 20:56:46 +02:00
  • f7f86d8a5d update to Jetty 9 jars - include javax.servlet 3.0 reger 2013-09-14 20:49:05 +02:00
  • 603368fc3e remove redundant declaration of USER_AGENT reger 2013-09-14 18:29:44 +02:00
  • bd71b14d25 add mandatory p2p parameter to templatePattern reger 2013-09-12 22:49:09 +02:00
  • b8da176c5d adjust setHandled to request of call parameter reger 2013-09-12 22:04:10 +02:00
  • 127adbf5cf remove references to 10_http thread (legacy http server) and add needed get/set function to jetty http server wrapper reger 2013-09-12 22:02:11 +02:00
  • 1a8c64117f decreased the responseHeaderDB database which is now flushed more frequently. This will preserve more documents in the cache in case of a crash. Michael Peter Christen 2013-09-11 13:03:58 +02:00
  • 3e22d05290 added option for daterange properties in GSA interface to use an left- or right-open date range; i.e. using daterange=..2013-09-09 or daterange=2013-09-02.. additional to daterange=2013-09-02..2013-09-09 Michael Peter Christen 2013-09-11 12:52:18 +02:00
  • 36b7159282 - remove double initialization of jetty - refactor some var assignments reger 2013-09-11 02:24:47 +02:00
  • 8e52271491 - delete not needed old jetty jars from libt - add jetty to Info.plist reger 2013-09-10 20:55:03 +02:00
  • 63ed04260a Merge remote-tracking branch 'origin/master' into jetty reger 2013-09-10 20:42:38 +02:00
  • fe87fb638a adjust test/ParserTest to dc_description data type reger 2013-09-10 20:05:10 +02:00
  • 35ab2cef7b added parsing of 'date', 'dc:date', 'dc.date' and 'last-modified' in html meta fields to get a correct (or: better) date timestamp. The http:last-modified mostly does not work because it is set to the current date from most CMS. Michael Peter Christen 2013-09-10 10:31:57 +02:00
  • 2ee68f76f6 added read parameter from multi-part form fields (to nasty quick-fix) reger 2013-09-10 01:42:08 +02:00
  • 9cc8468b30 added tools to visualize image generation (i.e. during testing) Michael Peter Christen 2013-09-09 12:58:26 +02:00
  • 105cf8f593 changes to adjust jetty to recent code changes reger 2013-09-09 02:37:29 +02:00
  • aafef72a8a merged current rc1/master into jetty branch to allow further development with latest version ServerSideIncludes and servlet return values need further work (for working jetty integration) - TODO: added nasty quickfix to allow SSI - needs further work - TODO: YaCy servlet return values/parameters are not handled reger 2013-09-09 02:36:06 +02:00
  • dbef8ccfcb forced deletion of ZURL entries for a specific host for each host that appears in the crawl url list Michael Peter Christen 2013-09-05 13:22:16 +02:00
  • e137ff4171 refactoring (im preparation for new removeHost method) Michael Peter Christen 2013-09-05 09:59:41 +02:00
  • 7a5574cd51 Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git Michael Peter Christen 2013-09-04 23:12:04 +02:00
  • 85456f46b2 added two new fields, exact_signature_copycount_i and fuzzy_signature_copycount_i, which count the number of copies of non-unique documents and assigns this to each document. Thus, each document there is a number assigned which shows how many copies of this document exists. These fields are disabled by default. Michael Peter Christen 2013-09-04 23:11:53 +02:00
  • 26366596d9 fix for a problem which ocurres when a site is crawled where the start url is redirected. orbiter 2013-09-04 16:00:47 +02:00
  • a2511b5600 turned images_alt_txt back to images_alt_sxt because it is not necessary to index the alt text. Indexed image Text is in images_text_t Michael Peter Christen 2013-09-04 10:47:18 +02:00
  • 85b1922244 activated image type navigation for image search Michael Peter Christen 2013-09-03 13:34:01 +02:00
  • 9e12fdff23 Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git Michael Peter Christen 2013-09-03 12:22:57 +02:00
  • ab1201fdfd fixed wrong facet count Michael Peter Christen 2013-09-03 12:22:29 +02:00
  • 049c3b3f2e added an option to exclude image search results from text search. This is on by default. Michael Peter Christen 2013-09-03 11:14:23 +02:00
  • 69f85265e1 added an option to put image links to the crawl queue and handle these like normal documents. Using this option (by default on at this moment; this might change soon) it is possible to get the exif data into the search index to be used in image search. Michael Peter Christen 2013-09-03 11:13:45 +02:00
  • e8e558a9b7 fix for content domain classification in URIMetadataNode Michael Peter Christen 2013-09-03 10:49:09 +02:00
  • a8c5bfcf58 avoid to create unnecessary objects Michael Peter Christen 2013-09-03 09:48:05 +02:00
  • 5a0de1b77d moving image description text to image text field Michael Peter Christen 2013-09-03 09:47:27 +02:00
  • dc179bd61f fix for catchall query goal for image search Michael Peter Christen 2013-09-03 07:55:21 +02:00
  • 5d71a4c8bc fix for dc:description field Michael Peter Christen 2013-09-03 07:54:49 +02:00
  • 392174de8c remove all_words, all_strings lists from QueryGoal - only used for text highlighting in parser text (ViewFile.html) which can be done with include_strings only reger 2013-09-02 23:09:43 +02:00
  • 169ef8963d one more fix for image search Michael Peter Christen 2013-09-02 20:02:26 +02:00
  • cb85b22725 redesign of the image search process (with much better results, unfortunately the index schema has changed and p2p image search will not be muchmuch better until many people update) Michael Peter Christen 2013-09-02 18:55:38 +02:00
  • 6184fd9d9a fix for solr/gsa result logging Michael Peter Christen 2013-09-02 08:05:42 +02:00
  • 29967102a2 optimized QueryGoal (reducing mem and computation by removing all_hashes) - all_hashes used for text highlighting and word distance computation which can be done with include_hashes only reger 2013-09-02 04:19:53 +02:00
  • f106345eef link strings should not be tokenized orbiter 2013-09-01 14:35:36 +02:00
  • deadeb406e image alt tag strings should be tokenized orbiter 2013-09-01 13:48:10 +02:00
  • 5b14bdfffd npe fix orbiter 2013-09-01 13:28:37 +02:00
  • 3e5f8e29e2 next development release step to reflect the extension of the solr api with javabin format capability orbiter 2013-09-01 13:12:36 +02:00