Commit Graph

  • 0dfeee154a adjustments for Bookmark icon to act on BookmarkDB, it acts on YMarks but YMark interface seems not maintained, for future features (e.g. query memory) BookmarkDB is the likely choice to expand, besides the crawlstart bookmark also the result bookmark icon now adds to BookmarkDB. The YMark related code is (for now) left untouched so both tables are updated. reger 2015-01-01 02:41:20 +01:00
  • 513e9259f5 Merge branch 'master' of git@gitorious.org:yacy/rc1.git Michael Peter Christen 2014-12-30 02:36:17 +01:00
  • e177d69387 remove obsolete config footer option (ConfigPortal user.login) no footer or footer-option in use reger 2014-12-29 03:50:00 +01:00
  • 5d4167f977 reacivated clear stacks code for termination of all crawls because this did not work wihtout that part of the code Michael Peter Christen 2014-12-28 15:52:43 +01:00
  • 3cd7deb3b8 do not flush non-errors to stdout because this is a concurrency issue. the flush-call appeared very often in thread dumps with high load, so this hopefully gives some performances Michael Peter Christen 2014-12-28 15:48:37 +01:00
  • 4e3e2acc69 Merge branch 'master' of gitorious.org:yacy/rc1-fixed_percent-encoding Michael Peter Christen 2014-12-28 15:01:40 +01:00
  • ecb6a59e9e do not translate gif images into png images for thumbnails. Instead, stream the original to the search result thumb viewer. This has two reasons: - animated gifs cause 100% cpu and deadlocks in the jvm gif parser; a known bug which is obviously not yet fixed - animated gifs now appear in the search result also as animation Michael Peter Christen 2014-12-28 14:53:55 +01:00
  • d9603039ff automatically set the Q flag for smb/ftp start urls (split pdf support) Michael Peter Christen 2014-12-28 14:36:43 +01:00
  • 8600ea01dd automatically swith on query option in case intranet protocols (smb/ftp) are used. This supports the new split-pdf option. Michael Peter Christen 2014-12-28 14:27:42 +01:00
  • 3e9871291f Applied URL-decoding prior to HTML-encoding. This removes percent-encoding from text shown in HTML arucard21 2014-12-27 09:52:34 +01:00
  • 3144313974 Postprocessing progress bar fix (Make it work as [probably] actually intended) Ryszard Goń 2014-12-27 03:02:18 +01:00
  • 6a04563578 Init Jetty using setDefaultDescriptor (web.xml) to defaults/web.xml so web.xml in defaults dir is applied first and optional DATA/SETTINGS/web.xml loaded on top. By using this Jetty feature (default web.xml) we assure that changes to the default are applied to existing installations and individual addition/changes are still respected. reger 2014-12-27 00:10:14 +01:00
  • 51ec9c1f44 fix "null" title in response writer for documents with multivalued title reger 2014-12-26 18:23:26 +01:00
  • 73ba5d8ef7 adjust fieldtype and description of field httpstatus_redirect_s in CollectionSchema - the field is not used (delete candidate) reger 2014-12-26 18:21:35 +01:00
  • 1f9389396a fix NPE related 500 (Bad Request) response of UrlProxy on blacklisted urls, by adding parameter HTTPDeamon and removing unused hostAddress lookup code in sendRespondError reger 2014-12-25 02:21:45 +01:00
  • 7e4e9f7e32 improve yacysearchitem, prevent allocation of String (modifyURL) if feature not used reger 2014-12-25 02:16:19 +01:00
  • 61f75d6019 add xmpcore as direct dependency to pom (otherwise it's looked up at pdfbox archive path and not found there) reger 2014-12-25 02:13:44 +01:00
  • 8ef56eda90 Merge branch 'master' of git@gitorious.org:yacy/rc1.git Michael Peter Christen 2014-12-24 12:24:15 +01:00
  • 9fce8bf2a5 crawling of multi-page pdfs with artificial post part on smb or ftp shares is not possible with the disabled setting; this is not temporary disabled until a better solution is on the hand. Michael Peter Christen 2014-12-24 12:23:59 +01:00
  • 682dd94925 fix div by 0 in hello Caused by: java.lang.ArithmeticException: / by zero at hello.respond(hello.java:159) reger 2014-12-24 00:04:35 +01:00
  • 17808898c6 update to SLF4J 1.7.9 reger 2014-12-23 19:11:21 +01:00
  • f856edecb6 fix proxy redirect (http status 302) response fixes http://mantis.tokeek.de/view.php?id=517 reger 2014-12-23 02:01:03 +01:00
  • cc090bcb01 enhanced initialization of autotagging Michael Peter Christen 2014-12-23 00:37:51 +01:00
  • 003ec43bee Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git Michael Peter Christen 2014-12-23 00:33:20 +01:00
  • bef689d0a2 NPE fix Michael Peter Christen 2014-12-23 00:30:34 +01:00
  • 1de33c6a53 add hint to Heuristics Config on "Greedy Learning Mode" in portal config, to point to a option to make this setting permanent. reger 2014-12-22 20:36:29 +01:00
  • 5332c9df21 update to commons-fileupload-1.3.1.jar (includes a security fix) reger 2014-12-22 20:34:13 +01:00
  • a0576ec737 fix for pdf sub-page result preparation Michael Peter Christen 2014-12-22 14:32:09 +01:00
  • 6ad43c4a8b removed debug code Michael Peter Christen 2014-12-22 14:24:09 +01:00
  • 407cfff010 fix to wkhtmltopdf usage Michael Peter Christen 2014-12-22 02:01:55 +01:00
  • 5d321d3dc5 fixes to wkhtmltopdf call Michael Peter Christen 2014-12-21 20:11:39 +01:00
  • eb78388a98 changed prefer strategy for http unique in such a way that http is preferred over https. While this is a bad idea from the standpoint of security it is more common applicable for environments where http and https mix and for some domains https is not available. Then the double-check is possible even if no postprocessing is performed. Michael Peter Christen 2014-12-21 19:17:06 +01:00
  • 84e2cccab4 fix to prevent assertion error in ranking servlet if no vocabularies are present that could be evaluated Michael Peter Christen 2014-12-21 19:08:28 +01:00
  • 9e588944fa prevent NPE during initialization of very large vocabularies Michael Peter Christen 2014-12-21 19:02:36 +01:00
  • aaf7d4775a Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git Michael Peter Christen 2014-12-21 18:10:25 +01:00
  • 8c3e5b7b6d added experimental pdf splitting which enables YaCy to split pdfs during parsing into individual pages and add them all using different URLs. These constructed urls are generated from the source url with an appended page=<pagenumber> attribute to the url get/post properties. This will distinguish the different page entries. The search result list will then replace the post parameter with a url anchor # mark which causes that the original url is presented in the search result. These URLs can be opened directly on the correct page using pdf.js which is now built-in into firefox. That means: if you find a search hit on page 5 and click on the search result, firefox will open the pdf viewer and shows page 5. Michael Peter Christen 2014-12-21 18:10:15 +01:00
  • 85773ebd4f removed debug lines Michael Peter Christen 2014-12-21 17:53:06 +01:00
  • d14114697c the miss cache does not seem to work, it sometimes contains urlhashes from documents which actually are inside the index. This can be reproduced using the crawl result table at http://localhost:8090/CrawlResults.html?process=5 The cache is temporary disabled to remove the bad behaviour, however a later reactivation of that feater may be possible. Michael Peter Christen 2014-12-21 17:31:51 +01:00
  • deb75a1dbe fix refactored size() -> filesize() in YMarkMetadata reger 2014-12-21 14:02:06 +01:00
  • 198102304b refactor size() -> filesize() of URIMetadataNode (harmonize with ResultEntry and to not get confused with Collection.size()) reger 2014-12-21 06:05:35 +01:00
  • c6f634a4f2 remove redundant caching of urlhash in URIMetadataNode (is already cached in underlaying DigestURL .url) reger 2014-12-21 03:45:54 +01:00
  • 445fafeb7c Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git Michael Peter Christen 2014-12-20 15:38:15 +01:00
  • 0d69089c61 fix for division by zero Michael Peter Christen 2014-12-20 15:11:06 +01:00
  • ac61a39828 use peeraddress for link in remote crawl list to make link work without enabled proxy reger 2014-12-20 01:59:00 +01:00
  • fe5d4e6c7b update to Jetty 9.2.6 reger 2014-12-19 21:54:17 +01:00
  • 5516819354 preventing the use of no-cache and expires in case that images are generated dynamically which will stay static in the future. This applies mainly to the search result favicon in front of search hits. These icons will now be generated once, but then caches in the browser. There is also a YaCy-internal cache for these icons which had prevented the re-generation of the icons in YaCy, but this cache is now superfluous since the browser should not call the servlet ViewImage again. Michael Peter Christen 2014-12-19 17:41:38 +01:00
  • d3e71ed070 fixes for searches when initialization of large autotagging libraries have not been finished Michael Peter Christen 2014-12-19 17:38:58 +01:00
  • 28683530cd fixes to usage of no-cache: use and recognize also the no-store directive Michael Peter Christen 2014-12-19 17:37:58 +01:00
  • c9c700b510 reduction of http requests to YaCy using the correct cache-control, expires and last-modified headers in http response. Michael Peter Christen 2014-12-19 11:51:14 +01:00
  • eca578a5fa update to PDFBox 1.8.8 reger 2014-12-19 02:54:38 +01:00
  • 13cca2b114 fix missing AppPath upd Maven plugin versionid reger 2014-12-19 01:58:37 +01:00
  • d7e2f08a89 Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git Michael Peter Christen 2014-12-18 14:56:18 +01:00
  • 0f7d4c42e9 include xmpcore.jar in classpath used by metadata-extractor reger 2014-12-16 21:12:37 +01:00
  • bd39e009ac Update russian translation malykhin.dmitry 2014-12-16 23:10:53 +03:00
  • 65125439fe added query modifier 'on'. This makes it possible to search for date occurrences within the (web) page documents (not the document last-modified!). This works only if the solr field dates_in_content_sxt is enabled. A search request may then have the form "term on:<date>", like gift on:24.12.2014 gift on:2014/12/24 * on:2014/12/31 For the date format you may use any kind of human-readable date representation(!yes!) - the on:<date> parser tries to identify language and also knows event names, like: bunny on:eastern .. as long as the date term has no spaces inside (use a dot). Further enhancement will be made to accept also strings encapsulated with quotes. Michael Peter Christen 2014-12-16 13:53:12 +01:00
  • 1cfddea578 added (very experimental) Solr response writer for snapshot image results Michael Peter Christen 2014-12-16 13:18:49 +01:00
  • 7287dd764e added url, date, time and page number on pdf snapshot footer Michael Peter Christen 2014-12-16 12:39:10 +01:00
  • 8b5d074715 fix for image parser (there is a class missing!) Michael Peter Christen 2014-12-16 12:10:15 +01:00
  • 932faafffe reactivated on-demand snapshot loading Michael Peter Christen 2014-12-16 12:09:57 +01:00
  • 2362ad7c34 fix for a count issue in snapshot api Michael Peter Christen 2014-12-16 11:33:30 +01:00
  • 3354cd63be Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git Michael Peter Christen 2014-12-15 23:32:57 +01:00
  • 9971e197e0 Added a transaction interface to the snapshots: all documents in the snapshots can now be processed with transactions using commit and rollback commands. Furthermore, a large number of monitoring methods had been added to check the success of transactions. Michael Peter Christen 2014-12-15 23:32:46 +01:00
  • 63846ddb89 add final SolrQueryRequest.close to SolrServlet reger 2014-12-15 22:54:49 +01:00
  • 9edc7308aa update to metadata-extractor-2.7.0.jar add 2 simple JUnit test cases for jpeg and tif parsing reger 2014-12-15 20:45:05 +01:00
  • 578ae29f1e added a note that the servlet is linked using web.xml Michael Peter Christen 2014-12-15 05:56:12 +01:00
  • 6c3f36def1 - fix path to default heuristic.cfg - deprecate unused ProxyServlet reger 2014-12-14 21:27:45 +01:00
  • 00113dcfbd add chardet.jar to Maven dependencies reger 2014-12-14 19:17:13 +01:00
  • 446f374ba9 fix yacy.init comment http://mantis.tokeek.de/view.php?id=513 reger 2014-12-14 19:12:18 +01:00
  • bbf0ac40c3 add the actual DateDetection class... (missed in latest commit) Michael Peter Christen 2014-12-14 13:43:30 +01:00
  • 66b5a56976 Added and integrated new date detection class which can identify date notions within the fulltext of a document. This class attempts to identify also dates given abbreviated or with missing year or described with names for special days, like 'Halloween'. In case that a date has no year given, the current year and following years are considered. Michael Peter Christen 2014-12-14 13:40:45 +01:00
  • c3c2b6999b fixes on wkhtmltopdf Michael Peter Christen 2014-12-14 04:03:20 +01:00
  • 114f0afc1e enable sku as anchor in html response writer Michael Peter Christen 2014-12-14 04:02:13 +01:00
  • aa80cb1159 enhanced tagging preparation speed which reduces initialization time for very large vocabularies Michael Peter Christen 2014-12-13 09:54:41 +01:00
  • 6a1865f507 refactoring date -> lastModified Michael Peter Christen 2014-12-11 23:37:41 +01:00
  • ab6cc3c88c added concurrent generation of snapshot pdfs Michael Peter Christen 2014-12-10 14:10:05 +01:00
  • ff035a20e7 fix for vocabulary import (double term detection) Michael Peter Christen 2014-12-10 14:09:34 +01:00
  • e6650050fe fix for Is Facet checkbox Michael Peter Christen 2014-12-10 13:14:39 +01:00
  • bd3ed5cae5 added charset detection to vocabulary reader Michael Peter Christen 2014-12-10 13:11:51 +01:00
  • 413eeefed4 added character set detection library from http://www-archive.mozilla.org/projects/intl/chardet.html Michael Peter Christen 2014-12-10 13:08:29 +01:00
  • 7bfc5b80cb added new options to vocabulary editor: - new switch 'isFacet' which causes that the usage of the vocabulary for search facets is enabled or disabled. This shall be used for large vocabularies sind searched in solr are extremely slow if facets for a large set of alternative terms are generated - new option to disable auto-enrichment from synonyms - new option to add synonyms from another column when importing from csv - automatically recognize double-occurrences in synonyms and bundling terms for such synonyms Michael Peter Christen 2014-12-10 12:20:27 +01:00
  • 87b53b3572 Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git Michael Peter Christen 2014-12-09 16:20:44 +01:00
  • 8df8ffbb6d enhanced the snapshot functionality: Michael Peter Christen 2014-12-09 16:20:34 +01:00
  • 5d67e165d9 remove redundant null check in ResponseHeader.lastModified added a JUnit testcase for ResponseHeader dates (using age()), adjusted age() to pass all tests reger 2014-12-09 00:58:08 +01:00
  • 4111d42c81 Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git Michael Peter Christen 2014-12-08 12:40:12 +01:00
  • 793ce6d13b added confirmation dialogs for row deletion Michael Peter Christen 2014-12-08 11:41:28 +01:00
  • cdc21d43b1 more robustness for broken table data in Table_API_p.html -- see bug report http://mantis.tokeek.de/view.php?id=495 Michael Peter Christen 2014-12-08 11:35:40 +01:00
  • 1d3ea35d69 prevent NPE on host link for to short HeuristicCfg.OpenSearchURL reger 2014-12-08 01:35:37 +01:00
  • a95af11050 enhancement for clearing the crawl queue Michael Peter Christen 2014-12-07 23:43:38 +01:00
  • 5f0bb1214f modified FieldReIndex to reindex queries with low number of documents first by using a internally a score map with number of documents as score and working through the list from low to high. reger 2014-12-07 04:31:09 +01:00
  • 8055ed5b2a update to commons-logging-1.2 reger 2014-12-06 22:32:24 +01:00
  • e52370728a fix startup stop on missing HTCACHE/SNAPSHOT directory reger 2014-12-06 02:25:24 +01:00
  • e5236aa7ca Merge origin/master reger 2014-12-06 01:44:03 +01:00
  • 70cf7060a4 coding fixes suggested in http://mantis.tokeek.de/view.php?id=509 http://mantis.tokeek.de/view.php?id=510 reger 2014-12-06 01:42:24 +01:00
  • d97deb5555 npe fix Michael Peter Christen 2014-12-06 00:43:12 +01:00
  • 4fe4bf29ad added rss feed output to snapshot servlet which can be used to get a list of latest/oldest entries in the snapshot database. This is an example: http://localhost:8090/api/snapshot.rss?depth=2&order=LATESTFIRST&host=yacy.net&maxcount=100 Michael Peter Christen 2014-12-06 00:25:05 +01:00
  • 8b522687e0 added toString() methods to feed classes which makes it possible to export full rss feed files out of the RSSFeed class Michael Peter Christen 2014-12-06 00:18:14 +01:00
  • 568c991405 remove the unused Request variable (fix of prev. commit) reger 2014-12-05 03:03:28 +01:00
  • d6539ba597 Merge origin/master reger 2014-12-05 01:15:41 +01:00
  • ff18129def ViewFile servlet: update index if newer, so viewed text and metadata (stored) info is similar - to archive it, use request with profile to allow indexing (defaultglobaltext) and update index (the resource is loaded, parsed anyway, so it's not a expensive operation) reger 2014-12-05 01:13:37 +01:00
  • a304058840 added Image Events as another option to generate images with a mac if no Ghostscript is available or does not work... Michael Peter Christen 2014-12-04 01:21:24 +01:00