Commit Graph

  • 5d67e165d9 remove redundant null check in ResponseHeader.lastModified added a JUnit testcase for ResponseHeader dates (using age()), adjusted age() to pass all tests reger 2014-12-09 00:58:08 +01:00
  • 4111d42c81 Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git Michael Peter Christen 2014-12-08 12:40:12 +01:00
  • 793ce6d13b added confirmation dialogs for row deletion Michael Peter Christen 2014-12-08 11:41:28 +01:00
  • cdc21d43b1 more robustness for broken table data in Table_API_p.html -- see bug report http://mantis.tokeek.de/view.php?id=495 Michael Peter Christen 2014-12-08 11:35:40 +01:00
  • 1d3ea35d69 prevent NPE on host link for to short HeuristicCfg.OpenSearchURL reger 2014-12-08 01:35:37 +01:00
  • a95af11050 enhancement for clearing the crawl queue Michael Peter Christen 2014-12-07 23:43:38 +01:00
  • 5f0bb1214f modified FieldReIndex to reindex queries with low number of documents first by using a internally a score map with number of documents as score and working through the list from low to high. reger 2014-12-07 04:31:09 +01:00
  • 8055ed5b2a update to commons-logging-1.2 reger 2014-12-06 22:32:24 +01:00
  • e52370728a fix startup stop on missing HTCACHE/SNAPSHOT directory reger 2014-12-06 02:25:24 +01:00
  • e5236aa7ca Merge origin/master reger 2014-12-06 01:44:03 +01:00
  • 70cf7060a4 coding fixes suggested in http://mantis.tokeek.de/view.php?id=509 http://mantis.tokeek.de/view.php?id=510 reger 2014-12-06 01:42:24 +01:00
  • d97deb5555 npe fix Michael Peter Christen 2014-12-06 00:43:12 +01:00
  • 4fe4bf29ad added rss feed output to snapshot servlet which can be used to get a list of latest/oldest entries in the snapshot database. This is an example: http://localhost:8090/api/snapshot.rss?depth=2&order=LATESTFIRST&host=yacy.net&maxcount=100 Michael Peter Christen 2014-12-06 00:25:05 +01:00
  • 8b522687e0 added toString() methods to feed classes which makes it possible to export full rss feed files out of the RSSFeed class Michael Peter Christen 2014-12-06 00:18:14 +01:00
  • 568c991405 remove the unused Request variable (fix of prev. commit) reger 2014-12-05 03:03:28 +01:00
  • d6539ba597 Merge origin/master reger 2014-12-05 01:15:41 +01:00
  • ff18129def ViewFile servlet: update index if newer, so viewed text and metadata (stored) info is similar - to archive it, use request with profile to allow indexing (defaultglobaltext) and update index (the resource is loaded, parsed anyway, so it's not a expensive operation) reger 2014-12-05 01:13:37 +01:00
  • a304058840 added Image Events as another option to generate images with a mac if no Ghostscript is available or does not work... Michael Peter Christen 2014-12-04 01:21:24 +01:00
  • d83de9ecf5 added another path for the convert command because on older Macs ImageMagick has a different installation location Michael Peter Christen 2014-12-03 18:07:05 +01:00
  • 226aea5914 added a servlet which can create preview images, preview tumbnails and preview pdfs from web pages, i.e.: http://localhost:8090/api/snapshot.png?url=http://yacy.net/en/&width=128&height=128 http://localhost:8090/api/snapshot.jpg?url=http://yacy.net/en/&width=128&height=128 http://localhost:8090/api/snapshot.pdf?url=http://yacy.net/en/ Michael Peter Christen 2014-12-03 11:45:48 +01:00
  • 28456dfc09 skip creation of unused Bluelist contenttransformer reger 2014-12-02 21:03:00 +01:00
  • 321840fde3 Replaced all fixed thread pools with cached thread pools. The cached thread pools will flush their cached (dead) threads after 60 seconds. This will cause that YaCy now runs constantly withl about 50 threads, about 100 at peak times. Previously, about 400 threads had been cached and kept in a hibernation state, which caused that the numproc counter in /proc/user_beancounters (exists only in VM-hosted linux) was as high as the cached number of threads. This caused that VM supervisors terminated whole VM sessions if a limit was reached. Many VM providers have limits of numproc=96 which made it virtually impossible to run YaCy on such machines. With this change, it will be possible to run many YaCy instances even on VM hosts. Michael Peter Christen 2014-12-02 16:26:07 +01:00
  • 181911376c showing list of all thread in threaddump using the ThreadMXBean counter (this obviously show more threads than before?) Michael Peter Christen 2014-12-02 16:21:06 +01:00
  • 7bfab5eb9d set Busy- and Blocking-Threads to daemon mode (they will now not prevent YaCy from termination if still running) Michael Peter Christen 2014-12-02 16:05:00 +01:00
  • 64887f6b21 show number of threads on status page Michael Peter Christen 2014-12-02 16:04:11 +01:00
  • e586e423aa in case that loading from the cache fails, load from wkhtmltopdf without cache using the user agent string given in the crawl profile Michael Peter Christen 2014-12-02 13:35:19 +01:00
  • d5bac64421 recognize more html file types for snapshots Michael Peter Christen 2014-12-02 12:52:36 +01:00
  • 6f0167fac1 get cloned crawl start parameter for snapshots Michael Peter Christen 2014-12-02 12:52:05 +01:00
  • a1ee101079 recognize more html file extensions Michael Peter Christen 2014-12-02 12:10:44 +01:00
  • 8480641f2d fix to xvfb-run usage (quotes did not parse in xvfb-run, default values are appropriate) Michael Peter Christen 2014-12-02 11:51:12 +01:00
  • 68b040e31e added fail-over missing http proxy service (i.e. overload) and quiet mode Michael Peter Christen 2014-12-01 18:21:52 +01:00
  • 25a64c51b3 moved snapshot generation out of the html handler to prevent that existing cache entries cause that the handler is not executed Michael Peter Christen 2014-12-01 17:37:25 +01:00
  • c35170a305 more logging Michael Peter Christen 2014-12-01 16:50:37 +01:00
  • e8be07ec78 grr Michael Peter Christen 2014-12-01 16:38:07 +01:00
  • 6f81bb756c wrap wkhtmltopdf with xvfb if necessary Michael Peter Christen 2014-12-01 16:26:28 +01:00
  • 0119f8665d more logging when failing to create pdf snapshot Michael Peter Christen 2014-12-01 16:00:45 +01:00
  • 416fe886e3 Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git Michael Peter Christen 2014-12-01 15:20:24 +01:00
  • 60f27bdf49 added the property timeoutrequests to configuration to disable TimeoutRequests. The purpose is to test if YaCy runs better on VMs where there is a limitation of concurrent processes; see /proc/user_beancounters in row numproc; this value is limited and should be low. Try to set timeoutrequests to keep this low. (works only after restart) Michael Peter Christen 2014-12-01 15:20:10 +01:00
  • 97f6089a41 YaCy can now create web page snapshots as pdf documents which can later be transcoded into jpg for image previews. To create such pdfs you must do: Michael Peter Christen 2014-12-01 15:03:09 +01:00
  • 41d00350e4 moved network configuration to Use Case submenu; this is necessary because the definiton of portal peers within the YaCy freeworld network is otherwise splitted into two different main menus. Michael Peter Christen 2014-12-01 01:12:51 +01:00
  • ff80700aff replace depreciated Solr DateField.formatExternal with recommended TrieDateField.formatExternal reger 2014-12-01 00:21:30 +01:00
  • 9ea120dbe5 Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git Michael Peter Christen 2014-11-30 22:02:25 +01:00
  • aa7122f079 update to guava.18.0.jar and jsch.0.1.51.jar reger 2014-11-30 19:43:53 +01:00
  • 0c97cc2440 skip unused call parameter for hashSentence() reger 2014-11-30 19:42:33 +01:00
  • 221f86dd5e position api icon (ViewFile.html) reger 2014-11-30 01:58:14 +01:00
  • 4c14a8b44d update to poi-3.10.1.jar reger 2014-11-29 22:36:02 +01:00
  • ea633a794c including small junit test case for WordTokenizer reger 2014-11-29 22:13:24 +01:00
  • 5790c7242e skip to tokenize punktuation as word in WordTokenizer remove unused variables in condenser related to Tokenizer reger 2014-11-29 17:16:05 +01:00
  • f07392ff17 add. use host port parameter in YaCyApp reger 2014-11-29 15:27:16 +01:00
  • 09d2867050 Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git Michael Peter Christen 2014-11-29 12:05:19 +01:00
  • ad0da5f246 added new web page snapshot infrastructure which will lead to the ability to have web page previews in the search results. (This is a stub, no function available with this yet...) Michael Peter Christen 2014-11-29 11:56:32 +01:00
  • aa0faeabc5 adjust translation text of error msg on empty query (ru: needs correction) reger 2014-11-29 03:09:55 +01:00
  • c475be2937 fix (enable) error msg on empty query reger 2014-11-28 22:44:33 +01:00
  • ef5c5b4489 update to Jetty 9.2.4 reger 2014-11-28 20:24:39 +01:00
  • f709132961 remove obsolete alternate link fix api link reger 2014-11-28 01:40:46 +01:00
  • 5f5c7d69d1 added image screenshot generator Michael Peter Christen 2014-11-28 01:25:52 +01:00
  • 3c71e1c872 show vocabularies in search result (in case of debugging) Michael Peter Christen 2014-11-28 01:19:31 +01:00
  • 1d45d9405a security bugfix Michael Peter Christen 2014-11-28 01:19:01 +01:00
  • ff728b4aa5 ignore url errors during search Michael Peter Christen 2014-11-27 20:50:55 +01:00
  • c94c24638f disabled postprocessing by default. If you read this: please disable postprocessing in your peer as well: open /IndexSchema_p.html, then deselect field process_sxt Michael Peter Christen 2014-11-27 12:13:20 +01:00
  • 2fce2e2697 larger boost fields for ranking Michael Peter Christen 2014-11-27 12:11:54 +01:00
  • 6c03ff8355 bold words in snippets should not be coloured black in the base style because there are styles with dark backgrounds which make the bold word invisible Michael Peter Christen 2014-11-27 08:08:05 +01:00
  • 8317914ce3 changed vocabulary navigator object type to TreeMap to get a specific order into the vocabularies. This is now lexicographic which is not so much random as a hashed order Michael Peter Christen 2014-11-27 07:44:41 +01:00
  • d5c1b07768 Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git Michael Peter Christen 2014-11-26 18:07:17 +01:00
  • c0f9f6ac66 added option to change the navbar-default, i.e. usable for dark skins Michael Peter Christen 2014-11-26 18:01:35 +01:00
  • 10794e8efd trying facet.method fc instead of fcs to handle large facets Michael Peter Christen 2014-11-25 23:11:42 +01:00
  • 041b605cfe Merge branch 'master' of git@gitorious.org:yacy/rc1.git Michael Peter Christen 2014-11-25 09:48:48 +01:00
  • f1f74e8626 toString fix Michael Peter Christen 2014-11-24 20:53:40 +01:00
  • 30276a2b48 prevent that a local Solr search and a local RWI search are running concurrently. When a RWI search result is flushed into the result set, id does Solr Queries (which replaced the old-style Metadata Queries) and they are possibly running concurrently to a previously startet Solr search. Both methods may block each other with IO. To enhance the speed, they are now serialized. Because the Solr search results may result in better results using the more advanced and configurable Ranking methods, this result is preverred over the RWI search result. However, remote RWI search results are still feeded concurrently into the search result as well. Michael Peter Christen 2014-11-24 20:53:19 +01:00
  • 84763126e0 added option to make the YaCy proxy act as the cache is never stale. If set to 'Always Fresh' the cache is always used if the entry in the cache exist. This is a good way to archive web content and access it without going online again in case the documents exist. To do so, open /Settings_p.html?page=ProxyAccess and check the "Always Fresh" checkbox. This is set do false which behave as set before. If you set this to true, then you have your web archive in DATA/HTCACHE. Copy this to carry around your private copy of the internet! Michael Peter Christen 2014-11-24 20:28:52 +01:00
  • 1e7ee72240 fix path lookup to ./defaults/yacy.badwords (fix of commit ee277b9b3e) reger 2014-11-23 23:29:20 +01:00
  • 7d863d6254 fix empty text facet entry (noticed on Author facet) reger 2014-11-23 23:12:01 +01:00
  • a39419f2ef more stacks shall be considered for on-demand loading, not only deep-depth stacks to prevent "too many open files" problem Michael Peter Christen 2014-11-23 20:11:23 +01:00
  • 5bb52f79be reduce number of calls to queue.size() because that may be a bottleneck during crawling Michael Peter Christen 2014-11-23 20:09:32 +01:00
  • 4920ab7b76 optimize usage of size() cache Michael Peter Christen 2014-11-23 20:07:32 +01:00
  • ee277b9b3e allow for local yacy.stopwords and yacy.badwords list (in DATA/SETTINGS/) if file in DATA/SETTINGS it is loaded otherwise file in ./defaults is loaded (if locale ./defaults/stopwords.xx doesn't exist take solr/lang/stopwords_xx.txt as default) reger 2014-11-23 05:22:23 +01:00
  • de56266bcb remove redundant toLower for topwords reger 2014-11-22 22:49:23 +01:00
  • a34f837592 better delete all files in path when removing host crawl stack Michael Peter Christen 2014-11-22 12:09:07 +01:00
  • 10b1db430a if we have many hosts, use on-demand earlier Michael Peter Christen 2014-11-22 12:04:04 +01:00
  • 1324927e66 prevent division by zero Michael Peter Christen 2014-11-22 12:01:00 +01:00
  • 2beb6abeb6 disabled crazy sleep loop Michael Peter Christen 2014-11-21 14:38:54 +01:00
  • 092d97d7ac when importing vocabulary csv files, accept also files without semicolon and truncate quotes from literals Michael Peter Christen 2014-11-21 12:42:29 +01:00
  • ee9ec40048 added hints to ranking to make ranking boosts using vocabularies easier Michael Peter Christen 2014-11-20 18:46:06 +01:00
  • 70f03f7c8e do not cache search requests to Solr if the result is used for doublechecking. If a double-check comes from cached results the doublecheck fails. Michael Peter Christen 2014-11-20 18:45:27 +01:00
  • a0b84e4def use a LinkedHashMap for factes to maintain facet order as given by solr Michael Peter Christen 2014-11-20 18:44:29 +01:00
  • ef5dc68313 include domtype to searcheventcache id to differenciate between local / global events for reuse of cached events fix for http://mantis.tokeek.de/view.php?id=493 reger 2014-11-20 02:04:43 +01:00
  • 0dc6e0a5f2 added option to enrich vocabularies with synonyms from synonym database Michael Peter Christen 2014-11-19 18:12:43 +01:00
  • 6a2a669db4 added loading of the synonyms file from addon/synonyms into the knowledge loader Michael Peter Christen 2014-11-19 17:36:56 +01:00
  • c67c5c0709 added new solr schema fields which record the occurences of vocabulary matchings. These matches can be used for result boosting, i.e. if a document contains words from a specific vocabulary, boost it. Michael Peter Christen 2014-11-18 15:02:34 +01:00
  • a67a465415 fix field counter for multi-fields in html writer for the solr servlet Michael Peter Christen 2014-11-18 12:11:18 +01:00
  • fdba8e2fa0 fix for 2-day network stats table: showing 48 instead of 24 hours from peer history Michael Peter Christen 2014-11-17 14:23:21 +01:00
  • ec9d021568 added option in vocabulary editor to import CSV files with different encodings (preselected windows-type character encoding which is typical for CSV files). Fixed also other problems with character encoding in dictionary files. Automatically generated vocabularies are now also noted in the API steering. Michael Peter Christen 2014-11-17 14:22:40 +01:00
  • b558433211 adjust tag cloud font size calculation to limit max font size to ~ TOPWORDS_MAXSIZE reger 2014-11-17 01:24:30 +01:00
  • 3c818fc912 add a check of java version string >=1.7 to startup class stopping start with error msg on version < 1.7 reger 2014-11-16 01:26:07 +01:00
  • 0550b54d56 added fix to postprocessing: avoid caching of postprocessing collection to always get fresh lists of documents. This is necessary since the postprocessing changes the same documents which the postprocessing-collection query selects. Michael Peter Christen 2014-11-14 16:34:55 +01:00
  • 68e8039fd1 added high-precision scheduler for API processes. This allows also to make the execution in dependency of available RAM or CPU load. The default value for CPU load is 4.0 and the check runs once a minute. Michael Peter Christen 2014-11-14 10:02:50 +01:00
  • 8aee7f940e added missing class for latest changes Michael Peter Christen 2014-11-13 01:30:12 +01:00
  • 97039049e4 fix in key enumeration methods for cases where the enumeration is done in reverse order. Michael Peter Christen 2014-11-13 01:15:31 +01:00
  • 7e1b0b6712 fix for wildcard patch in search queries Michael Peter Christen 2014-11-13 00:59:30 +01:00
  • 0a879c98e7 added new 'firstSeen' database table and necessary data structures which hold a date for each URL to record when a url was first seen. This is then used to overwrite the modification date for urls upon recrawl in case that the first-seen date is before the latest document date. This behaviour is necessary due to the common behaviour of content management systems which attach always the current date to all documents. Using the firstSeen database it is possible to approximate a real first document creation date in case that the crawler starts frequently for the same domain. As a result the search results ordered by date have a much better quality and the usage of YaCy as search agent for latest news has a better quality. Michael Peter Christen 2014-11-13 00:58:58 +01:00