Commit Graph

  • 864abcd33d removed Latency update after URL selection because that causes a completely wrong behaviour when cache fresh cases appear. Makes re-crawling MUCH faster! Michael Peter Christen 2012-12-07 15:35:44 +01:00
  • 4491072256 - clear the search cache when altering the solr boosts - better positions for submit buttons Michael Peter Christen 2012-12-07 14:56:34 +01:00
  • 2b7d46bc1f using a filter query for the site parameter in GSA api Michael Peter Christen 2012-12-07 14:54:49 +01:00
  • dd241d03bb latency fix: only set last-visit time if access was actually by the robot Michael Peter Christen 2012-12-07 02:00:12 +01:00
  • 118233a7e6 fix for bad xml in gsa result when doing a query with quotes Michael Peter Christen 2012-12-07 01:35:02 +01:00
  • 1e002ab18e added another blacklist-cleaner into balancer Michael Peter Christen 2012-12-07 01:27:24 +01:00
  • 10527e28ae fix for wrong display of error urls in HostBrowser Michael Peter Christen 2012-12-07 00:31:10 +01:00
  • 756772fbd3 fix for waitingtime computation for intranet configuration Michael Peter Christen 2012-12-06 17:40:52 +01:00
  • fa27e5820f - check blacklist (again) when taking urls from the crawl stack because the blacklist may get extended during crawling - removed debug output Michael Peter Christen 2012-12-06 00:12:16 +01:00
  • 5f5d66921e patch for funny symbols in url paths (like tilde) Michael Peter Christen 2012-12-05 22:05:49 +01:00
  • adfecc6ba8 more robustness during shutdown Michael Peter Christen 2012-12-05 18:20:43 +01:00
  • d4bfe9339e Brute-force attempt to start solr in case of a memory problem. I don't actually know if this is correct. It is a desperate try to get YaCy running on production servers which must get alive even with strange hacks like this. This is also related to a forum posting in http://forum.yacy-websuche.de/viewtopic.php?t=4528&p=27135#p27135 Michael Peter Christen 2012-12-05 18:16:06 +01:00
  • 8aa08261a7 update to Solr Boost handling Michael Peter Christen 2012-12-05 12:26:42 +01:00
  • 908ad2f174 Added a new servlet to configure the solr ranking using field boosts Michael Peter Christen 2012-12-03 17:01:19 +01:00
  • a598fb6227 renamed Ranking_p.html to RankingRWI_p.html because there will be another Ranking servlet as well at next Michael Peter Christen 2012-12-03 00:01:41 +01:00
  • a01e47b992 enhanced exists()-method for solr; should reduce a lot of IO during DHT target selection Michael Peter Christen 2012-12-02 17:29:37 +01:00
  • 72f165d58b added a Boost class which stores solr query boost values. The class can be configured using the yacy.init file. The boost information is taken from the configuration each time when a query to solr is done. Michael Peter Christen 2012-12-02 16:54:29 +01:00
  • ea033f8f8e added number of characters in url to default index to be able to use this field for ranking Michael Peter Christen 2012-12-02 16:53:02 +01:00
  • b5ee88c6af added more logging to get info which url causes performance problems Michael Peter Christen 2012-12-02 16:52:12 +01:00
  • 1faa045dc1 fix: prevent regex pattern compile error for blacklist import for path '*' (extend it to '.*') reger 2012-12-01 22:41:21 +01:00
  • bb20691d4f fix: respect config setting of "show Nav Top-Menu" in HostBrowser.html for public users (as hostbrowser is now available in search results) reger 2012-12-01 01:14:29 +01:00
  • 6cf33f899c prevent Solr "version conflict" on update by set Solr "_version_" field to 0 (=no version check) reger 2012-11-28 00:09:53 +01:00
  • acd98bebb7 improvements in GSA result writer Michael Peter Christen 2012-11-26 15:18:51 +01:00
  • 3de784c8dd replaced more split and replaceAll missing pattern pre-compilation with pre-compiled pattern Michael Peter Christen 2012-11-26 13:40:53 +01:00
  • 8fc3679c66 using more pre-compile pattern for split methods Michael Peter Christen 2012-11-26 13:11:55 +01:00
  • d48e9788d2 enhanced search result processing behavior - query less at one time; query more often - in between the small queries, evaluate results - remove fields from search results which are not needed Michael Peter Christen 2012-11-26 12:24:35 +01:00
  • bf512e6350 Merge branch 'master' of git://gitorious.org/~reger/yacy/bbyacy-rc1 Michael Peter Christen 2012-11-26 00:14:57 +01:00
  • 469efcdb9d fix: display and calculate authors and namespace search navigator if configured (otherwise skip overhead) (leave hosts, topics and not in ConfigPortal included filetype, protocoll navigator untouched) reger 2012-11-25 22:49:26 +01:00
  • eca68fa197 added debug code to crawler monitor Michael Peter Christen 2012-11-25 15:43:42 +01:00
  • 205f8b222b Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git Michael Peter Christen 2012-11-25 14:41:49 +01:00
  • c54cb85422 added link to http://docs.oracle.com/javase/6/docs/api/java/util/regex/Pattern.html to the /RegexTest.html servlet orbiter 2012-11-25 12:20:41 +01:00
  • ee612e8b93 start the local search only if this peer is doing a remote search or when it is doing a local search and the peer is old orbiter 2012-11-25 11:58:57 +01:00
  • d465773a37 - removed multi-add of documents (no used) - inserted specialized code for size request Michael Peter Christen 2012-11-25 01:34:39 +01:00
  • a1a4d9aa94 Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git Michael Peter Christen 2012-11-24 22:31:46 +01:00
  • b7004043ea - added a field cache for solr queries which call only for a single value - fixed a version conflict exception within a solr add request Michael Peter Christen 2012-11-24 22:30:05 +01:00
  • 5aa5202adf fixes for filesystem indexing orbiter 2012-11-24 10:27:29 +01:00
  • bf42179982 introduced more structure in HostBrowser, table view, better counting, distinguishing of error cases (fail/excluded) Michael Peter Christen 2012-11-23 14:09:48 +01:00
  • efd2c4622d added a new fail type attribute for the index to distinguish two separate fail types: network fail and forced exclusion (i.e. by robots or forwarding rules). Michael Peter Christen 2012-11-23 14:00:30 +01:00
  • 5e182a566f - added another enumeration method in kelondro data structure to get a more random access to data for the balancer - added random access inside the balancer Michael Peter Christen 2012-11-23 13:58:39 +01:00
  • 4eab3aae60 removed overhead by preventing generation of full search results when only the url is requested Michael Peter Christen 2012-11-23 01:35:28 +01:00
  • a114bb23bb - using edismax in gsa interface - generating less field data for gsa search results - using a boost query in gsa interface to move double content to the end of the result list Michael Peter Christen 2012-11-22 13:03:33 +01:00
  • d6b82840f8 added a feature to find similarities in documents. This uses an enhanced version of the Nutch/Solr TextProfileSignatue. As a result, a signature of the document is written to the solr search index. Additionally for each time when a signature is written, it is checked if the singature exists already in the index. If the signature does not exist, the document is marked as unique. The unique attribute can now be used to sort document lists and bring duplicates to the end of a result list. To enable this, a large portion of the search api to Solr had to be changed. This affected mainly caching of 'exists' searches to enhance the check for existing signatures and do this without actually doing a solr query. Because here the first time a long number is used as value in the Solr store, also the value naming in the YaCySchema had to be adopted and normalized. This caused that many files had to be changed. Michael Peter Christen 2012-11-21 18:46:49 +01:00
  • f5ca5cea44 - added field options to all solr queries. This can be used to restrict the actual data which is fetched from solr. - used the new field options to reduce generic options like getting the load date or the count of search results. should increase overall speed - used the new field options to reduce overhead in the host browser during aquisition of links. - used the field options to make checking of links in crawler faster - if the crawler is paused, the crawl queue is not cleaned Michael Peter Christen 2012-11-19 17:24:34 +01:00
  • 46be4af5b9 Merge commit '2bb8f045cc92f31fc7e720cc30b38af417563890' Michael Peter Christen 2012-11-18 22:11:04 +01:00
  • c73a9bc654 Merge remote-tracking branch 'reger/master' Michael Peter Christen 2012-11-18 22:04:34 +01:00
  • 832eead998 Merge remote-tracking branch 'regerdev/master' Michael Peter Christen 2012-11-18 22:04:11 +01:00
  • 952e143580 FINALLY YaCy can now search for full strings using double- or singlequoted strings in the search query line!!! Michael Peter Christen 2012-11-18 16:03:34 +01:00
  • 5dfd6359cb redesign of the QueryParams class: introduced QueryGoal which holds the query string parser. This shall be used to create a proper full-string matching which is handled then by QueryGoal. orbiter 2012-11-18 01:22:41 +01:00
  • 2bb8f045cc content control: use up-to-date definitions cominch 2012-11-13 17:32:19 +01:00
  • 5fd3b93661 added deletion of hosts during crawl start if deleteold option was given Michael Peter Christen 2012-11-13 16:54:28 +01:00
  • d64445c3cb because we have the inurl:<term> - searchmodifier, we don't actually need regular expressions as search attributes. They had now been removed from the advanced search page while they are still created internally. The filter is then expressed against solr as regular expression filter query. If the expression points out a selection of an specific protocol, host or filetype this is then translated into a facetted query. Michael Peter Christen 2012-11-13 11:45:56 +01:00
  • b55ea2197f - redesign of crawl start servlet - for domain-limited crawls, the domain is deleted now by default before the crawl is started orbiter 2012-11-13 10:54:21 +01:00
  • 1c66de4bd4 - removed scheduled crawling options in crawl start because it is superfluous there; it can be changed in the scheduler servlet. It's also confusing in the presence of the delete-option, which will be implemented next. - removed unused crawl start servlet - some refactoring to make the time parser reusable orbiter 2012-11-12 11:19:39 +01:00
  • a67ff1c8ac SMW Import: replaced JSON import routines with stable ones cominch 2012-11-12 11:17:50 +01:00
  • 328ce0b297 fix: remove fixed individual testing IP (85.25.151.30 = server4you.de) from default/yacy.network.freeworld.unit reger 2012-11-11 21:19:18 +01:00
  • 2e7219f9fd removed hightlighting of search results within collections in GSA interface Michael Peter Christen 2012-11-09 16:25:24 +01:00
  • 074dfd297b added icons and a selection for hosts with urls pending for crawler or with errors Michael Peter Christen 2012-11-09 16:24:56 +01:00
  • d2a94cc55e refactor package cominch 2012-11-09 16:22:24 +01:00
  • 05742b4562 remove old SMW importer which was part of the ymarks package cominch 2012-11-09 15:44:59 +01:00
  • 21df1ad9e0 update and generalization of the SMW import and content control routines cominch 2012-11-09 13:48:40 +01:00
  • f07e5fb553 release 1.2 Release_1.2 Michael Peter Christen 2012-11-07 23:14:45 +01:00
  • 4c4e0eece2 added new submenu 'Target Analysis' with three servlets which are useful to analyse the target servers: robots.txt table, mass target analysis and a regex tester Michael Peter Christen 2012-11-07 21:26:01 +01:00
  • 61995d508e do the commit anyway before calling a search interface Michael Peter Christen 2012-11-07 17:27:50 +01:00
  • 842faf96a2 fixed media search Michael Peter Christen 2012-11-07 17:27:13 +01:00
  • 86ec199126 using a better file name Michael Peter Christen 2012-11-07 16:39:49 +01:00
  • 93001586a0 removed warnings, removed too-fast pausing of crawls Michael Peter Christen 2012-11-07 15:37:14 +01:00
  • 8041742e48 added matching of path to query pattern Michael Peter Christen 2012-11-07 15:06:13 +01:00
  • 8b1c9cba3d fixed a problem with non-terminating crawls Michael Peter Christen 2012-11-07 15:05:44 +01:00
  • 61a1d32356 fix to ftp client Michael Peter Christen 2012-11-07 14:58:28 +01:00
  • 5105256927 update to search result logging (this was a remaining issue from the solr 4.0.0 migration) Michael Peter Christen 2012-11-07 14:15:27 +01:00
  • 570e42c4e3 fix for filetype naviagtor Michael Peter Christen 2012-11-07 13:53:29 +01:00
  • 71ed8e5e07 bugfixes for crawler Michael Peter Christen 2012-11-07 12:52:19 +01:00
  • 29fbbb49dc better colors for host browser and corrected document count Michael Peter Christen 2012-11-07 12:23:21 +01:00
  • 12c0db20e5 fixed npe for surrogate import Michael Peter Christen 2012-11-07 02:46:51 +01:00
  • 6244b084cd fixed wrong order of result count values Michael Peter Christen 2012-11-07 02:29:33 +01:00
  • 631b08e7e2 update to HostBrowser Michael Peter Christen 2012-11-07 02:17:24 +01:00
  • 51f420e4f5 removed location search because it is only working in special cases Michael Peter Christen 2012-11-07 02:04:41 +01:00
  • 52df6ee369 more logging Michael Peter Christen 2012-11-07 02:04:08 +01:00
  • 158732af37 automatically delete entries from the crawl profile list if crawl is terminated. Michael Peter Christen 2012-11-07 02:03:44 +01:00
  • 15d1460b40 added information about the reason of pausing of crawls Michael Peter Christen 2012-11-06 15:21:56 +01:00
  • 2371ef031c added solr faceted search support to YaCy search results added solr highlighting / YaCy snippets to YaCy search results - facets are now much more complete - facets are computed and searched much faster - snippet computation is done by solr if solr knows the snippet Michael Peter Christen 2012-11-06 14:32:08 +01:00
  • b30a7162fa added more thread-renaiming for search processes Michael Peter Christen 2012-11-06 12:31:23 +01:00
  • 900445d8e9 set the thread name during solr queries to the solr query to get better debugging options Michael Peter Christen 2012-11-06 11:48:04 +01:00
  • d481abd087 added the visualization of error-urls to host browser - only visible for admins - a faceted search generates a huge list for all hosts in the host list - the faceted search algorithms had to be modified for that - within the browsing of the directory path, the error cause is written to the url which is presented as error-url - the errors are also accumulated for directory sums Michael Peter Christen 2012-11-06 00:29:37 +01:00
  • a15819fbec fix for some interface problems Michael Peter Christen 2012-11-05 22:14:52 +01:00
  • 791e1dcfdf when a new crawl is started, delete all entries about error-urls for crawl-start domains Michael Peter Christen 2012-11-05 22:14:27 +01:00
  • c6a6f4c4e6 added a hack which makes the HostBrowser more performant when the given host has a lot of urls. If the number of urls is > 1000, then the list of documents is restricted to such which have no subpath, if the root path is selected. However, this can cause a problem if no documents on the root path exist but only on paths below that root path. Michael Peter Christen 2012-11-05 18:57:21 +01:00
  • 619bf7e875 fixed filetype modified for media types in text search Michael Peter Christen 2012-11-05 18:08:00 +01:00
  • 97f82994a6 automatically pause the crawler if there is a problem with solr Michael Peter Christen 2012-11-05 16:34:42 +01:00
  • 64ac2b7b7d new submenu template Michael Peter Christen 2012-11-05 15:36:42 +01:00
  • 5e77801aac update to web interface structure Michael Peter Christen 2012-11-05 15:23:03 +01:00
  • 8fb370d9f8 renovated the way how search results are count. should be correct now... Michael Peter Christen 2012-11-05 03:19:28 +01:00
  • 7bec253bb0 Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git Michael Peter Christen 2012-11-04 09:21:58 +01:00
  • d88eb657fd Merge branch 'master' of git://gitorious.org/~reger/yacy/bbyacy-rc1 Michael Peter Christen 2012-11-04 09:21:21 +01:00
  • 354ef8000d - added 'deleteold' option to crawler which causes that documents are deleted which are selected by a crawl filter (host or subpath) - site crawl used this option be default now - made option to deleteDomain() concurrency orbiter 2012-11-04 02:58:26 +01:00
  • 633fbe9188 Fix Metadata handling - language default on missing lang property to "uk" (fix set to nothing) - language set to TLD (added call to existing language calculation from TLD) - coordinate number exception on possible lat/lon content of "NaN,NaN" reger 2012-11-04 02:07:59 +01:00
  • 19d1f474ce host browser now shows also number of pending files per subdirectory + bugfixes Michael Peter Christen 2012-11-02 14:40:02 +01:00
  • 75dd706e1b update to HostBrowser: - time-out after 3 seconds to speed up display (may be incomplete) - showing also all links from the balancer queue in the host list (after the '/') and in the result browser view with tag 'loading' Michael Peter Christen 2012-11-02 13:57:43 +01:00
  • e2c4c3c7d3 migration to solr 4.0.0 Michael Peter Christen 2012-11-02 12:29:48 +01:00
  • b764de424a code cleanup Michael Peter Christen 2012-11-02 10:28:32 +01:00