Commit Graph

  • ee612e8b93 start the local search only if this peer is doing a remote search or when it is doing a local search and the peer is old orbiter 2012-11-25 11:58:57 +01:00
  • d465773a37 - removed multi-add of documents (no used) - inserted specialized code for size request Michael Peter Christen 2012-11-25 01:34:39 +01:00
  • a1a4d9aa94 Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git Michael Peter Christen 2012-11-24 22:31:46 +01:00
  • b7004043ea - added a field cache for solr queries which call only for a single value - fixed a version conflict exception within a solr add request Michael Peter Christen 2012-11-24 22:30:05 +01:00
  • 5aa5202adf fixes for filesystem indexing orbiter 2012-11-24 10:27:29 +01:00
  • bf42179982 introduced more structure in HostBrowser, table view, better counting, distinguishing of error cases (fail/excluded) Michael Peter Christen 2012-11-23 14:09:48 +01:00
  • efd2c4622d added a new fail type attribute for the index to distinguish two separate fail types: network fail and forced exclusion (i.e. by robots or forwarding rules). Michael Peter Christen 2012-11-23 14:00:30 +01:00
  • 5e182a566f - added another enumeration method in kelondro data structure to get a more random access to data for the balancer - added random access inside the balancer Michael Peter Christen 2012-11-23 13:58:39 +01:00
  • 4eab3aae60 removed overhead by preventing generation of full search results when only the url is requested Michael Peter Christen 2012-11-23 01:35:28 +01:00
  • a114bb23bb - using edismax in gsa interface - generating less field data for gsa search results - using a boost query in gsa interface to move double content to the end of the result list Michael Peter Christen 2012-11-22 13:03:33 +01:00
  • d6b82840f8 added a feature to find similarities in documents. This uses an enhanced version of the Nutch/Solr TextProfileSignatue. As a result, a signature of the document is written to the solr search index. Additionally for each time when a signature is written, it is checked if the singature exists already in the index. If the signature does not exist, the document is marked as unique. The unique attribute can now be used to sort document lists and bring duplicates to the end of a result list. To enable this, a large portion of the search api to Solr had to be changed. This affected mainly caching of 'exists' searches to enhance the check for existing signatures and do this without actually doing a solr query. Because here the first time a long number is used as value in the Solr store, also the value naming in the YaCySchema had to be adopted and normalized. This caused that many files had to be changed. Michael Peter Christen 2012-11-21 18:46:49 +01:00
  • f5ca5cea44 - added field options to all solr queries. This can be used to restrict the actual data which is fetched from solr. - used the new field options to reduce generic options like getting the load date or the count of search results. should increase overall speed - used the new field options to reduce overhead in the host browser during aquisition of links. - used the field options to make checking of links in crawler faster - if the crawler is paused, the crawl queue is not cleaned Michael Peter Christen 2012-11-19 17:24:34 +01:00
  • 46be4af5b9 Merge commit '2bb8f045cc92f31fc7e720cc30b38af417563890' Michael Peter Christen 2012-11-18 22:11:04 +01:00
  • c73a9bc654 Merge remote-tracking branch 'reger/master' Michael Peter Christen 2012-11-18 22:04:34 +01:00
  • 832eead998 Merge remote-tracking branch 'regerdev/master' Michael Peter Christen 2012-11-18 22:04:11 +01:00
  • 952e143580 FINALLY YaCy can now search for full strings using double- or singlequoted strings in the search query line!!! Michael Peter Christen 2012-11-18 16:03:34 +01:00
  • 5dfd6359cb redesign of the QueryParams class: introduced QueryGoal which holds the query string parser. This shall be used to create a proper full-string matching which is handled then by QueryGoal. orbiter 2012-11-18 01:22:41 +01:00
  • 2bb8f045cc content control: use up-to-date definitions cominch 2012-11-13 17:32:19 +01:00
  • 5fd3b93661 added deletion of hosts during crawl start if deleteold option was given Michael Peter Christen 2012-11-13 16:54:28 +01:00
  • d64445c3cb because we have the inurl:<term> - searchmodifier, we don't actually need regular expressions as search attributes. They had now been removed from the advanced search page while they are still created internally. The filter is then expressed against solr as regular expression filter query. If the expression points out a selection of an specific protocol, host or filetype this is then translated into a facetted query. Michael Peter Christen 2012-11-13 11:45:56 +01:00
  • b55ea2197f - redesign of crawl start servlet - for domain-limited crawls, the domain is deleted now by default before the crawl is started orbiter 2012-11-13 10:54:21 +01:00
  • 1c66de4bd4 - removed scheduled crawling options in crawl start because it is superfluous there; it can be changed in the scheduler servlet. It's also confusing in the presence of the delete-option, which will be implemented next. - removed unused crawl start servlet - some refactoring to make the time parser reusable orbiter 2012-11-12 11:19:39 +01:00
  • a67ff1c8ac SMW Import: replaced JSON import routines with stable ones cominch 2012-11-12 11:17:50 +01:00
  • 328ce0b297 fix: remove fixed individual testing IP (85.25.151.30 = server4you.de) from default/yacy.network.freeworld.unit reger 2012-11-11 21:19:18 +01:00
  • 2e7219f9fd removed hightlighting of search results within collections in GSA interface Michael Peter Christen 2012-11-09 16:25:24 +01:00
  • 074dfd297b added icons and a selection for hosts with urls pending for crawler or with errors Michael Peter Christen 2012-11-09 16:24:56 +01:00
  • d2a94cc55e refactor package cominch 2012-11-09 16:22:24 +01:00
  • 05742b4562 remove old SMW importer which was part of the ymarks package cominch 2012-11-09 15:44:59 +01:00
  • 21df1ad9e0 update and generalization of the SMW import and content control routines cominch 2012-11-09 13:48:40 +01:00
  • f07e5fb553 release 1.2 Release_1.2 Michael Peter Christen 2012-11-07 23:14:45 +01:00
  • 4c4e0eece2 added new submenu 'Target Analysis' with three servlets which are useful to analyse the target servers: robots.txt table, mass target analysis and a regex tester Michael Peter Christen 2012-11-07 21:26:01 +01:00
  • 61995d508e do the commit anyway before calling a search interface Michael Peter Christen 2012-11-07 17:27:50 +01:00
  • 842faf96a2 fixed media search Michael Peter Christen 2012-11-07 17:27:13 +01:00
  • 86ec199126 using a better file name Michael Peter Christen 2012-11-07 16:39:49 +01:00
  • 93001586a0 removed warnings, removed too-fast pausing of crawls Michael Peter Christen 2012-11-07 15:37:14 +01:00
  • 8041742e48 added matching of path to query pattern Michael Peter Christen 2012-11-07 15:06:13 +01:00
  • 8b1c9cba3d fixed a problem with non-terminating crawls Michael Peter Christen 2012-11-07 15:05:44 +01:00
  • 61a1d32356 fix to ftp client Michael Peter Christen 2012-11-07 14:58:28 +01:00
  • 5105256927 update to search result logging (this was a remaining issue from the solr 4.0.0 migration) Michael Peter Christen 2012-11-07 14:15:27 +01:00
  • 570e42c4e3 fix for filetype naviagtor Michael Peter Christen 2012-11-07 13:53:29 +01:00
  • 71ed8e5e07 bugfixes for crawler Michael Peter Christen 2012-11-07 12:52:19 +01:00
  • 29fbbb49dc better colors for host browser and corrected document count Michael Peter Christen 2012-11-07 12:23:21 +01:00
  • 12c0db20e5 fixed npe for surrogate import Michael Peter Christen 2012-11-07 02:46:51 +01:00
  • 6244b084cd fixed wrong order of result count values Michael Peter Christen 2012-11-07 02:29:33 +01:00
  • 631b08e7e2 update to HostBrowser Michael Peter Christen 2012-11-07 02:17:24 +01:00
  • 51f420e4f5 removed location search because it is only working in special cases Michael Peter Christen 2012-11-07 02:04:41 +01:00
  • 52df6ee369 more logging Michael Peter Christen 2012-11-07 02:04:08 +01:00
  • 158732af37 automatically delete entries from the crawl profile list if crawl is terminated. Michael Peter Christen 2012-11-07 02:03:44 +01:00
  • 15d1460b40 added information about the reason of pausing of crawls Michael Peter Christen 2012-11-06 15:21:56 +01:00
  • 2371ef031c added solr faceted search support to YaCy search results added solr highlighting / YaCy snippets to YaCy search results - facets are now much more complete - facets are computed and searched much faster - snippet computation is done by solr if solr knows the snippet Michael Peter Christen 2012-11-06 14:32:08 +01:00
  • b30a7162fa added more thread-renaiming for search processes Michael Peter Christen 2012-11-06 12:31:23 +01:00
  • 900445d8e9 set the thread name during solr queries to the solr query to get better debugging options Michael Peter Christen 2012-11-06 11:48:04 +01:00
  • d481abd087 added the visualization of error-urls to host browser - only visible for admins - a faceted search generates a huge list for all hosts in the host list - the faceted search algorithms had to be modified for that - within the browsing of the directory path, the error cause is written to the url which is presented as error-url - the errors are also accumulated for directory sums Michael Peter Christen 2012-11-06 00:29:37 +01:00
  • a15819fbec fix for some interface problems Michael Peter Christen 2012-11-05 22:14:52 +01:00
  • 791e1dcfdf when a new crawl is started, delete all entries about error-urls for crawl-start domains Michael Peter Christen 2012-11-05 22:14:27 +01:00
  • c6a6f4c4e6 added a hack which makes the HostBrowser more performant when the given host has a lot of urls. If the number of urls is > 1000, then the list of documents is restricted to such which have no subpath, if the root path is selected. However, this can cause a problem if no documents on the root path exist but only on paths below that root path. Michael Peter Christen 2012-11-05 18:57:21 +01:00
  • 619bf7e875 fixed filetype modified for media types in text search Michael Peter Christen 2012-11-05 18:08:00 +01:00
  • 97f82994a6 automatically pause the crawler if there is a problem with solr Michael Peter Christen 2012-11-05 16:34:42 +01:00
  • 64ac2b7b7d new submenu template Michael Peter Christen 2012-11-05 15:36:42 +01:00
  • 5e77801aac update to web interface structure Michael Peter Christen 2012-11-05 15:23:03 +01:00
  • 8fb370d9f8 renovated the way how search results are count. should be correct now... Michael Peter Christen 2012-11-05 03:19:28 +01:00
  • 7bec253bb0 Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git Michael Peter Christen 2012-11-04 09:21:58 +01:00
  • d88eb657fd Merge branch 'master' of git://gitorious.org/~reger/yacy/bbyacy-rc1 Michael Peter Christen 2012-11-04 09:21:21 +01:00
  • 354ef8000d - added 'deleteold' option to crawler which causes that documents are deleted which are selected by a crawl filter (host or subpath) - site crawl used this option be default now - made option to deleteDomain() concurrency orbiter 2012-11-04 02:58:26 +01:00
  • 633fbe9188 Fix Metadata handling - language default on missing lang property to "uk" (fix set to nothing) - language set to TLD (added call to existing language calculation from TLD) - coordinate number exception on possible lat/lon content of "NaN,NaN" reger 2012-11-04 02:07:59 +01:00
  • 19d1f474ce host browser now shows also number of pending files per subdirectory + bugfixes Michael Peter Christen 2012-11-02 14:40:02 +01:00
  • 75dd706e1b update to HostBrowser: - time-out after 3 seconds to speed up display (may be incomplete) - showing also all links from the balancer queue in the host list (after the '/') and in the result browser view with tag 'loading' Michael Peter Christen 2012-11-02 13:57:43 +01:00
  • e2c4c3c7d3 migration to solr 4.0.0 Michael Peter Christen 2012-11-02 12:29:48 +01:00
  • b764de424a code cleanup Michael Peter Christen 2012-11-02 10:28:32 +01:00
  • 69aa39d664 update to libraries required by solr 4.0.0 Michael Peter Christen 2012-11-02 10:27:44 +01:00
  • 9330ad4838 - fixed the delete option in host browser - added a delete method which can be used to delete a full subpath in solr. Michael Peter Christen 2012-11-02 01:22:31 +01:00
  • a63179f3f9 added the MIME attribute for the R tag in GSA search result writer Michael Peter Christen 2012-11-02 00:14:29 +01:00
  • 40df2fd193 added the host browser as link to search results. that means you can select a browsing position after a search is done on the search results. Michael Peter Christen 2012-11-01 21:38:05 +01:00
  • 1168d09de8 more refactoring - integrated the code of SnippetProcess into SearchEvent Michael Peter Christen 2012-11-01 17:40:06 +01:00
  • 6629e37685 tried to clean up the search process mess Michael Peter Christen 2012-11-01 17:16:43 +01:00
  • c5f67a5d6d fixed a problem with local search from solr results: now all results from solr are shown (again) Michael Peter Christen 2012-11-01 10:22:22 +01:00
  • 02957d5982 missing license-files (sorry I didn't commit theses files by mistake) sixcooler 2012-10-31 23:47:08 +01:00
  • 16216c2344 added missing libraries Michael Peter Christen 2012-10-31 23:29:47 +01:00
  • 9d062873d2 bump to httpclient-4.2.2 sixcooler 2012-10-31 19:09:48 +01:00
  • f8f05ecba7 - added a delete button in host browser to delete a complete subpath - removed storage of default collection name - default is now "user" - made stacking of crawl start points concurrently Michael Peter Christen 2012-10-31 17:44:45 +01:00
  • 0716a24737 added more / all new crawl profile fields into crawl profile editor Michael Peter Christen 2012-10-31 15:13:05 +01:00
  • 4a14122ba7 in case that a crawl profile has a collection assigned, use the collection to show a name in the web interface. This should prevent that much too long names make the interface unusable. Michael Peter Christen 2012-10-31 14:08:33 +01:00
  • 0fe8be7981 enhaced data structures for balancer and latency computation which should produce a bit better prognosis about forced waiting times. Michael Peter Christen 2012-10-30 17:30:24 +01:00
  • ac9540dfb6 removed options for stopwords which are not used Michael Peter Christen 2012-10-30 12:36:36 +01:00
  • ce3fed8882 added the Google Search Appliance (GSA) api interface to the main menu. See: https://developers.google.com/search-appliance/documentation/68/xml_reference#request_overview Michael Peter Christen 2012-10-30 12:27:22 +01:00
  • b2ffd49817 less latency Michael Peter Christen 2012-10-30 12:26:32 +01:00
  • 0833937c1c better balancing and duetime-cumputation also for no-delay intranet hosts Michael Peter Christen 2012-10-30 11:28:49 +01:00
  • c326aa8f67 disabled writing new entries to crawl stacks to prevent that a domain with many documents block refreshing of the crawl queue Michael Peter Christen 2012-10-29 22:26:52 +01:00
  • 6905182d41 - fix for number of words log message - adding meta:refresh also to crawler stack Michael Peter Christen 2012-10-29 21:42:31 +01:00
  • c25d7bcb80 - added concurrency for robots.txt loading - changed data model for domain counter Michael Peter Christen 2012-10-29 21:08:45 +01:00
  • a94c537afc fixed getSize() which can use the cache size while the crawl is running Michael Peter Christen 2012-10-29 11:56:07 +01:00
  • 96912c9471 enhancement to solr caching: consider that during a get() the document is not in solr but the cache points out that a commit is needed to get the document. Michael Peter Christen 2012-10-29 11:35:24 +01:00
  • a87811bc38 more auto-commit calls when a search interface is opened, but not when a search is done there to prevent blocking during search-time. Michael Peter Christen 2012-10-29 11:27:13 +01:00
  • 3d3d654e88 if a network configuration is choosed which does not allow DHT and no P2P communication is in robinson mode) then some menu entries are disabled which have no use in this mode. Michael Peter Christen 2012-10-29 01:51:19 +01:00
  • 2d9e577ad0 replaced the custom robots.txt loader by the standard http loader Michael Peter Christen 2012-10-28 22:48:11 +01:00
  • 799d71bc67 enhanced solr caching: - increased cache size which is needed for longer solr commit time - speed hacks on cache write code Michael Peter Christen 2012-10-28 20:31:29 +01:00
  • a33e2742cb - removed unnecessary synchronized and deadlock in crawler - removed problem with monitoring object on Balancer.wait - added missing user agent settings Michael Peter Christen 2012-10-28 19:56:02 +01:00
  • 8952153ecf update to Balancer algorithm: - create a load list from the current list of known hosts - do not create this list for each Balancer.pop access - create the list from those hosts which have a zero-waiting time - select 1/3 from that list which have the most urls waiting - get hosts from the wainting list in random order - fixes for some delta-time computations - always load all urls from hosts which have never been loaded before orbiter 2012-10-28 13:24:49 +01:00
  • 354f0d9acd moved static method from ClusteredScoreMap to MapDataMining because it was not used in the ClusteredScoreMap class but only in MapDataMining orbiter 2012-10-28 11:29:53 +01:00
  • 722a447b0d - optimize code of augmented parsing to enhence document tags - commented out augmentedparser.analyse (not function implemented yet) - adjust init of document title list to always use same list type reger 2012-10-26 18:50:45 +02:00