Commit Graph

  • 12fb9d7cd1 log postprocessing constraints in case that postprocessing is not performed Michael Peter Christen 2014-08-04 14:19:37 +02:00
  • 3c23b89823 less logging Michael Peter Christen 2014-08-04 13:37:34 +02:00
  • a0c53174c5 better solr query logging to detect unnecessary sort requests for more performance profiling Michael Peter Christen 2014-08-04 13:00:45 +02:00
  • 338f574bdc no sorting if http/www unique fields are not demanded (makes query faster) and some code restrucuring Michael Peter Christen 2014-08-04 12:59:38 +02:00
  • 1609763be5 toString fix Michael Peter Christen 2014-08-04 12:58:39 +02:00
  • b983e68254 more retries, less sleep Michael Peter Christen 2014-08-04 08:29:35 +02:00
  • 1503ba7794 Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git Michael Peter Christen 2014-08-04 08:24:31 +02:00
  • 8f77719091 fix "Ljava.lang.String" in crawl queue anchor name (e.g. IndexCreateQueues_p.html?stack=LOCAL with images in queue) reger 2014-08-04 02:38:58 +02:00
  • 0ceeceb35e more logic on Solr queries; usage of the query terms in posprocessing, saving one query for double document detection now per document Michael Peter Christen 2014-08-04 02:35:38 +02:00
  • 3963bca3b6 catch IndexControlRWIs_p error if RWI not connected reger 2014-08-04 00:03:42 +02:00
  • 38864ae004 Merge branch 'master' of git@gitorious.org:yacy/rc1.git orbiter 2014-08-03 22:44:49 +02:00
  • 4099296b45 added new classes which shall reduce call overhead to Solr (stub) orbiter 2014-08-03 22:44:22 +02:00
  • d0c02e1de7 adjust rss lat/lon to double (common format across other classes) reger 2014-08-03 20:09:23 +02:00
  • 3491ab4c38 removed unused images from webgraph edge computation orbiter 2014-08-01 13:21:16 +02:00
  • 2371d6b8db target linktexts must be string to enable search facets on these fields orbiter 2014-08-01 13:20:25 +02:00
  • 001e05bb80 do not store failure of loading of robots.txt into the index as a fail document Michael Peter Christen 2014-08-01 12:15:14 +02:00
  • 05d58e4df0 Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git Michael Peter Christen 2014-08-01 12:04:25 +02:00
  • 98f45c9032 fix for image alt attachment to AnchorURLs in html parser. Michael Peter Christen 2014-08-01 12:04:15 +02:00
  • 22ce4fb4dd better error handling for remote solr queries and exists-checks orbiter 2014-08-01 11:00:10 +02:00
  • b510b182d8 - update Maven pom - add ppt parser test case reger 2014-08-01 01:47:53 +02:00
  • 3dcfc717eb This hopefully fixes http://mantis.tokeek.de/view.php?id=424 Marc Nause 2014-07-29 22:02:11 +02:00
  • 9df14fc126 Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git Marc Nause 2014-07-29 21:26:43 +02:00
  • 477be17c51 Replaced old UPNP library with Weupnp. UPNP should work now, at least it does on my network. UPNP code in YaCy can still be improved though (see TODO comment: make port on gateway configurable or find free one). Marc Nause 2014-07-29 21:26:27 +02:00
  • 738989aab7 reverted commit f94c91315b because the webgraph has not enough performance for that orbiter 2014-07-29 18:49:42 +02:00
  • e9163e7e10 fix for malformed hostpath names in crawl balancer orbiter 2014-07-29 11:18:45 +02:00
  • 161a11070c yacystats is gone :( orbiter 2014-07-29 11:12:01 +02:00
  • c115f3869c enhanced snippet computation and test method in ViewFile Michael Peter Christen 2014-07-28 15:42:57 +02:00
  • 6c10b59f3e move bootstrap peers test systems to its test class var assignment not needed elsewhere. reger 2014-07-27 04:13:07 +02:00
  • 7328c2883b fix type in .init description http://mantis.tokeek.de/view.php?id=430 reger 2014-07-26 00:38:53 +02:00
  • 94819f0797 set .ini default boost fields to same as assigned by button "reset to default" (in RankingSolr_p) - fix typo http://mantis.tokeek.de/view.php?id=430 reger 2014-07-26 00:17:41 +02:00
  • b4b937a046 update to pdfbox 1.8.6 reger 2014-07-25 23:55:10 +02:00
  • 1027f3d04a fix for the usage of ready-prepared solr queries, some queries are formulated as edismax query but this was not set as query attribut. The defType=edismax property needs a qf-field, so this was added as well. Do not remove that field again! This fixes also a problem with title-unique computation. orbiter 2014-07-25 18:53:13 +02:00
  • f94c91315b if the webgraph is used, then use it also for reference computation to avoid contradictions with references_i in the collection index. Michael Peter Christen 2014-07-24 15:35:53 +02:00
  • 6e1dc444c3 added a snippet test function in ViewFile: you can now search for a specific word on the document; the servlet returns the snippet in the same way as it would be shown in a search result. Michael Peter Christen 2014-07-24 14:59:37 +02:00
  • c63e93df46 Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git Michael Peter Christen 2014-07-24 00:04:56 +02:00
  • 1bf605b6d1 toString() fix Michael Peter Christen 2014-07-24 00:04:46 +02:00
  • 4b06adb751 fix for file urls orbiter 2014-07-23 17:54:31 +02:00
  • 08409ec680 no idea why the words max was an ordered one. This change increaes speed dunring document processin a bit orbiter 2014-07-23 17:54:16 +02:00
  • dd311ddac9 Merge origin/master reger 2014-07-22 22:01:01 +02:00
  • e5854a5cdb fix localhost link to opensearchdescription.xml reger 2014-07-22 21:57:38 +02:00
  • 29d1945c16 fix double &query parameter (index.html) ?query=word&query= reger 2014-07-22 21:54:46 +02:00
  • 172d7e68da Updated commandline reconfiguration tool. Marc Nause 2014-07-22 21:52:53 +02:00
  • b44626e55b fixed target_alt_t in webgraph Michael Peter Christen 2014-07-22 18:24:10 +02:00
  • 504327b15c fix for condition for writing the webgraph Michael Peter Christen 2014-07-22 00:59:08 +02:00
  • 542c20a597 changed handling of crawl profile field crawlingIfOlder: this should be filled with the date, when the url is recognized as to be outdated. That field was partly misinterpreted and the time interval was filled in. In case that all the urls which are in the index shall be treated as outdated, the field is filled now with Long.MAX_VALUE because then all crawl dates are before that date and therefore outdated. Michael Peter Christen 2014-07-22 00:23:17 +02:00
  • 4eec1a7452 refactoring (change Metadata name of load time data structure to avoid confusion with Node data which is also called metadata) Michael Peter Christen 2014-07-21 23:54:23 +02:00
  • c95ba52cf0 improve logexception info - log a message or class name insted of msgtxt "null" reger 2014-07-21 22:13:34 +02:00
  • 7f0e757bb5 fix bookmark.rss - channel end tag postion - link with html entity reger 2014-07-21 19:26:12 +02:00
  • e441831a24 reverted toString() change in AnchorURL to prevent mistakenly used toString(). This fixes also the update link bug. orbiter 2014-07-21 15:58:29 +02:00
  • 697b9743e7 Add link to RemoteCrawl_p suggestion http://mantis.tokeek.de/view.php?id=277 reger 2014-07-21 02:00:05 +02:00
  • 47f201a6b8 Add Solr default query fields (&qf) to select servlet according to the ranking profiles boost fields defined by the peer (if df/qf is not specified in query). This allows for pretty simple queries ( q=word) without the need to know about the specific index configuration. Making sure all relevant fields (as determined by the index owner) are searched, still maintaining the option to query specific fields and does not relay on the duplication of text to text_t. - add author to reset-default boost fields (support results for author nav) reger 2014-07-21 00:47:14 +02:00
  • f96cfdc84d prevent array out of bound exception on getRankingProfile(x) on faulty &profileNr= query parameter reger 2014-07-21 00:04:54 +02:00
  • 970368359b Merge branch 'master' of ssh://gitorious.org/yacy/rc1 Michael Peter Christen 2014-07-20 22:35:40 +02:00
  • c4608469bf Merge branch 'master' of gitorious.org:yacy/icewindxs-rc1 Michael Peter Christen 2014-07-20 22:35:19 +02:00
  • 8004cfc961 fix input boostfield factor of 0.0 in RankingSolr - input was accepted and stored but not editeable (added check factor >0.0 during edit) - make use of some more predefined solr constants reger 2014-07-20 12:28:59 +02:00
  • 5f5fb4ecdc remove unused static (RSS)search from protocol reger 2014-07-20 02:49:49 +02:00
  • 7c1706d83a use CRLF in generated bat command scripts for windows - for easier viewing with standard viewers reger 2014-07-20 00:06:22 +02:00
  • a2cb366b25 Combine /heuristic search modifier with opensearch configured targets - with search modifier /heuristic a request is send to all configured opensearch target systems (old /heuristic/blekko modifier not longer valid) - this allows to use opensearch heuristic on individual search request (in contrast to configuration HEURISTIC_OPENSEARCH=true which sends a osd request on all global searches - the index.html searchoption text adjusted to be displayed only if option configured - add Archive-It to predefined systems reger 2014-07-20 00:00:43 +02:00
  • 2de159719b added an option to set 'obey nofollow' for links with rel="nofollow" attribute in the <a> tag for each crawl. This introduces a lot of changes because it extends the usage of the AnchorURL Object type which now also has a different toString method that the underlying DigestURL.toString. It is therefore not advised to use .toString at all for urls, just just toNormalform(false) instead. Michael Peter Christen 2014-07-18 12:43:01 +02:00
  • bf1b6b93e7 do not write CR values to webgraph if no CR values are computed Michael Peter Christen 2014-07-16 18:13:29 +02:00
  • e039e78210 small bugfixes Michael Peter Christen 2014-07-16 16:04:38 +02:00
  • 87f8118108 added option to delete documents from the webgraph Michael Peter Christen 2014-07-16 16:04:19 +02:00
  • 32a2ff925c Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git Michael Peter Christen 2014-07-16 14:58:27 +02:00
  • d07cdd8c3b added SolrCloud access mode and configuration Michael Peter Christen 2014-07-16 14:57:51 +02:00
  • 8514bffc22 enhanced postprocessing status report Michael Peter Christen 2014-07-16 14:57:25 +02:00
  • 53ecd54b45 Update russian translation malykhin.dmitry 2014-07-13 02:50:16 +04:00
  • f99f3d5cf2 fix button (clear list) text color in CrawlResults reger 2014-07-13 00:48:50 +02:00
  • b24572f304 fix GSA filter query assignment - use more parameter constants reger 2014-07-13 00:11:17 +02:00
  • b5fc2b63ea removed exist() retrieval functions from error cache and replaced it with metadata retrieval from connectors directly. This should cause better usage of the cache. Automatically increase the metadata cache if more memory is available. Michael Peter Christen 2014-07-11 19:52:25 +02:00
  • 62c72360ee cleanup of checkAcceptanceInitially in CrawlStacker, should avoid double-calling of solr Michael Peter Christen 2014-07-11 18:36:04 +02:00
  • dd5cdfe212 reverted filter query hack, it did not work Michael Peter Christen 2014-07-11 18:15:35 +02:00
  • b5d78ba156 reduced number of solr queries during crawling Michael Peter Christen 2014-07-11 18:05:11 +02:00
  • 5326970d6c enhanced solr queries for single document extraction Michael Peter Christen 2014-07-11 18:04:55 +02:00
  • 525575bd97 added debugging of filter queries in thread dump thread names Michael Peter Christen 2014-07-11 17:34:41 +02:00
  • f319ef268f testing filter queries instead of queries to retrieve documents by id Michael Peter Christen 2014-07-11 17:09:46 +02:00
  • fd87fa1613 removed more unnecessary exist-checks in ErrorCache Michael Peter Christen 2014-07-11 16:48:08 +02:00
  • f2b476e08b don't do a double check to solr for failed documents if they are not written to solr Michael Peter Christen 2014-07-11 16:26:52 +02:00
  • 06ab72d1af enhanced crawler host round-robin strategy Michael Peter Christen 2014-07-11 16:01:42 +02:00
  • dab9a0786a Merge branch 'master' of git@gitorious.org:yacy/rc1.git orbiter 2014-07-11 04:04:34 +02:00
  • 51bf5c85b0 Renamed the transmission cloud to buffer in dispatcher since the name 'cloud' was a bad idea. Changed also the accumulation process for peer targets so that every dht chunk is not assigned the set of redundant targets but they are assigned to redundant targets individually. This enhances the granularity of the target accumulation and should enhance the efficiency of the process. Finally the dht protocol client was enriched with the ability to remove the 'accept remote index' flag from peers or remove peers completely if they do not answer at all. orbiter 2014-07-11 04:04:09 +02:00
  • 7057e0b3e2 catch input file not found in Mediawiki import reger 2014-07-10 23:58:47 +02:00
  • a694b6a8fc another fix for unique field computation Michael Peter Christen 2014-07-10 17:25:33 +02:00
  • fb3dd56b02 fix for processing of noindex flag in http header Michael Peter Christen 2014-07-10 17:13:35 +02:00
  • b0d941626f fixed bugs in canonical, robots and title/description unique calculation Michael Peter Christen 2014-07-10 15:40:38 +02:00
  • d9472d043a cleanup older unused classes reger 2014-07-10 02:20:01 +02:00
  • 665e12f88e move startup time from old serverCore to switchboard (most used here) to make servercore eventually obsolete. reger 2014-07-10 02:17:56 +02:00
  • 336425912a remove unused localSearchThread from SearchEvent reger 2014-07-10 02:14:03 +02:00
  • 32bd2a61c1 add local ip to AbstractRemoteHandler local hostname cache reger 2014-07-10 02:09:26 +02:00
  • f3a6b6e21e fix for bad URL decoding Michael Peter Christen 2014-07-10 01:59:29 +02:00
  • 1092e798a5 fixed double content postprocessing Michael Peter Christen 2014-07-07 19:15:11 +02:00
  • aee5b108e5 added linkScraperParser, a parser which ignores the text like the generic parser but extracts links like the htmlParser. This should be used for ASCII documents without known text format annotation like source code files or json documents. Probably also good for xml files without known schema. Michael Peter Christen 2014-07-07 13:37:17 +02:00
  • f384fd624b Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git Michael Peter Christen 2014-07-07 11:11:50 +02:00
  • 2b8cc5832c fix seek error for 0 file size records file by add extra check for file size = 0 in cleanlast() - (http://mantis.tokeek.de/view.php?id=411) reger 2014-07-06 20:49:01 +02:00
  • 1f2eba977d add test case for Records (used in HostBalancer) - simulating seek error (http://mantis.tokeek.de/view.php?id=411) reger 2014-07-06 20:41:26 +02:00
  • 2ba394333f fix Crawler HostQueue release of stackfile - close stackfile inputstream at end of ChunkIterator This should solve startup delay while unfinished crawl jobs exist (maybe also too many open file situation) reger 2014-07-06 16:04:30 +02:00
  • 40133ba2d0 fix NPE in Condenser, discovered by calling IndexControlRWI, "Word Deletion" with "for every resolvable and deleted URL reference" reger 2014-07-06 13:24:36 +02:00
  • e94efd4d7c update to JUnit 4.11 - fix build.xml -> parserTest error on Windows due to javac encoding reger 2014-07-06 05:38:32 +02:00
  • 3b77e41f1a adding test for HostQueue crawl stack - simulating problem with zero length stack file (but not fixing it) - adding test data clean to maven pom reger 2014-07-06 00:38:16 +02:00
  • ba5a59a28d make search result also avail. as atom feed via /yacysearch.atom - fix logo in rss feed reger 2014-07-03 22:01:13 +02:00
  • 59160984cc timeline performance update orbiter 2014-07-03 13:06:29 +02:00