Commit Graph

  • 1ca4b9612c added special handling of the BinaryResponseWriter in the solr interface which makes it possible to use solrj with the javabin format which is much better (compressed, no xml overhead, java object streams) and faster. Furthermore, this enables the 'shards' option in the solr interface which connects one solr (YaCy) to another solr (YaCy) ad-hoc. orbiter 2013-09-01 13:11:40 +02:00
  • d0e78082d1 return field names in index instead of in schema for SolrServerConnector.getFields reger 2013-08-31 06:25:12 +02:00
  • 1a3e42eca4 index migration to lucene 4.4 Michael Peter Christen 2013-08-26 12:49:39 +02:00
  • a88a62f7aa added a feature to set a collection for a crawl result based on a regular expression on th url: the collection attribut for a crawl start may be now either a token or a list of tokens, seperated by ',' where a token is either a string or a pair <string,pattern> where the string is separated to the pattern with a ':' and the string is assigned to the document as collection only if the pattern matches with the url. Michael Peter Christen 2013-08-25 00:13:48 +02:00
  • 3c5abedabf NPE during shutdown fix Michael Peter Christen 2013-08-24 23:36:50 +02:00
  • e4cbe9232d fixed a crawler bug where a double-occurring url was not re-crawled because the double-check error was written to the error-db and never deleted. No the error-db is cleared on every start and these double-messages are not written to the error-db any more. Michael Peter Christen 2013-08-22 15:56:09 +02:00
  • 765943a4b7 Redesign of crawler identification and robots steering. A non-p2p user in intranets and the internet can now choose to appear as Googlebot. This is an essential necessity to be able to compete in the field of commercial search appliances, since most web pages are these days optimized only for Google and no other search platform any more. All commercial search engine providers have a built-in fake-Google User Agent to be able to get the same search index as Google can do. Without the resistance against obeying to robots.txt in this case, no competition is possible any more. YaCy will always obey the robots.txt when it is used for crawling the web in a peer-to-peer network, but to establish a Search Appliance (like a Google Search Appliance, GSA) it is necessary to be able to behave exactly like a Google crawler. With this change, you will be able to switch the user agent when portal or intranet mode is selected on per-crawl-start basis. Every crawl start can have a different user agent. Michael Peter Christen 2013-08-22 14:23:47 +02:00
  • 0f3d8890db removed an assert which causes a shortcut call circuit Michael Peter Christen 2013-08-22 10:12:25 +02:00
  • 6d5fefe060 added missing files :( Michael Peter Christen 2013-08-20 16:31:34 +02:00
  • 554c0351dd fix for http://bugs.yacy.net/view.php?id=286 Michael Peter Christen 2013-08-20 16:10:26 +02:00
  • 47b1c81d08 - refactoring - generalized writing of url attributes to solr documents - added more url attributes to error documents Michael Peter Christen 2013-08-20 15:46:04 +02:00
  • e6b423c4d9 Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git Michael Peter Christen 2013-08-19 22:02:41 +02:00
  • 94bec24d14 add back menu to Surftips page (currently no menu is displayed) reger 2013-08-19 17:53:37 +02:00
  • 1f299b0d42 removed link.gif as link button because this image is now shown automatically for expernal links Michael Peter Christen 2013-08-19 10:54:23 +02:00
  • 1c62fa7698 fix for bad snippets in gsa api Michael Peter Christen 2013-08-18 10:37:25 +02:00
  • 48ddd50a6c html fix Michael Peter Christen 2013-08-17 09:32:24 +02:00
  • 697613170d less logging for postprocessing (this was a debugging logging with high CPU load) Michael Peter Christen 2013-08-17 09:25:32 +02:00
  • 96ae332427 revert del _blank (last commit) in template reger 2013-08-15 00:15:01 +02:00
  • 43348a98a9 add some href target=_blank to ext. links with external icon reger 2013-08-15 00:05:32 +02:00
  • b4016ff324 - remove possible double initialization of rdfa parser - use ordered list to use preferred parser for mime/extension first (relates to html, rdfa, argument parser) - harmonize xhtml extension config for the 3 html base parsers reger 2013-08-14 21:12:10 +02:00
  • 82d81a57bd info msg if no embedded Solr http://bugs.yacy.net/view.php?id=279 reger 2013-08-14 20:59:46 +02:00
  • f0575bd44b FieldReIndex: omit active vocabulary fields from reindex detection reger 2013-08-14 00:00:30 +02:00
  • a5019bc470 make Vocabulary Navigator tags a hard result entry filter by checking vocabulary tags also for rwi results (currently a filter is applied to the solr query) reger 2013-08-13 03:07:25 +02:00
  • a67a4b7d86 improve tld: query modifier filter pattern (to prevent tld:net accepting www.abcinet.org) reger 2013-08-12 21:20:23 +02:00
  • 02fe8b43ba Field Re-Indexing: display list of fields in reindex queue change servlet to display statistic on 1st click (instead after refresh) reger 2013-08-11 04:51:29 +02:00
  • 7f501b7c38 clear some caches before reporting low Memory do not break lines in Network-table-rows sixcooler 2013-08-08 14:38:26 +02:00
  • b355dd52c6 Index Administration - Field Re-Indexing: exclude internal Solr _version_ field from obsolete field check reger 2013-08-08 00:55:21 +02:00
  • 1bc6003057 rise autoCommit maxTime to 3 Minutes to reduce IO lower mergeFactor again (5) for less segments sixcooler 2013-08-06 03:58:53 +02:00
  • 5189620026 add branch to packet-name if not build from master sixcooler 2013-08-06 03:48:29 +02:00
  • 070bf85b33 css fix for IE10 showing border on all img within <a /> tag since introduction of external link icon (commit 112836dcc9) reger 2013-08-04 05:37:20 +02:00
  • 8a96140f92 fix / workaround for http://forum.yacy-websuche.de/viewtopic.php?f=5&t=4750 + Seed.hash should be final sixcooler 2013-08-01 16:40:58 +02:00
  • c0ff91b9a8 bugfix release 1.62 Michael Peter Christen 2013-08-01 12:36:59 +02:00
  • 2674d28ef4 protection against self-ping (may be cause by fraud attempts) Michael Peter Christen 2013-08-01 12:35:44 +02:00
  • 944ae5686c added donation plea to the about box as default (you can replace this in your peer!) orbiter 2013-08-01 12:11:56 +02:00
  • f3d001c7ab more space in the about section orbiter 2013-08-01 11:49:07 +02:00
  • e879b97b0a added line to enhance debugging Michael Peter Christen 2013-07-31 13:33:05 +02:00
  • 2857499467 fix to collection schema; bug appeared for _txt fields with empty String as content Michael Peter Christen 2013-07-31 13:32:05 +02:00
  • dbfa865700 added a stub of a class for crawler redesign Michael Peter Christen 2013-07-31 13:16:32 +02:00
  • 76afcccaaf fix for default boolean post values: the default value MUST NOT be TRUE, because it's normal that a boolean value is missing in the post argument if a checkbox is not selected. Added also some style enhancements to IndexFederated, removed the Solr attachment manual and replaced it with a link to the wiki which explains this in more detail. Michael Peter Christen 2013-07-31 10:49:26 +02:00
  • 252c525709 fixed feed api servlet and and enhanced RSSReader class orbiter 2013-07-31 06:18:30 +02:00
  • d38c3c14d8 fix for CGI test orbiter 2013-07-31 05:43:58 +02:00
  • 112836dcc9 Improved external links. Marc Nause 2013-07-30 21:40:37 +02:00
  • d64a094f0e External links in HTML interface are marked as external with small icon. Marc Nause 2013-07-30 20:46:51 +02:00
  • 31902f54df fix for NPE which happens within solr code at MultiMapSolrParams.java, line 52 in case that the array arr.length == 0 Michael Peter Christen 2013-07-30 14:32:59 +02:00
  • 5b7c0d0745 update to pdfbox 1.8.2 Michael Peter Christen 2013-07-30 14:14:16 +02:00
  • f13df9dbb6 migration to solr 4.4.0 Michael Peter Christen 2013-07-30 14:01:16 +02:00
  • dc1002e511 cleaned sourcepaths from eclipse classpath Michael Peter Christen 2013-07-30 13:05:32 +02:00
  • 1b09362949 next development cycle Michael Peter Christen 2013-07-30 12:51:00 +02:00
  • 58fe986cca Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git Michael Peter Christen 2013-07-30 12:49:14 +02:00
  • cf12835f20 replaced the single-text description solr field with a multi-value description_txt text field Michael Peter Christen 2013-07-30 12:48:57 +02:00
  • 7d53ac86a3 fix for Blacklist (-Administration) sixcooler 2013-07-29 19:09:28 +02:00
  • f2d99053ed Field Re-Indexing: prevent endless error loop in ReindexSolrBusyThread on Solr exception (by skipping query causing the exception) (occured during testing while working on q=store:[* TO *]) reger 2013-07-29 01:32:02 +02:00
  • 92d3f71b16 htmlParser: closes input stream -> changed it to leave it open for a reset (used by AugmentParser - even if this is practically not used), note: stream.close is done by caller (Textparser.parseSource) - removed unnecessary reset in AugmentParser - added stream.mark in tdfatripleimpl. to make stream.reset work here reger 2013-07-28 03:41:09 +02:00
  • f117ea0492 reverted start script options - yacy on windows did not start with the given values orbiter 2013-07-27 15:36:46 +02:00
  • 87cfeaa4f3 fix for npe orbiter 2013-07-27 15:20:09 +02:00
  • 268a36aaff emergency fix for crawler: this will otherwise cause loss of complete crawl queue if latency of remote system is too low orbiter 2013-07-27 11:59:07 +02:00
  • 743e4878a8 Release 1.6 Release_1.6 orbiter 2013-07-27 11:26:14 +02:00
  • e7fcb81cea we should not do too much greedylearning at this time as we don't have enough experience with it. set greedylearning.limit.doccount to a much lower limit. orbiter 2013-07-27 11:22:40 +02:00
  • d05e0c5368 wait a bit longer before doing the first peer ping orbiter 2013-07-27 11:00:35 +02:00
  • f425b2c61c re-try to fetch url after a soft commit orbiter 2013-07-27 10:56:02 +02:00
  • b8f57f7703 don't be noisy when doing background tasks that may be allowed to fail orbiter 2013-07-27 10:51:58 +02:00
  • bf0ad04e1b apply load limitation also to dht-in orbiter 2013-07-27 10:42:38 +02:00
  • 0343f0668c Fix for NPE: E 2013/07/26 20:29:29 BUSYTHREAD Runtime Error in serverInstantThread.job, thread 'net.yacy.search.Switchboard.cleanupJob': null; target exception: null java.lang.NullPointerException at net.yacy.search.schema.CollectionConfiguration.convergenceStep(CollectionConfiguration.java:1116) at net.yacy.search.schema.CollectionConfiguration.postprocessing(CollectionConfiguration.java:897) at net.yacy.search.Switchboard.cleanupJob(Switchboard.java:2296) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at net.yacy.kelondro.workflow.InstantBusyThread.job(InstantBusyThread.java:107) at net.yacy.kelondro.workflow.AbstractBusyThread.run(AbstractBusyThread.java:165) Roland Haeder 2013-07-27 10:19:46 +02:00
  • b58ca8622d Some cleanups: - added SKINS_PATH_DEFAULT as same as LISTS_PATH_DEFAULT was added - Added 'final' keyword to a string Roland Haeder 2013-07-26 19:51:34 +02:00
  • e2ee412160 Use SwitchboardConstants.LISTS_PATH_DEFAULT instead of 'DATA/LISTS' Roland Haeder 2013-07-27 10:12:58 +02:00
  • ae19401af0 Removed another duplicate occurance of Blacklist.BLACKLIST_FILENAME_FILTER Roland Haeder 2013-07-26 19:17:31 +02:00
  • 59225487ea Fix for blacklist export, also applied the filename filter here Roland Haeder 2013-07-26 19:09:38 +02:00
  • 952fc0e7bd Removed superfluous check for files ending '.black' as the previous commit already excluded all other files (e.g. .ser dumps), added logging in catch-all block Roland Haeder 2013-07-26 19:09:00 +02:00
  • 060fec1577 Reuse Blacklist.BLACKLIST_FILENAME_FILTER Roland Haeder 2013-07-26 19:07:42 +02:00
  • 29049c71f5 Possible fix for ticket http://bugs.yacy.net/view.php?id=270, the filter for only including *.black must be applied Roland Haeder 2013-07-26 18:32:04 +02:00
  • 7263bb82fb Fix for NPE on shutdown: java.lang.NullPointerException at net.yacy.search.Switchboard.storeDocumentIndex(Switchboard.java:2732) at net.yacy.search.Switchboard.access00(Switchboard.java:207) at net.yacy.search.Switchboard.run(Switchboard.java:3049) Roland Haeder 2013-07-25 12:02:11 +02:00
  • 13433d41a1 Log this exception better Roland Haeder 2013-07-27 09:54:51 +02:00
  • 080d80c9de do not write an empty failreason in case that there is no fail. Because of the lazy instantiation rule this value was not actually written, but if lazy instantiation is switched on, then this causes that all crawl starts delete all crawl-start-hosts completely because this looks for filled error reasons. orbiter 2013-07-26 17:53:28 +02:00
  • 4c242f9af9 always use a default value for boolean options to have transparency for the outcome if the attribute is missing in servlets Michael Peter Christen 2013-07-25 12:17:29 +02:00
  • 61e015268b fix in forced deletion: forced commit needed Michael Peter Christen 2013-07-25 09:53:19 +02:00
  • 83e2921b39 new test case for http://bugs.yacy.net/view.php?id=141 Michael Peter Christen 2013-07-25 09:31:48 +02:00
  • 304aacb2cc fix for http://bugs.yacy.net/view.php?id=267 Michael Peter Christen 2013-07-25 09:26:24 +02:00
  • c3b2301b2f fix for http://bugs.yacy.net/view.php?id=268 Michael Peter Christen 2013-07-25 09:21:37 +02:00
  • aa1a1f1d2c - small adjustment to make sure genericParser is tried last -- for some documents genericParser grabs document instead of specific available parser due to unordered pick of 1st to try parser (like .ps .rdf files and other) - remove redundant file extension registration reger 2013-07-23 20:24:13 +02:00
  • 3e901dcb06 Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git orbiter 2013-07-23 19:33:07 +02:00
  • f50b596e0b do not run dht ditribution if system load is over 2.5 orbiter 2013-07-23 19:32:32 +02:00
  • 9c681cc00d added segment sizes, postprocessing status and cpu load to crawler monitor orbiter 2013-07-23 19:10:11 +02:00
  • 86b514cf46 added load info to status_p.xml orbiter 2013-07-23 18:20:07 +02:00
  • 056b42f5aa - added information about segment count to status_p.xml - also moved this information from the old index structure, which is still in use for the RWI/DHT index to that front-end orbiter 2013-07-23 18:03:33 +02:00
  • 6fb2811e68 fixes for problems with remote solr and non-activated webgraph index orbiter 2013-07-23 16:46:44 +02:00
  • af740f3058 changed optimization to a segment-size of index-size/5.000.000 + one if not idle + one (and force) if postprocessing sixcooler 2013-07-23 14:21:12 +02:00
  • 336f86394c replaced StringBuffer with StringBuilder Michael Peter Christen 2013-07-23 12:21:27 +02:00
  • aeac2fb763 replaced more containsKey() -> get() usages by a simple get(), followed by a test for NULL. This should increase the application speed and reduces the lookup time for the affected methods by 50% Michael Peter Christen 2013-07-23 12:16:51 +02:00
  • 5364c4dcc9 delayed first peer-ping to send the first ping out after the http got up; if the ping comes before the http is up, it cannot be recognized as senior peer (if at all). See also: http://bugs.yacy.net/view.php?id=266 orbiter 2013-07-22 18:21:37 +02:00
  • e24016e30a added the property federated.service.solr.indexing.timeout to yacy.init to provide a configurable time-out for solr; see also: http://bugs.yacy.net/view.php?id=254 orbiter 2013-07-22 17:45:12 +02:00
  • c124037f19 removed forced non-soft commits to prevent index fragmentation orbiter 2013-07-22 17:28:20 +02:00
  • 31483c47e1 fixed problem with remote luke requests Michael Peter Christen 2013-07-22 15:55:20 +02:00
  • c15aa758dc removed failreason_t removal patch because that causes too much confusion using an external solr. to clean up the index after a schema change, use the index cleaner function from the online servlet Michael Peter Christen 2013-07-22 14:17:38 +02:00
  • 2b7a38640a extend content type detection on file extension for .tif .tiff .htm reger 2013-07-21 22:57:21 +02:00
  • ac1aad5064 added a getSegmentCount method and use it to disable optimize if wanted current segment count is below optimization level Michael Peter Christen 2013-07-18 14:31:42 +02:00
  • 36035e0a0a - used reger's LukeRequest to generalize the index info in SolrServerConnector - used the LukeRequest in SolrServerConnector to replace the index size method by a getNumDocs request to a LukeRequest result Michael Peter Christen 2013-07-18 13:26:07 +02:00
  • 39fceb5ccf fix for NPE & bug #264 Michael Peter Christen 2013-07-18 12:37:32 +02:00
  • 735a66eff3 enhancements to crawler Michael Peter Christen 2013-07-18 12:29:04 +02:00
  • 232100301c removed double-ocurring value assignments orbiter 2013-07-17 19:09:25 +02:00
  • be0ff6018f Removed trailing spaces + some more final Roland Haeder 2013-07-15 18:22:35 +02:00