Commit Graph

  • 54024958ac added url_file_name_s in qeury for live-search of urls Michael Peter Christen 2013-06-25 16:36:05 +02:00
  • 16d1d744fa added url_file_name_s in default collection schema for the file name without the file extension. This part of the file path is removed from the multi-field url_paths_sxt, which has now not the file name as last part of the path list. Michael Peter Christen 2013-06-25 16:27:20 +02:00
  • 8d1c4c423d make imageparser fileextension detection case insensitive (extensions are often upper case) reger 2013-06-23 00:39:15 +02:00
  • f542cf7d9c fix for daterange: the to-date is inclusive Michael Peter Christen 2013-06-21 15:47:12 +02:00
  • f9d859f5dc now writing image alt texts and (camelcase-)parsed urls into a text search field for a better image retrieval Michael Peter Christen 2013-06-18 16:51:56 +02:00
  • c36720d45f added daterange option to gsa api Michael Peter Christen 2013-06-18 16:25:00 +02:00
  • e441a9d4c8 to avoid confusion, the gsa api is available at /search? and /searchresult? Michael Peter Christen 2013-06-18 16:22:06 +02:00
  • 8792e6c6e9 stub for better image indexing orbiter 2013-06-18 13:28:30 +02:00
  • 97f2ac9091 added hint to gsa response writer that the result comes from a yacy peer orbiter 2013-06-17 13:29:03 +02:00
  • d62464f129 start of next development cycle with small version number 0.01 (as in the past) orbiter 2013-06-17 13:28:28 +02:00
  • 363e955a0c Release 1.5 Release_1.5 Michael Peter Christen 2013-06-13 23:50:00 +02:00
  • 14186e815e npe fix Michael Peter Christen 2013-06-13 22:42:21 +02:00
  • 4e3007f4a0 typo Michael Peter Christen 2013-06-13 22:40:46 +02:00
  • bdf306e0a7 increased time-out for loading of seed-lists Michael Peter Christen 2013-06-13 22:32:06 +02:00
  • 2cb6b6bc21 added target="_blank" to shutdown links Michael Peter Christen 2013-06-13 22:31:39 +02:00
  • c8e94ad7c7 fix for citation search in case that the citation is very fresh orbiter 2013-06-13 18:27:57 +02:00
  • 57dcf68665 added a feed-back message inside the shutdown page orbiter 2013-06-13 14:44:47 +02:00
  • 0600d510e1 show the citation report also in ViewFile Michael Peter Christen 2013-06-13 13:22:43 +02:00
  • 1a92b61d69 fixed usage of ViewFile which needs a commit before showing latest crawl result pages. Michael Peter Christen 2013-06-13 13:08:24 +02:00
  • 374d2e2a52 removed warning message during crawling Michael Peter Christen 2013-06-13 13:03:56 +02:00
  • 570511f3c8 removed fields references_internal_id_sxt and references_internal_url_sxt because they had been shown to be superfluous. The citation of referrer in the host browser is possible without them. Therefore now the host browser does not only show internal, but also external referrer to each link. Michael Peter Christen 2013-06-13 13:01:28 +02:00
  • fd1776a3b0 added a new 'Citations' function: each search result item can now be explored for citations within other documents. A click on the 'Citations' link shows an analysis with all text lines in the document each with a complete list of documents which contain the same line. A second section shows the linking documents in ascending order of number of citations from the original document. Because documents from different hosts are most interesting here, they are listed at the top of the page as possible 'copypasta' source. Michael Peter Christen 2013-06-12 15:02:49 +02:00
  • fc3ff92c69 npe fix Michael Peter Christen 2013-06-12 13:23:58 +02:00
  • 7754a1263b switching back to the merge factor 10; the solr default. Michael Peter Christen 2013-06-12 11:29:35 +02:00
  • 1762911f57 added synchronizations and timeouts in solr api; missing synchronizations in index modification methods causes deadlocks inside solr. Michael Peter Christen 2013-06-12 02:13:18 +02:00
  • 3e1e358fdc calling pdf cache flush on class initialization because calling of the methods during runtime can conflict with dynamic solr class loader and cause a deadlock (seriously!) Michael Peter Christen 2013-06-12 00:17:44 +02:00
  • 291912ee52 removed misleading http accessGranted message (this is only for debugging) Michael Peter Christen 2013-06-12 00:16:28 +02:00
  • 2fd7bbb450 reduced load on solr; no seed update in Status and no exists-check in HTTPLoader in case of redirects, that can be done using the htcache. Michael Peter Christen 2013-06-12 00:14:55 +02:00
  • 7ee71c2354 changed administration page headline to 'admnistration' Michael Peter Christen 2013-06-12 00:12:04 +02:00
  • 898e14471b changed windows icon again Michael Peter Christen 2013-06-12 00:10:25 +02:00
  • 959ccc4675 increased the solr merge factor because 4 was too much IO load for frequent index receiving and re-indexing after clickdepth/cr calculation. Michael Peter Christen 2013-06-11 16:51:40 +02:00
  • efd973d29d changed p2p/stealth mode text and links a bit Michael Peter Christen 2013-06-11 16:50:34 +02:00
  • 2648b42b27 added fixed clear method as public method Michael Peter Christen 2013-06-11 16:22:43 +02:00
  • 20fab1feb6 allip net has greedy learning disabled Michael Peter Christen 2013-06-11 14:52:46 +02:00
  • ffc570f95f removed forced soft commit since this may be the cause for a performance problem Michael Peter Christen 2013-06-11 14:51:26 +02:00
  • 6115bef335 added a 'greedy learning' mechanismn which will cause that a 'fresh' yacy will load linked web pages from search results until the total number of web pages reaches 15000. This shall give fresh peers a 'boost' to get faster a personalized search index. Michael Peter Christen 2013-06-11 14:42:30 +02:00
  • a5e328d7c5 new icons Michael Peter Christen 2013-06-11 13:16:46 +02:00
  • f24574b3da use s greeting line which does not sound so beta Michael Peter Christen 2013-06-11 13:12:59 +02:00
  • b85db72a73 added another response writer which can present search result with texts, separated by sentences. Then, these sentences can be used to search again in the index for the same sentence. This can be used to provide a tool for plagiarism-search. (not finished yet). Try the following: http://localhost:8090/solr/select?q=text_t:flut&grep=wasser&defType=edismax&start=0&rows=3&core=collection1&wt=grephtml .. to search for 'flut' and show only sentences in the result documents which contain the word 'wasser'. Consider this like using a grep-tool on documents: you select the documents by a search query and you grep sentences inside the found documents with the 'grep' attribute. Michael Peter Christen 2013-06-10 18:41:00 +02:00
  • 856e5c42ae the line "Web Search by the People, for the People" is more generic for P2P and portal search as default search string. Otherwise, if people switch to Portal mode, the "P2P Web Search" does not make sense. Michael Peter Christen 2013-06-10 18:36:06 +02:00
  • 8e965ffd16 fix for host compare in case that the host is null. This happens when doing a search in the intranet for file resources (they don't have a host). Michael Peter Christen 2013-06-10 16:23:58 +02:00
  • 5132bf719c added new buttons to search result page in p2p mode which show the switch between p2p search and the 'stealth mode' which is simply a non-p2p search within the p2p network. The functionality was there all the time, but the switch to this was not very visible. Michael Peter Christen 2013-06-10 16:22:00 +02:00
  • 2b320313d9 replaced yacydoc servlet usage by a solr result output using an html output writer. This made the creation of a html result writer necessary which is included in this commit. The yacydoc servlet was used to present all metadata to a document, but the solr interface can serve for this purpose in a much better way. All usages (instead one) of yacydoc were replaced by a solr call. This affects also the 'metadata' link attached to search results. orbiter 2013-06-09 12:12:34 +02:00
  • 200769d0c6 show the cache link in search results only if there is actually a cache entry stored in HTCACHE orbiter 2013-06-09 08:15:23 +02:00
  • 713a6199ef activated citation ranking by default Michael Peter Christen 2013-06-07 14:26:14 +02:00
  • f7a4377812 usage of the new normalized link polularity CRn as default ranking function. This replaces the previous formula, which was bad. Before you update to this version, please check if you changed the ranking function yourself before, since it will be overwritten. Michael Peter Christen 2013-06-07 13:22:22 +02:00
  • f7e77a21bf Added a citation reference computation for intra-domain link structures. While the values for the reference evaluation are computed, also a backlink-structure can be discovered and written to the index as well. The host browser has been extended to show such backlinks to each presented links. The host browser therefore can now show an information where an document is linked. The new citation reference is computed as likelyhood for a random click path with recursive usage of previously computed likelyhood. This process is repeated until the likelyhood converges to a specific number. This number is then normalized to a ranking value CRn, 0<=CRn<=1. The value CRn can therefore be used to rank popularity within intra-domain link structures. Michael Peter Christen 2013-06-07 13:20:57 +02:00
  • e20450e798 patch in HTCache and CitationIndex loading in case that a file is broken: do not crash; instead ignore the file and delete it. Michael Peter Christen 2013-06-07 12:52:03 +02:00
  • fdcd4e6a6f fixes to index deletion: quoting of host name (a '-' may be part of the url) and disabling the engage button when changing the url field at 'Delete by URL matching' Michael Peter Christen 2013-06-07 08:52:07 +02:00
  • d367b1f4d9 add null pointer check to stopword fix reger 2013-06-07 00:13:45 +02:00
  • 7480e87386 - fix stopword handling for RWI see example http://bugs.yacy.net/view.php?id=247 - append language setting specific stopword list reger 2013-06-06 22:07:54 +02:00
  • 5c7ddc67fe in GSA api enable usage of solr fq-attribute together with GSA site-attribute orbiter 2013-06-06 13:36:58 +02:00
  • 9fc0c4df98 fix for bad exists 'enhancement'; see bug: http://bugs.yacy.net/view.php?id=245 Michael Peter Christen 2013-06-02 13:50:12 +02:00
  • 9ef1fd9bac fix: enable use of solrcore.properties for property substitution of solrconfig.xml reger 2013-06-01 05:50:03 +02:00
  • 8a7fcb391d enable use of solrcore.properties for property substitution of solrconfig.xml - move setting of system property solr.directoryFactory=solr.MMapDirectoryFactory to solrcore.properties - add check of os.arch for 64bit system, if it fails use default/solrcore.x86.properties (if exists) as solrcore.properties reger 2013-06-01 05:43:08 +02:00
  • f7e887bf49 added missing class Michael Peter Christen 2013-05-30 16:39:48 +02:00
  • eb9d0ba5b1 ranking and boost function update, small bugfixes, better default search field for solr Michael Peter Christen 2013-05-30 16:30:35 +02:00
  • 5f92c68f1f removed block rank ranking and all YBR files in /ranking Michael Peter Christen 2013-05-30 13:01:22 +02:00
  • 164603b946 cleanup Michael Peter Christen 2013-05-30 12:47:22 +02:00
  • ba793a32c0 added timeout for remote searches of 10 seconds Michael Peter Christen 2013-05-30 12:39:28 +02:00
  • 1c4c1c0345 try to commit in case of failure which hopefully frees up some RAM Michael Peter Christen 2013-05-30 12:38:54 +02:00
  • 409d6edf53 Store node/solr search threads to be able to send them an interrupt signal in case that a cleanup process wants to remove the search process. Added also a new cleanup process which can reduce the number of stored searches to a specific number which can be higher or lower according to the remaining RAM. The cleanup process is called every time a search ist started. Michael Peter Christen 2013-05-30 12:38:15 +02:00
  • 2a8b99ea82 remove text_t in search result after snippet has been computed to save space in search result cache Michael Peter Christen 2013-05-30 12:35:47 +02:00
  • a1644ca0fd new workflow processor in Segment to enqueue indexing documents to solr Michael Peter Christen 2013-05-30 12:34:53 +02:00
  • a8dc4346e8 default configuration of MMapDirectoryFactory for solr, increased lock timeout, less documents from remote searches (too many results had easily blocked a peer) Michael Peter Christen 2013-05-30 12:31:28 +02:00
  • 0c1a018bbd removed 'later' tactic because it used too much RAM, reduced number of soft commits, reduced caching size of search events, ensured that solr results are processed before connection is closed to keep that stuff not too long in RAM Michael Peter Christen 2013-05-29 18:27:27 +02:00
  • 5344a1c5f7 getting the trash out Michael Peter Christen 2013-05-29 16:09:05 +02:00
  • 709e9b8ce7 Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git Michael Peter Christen 2013-05-29 13:49:42 +02:00
  • 9e07447d47 added new link for SMW Michael Peter Christen 2013-05-29 13:45:22 +02:00
  • 3c04dd11de removed dead link Michael Peter Christen 2013-05-29 13:42:38 +02:00
  • 1eb9626cca less logging Michael Peter Christen 2013-05-29 13:30:32 +02:00
  • 536fd1450e added new keys for update locations Michael Peter Christen 2013-05-29 13:10:32 +02:00
  • 281959a2d7 added option to re-boot the embedded solr during run-time. Added also API recording for this method so it can be repeated automatically. The index dump generation is now also available for API recording. Added some synchronization in backend which was necessary for this. Michael Peter Christen 2013-05-29 13:09:34 +02:00
  • 80a7989e8c fixed ClassCastException: [Ljava.lang.Object; cannot be cast to [Ljava.util.List; in robots.txt servlet Michael Peter Christen 2013-05-29 12:02:19 +02:00
  • da621e827e prevent NPE in case RWI is disabled orbiter 2013-05-28 16:26:38 +02:00
  • c2bcfd8afb Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git Michael Peter Christen 2013-05-28 11:39:10 +02:00
  • 67757b425a use a retry handler with retryCount=0 because we usually expect requests to fail if we access non-permanently available resources (peers, web pages) and want to fail fast without repeating the same request which is doomed to fail. The previous appearance of http client connection had a 1-2-4-8-second timeout scheme, which caused that connection attempts lasted for 16 seconds. Michael Peter Christen 2013-05-28 11:38:45 +02:00
  • 7300d81f40 include API Table deletion requests to the API recorder Michael Peter Christen 2013-05-28 11:35:56 +02:00
  • c2b1075dcf activating pollImmediately in case that DHT receive is off. This will cause a much faster search result when running in public robinson mode. Michael Peter Christen 2013-05-28 10:36:49 +02:00
  • d2ade87b49 fixed missing thisaddress in yacysearch.html which caused that the opensearch link was not working Michael Peter Christen 2013-05-28 10:33:41 +02:00
  • 179d032181 added a (badly formatted) delete button for process scheduler entries Michael Peter Christen 2013-05-27 16:15:58 +02:00
  • 888a985dc6 set a higher limit for table copy usage orbiter 2013-05-27 15:23:12 +02:00
  • 2b563debbf javadoc of new multiple-exist test Michael Peter Christen 2013-05-27 13:45:09 +02:00
  • c03f75ebc3 fix DHT url receive see http://bugs.yacy.net/view.php?id=242 reger 2013-05-26 03:24:32 +02:00
  • 8fb1b1e290 *) simplified banner creation code Marc Nause 2013-05-25 12:56:43 +02:00
  • cd0b5f31b4 *) updated links to description of regex Marc Nause 2013-05-25 11:08:06 +02:00
  • 8f2d3ce2f9 reduced locking situation in crawler: shifted synchronized location and reduced time-out of robots.txt load limit Michael Peter Christen 2013-05-20 22:05:28 +02:00
  • f93501e6e0 nice crawl name if crawl is started with file:// (was: null) Michael Peter Christen 2013-05-20 11:25:26 +02:00
  • b4f0cac102 added the reindexing job servlet to the submenu structure Michael Peter Christen 2013-05-20 11:02:21 +02:00
  • 97ab5b90e8 - odt & ooxml (office document) parser correction to add content to fulltext index - adjust Junit yacyVersionTest & ParserTest - update yacyVersion.combined2prettyVersion to the default 4-digit minor ver. reger 2013-05-20 01:50:09 +02:00
  • b68fbe7d21 Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git Michael Peter Christen 2013-05-17 14:13:07 +02:00
  • 06d3063dc9 - no downcase when using collection modifier - removed warnings Michael Peter Christen 2013-05-17 14:11:10 +02:00
  • 8dbc80da70 redesign of index.exist-test: this shall now not be done using a single id to be tested, but with a collection of ids. This will cause only a single call to solr instead of many. The result is a much better performace when testing the existence of many urls. The effect should cause very much less IO during index transmission, both on sender and receiver side. Michael Peter Christen 2013-05-17 13:59:37 +02:00
  • 7f63d3747d more generic field selection for reindex option of documents with disabled fields using Luke request to compare config with actual fields in index reger 2013-05-15 23:16:32 +02:00
  • c91c67c3cd reject bad solr requests Michael Peter Christen 2013-05-15 22:42:05 +02:00
  • 44e363f37f refactoring of WorkflowProcessor, added process counter, update of process counter if an blocking thread dies. Added also a new column in PerformanceConcurrency_p servlet to show the actual number of concurrent processes. Michael Peter Christen 2013-05-13 13:28:07 +02:00
  • 4058369288 fixed query expressions for collection selection (added quotes) Michael Peter Christen 2013-05-13 13:27:01 +02:00
  • f2e36fbd06 enhanced deletion process for very large number of documents Michael Peter Christen 2013-05-13 13:26:24 +02:00
  • 79401cb938 added reindex option for documents with disabled or obsolete fields to Solr Schema Editor page (IndexSchema_p.html) this allows to remove obsolete fields from the index (according to current schema config) by selecting all documents containig disabled fields. reger 2013-05-13 04:06:57 +02:00
  • cf36c1614f prevent that concurrent deletion process causes wrong double-check in crawl start orbiter 2013-05-12 21:37:45 +02:00