Commit Graph

  • fd90fcc4e0 Fixes #196. Felix Ableitner 2013-07-02 20:45:41 +02:00
  • 5a5d411ec0 new robots_i attribute fields Michael Peter Christen 2013-07-02 14:29:13 +02:00
  • fa08bd9d5a hack to prevent long waiting times in crawler Michael Peter Christen 2013-07-01 13:24:52 +02:00
  • f1c5338210 prepartion for greedy crawl profiles and refactoring Michael Peter Christen 2013-07-01 13:10:09 +02:00
  • e6f361f474 adding the canonical tag to crawl queues Michael Peter Christen 2013-07-01 13:09:41 +02:00
  • 40c5ee47c1 Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git orbiter 2013-06-30 12:07:25 +02:00
  • ae23a0badb updated copyright message; included LGPL for 'cora' and a warranty warning. orbiter 2013-06-30 11:30:39 +02:00
  • a6bf44212e bugfix: location (lat/lon) meta data retrival (Double.NaN check) reger 2013-06-30 03:50:07 +02:00
  • 203921006a redesign of citation index storage Michael Peter Christen 2013-06-30 02:11:46 +02:00
  • 7c6ccc426c set crawlingQ to true by default because most webpages are dynamic and crawlingQ should only be switched off in case of crawler traps orbiter 2013-06-29 20:28:14 +02:00
  • 5de4267a9d windows installer: update to latest jre Lotus 2013-06-29 18:54:30 +02:00
  • 83763ee4a4 jpeg parser: extract GPS location from meta data reger 2013-06-29 00:35:43 +02:00
  • e92b9275ce Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git Michael Peter Christen 2013-06-28 15:33:29 +02:00
  • 56cdcfa2fa fixed greedy learning mode - global is not a search attribute in searchitems Michael Peter Christen 2013-06-28 15:33:19 +02:00
  • 32aa1d4569 removed unused option for queries Michael Peter Christen 2013-06-28 15:32:36 +02:00
  • 0c5bed7e2c added configuration option for greedy learning function to ConfigPortal servlet Michael Peter Christen 2013-06-28 15:31:36 +02:00
  • 5d1f619f07 possible helpful closing of solr-requests sixcooler 2013-06-28 15:19:50 +02:00
  • 9d291764d1 Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git Michael Peter Christen 2013-06-28 15:03:25 +02:00
  • e5abccdfe4 added optimize-option sixcooler 2013-06-28 14:51:37 +02:00
  • 8ea6ddf636 removed attributes from ConfigPortal.html which are redundant to ConfigSearchPage_p.html Michael Peter Christen 2013-06-28 14:17:14 +02:00
  • 64140f35cd fix for solr requests if no query part is given (prevent npe) Michael Peter Christen 2013-06-28 13:16:25 +02:00
  • 8caaf6203a fixed false multiple-generation of remote facet search which caused high cpu usage on remote side. Michael Peter Christen 2013-06-28 12:39:36 +02:00
  • 23fb458963 - fix to gsa searchresult answer in case that no query part is given - fix to gsa default number of results (is 'num') Michael Peter Christen 2013-06-28 12:22:33 +02:00
  • 823ae4d6a7 added url_protocol_s to error documents Michael Peter Christen 2013-06-26 16:51:36 +02:00
  • 660a196989 refactoring Michael Peter Christen 2013-06-26 09:27:22 +02:00
  • c4538d8d91 added metadata-extractor-2.6.2.jar to eclipse classpath, removed old lib Michael Peter Christen 2013-06-26 09:26:34 +02:00
  • 3760e2616b bump up lib/metadata-extractor-2.6.2.jar (used for image parser) with needed code adjustments reger 2013-06-25 23:24:02 +02:00
  • 9a6fcdf597 npe fix Michael Peter Christen 2013-06-25 16:36:16 +02:00
  • 54024958ac added url_file_name_s in qeury for live-search of urls Michael Peter Christen 2013-06-25 16:36:05 +02:00
  • 16d1d744fa added url_file_name_s in default collection schema for the file name without the file extension. This part of the file path is removed from the multi-field url_paths_sxt, which has now not the file name as last part of the path list. Michael Peter Christen 2013-06-25 16:27:20 +02:00
  • 8d1c4c423d make imageparser fileextension detection case insensitive (extensions are often upper case) reger 2013-06-23 00:39:15 +02:00
  • f542cf7d9c fix for daterange: the to-date is inclusive Michael Peter Christen 2013-06-21 15:47:12 +02:00
  • f9d859f5dc now writing image alt texts and (camelcase-)parsed urls into a text search field for a better image retrieval Michael Peter Christen 2013-06-18 16:51:56 +02:00
  • c36720d45f added daterange option to gsa api Michael Peter Christen 2013-06-18 16:25:00 +02:00
  • e441a9d4c8 to avoid confusion, the gsa api is available at /search? and /searchresult? Michael Peter Christen 2013-06-18 16:22:06 +02:00
  • 8792e6c6e9 stub for better image indexing orbiter 2013-06-18 13:28:30 +02:00
  • 97f2ac9091 added hint to gsa response writer that the result comes from a yacy peer orbiter 2013-06-17 13:29:03 +02:00
  • d62464f129 start of next development cycle with small version number 0.01 (as in the past) orbiter 2013-06-17 13:28:28 +02:00
  • 363e955a0c Release 1.5 Release_1.5 Michael Peter Christen 2013-06-13 23:50:00 +02:00
  • 14186e815e npe fix Michael Peter Christen 2013-06-13 22:42:21 +02:00
  • 4e3007f4a0 typo Michael Peter Christen 2013-06-13 22:40:46 +02:00
  • bdf306e0a7 increased time-out for loading of seed-lists Michael Peter Christen 2013-06-13 22:32:06 +02:00
  • 2cb6b6bc21 added target="_blank" to shutdown links Michael Peter Christen 2013-06-13 22:31:39 +02:00
  • c8e94ad7c7 fix for citation search in case that the citation is very fresh orbiter 2013-06-13 18:27:57 +02:00
  • 57dcf68665 added a feed-back message inside the shutdown page orbiter 2013-06-13 14:44:47 +02:00
  • 0600d510e1 show the citation report also in ViewFile Michael Peter Christen 2013-06-13 13:22:43 +02:00
  • 1a92b61d69 fixed usage of ViewFile which needs a commit before showing latest crawl result pages. Michael Peter Christen 2013-06-13 13:08:24 +02:00
  • 374d2e2a52 removed warning message during crawling Michael Peter Christen 2013-06-13 13:03:56 +02:00
  • 570511f3c8 removed fields references_internal_id_sxt and references_internal_url_sxt because they had been shown to be superfluous. The citation of referrer in the host browser is possible without them. Therefore now the host browser does not only show internal, but also external referrer to each link. Michael Peter Christen 2013-06-13 13:01:28 +02:00
  • fd1776a3b0 added a new 'Citations' function: each search result item can now be explored for citations within other documents. A click on the 'Citations' link shows an analysis with all text lines in the document each with a complete list of documents which contain the same line. A second section shows the linking documents in ascending order of number of citations from the original document. Because documents from different hosts are most interesting here, they are listed at the top of the page as possible 'copypasta' source. Michael Peter Christen 2013-06-12 15:02:49 +02:00
  • fc3ff92c69 npe fix Michael Peter Christen 2013-06-12 13:23:58 +02:00
  • 7754a1263b switching back to the merge factor 10; the solr default. Michael Peter Christen 2013-06-12 11:29:35 +02:00
  • 1762911f57 added synchronizations and timeouts in solr api; missing synchronizations in index modification methods causes deadlocks inside solr. Michael Peter Christen 2013-06-12 02:13:18 +02:00
  • 3e1e358fdc calling pdf cache flush on class initialization because calling of the methods during runtime can conflict with dynamic solr class loader and cause a deadlock (seriously!) Michael Peter Christen 2013-06-12 00:17:44 +02:00
  • 291912ee52 removed misleading http accessGranted message (this is only for debugging) Michael Peter Christen 2013-06-12 00:16:28 +02:00
  • 2fd7bbb450 reduced load on solr; no seed update in Status and no exists-check in HTTPLoader in case of redirects, that can be done using the htcache. Michael Peter Christen 2013-06-12 00:14:55 +02:00
  • 7ee71c2354 changed administration page headline to 'admnistration' Michael Peter Christen 2013-06-12 00:12:04 +02:00
  • 898e14471b changed windows icon again Michael Peter Christen 2013-06-12 00:10:25 +02:00
  • 959ccc4675 increased the solr merge factor because 4 was too much IO load for frequent index receiving and re-indexing after clickdepth/cr calculation. Michael Peter Christen 2013-06-11 16:51:40 +02:00
  • efd973d29d changed p2p/stealth mode text and links a bit Michael Peter Christen 2013-06-11 16:50:34 +02:00
  • 2648b42b27 added fixed clear method as public method Michael Peter Christen 2013-06-11 16:22:43 +02:00
  • 20fab1feb6 allip net has greedy learning disabled Michael Peter Christen 2013-06-11 14:52:46 +02:00
  • ffc570f95f removed forced soft commit since this may be the cause for a performance problem Michael Peter Christen 2013-06-11 14:51:26 +02:00
  • 6115bef335 added a 'greedy learning' mechanismn which will cause that a 'fresh' yacy will load linked web pages from search results until the total number of web pages reaches 15000. This shall give fresh peers a 'boost' to get faster a personalized search index. Michael Peter Christen 2013-06-11 14:42:30 +02:00
  • a5e328d7c5 new icons Michael Peter Christen 2013-06-11 13:16:46 +02:00
  • f24574b3da use s greeting line which does not sound so beta Michael Peter Christen 2013-06-11 13:12:59 +02:00
  • b85db72a73 added another response writer which can present search result with texts, separated by sentences. Then, these sentences can be used to search again in the index for the same sentence. This can be used to provide a tool for plagiarism-search. (not finished yet). Try the following: http://localhost:8090/solr/select?q=text_t:flut&grep=wasser&defType=edismax&start=0&rows=3&core=collection1&wt=grephtml .. to search for 'flut' and show only sentences in the result documents which contain the word 'wasser'. Consider this like using a grep-tool on documents: you select the documents by a search query and you grep sentences inside the found documents with the 'grep' attribute. Michael Peter Christen 2013-06-10 18:41:00 +02:00
  • 856e5c42ae the line "Web Search by the People, for the People" is more generic for P2P and portal search as default search string. Otherwise, if people switch to Portal mode, the "P2P Web Search" does not make sense. Michael Peter Christen 2013-06-10 18:36:06 +02:00
  • 8e965ffd16 fix for host compare in case that the host is null. This happens when doing a search in the intranet for file resources (they don't have a host). Michael Peter Christen 2013-06-10 16:23:58 +02:00
  • 5132bf719c added new buttons to search result page in p2p mode which show the switch between p2p search and the 'stealth mode' which is simply a non-p2p search within the p2p network. The functionality was there all the time, but the switch to this was not very visible. Michael Peter Christen 2013-06-10 16:22:00 +02:00
  • 2b320313d9 replaced yacydoc servlet usage by a solr result output using an html output writer. This made the creation of a html result writer necessary which is included in this commit. The yacydoc servlet was used to present all metadata to a document, but the solr interface can serve for this purpose in a much better way. All usages (instead one) of yacydoc were replaced by a solr call. This affects also the 'metadata' link attached to search results. orbiter 2013-06-09 12:12:34 +02:00
  • 200769d0c6 show the cache link in search results only if there is actually a cache entry stored in HTCACHE orbiter 2013-06-09 08:15:23 +02:00
  • 713a6199ef activated citation ranking by default Michael Peter Christen 2013-06-07 14:26:14 +02:00
  • f7a4377812 usage of the new normalized link polularity CRn as default ranking function. This replaces the previous formula, which was bad. Before you update to this version, please check if you changed the ranking function yourself before, since it will be overwritten. Michael Peter Christen 2013-06-07 13:22:22 +02:00
  • f7e77a21bf Added a citation reference computation for intra-domain link structures. While the values for the reference evaluation are computed, also a backlink-structure can be discovered and written to the index as well. The host browser has been extended to show such backlinks to each presented links. The host browser therefore can now show an information where an document is linked. The new citation reference is computed as likelyhood for a random click path with recursive usage of previously computed likelyhood. This process is repeated until the likelyhood converges to a specific number. This number is then normalized to a ranking value CRn, 0<=CRn<=1. The value CRn can therefore be used to rank popularity within intra-domain link structures. Michael Peter Christen 2013-06-07 13:20:57 +02:00
  • e20450e798 patch in HTCache and CitationIndex loading in case that a file is broken: do not crash; instead ignore the file and delete it. Michael Peter Christen 2013-06-07 12:52:03 +02:00
  • fdcd4e6a6f fixes to index deletion: quoting of host name (a '-' may be part of the url) and disabling the engage button when changing the url field at 'Delete by URL matching' Michael Peter Christen 2013-06-07 08:52:07 +02:00
  • d367b1f4d9 add null pointer check to stopword fix reger 2013-06-07 00:13:45 +02:00
  • 7480e87386 - fix stopword handling for RWI see example http://bugs.yacy.net/view.php?id=247 - append language setting specific stopword list reger 2013-06-06 22:07:54 +02:00
  • 5c7ddc67fe in GSA api enable usage of solr fq-attribute together with GSA site-attribute orbiter 2013-06-06 13:36:58 +02:00
  • 9fc0c4df98 fix for bad exists 'enhancement'; see bug: http://bugs.yacy.net/view.php?id=245 Michael Peter Christen 2013-06-02 13:50:12 +02:00
  • 9ef1fd9bac fix: enable use of solrcore.properties for property substitution of solrconfig.xml reger 2013-06-01 05:50:03 +02:00
  • 8a7fcb391d enable use of solrcore.properties for property substitution of solrconfig.xml - move setting of system property solr.directoryFactory=solr.MMapDirectoryFactory to solrcore.properties - add check of os.arch for 64bit system, if it fails use default/solrcore.x86.properties (if exists) as solrcore.properties reger 2013-06-01 05:43:08 +02:00
  • f7e887bf49 added missing class Michael Peter Christen 2013-05-30 16:39:48 +02:00
  • eb9d0ba5b1 ranking and boost function update, small bugfixes, better default search field for solr Michael Peter Christen 2013-05-30 16:30:35 +02:00
  • 5f92c68f1f removed block rank ranking and all YBR files in /ranking Michael Peter Christen 2013-05-30 13:01:22 +02:00
  • 164603b946 cleanup Michael Peter Christen 2013-05-30 12:47:22 +02:00
  • ba793a32c0 added timeout for remote searches of 10 seconds Michael Peter Christen 2013-05-30 12:39:28 +02:00
  • 1c4c1c0345 try to commit in case of failure which hopefully frees up some RAM Michael Peter Christen 2013-05-30 12:38:54 +02:00
  • 409d6edf53 Store node/solr search threads to be able to send them an interrupt signal in case that a cleanup process wants to remove the search process. Added also a new cleanup process which can reduce the number of stored searches to a specific number which can be higher or lower according to the remaining RAM. The cleanup process is called every time a search ist started. Michael Peter Christen 2013-05-30 12:38:15 +02:00
  • 2a8b99ea82 remove text_t in search result after snippet has been computed to save space in search result cache Michael Peter Christen 2013-05-30 12:35:47 +02:00
  • a1644ca0fd new workflow processor in Segment to enqueue indexing documents to solr Michael Peter Christen 2013-05-30 12:34:53 +02:00
  • a8dc4346e8 default configuration of MMapDirectoryFactory for solr, increased lock timeout, less documents from remote searches (too many results had easily blocked a peer) Michael Peter Christen 2013-05-30 12:31:28 +02:00
  • 0c1a018bbd removed 'later' tactic because it used too much RAM, reduced number of soft commits, reduced caching size of search events, ensured that solr results are processed before connection is closed to keep that stuff not too long in RAM Michael Peter Christen 2013-05-29 18:27:27 +02:00
  • 5344a1c5f7 getting the trash out Michael Peter Christen 2013-05-29 16:09:05 +02:00
  • 709e9b8ce7 Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git Michael Peter Christen 2013-05-29 13:49:42 +02:00
  • 9e07447d47 added new link for SMW Michael Peter Christen 2013-05-29 13:45:22 +02:00
  • 3c04dd11de removed dead link Michael Peter Christen 2013-05-29 13:42:38 +02:00
  • 1eb9626cca less logging Michael Peter Christen 2013-05-29 13:30:32 +02:00
  • 536fd1450e added new keys for update locations Michael Peter Christen 2013-05-29 13:10:32 +02:00