3897bb4409added (manual) urldb migration (link on: Index Administraton -> Federated Solr Index) - migrates all entries in old urldb
reger
2013-01-14 03:06:24 +01:00
3b6e08b49fprevent checking of urldb if empty - disconnect urlIndexFile if empty - add missing lock class in submenuSearchConfiguration
reger
2013-01-12 15:20:23 +01:00
1fb452174aread defaults from yacy.init for "Set to Defaults" button
reger
2013-01-05 20:47:18 +01:00
f143804382fix configuration for search page navigators - added additional config page (ConfigSearchPage_p) for easy setup of search page layout (to not overload ConfigPortal page) - currently redundant setting with part of ConfigPortal page - added missing config for filetype and protocol navigator - adjusted init of SearchEvent to check navigation config setting - renamed RankigProcess.getTopicNavigator to getTopics (to distiguish between added SearchEvent.getTopicNavigator)
reger
2013-01-05 19:00:54 +01:00
24db2fcd9dfix for Network info
Michael Peter Christen
2013-01-05 11:52:35 +01:00
22c694f906activated the clickdepth_i attribute for solr again because the calculcation of that value is not as extensive as expected and furthermore the value is very useful for ranking
Michael Peter Christen
2013-01-05 01:00:18 +01:00
becd52a984added also a re-calculation of reference counts during the post-processing of clickcount calculations. This is a really nice thing to have because the reference count affects ranking.
Michael Peter Christen
2013-01-05 00:58:27 +01:00
fc47109608added 'Last Hour' to network statistics
Michael Peter Christen
2013-01-05 00:37:52 +01:00
38d3feae65added separate delete commands for the local+remote solr index, the old metadata and old rwi and for the citation index. The important advancement is the separation of the citation index deletion because that index is responsible for the linkdepth calculation. Now a search index can be deleted without the citation index and that should cause that less clickdepths must be post-processed.
Michael Peter Christen
2013-01-04 16:39:34 +01:00
6f0baaa309added the clickdepth post-processing: some links may have 'shortcuts' to already calculated click depths. There are then calculated if the crawl buffer is empty and therefore no new 'shortcuts' can be discovered. The status of the clickdepth stack (to-be-processed) can be seen using a solr search command like this: http://localhost:8090/solr/select?q=process_sxt:[*%20TO%20*]&start=0&rows=30&fl=sku,clickdepth_i,process_sxt
Michael Peter Christen
2013-01-04 16:37:39 +01:00
0f5b6f38c1enhanced root-url detection
Michael Peter Christen
2013-01-03 19:21:21 +01:00
5a0eb1b268clickpath should not be active by default because it needs extensive computation - partly to be implemented
Michael Peter Christen
2013-01-03 01:30:05 +01:00
8ae08a2cacmoved HTCache, Heuristics and Parser servlet to a more appropriate menu location
Michael Peter Christen
2013-01-03 01:27:16 +01:00
5c0c56cfe1Preparations to produce a click depth attribute in the search index. This attribute can be used for ranking and for other purpose (demand by customer) The click depth is computed in two steps: - during indexing the current fill-state of the reverse link index is used to backtrack the current page to the root page. The length of that backtrack is the clickdepth. But this does not discover the shortest click depth. To get this, a second process to check again is needed - added a process tag that can be used to do operations on the existing index after a crawl; i.e. calculation the shortest clickpath. Added a field to control this operation but not a method to operate on this. - added a visualization of the clickpath length in the host browser
Michael Peter Christen
2013-01-02 20:55:43 +01:00
6861af87e2removed warnings
Michael Peter Christen
2013-01-02 19:05:48 +01:00
295884fd54- Merge commit '168b1d130d9d67b5e8855a0b50c4ba7ad4a416f8' - fixed conflict in htroot/yacysearch.java - removed nedres check because that causes that the remote server is not called at all in most cases (local index has already results but we want more) - fixed a regex bug (a '=' too much)
Michael Peter Christen
2013-01-02 15:08:07 +01:00
276e63401esmall sanitary fixes - exclude unix shell scripts in NSIS windows install archive - replace link to env/grafics/yacy.gif to yacy.png (build.nsi) - remove unused code lines (Blacklist_p, Response, WordReferenceVars) - type & xhtml (RankingSolr_p.html)
reger
2013-01-02 01:59:47 +01:00
f301336adffix: no results with configuration citation reference index switched off - urlcitationindex != null check added to ResultEntry.referencesCount - plus other places where conflicting procedure was used (and urlcitationindex not already checked != null)
reger
2012-12-30 02:13:48 +01:00
fe50702eb0added a filterscannerfail attribute to QueryParams which causes that a check to the network scanner fail/success status can be used/suppressed for search results. This is a feature that comes with the port scanner.
orbiter
2012-12-29 17:47:34 +01:00
168b1d130dAdding heuristic to get search results from configured systems which support opensearch specification - any system supporting opensearch specification can be configured - search query is only forwarded to remote system if not enough results available on local peer - discover function provided, checking the local Solr index for links to opensearchdescription files, to add to the config - sample config file with some general search engines with opensearch support
reger
2012-12-29 08:24:48 +01:00
eb90d38cd7added missing extension 'mkv' for navigation
Michael Peter Christen
2012-12-27 13:56:13 +01:00
e9e0d63897Add config option to show HostBrowser link in search result - ConfigPortal: added checkbox Host Browser - yacy.init: added search.result.show.hostbrowser as default = on (true) - fix HostBrowser: broken link to protected WebStructurePicture for public user
reger
2012-12-27 10:01:10 +01:00
9dfc9c95d8updated slf4j and log4j
Michael Peter Christen
2012-12-27 04:37:21 +01:00
95712fdc8bupdate to pdf parser
Michael Peter Christen
2012-12-27 04:16:31 +01:00
4a9182ae16use the search configuration to default the cacheStrategy to the value as given in the search configuration
Michael Peter Christen
2012-12-27 03:19:21 +01:00
98819ec3d9use solr boost configuration to select search fields. At this time it is possible to enter a negative boost value to switch that value off. This might be different in the future with a better input interface.
Michael Peter Christen
2012-12-27 03:17:45 +01:00
a6ad1d6fd1update to search tests (use yacy interface and a bugfix)
Michael Peter Christen
2012-12-27 03:15:50 +01:00
e1f89efd0d- made image search in interactive search using the ViewImage servlet - that enables viewing of images for intranet SMB servers. - added a filter search for protocol, tld and ext again; otherwise p2p search produces a lot of rubbish
Michael Peter Christen
2012-12-26 21:25:27 +01:00
8f3bd0c387fix for smb crawl situation (lost too many urls)
Michael Peter Christen
2012-12-26 19:15:11 +01:00
d456f69381SeedUpload url : check to reject localhost url included in saveSeedList (same check as in / copied from Seed.isProper() ), to prevent identity change on next startup (due to rejected seeduploadurl).
reger
2012-12-24 23:29:02 +01:00
fbf84e9ff3fix SeedUpload setting propery name for include template file
reger
2012-12-24 04:13:38 +01:00
4987caf1c9- apply fix for localhost handling (from yacy2solr) also to metadata2solr
reger
2012-12-23 01:30:52 +01:00
0148f1bb8cfix: exception if default work files don't exist
reger
2012-12-22 23:03:39 +01:00
9e4033f229fix for event starter: delete start time when event is removed
Michael Peter Christen
2012-12-22 21:16:22 +01:00
99271ffd13copy work tables from defaults/data/work if exist there and not in DATA/WORK This can be used to create start-up behavior work scripts in the api.bheap table
Michael Peter Christen
2012-12-22 20:54:05 +01:00
99edbf6f14fix for config basic: do not accept empty peer names
Michael Peter Christen
2012-12-22 20:52:52 +01:00
24c9bb35f7extended the Scheduler: introduced scheduled events - an event type (once, regular) can be selected - for this event type, a fixed time can be selected. This may be either directly after startup or at one of the full hours at a day (==25 options) The main point about this feature is the opportunity to start an action directly after startup. That makes it possible to create YaCy distributions which, after started at the first time, start to index parts of the intranet/internet by itself.
Michael Peter Christen
2012-12-22 16:27:14 +01:00
433143ba40removed protocol, tld, ext from the urlmask and created specific navigation field for these
Michael Peter Christen
2012-12-19 12:45:40 +01:00
84f82541e8search process enhancements
Michael Peter Christen
2012-12-19 10:41:22 +01:00
02020b590b- removed all extension types from extension navigation which are not proper/known - automatically show the protocol navigation if there is more than http and https - automatically show the extension navigation if there is some media content
Michael Peter Christen
2012-12-19 02:38:05 +01:00
01200f06ccusing the author field as solr-native facet. this makes it necessary to introduce a copy-field for the author field to be copied to a string field. This field is then used to generate facets. Without this field, the facet would consist only of the words of the author names, not of the full author string.
Michael Peter Christen
2012-12-19 01:56:33 +01:00
2a4c064c89using the publisher information for the author field if no author is given. This applies to cases where only the copyright field in the html header is filled but not the author field
Michael Peter Christen
2012-12-19 01:54:35 +01:00
bab573361f- using a filter query for facet restriction - calculating the whole search result in at most two sub-queries from solr
Michael Peter Christen
2012-12-19 01:00:57 +01:00
7ad5457db0using the solr facets as navigation in yacyinteractive.html instead of counting locally result types
Michael Peter Christen
2012-12-19 00:59:40 +01:00
eac9650b31added another solr field clickdepth_i which reflects the number of clicks which are necessary to get from the portal of a host to a specific document. At this time, only the start document is flagged with clickdepth '0', all other with '-1'. To get the actual clickdepth, a process must use crawled information to collect the actual number of clicks. This will be added in another/next step.
Michael Peter Christen
2012-12-18 17:20:42 +01:00
1052263af3- added a new solr field references_i which stores the number of INCOMING links to the corresponding web page. This information is taken from the reverse link index (a 'little sister' of the RWI index). - this field can be of use to enhance the ranking because a web page with more incoming links can be more more important than others. But this is not true for typical link pages like menues. Therefore the number of outgoing links is needed. - added a new solr attribute 'bf' to solr queries which is a boost function extension. this field can contain a formula which comuptes the boost according to given field values. After some experiments the following forumla is now default: div(add(1,references_i),pow(add(1,inboundlinkscount_i),1.6))^0.4 This takes the number of references and the inbound links. Further experiments are needed to enhance that forumula.
Michael Peter Christen
2012-12-18 14:42:35 +01:00
7c3de8b4cd- fix for localhost detection - added IPv6 patterns for localhost detection
Michael Peter Christen
2012-12-18 12:52:20 +01:00
34f8786508removed dependency of vocabulary navigation from Jena and it's triplestore; the vocabulary search is now done using generic solr fields which are created on-the-fly during runtime.
Michael Peter Christen
2012-12-18 02:29:03 +01:00
ad71747525fix: set defaul language to "en"
reger
2012-12-16 20:53:45 +01:00
9319b90d8a- fixes for host navigation - fixes for filetype navigation - removed unused code
Michael Peter Christen
2012-12-15 09:14:49 +01:00
cb5cbec14ddistinguishing modified query string and original query string
Michael Peter Christen
2012-12-15 00:05:46 +01:00
fb0fa9a102- fixed 'delete from subpath' during crawl start which deleted nothing; now works; - changed some crawl start html design details
Michael Peter Christen
2012-12-11 13:38:28 +01:00
23d4a62345fixes in the Russian translation, chmod a-x cn.lng
Aleksej
2012-12-11 13:44:25 +04:00
899fd8b62dMerge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git
orbiter
2012-12-10 21:18:56 +01:00
712cc37c40if maxFileSize < 0 then the file size limit is without limit.
orbiter
2012-12-10 21:17:45 +01:00
54e193a2b8you can now search for '*' to get just ALL entries in the search index as result list. This makes sense if you intend to search just by using the navigation tools to cut the data set into navigation 'slices'.
orbiter
2012-12-10 21:00:30 +01:00
1228a5798dyou can now search for '*' to get just ALL entries in the search index as result list. This makes sense if you intend to search just by using the navigation tools to cut the data set into navigation 'slices'.
orbiter
2012-12-10 20:55:11 +01:00
1f33c30d7bre-integrating useForHost method (lost sometime?) to get the noProxy pattern working again. Without using this method all remote urls including the localhost had been accessed through the configured proxy
orbiter
2012-12-10 20:44:29 +01:00
f1a9c2e604fix Servlet template on conditional file include with use of conditional template pattern in included template file (example IndexCreateQueues_p.html) see bug http://bugs.yacy.net/view.php?id=215
reger
2012-12-10 20:02:35 +01:00
a4a780b871- fix for bad url conversion in bookmarks when using smb urls - fix for localhost hosts in solr schema host handling
orbiter
2012-12-10 07:22:42 +01:00
e80dfeca23- making blacklist path part case insensitive (solving http://bugs.yacy.net/view.php?id=171) - blacklist test adding explicite response text "not blocked" if no blacklist match
reger
2012-12-08 06:34:48 +01:00
e2d499be9eremove NOT NEEDED reference to solr.YaCySchema from ConfigurationSet to be able to use ConfigurationSet for other conf files (than solr.keys.default.list).
reger
2012-12-08 00:19:20 +01:00
a3cd3852abintroduced a better place to update the lastacc time value in latency
Michael Peter Christen
2012-12-07 15:49:23 +01:00
864abcd33dremoved Latency update after URL selection because that causes a completely wrong behaviour when cache fresh cases appear. Makes re-crawling MUCH faster!
Michael Peter Christen
2012-12-07 15:35:44 +01:00
4491072256- clear the search cache when altering the solr boosts - better positions for submit buttons
Michael Peter Christen
2012-12-07 14:56:34 +01:00
2b7d46bc1fusing a filter query for the site parameter in GSA api
Michael Peter Christen
2012-12-07 14:54:49 +01:00
dd241d03bblatency fix: only set last-visit time if access was actually by the robot
Michael Peter Christen
2012-12-07 02:00:12 +01:00
118233a7e6fix for bad xml in gsa result when doing a query with quotes
Michael Peter Christen
2012-12-07 01:35:02 +01:00
1e002ab18eadded another blacklist-cleaner into balancer
Michael Peter Christen
2012-12-07 01:27:24 +01:00
10527e28aefix for wrong display of error urls in HostBrowser
Michael Peter Christen
2012-12-07 00:31:10 +01:00
756772fbd3fix for waitingtime computation for intranet configuration
Michael Peter Christen
2012-12-06 17:40:52 +01:00
fa27e5820f- check blacklist (again) when taking urls from the crawl stack because the blacklist may get extended during crawling - removed debug output
Michael Peter Christen
2012-12-06 00:12:16 +01:00
5f5d66921epatch for funny symbols in url paths (like tilde)
Michael Peter Christen
2012-12-05 22:05:49 +01:00
adfecc6ba8more robustness during shutdown
Michael Peter Christen
2012-12-05 18:20:43 +01:00
d4bfe9339eBrute-force attempt to start solr in case of a memory problem. I don't actually know if this is correct. It is a desperate try to get YaCy running on production servers which must get alive even with strange hacks like this. This is also related to a forum posting in http://forum.yacy-websuche.de/viewtopic.php?t=4528&p=27135#p27135
Michael Peter Christen
2012-12-05 18:16:06 +01:00
8aa08261a7update to Solr Boost handling
Michael Peter Christen
2012-12-05 12:26:42 +01:00
908ad2f174Added a new servlet to configure the solr ranking using field boosts
Michael Peter Christen
2012-12-03 17:01:19 +01:00
a598fb6227renamed Ranking_p.html to RankingRWI_p.html because there will be another Ranking servlet as well at next
Michael Peter Christen
2012-12-03 00:01:41 +01:00
a01e47b992enhanced exists()-method for solr; should reduce a lot of IO during DHT target selection
Michael Peter Christen
2012-12-02 17:29:37 +01:00
72f165d58badded a Boost class which stores solr query boost values. The class can be configured using the yacy.init file. The boost information is taken from the configuration each time when a query to solr is done.
Michael Peter Christen
2012-12-02 16:54:29 +01:00
ea033f8f8eadded number of characters in url to default index to be able to use this field for ranking
Michael Peter Christen
2012-12-02 16:53:02 +01:00
b5ee88c6afadded more logging to get info which url causes performance problems
Michael Peter Christen
2012-12-02 16:52:12 +01:00
1faa045dc1fix: prevent regex pattern compile error for blacklist import for path '*' (extend it to '.*')
reger
2012-12-01 22:41:21 +01:00
bb20691d4ffix: respect config setting of "show Nav Top-Menu" in HostBrowser.html for public users (as hostbrowser is now available in search results)
reger
2012-12-01 01:14:29 +01:00
6cf33f899cprevent Solr "version conflict" on update by set Solr "_version_" field to 0 (=no version check)
reger
2012-11-28 00:09:53 +01:00
acd98bebb7improvements in GSA result writer
Michael Peter Christen
2012-11-26 15:18:51 +01:00
3de784c8ddreplaced more split and replaceAll missing pattern pre-compilation with pre-compiled pattern
Michael Peter Christen
2012-11-26 13:40:53 +01:00
8fc3679c66using more pre-compile pattern for split methods
Michael Peter Christen
2012-11-26 13:11:55 +01:00
d48e9788d2enhanced search result processing behavior - query less at one time; query more often - in between the small queries, evaluate results - remove fields from search results which are not needed
Michael Peter Christen
2012-11-26 12:24:35 +01:00
bf512e6350Merge branch 'master' of git://gitorious.org/~reger/yacy/bbyacy-rc1
Michael Peter Christen
2012-11-26 00:14:57 +01:00
469efcdb9dfix: display and calculate authors and namespace search navigator if configured (otherwise skip overhead) (leave hosts, topics and not in ConfigPortal included filetype, protocoll navigator untouched)
reger
2012-11-25 22:49:26 +01:00
eca68fa197added debug code to crawler monitor
Michael Peter Christen
2012-11-25 15:43:42 +01:00
205f8b222bMerge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git
Michael Peter Christen
2012-11-25 14:41:49 +01:00