1
0
mirror of https://github.com/yacy/yacy_search_server.git synced 2025-07-23 09:24:39 -04:00

Commit Graph

  • c6a6f4c4e6 added a hack which makes the HostBrowser more performant when the given host has a lot of urls. If the number of urls is > 1000, then the list of documents is restricted to such which have no subpath, if the root path is selected. However, this can cause a problem if no documents on the root path exist but only on paths below that root path. Michael Peter Christen 2012-11-05 18:57:21 +01:00
  • 619bf7e875 fixed filetype modified for media types in text search Michael Peter Christen 2012-11-05 18:08:00 +01:00
  • 97f82994a6 automatically pause the crawler if there is a problem with solr Michael Peter Christen 2012-11-05 16:34:42 +01:00
  • 64ac2b7b7d new submenu template Michael Peter Christen 2012-11-05 15:36:42 +01:00
  • 5e77801aac update to web interface structure Michael Peter Christen 2012-11-05 15:23:03 +01:00
  • 8fb370d9f8 renovated the way how search results are count. should be correct now... Michael Peter Christen 2012-11-05 03:19:28 +01:00
  • 7bec253bb0 Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git Michael Peter Christen 2012-11-04 09:21:58 +01:00
  • d88eb657fd Merge branch 'master' of git://gitorious.org/~reger/yacy/bbyacy-rc1 Michael Peter Christen 2012-11-04 09:21:21 +01:00
  • 354ef8000d - added 'deleteold' option to crawler which causes that documents are deleted which are selected by a crawl filter (host or subpath) - site crawl used this option be default now - made option to deleteDomain() concurrency orbiter 2012-11-04 02:58:26 +01:00
  • 633fbe9188 Fix Metadata handling - language default on missing lang property to "uk" (fix set to nothing) - language set to TLD (added call to existing language calculation from TLD) - coordinate number exception on possible lat/lon content of "NaN,NaN" reger 2012-11-04 02:07:59 +01:00
  • 19d1f474ce host browser now shows also number of pending files per subdirectory + bugfixes Michael Peter Christen 2012-11-02 14:40:02 +01:00
  • 75dd706e1b update to HostBrowser: - time-out after 3 seconds to speed up display (may be incomplete) - showing also all links from the balancer queue in the host list (after the '/') and in the result browser view with tag 'loading' Michael Peter Christen 2012-11-02 13:57:43 +01:00
  • e2c4c3c7d3 migration to solr 4.0.0 Michael Peter Christen 2012-11-02 12:29:48 +01:00
  • b764de424a code cleanup Michael Peter Christen 2012-11-02 10:28:32 +01:00
  • 69aa39d664 update to libraries required by solr 4.0.0 Michael Peter Christen 2012-11-02 10:27:44 +01:00
  • 9330ad4838 - fixed the delete option in host browser - added a delete method which can be used to delete a full subpath in solr. Michael Peter Christen 2012-11-02 01:22:31 +01:00
  • a63179f3f9 added the MIME attribute for the R tag in GSA search result writer Michael Peter Christen 2012-11-02 00:14:29 +01:00
  • 40df2fd193 added the host browser as link to search results. that means you can select a browsing position after a search is done on the search results. Michael Peter Christen 2012-11-01 21:38:05 +01:00
  • 1168d09de8 more refactoring - integrated the code of SnippetProcess into SearchEvent Michael Peter Christen 2012-11-01 17:40:06 +01:00
  • 6629e37685 tried to clean up the search process mess Michael Peter Christen 2012-11-01 17:16:43 +01:00
  • c5f67a5d6d fixed a problem with local search from solr results: now all results from solr are shown (again) Michael Peter Christen 2012-11-01 10:22:22 +01:00
  • 02957d5982 missing license-files (sorry I didn't commit theses files by mistake) sixcooler 2012-10-31 23:47:08 +01:00
  • 16216c2344 added missing libraries Michael Peter Christen 2012-10-31 23:29:47 +01:00
  • 9d062873d2 bump to httpclient-4.2.2 sixcooler 2012-10-31 19:09:48 +01:00
  • f8f05ecba7 - added a delete button in host browser to delete a complete subpath - removed storage of default collection name - default is now "user" - made stacking of crawl start points concurrently Michael Peter Christen 2012-10-31 17:44:45 +01:00
  • 0716a24737 added more / all new crawl profile fields into crawl profile editor Michael Peter Christen 2012-10-31 15:13:05 +01:00
  • 4a14122ba7 in case that a crawl profile has a collection assigned, use the collection to show a name in the web interface. This should prevent that much too long names make the interface unusable. Michael Peter Christen 2012-10-31 14:08:33 +01:00
  • 0fe8be7981 enhaced data structures for balancer and latency computation which should produce a bit better prognosis about forced waiting times. Michael Peter Christen 2012-10-30 17:30:24 +01:00
  • ac9540dfb6 removed options for stopwords which are not used Michael Peter Christen 2012-10-30 12:36:36 +01:00
  • ce3fed8882 added the Google Search Appliance (GSA) api interface to the main menu. See: https://developers.google.com/search-appliance/documentation/68/xml_reference#request_overview Michael Peter Christen 2012-10-30 12:27:22 +01:00
  • b2ffd49817 less latency Michael Peter Christen 2012-10-30 12:26:32 +01:00
  • 0833937c1c better balancing and duetime-cumputation also for no-delay intranet hosts Michael Peter Christen 2012-10-30 11:28:49 +01:00
  • c326aa8f67 disabled writing new entries to crawl stacks to prevent that a domain with many documents block refreshing of the crawl queue Michael Peter Christen 2012-10-29 22:26:52 +01:00
  • 6905182d41 - fix for number of words log message - adding meta:refresh also to crawler stack Michael Peter Christen 2012-10-29 21:42:31 +01:00
  • c25d7bcb80 - added concurrency for robots.txt loading - changed data model for domain counter Michael Peter Christen 2012-10-29 21:08:45 +01:00
  • a94c537afc fixed getSize() which can use the cache size while the crawl is running Michael Peter Christen 2012-10-29 11:56:07 +01:00
  • 96912c9471 enhancement to solr caching: consider that during a get() the document is not in solr but the cache points out that a commit is needed to get the document. Michael Peter Christen 2012-10-29 11:35:24 +01:00
  • a87811bc38 more auto-commit calls when a search interface is opened, but not when a search is done there to prevent blocking during search-time. Michael Peter Christen 2012-10-29 11:27:13 +01:00
  • 3d3d654e88 if a network configuration is choosed which does not allow DHT and no P2P communication is in robinson mode) then some menu entries are disabled which have no use in this mode. Michael Peter Christen 2012-10-29 01:51:19 +01:00
  • 2d9e577ad0 replaced the custom robots.txt loader by the standard http loader Michael Peter Christen 2012-10-28 22:48:11 +01:00
  • 799d71bc67 enhanced solr caching: - increased cache size which is needed for longer solr commit time - speed hacks on cache write code Michael Peter Christen 2012-10-28 20:31:29 +01:00
  • a33e2742cb - removed unnecessary synchronized and deadlock in crawler - removed problem with monitoring object on Balancer.wait - added missing user agent settings Michael Peter Christen 2012-10-28 19:56:02 +01:00
  • 8952153ecf update to Balancer algorithm: - create a load list from the current list of known hosts - do not create this list for each Balancer.pop access - create the list from those hosts which have a zero-waiting time - select 1/3 from that list which have the most urls waiting - get hosts from the wainting list in random order - fixes for some delta-time computations - always load all urls from hosts which have never been loaded before orbiter 2012-10-28 13:24:49 +01:00
  • 354f0d9acd moved static method from ClusteredScoreMap to MapDataMining because it was not used in the ClusteredScoreMap class but only in MapDataMining orbiter 2012-10-28 11:29:53 +01:00
  • 722a447b0d - optimize code of augmented parsing to enhence document tags - commented out augmentedparser.analyse (not function implemented yet) - adjust init of document title list to always use same list type reger 2012-10-26 18:50:45 +02:00
  • 8e1248ffe3 force a commit in advance of a search for the administrator to get most recent results even if commit time is high and an indexing is ongoing. Michael Peter Christen 2012-10-26 15:35:42 +02:00
  • 3b48c78190 added an option to force a commit to solr. may be used by a search front-end in case that the commitWithinMs time is too short to get recently indexed documents. Michael Peter Christen 2012-10-26 07:39:07 +02:00
  • 2d972f289a rise commitWithinMs to default-value from SwitchBoard (result in lower hd-io) sixcooler 2012-10-26 02:12:45 +02:00
  • 8fde1dd3b6 another performance and memory hack to graphics: this makes it possible to produce a 100-Megapixel png network graphic image on my 6 year old laptop in standard configuration in 10 seconds. orbiter 2012-10-25 21:40:27 +02:00
  • 1baf498d59 - show more lines in online log - reverse order is default now Michael Peter Christen 2012-10-25 18:38:39 +02:00
  • 55bdafbaf1 more image processing hacks Michael Peter Christen 2012-10-25 18:20:05 +02:00
  • f2d0418218 because the new PngEncoder had a problem with the PixelGrabber which is caused by a JRE bug, the PixelGrabber had to be circumvented using an own frame buffer which can be read without a PixelGrabber. This resulted in ultra-fast and much less memory-consuming transformation. YaCy images are now generated really fast! Michael Peter Christen 2012-10-25 17:59:20 +02:00
  • d5d64019e5 - added a method for the RasterPlotter to draw arrow endings to lines - replaced the dot in the NetworkGraph with arrows - enhanced the image drawing speed using pre-computed color values - added more attention for OOM cases during very large image painting Michael Peter Christen 2012-10-25 16:05:04 +02:00
  • 342543a6c4 fix for host browser Michael Peter Christen 2012-10-25 10:23:43 +02:00
  • 85ca07b90e when a new crawl is started, an equal crawl, if still running, is terminated and the corresponding crawl profile is deleted (this also clears the crawl queue entries for that crawl profile) Michael Peter Christen 2012-10-25 10:20:55 +02:00
  • 906e51214a the web structure image shows the pivot dot in a different color Michael Peter Christen 2012-10-25 10:18:28 +02:00
  • b3ffcde0c7 - prepared PngEncoder for concurrency: PixelGrabber.grabPixels is the main time-consuming process. This shall be done in concurrency. - added concurrent processes to call the PixelGrabber and framework to do that (queues) Michael Peter Christen 2012-10-24 02:08:51 +02:00
  • e9c6f4ce2e - new order of data computation: first compute the size of compressed deflater output, then assign an exact-sized byte[] which makes resizing afterwards superfluous - after all enhancements all class objects were removed; result is just one short static method - made objects final where possible Michael Peter Christen 2012-10-24 00:41:09 +02:00
  • c6a1b21399 added a 9-year old png encoder from David Eisenberg which I rewrote quite a bit to remove all code that handles transparency. With this highly specialized png writer it is possible to write png images much faster that with the JRE built-in png writer. In a second step it can be possible to add concurrency to increase computation speed further. orbiter 2012-10-23 23:27:41 +02:00
  • 276dd6452b removed warnings orbiter 2012-10-23 19:08:44 +02:00
  • 59bf4677b6 added option to view the complete directory structure in host browser orbiter 2012-10-23 19:02:55 +02:00
  • b991685782 Merge branch 'master' of git://gitorious.org/~reger/yacy/bbyacy-rc1 Michael Peter Christen 2012-10-23 18:14:58 +02:00
  • 7602fce0b9 Merge branch 'master' of ssh://git@gitorious.org/yacy/rc1.git Michael Peter Christen 2012-10-23 18:12:48 +02:00
  • ea11a1efea fix for highlighting in gsa search Michael Peter Christen 2012-10-23 18:11:49 +02:00
  • 9eaede50e7 enhanced web structure images Michael Peter Christen 2012-10-23 18:11:19 +02:00
  • b7ac1da6a3 gsa results shall have only one title in metadata and that should be the visible title in the <title>-tag Michael Peter Christen 2012-10-23 18:03:12 +02:00
  • 206e7bcf94 whitelist yacyportalsearch aka search.yacy.net sixcooler 2012-10-23 03:49:27 +02:00
  • ae6feb5610 showing the web structure graph as animation in the crawl monitor Michael Peter Christen 2012-10-23 02:50:26 +02:00
  • 87aab9aa7c - fix: with augmented parsing = on; missing metadata in index (like title) due to overwriting metadata by adding multiple result docs from augmentparser with same url - fix Document.addsubdocuments: sections might be initialized as Arrays.toList which does not provide the used .addAll methode see e.g. http://kamleshkr.wordpress.com/2010/02/17/inside-java-arrays-aslistt-a/ reger 2012-10-22 22:48:35 +02:00
  • 39317a6c66 enhanced webstructure image: introduced - multiple hosts can be listed (comma-separated) as host argument - new 'bf'-attribut (branch factor): the maximum number of edges per node - the bf-value is computed automatically - ordering of nodes when the graphic is drawed: mostly the drawing ends with an limitation eg. number of nodes. When this happens, it should be ensured that more 'interesting' nodes are painted in advance. This is now done by sorting all nodes by the number of links they have in de distant sub-graph. Michael Peter Christen 2012-10-22 16:23:39 +02:00
  • 47ae7e322e smaller dhtDispatcher.cloudSize @Orbiter: we talked about this times ago - please revert if I'm wrong sixcooler 2012-10-21 20:05:28 +02:00
  • 57ddd63888 not hold a expensive cache of references for DHT-out,but but load them on demand see: http://forum.yacy-websuche.de/viewtopic.php?f=8&t=4530 sixcooler 2012-10-21 20:00:36 +02:00
  • 1dc6482feb format crawler timeout output string in seconds (was days) reger 2012-10-21 03:00:05 +02:00
  • ef937af35d more custom field usage in gsa search result Michael Peter Christen 2012-10-18 15:26:55 +02:00
  • ea27d2e5f6 fixed more getSolrFieldName usages Michael Peter Christen 2012-10-18 15:21:05 +02:00
  • ce0e5b1e17 - more refactoring / private methods - fix for usage of custom solr field names Michael Peter Christen 2012-10-18 15:09:04 +02:00
  • ccc3760a47 Refactoring and redesign of data architecture to make URIMetadataRow superfluous. The target is to make a solr document as the core of YaCy documents which would cause that many conversions can be removed. On the way to this target the Equivalence of URIMetadataRow and URIMetadataNode had to be removed to expose the usage of the old URIMetadataRow data structure. This refactoring already removes unneccessary conversions and should make memory usage during indexing lower. Michael Peter Christen 2012-10-18 14:29:11 +02:00
  • 7f71dfab03 added a HostBrowser.xml api file and changed a bit of attribute naming Michael Peter Christen 2012-10-18 11:42:13 +02:00
  • b400fc7b4d fix for file parser problem Michael Peter Christen 2012-10-17 18:06:44 +02:00
  • e5b3c172ff removed hack which translated Solr documents to virtual RWI entries which had been then mixed with remote RWIs. Now these Solr documents are feeded into the result set as they appear during local and remote search. That makes the search much faster. Michael Peter Christen 2012-10-17 17:45:41 +02:00
  • 6017691522 added an exception catch Michael Peter Christen 2012-10-17 13:56:11 +02:00
  • 68c7ed5ce9 added a shell script which can be used to delete the api action steering table. This may be necessary if the api is called by remote command and the recordings are not used. Then they can be deleted frequently by calling this clear command using a cron job Michael Peter Christen 2012-10-17 00:44:16 +02:00
  • ed803708ab added a shell script which can be used to add a rss feed to the index. All pages linked in the rss feed are added. The process is not repeated automatically. If you want to repeat this, add the command to a cron job. Michael Peter Christen 2012-10-17 00:31:59 +02:00
  • 5d16c23a1f specified more URIMetadata as URIMetadataNode Michael Peter Christen 2012-10-16 18:26:21 +02:00
  • 43f3345c90 - removed dependencies from URIMetadataRow and made direct access to URIMetadataNode which creates the opportunity to access Solr objects directly and use their information richness - lazy initialization of the URIMetadataNode object - should cause less computation and memory usage during search. - removed dead code Michael Peter Christen 2012-10-16 18:11:57 +02:00
  • cc98496ff3 enhanced the HostBrowser: - showing also outbound links to other domains if there are any - the outbound links browser shows also the link structure image - showing even inbound links if the web structure graph has information about that - removed the left menu and made the HostBrowser a part of the top menu for search - moved the file search also to the top menu - added hover information in the HostBrowser to explain what the click means - because the HostBrowser also links to the Metadata viewer ViewFile, there should be a button to switch back to the HostBrowser: added that also. Michael Peter Christen 2012-10-16 17:13:18 +02:00
  • 21fe8339b4 - enhanced generation of url objects - enhanced computation of link structure graphics - enhanced collection of data for link structures Michael Peter Christen 2012-10-15 13:17:13 +02:00
  • 4023d88b0b added date info in parser errors Michael Peter Christen 2012-10-15 10:57:36 +02:00
  • 1b02408936 use less cache Michael Peter Christen 2012-10-11 14:32:37 +02:00
  • e45a3235e0 default cache size was much too high; decreased solr cache size Michael Peter Christen 2012-10-11 12:03:48 +02:00
  • 613cf7da7f enhancement to post argument parsing - possible fix to zero-filled parameter values Michael Peter Christen 2012-10-11 10:46:06 +02:00
  • 36c13ed15b less solr prefetch Michael Peter Christen 2012-10-11 10:17:05 +02:00
  • f3fc8eac80 fixed clear scripts Michael Peter Christen 2012-10-11 10:16:37 +02:00
  • 5f0ab25382 removed the option to prevent removal of &amp; parts inside of the MultiProtocolURI during normalform computation because that should always be done and also be done during initialization of the MultiProtocolURI Object. The new normalform method takes only one argument which should be 'true' unless you know exactly what you are doing. Michael Peter Christen 2012-10-10 11:46:22 +02:00
  • 53789555b9 fix for crawl start filter Michael Peter Christen 2012-10-10 10:40:32 +02:00
  • abebb3b124 added a crawl start checker which makes a simple analysis on the list of all given urls: shows if the url can be loaded and if there is a robots and/or a sitemap. Michael Peter Christen 2012-10-10 02:02:17 +02:00
  • 941873fba4 moved the index deletion functions from IndexControlRWIs to IndexControlURLs where it appears more naturally. Because the RWI administration is less important in the presence of Solr, the IndexControlURL is now the default servlet when the Index Administration button on the main menu is selected. Michael Peter Christen 2012-10-10 00:09:27 +02:00
  • ae246c30c3 fixed interpretation of directDocByURL attribute during crawl start orbiter 2012-10-09 23:11:31 +02:00
  • 68d0f8de03 Merge branch 'master' of git://gitorious.org/~reger/yacy/bbyacy-rc1 orbiter 2012-10-09 20:36:32 +02:00
  • bfb0d4c69b - add language detection from <html lang="xx"> tag - add jaudiotagger jar to Netbeans-IDE project classpath reger 2012-10-09 20:02:58 +02:00