Commit Graph

  • daea87d436 do not accept dht from bad versions delete bad hashes on receive lotus 2009-04-22 12:13:37 +00:00
  • 16baa7ad24 To translate a mediawiki dump into the YaCy surrogate format do the following: - download a wikipedia dump, i.e. dewiki-20090311-pages-articles.xml.bz2 from http://download.wikimedia.org/dewiki/20090311/ - move dewiki-20090311-pages-articles.xml.bz2 to DATA/HTCACHE/ - start the conversion; open a command shell, move to the yacy home directory and execute java -Xmx2000m -cp classes:lib/bzip2.jar de.anomic.tools.mediawikiIndex -convert DATA/HTCACHE/dewiki-20090311-pages-articles.xml.bz2 DATA/SURROGATES/in/ http://de.wikipedia.org/wiki/ orbiter 2009-04-21 22:12:19 +00:00
  • 0b2c98edc9 some more work on the wikipedia-dump exporter (not finished yet) orbiter 2009-04-21 15:19:32 +00:00
  • 5195c94838 two patches for performance enhancements of the index handover process from documents to the index cache: - one word prototype is generated for each document, that is re-used when a specific word is stored. - the index cache uses now ByteArray objects to reference to the RWI instead of byte[]. This enhances access to the the map that stores the cache. To dump the cache to the FS, the content must be sorted, but sorting takes less time than maintenance of a sorted map during caching. orbiter 2009-04-21 14:23:04 +00:00
  • 06c878ed11 moved update_key to correct position in file lulabad 2009-04-21 10:15:12 +00:00
  • 9416f5c26f more speed test cases: kelondro provides map functions that are more than 20% faster than standard java classes and use less than halve of the memory of java classes: just start IndexTest (here with 1000000 test objects) orbiter 2009-04-21 09:29:08 +00:00
  • b53790abb1 more performance hacks: 10% more speed for Base64.compare() which is really often used in YaCy code orbiter 2009-04-21 07:39:21 +00:00
  • 8ffb9889e1 some fixes and performance hacks orbiter 2009-04-20 23:01:44 +00:00
  • dfb96ecb72 more fixes orbiter 2009-04-20 22:08:38 +00:00
  • 1b8d346b4c fixes in connection with transiton to byte[] hashes orbiter 2009-04-20 21:54:00 +00:00
  • 0b0a46d35a * fix transferRWI as suggested by celle (thanks!) see http://forum.yacy-websuche.de/viewtopic.php?f=5&t=2000#p14023 f1ori 2009-04-20 19:51:20 +00:00
  • 996572de95 quickfix orbiter 2009-04-20 16:11:35 +00:00
  • 380ed2dac0 performance and debugging additions orbiter 2009-04-20 15:01:43 +00:00
  • 635b0a9da7 code-split allow cgi indexing lotus 2009-04-20 13:28:28 +00:00
  • e7559f3234 fix for http://forum.yacy-websuche.de/viewtopic.php?p=13977#p13977 orbiter 2009-04-20 10:06:55 +00:00
  • fa3adbbfc6 added domain checks to surrogate reader and RWI transfer receiver to prevent spaming using surrogates orbiter 2009-04-20 06:38:28 +00:00
  • 76af84d732 * add custom comparator to ScoreCluster for byte[] * fixes http://forum.yacy-websuche.de/viewtopic.php?f=6&t=2010 f1ori 2009-04-19 20:01:46 +00:00
  • 31c6934df2 *) fix for r5832 low012 2009-04-19 07:40:23 +00:00
  • ab0030d7a7 allow dht-out for remote-crawl processing peers on default settings lotus 2009-04-18 20:04:01 +00:00
  • 616a4d724f high-end favicon with 2 versions: * true color + alpha channel for modern browsers * 256 colors and non-transparent background for others lotus 2009-04-18 18:37:26 +00:00
  • d1116c049f *) added new method "contains()" to Blacklist interface *) implemented contains() in class AbstractBlacklist *) used new method in Blacklist_p to prevent double entries in blacklists low012 2009-04-18 16:27:17 +00:00
  • 08445e42f0 * don't throw exception, in case of bad charset in http-header f1ori 2009-04-18 15:38:29 +00:00
  • 2f860a2564 * convert byte[] hashes to string for log output f1ori 2009-04-18 14:35:18 +00:00
  • 94a6c83256 * rewrite code without using java 1.6 features f1ori 2009-04-17 15:54:44 +00:00
  • d93a2a6552 * ignore whitespaces so you can copy&paste signatures better f1ori 2009-04-17 14:52:42 +00:00
  • fadf311b97 added sign key for yacystats updates lulabad 2009-04-17 14:32:08 +00:00
  • fbcbcc5bdb export of yacy document objects as dublin core record in xml orbiter 2009-04-17 14:20:12 +00:00
  • d7cbf4cdd4 more performance hacks: less overhead in word hash computation orbiter 2009-04-17 13:47:06 +00:00
  • 29e96c1a60 bugfixes and performance hacks orbiter 2009-04-17 13:04:56 +00:00
  • 4e97a31009 corrections in dublin core syntax orbiter 2009-04-17 12:23:00 +00:00
  • 44daec7936 * introduce signatures to autoupdate as long as there aren't publickeys for the updatelocations set, no signatures are checked * wiki-article follows... f1ori 2009-04-17 09:58:06 +00:00
  • 538e375901 replaced old caching method for computed word hashes with a better method. The word hash computation is a new performance bottleneck (after the IO bottleneck was removed with the IndexCell data structure) and a better caching for word hashes was necessary. orbiter 2009-04-17 09:26:16 +00:00
  • 9e853e1977 partly reverting SVN 5818: identical comparator required for join operator orbiter 2009-04-17 08:18:01 +00:00
  • e16c25ddf7 (peak-) performance hacks orbiter 2009-04-16 22:45:39 +00:00
  • 63cd152969 fixes orbiter 2009-04-16 22:18:35 +00:00
  • 7dfe7e7cc6 fixed some problems with surrogate reader. This is now ready for testing. orbiter 2009-04-16 21:29:41 +00:00
  • 3a1364ed5c removed example lines from SurrogateReader sources; added additional example file orbiter 2009-04-16 21:05:34 +00:00
  • 9050a3c4c5 alpha version of surrogate reading and indexing. see the example file for an explanation. orbiter 2009-04-16 20:47:55 +00:00
  • 870066ab35 another fix orbiter 2009-04-16 20:23:20 +00:00
  • b15b059c0d fix for latest commit orbiter 2009-04-16 19:53:21 +00:00
  • c8624903c6 full redesign of index access data model: terms (words) are not any more retrieved by their word hash string, but by a byte[] containing the word hash. this has strong advantages when RWIs are sorted in the ReferenceContainer Cache and compared with the sun.java TreeMap method, which needed getBytes() and new String() transformations before. Many thousands of such conversions are now omitted every second, which increases the indexing speed by a factor of two. orbiter 2009-04-16 15:29:00 +00:00
  • dd6b5005ff * fix missing charset handling in getpageinfo_p f1ori 2009-04-16 12:31:28 +00:00
  • bd5f4c78d8 - added default profile for surrogate indexing - integrated surrogate indexing into indexing queue process orbiter 2009-04-16 08:01:38 +00:00
  • ad78e3a59f - less lines in rssTerminal - crawl more documents: if remote crawling is enabled, a remote crawl list is also loaded if a local crawl is running in case that the indexer is idle orbiter 2009-04-15 23:07:51 +00:00
  • bc80dc913a added new surrogate reader (surrogates are parsed documents on batches) this will open a new way to insert indexes to YaCy (instead crawling) orbiter 2009-04-15 15:30:25 +00:00
  • 12d81e98eb - fixed bad search results when searching for empty string - simplified result handling and page composition in case that nothing was searched orbiter 2009-04-15 11:22:43 +00:00
  • 8a24350036 - fix for join method with new generalized RWI data structure (caused by latest commit) - added more functions to mediawiki parser orbiter 2009-04-15 10:26:24 +00:00
  • e58320a507 added more info in log fore debugging orbiter 2009-04-15 07:37:36 +00:00
  • 89ec3acb3e - full abstraction of index content type: the kelondro full text index may now also contain indexes about other content than text, i.e. navigation indexes or reverse linking indexes. - during index joins all word positions are maintained: better ranking for word distance possible; exact phrase match can be implemented soundly orbiter 2009-04-15 06:34:27 +00:00
  • 7a48090fcf - fix for "uk" language - svn attributes added borg-0300 2009-04-14 11:40:44 +00:00
  • dc2af61bc9 allow up to 50 results from remote peers orbiter 2009-04-13 21:47:57 +00:00
  • c0e8ed5461 fixed problem with not http client orbiter 2009-04-13 21:21:47 +00:00
  • 6504b21cea *) fix for http://forum.yacy-websuche.de/viewtopic.php?t=1976 low012 2009-04-13 09:22:11 +00:00
  • 8862a2fed0 ups orbiter 2009-04-12 10:22:21 +00:00
  • de68948bc5 better handling of free memory computation and emrgency cache flush for index cell orbiter 2009-04-12 09:24:32 +00:00
  • 601d63ef48 removed comment tag (no use at this point) orbiter 2009-04-11 12:24:45 +00:00
  • c2d85b039e *) added language statistics files low012 2009-04-11 10:18:42 +00:00
  • 0c8fd811dc *) first and very limited version of XML import, does not use benefits provided by XML yet low012 2009-04-11 09:55:46 +00:00
  • fcb77c3140 * added .im (Isle of Man) to TLD-list f1ori 2009-04-10 23:06:48 +00:00
  • 8ce5bb4f31 added shell scripts that list host addresses orbiter 2009-04-10 09:45:22 +00:00
  • 51ea865569 small fix for localsearch shell script orbiter 2009-04-10 09:44:03 +00:00
  • b81c7467d8 protection against too many files in RICELL in case of massive emergency dumps caused by low memory orbiter 2009-04-09 23:55:47 +00:00
  • d4d87d90c4 - extended experimental wikipedia dump parser - removed historic, possibly unused code from wiki parser that was in conflict with actual wikipedia wiki code orbiter 2009-04-09 14:55:20 +00:00
  • c3aff2521e fix for NPE orbiter 2009-04-09 13:32:56 +00:00
  • 57c00dd8c9 fix for bad filtering of common http error orbiter 2009-04-09 13:20:09 +00:00
  • 14361f1ca4 added log message for index generation in HeapReader orbiter 2009-04-09 10:34:22 +00:00
  • 43bcd192cd ups orbiter 2009-04-08 15:56:38 +00:00
  • c08f9b36a4 refactoring of wiki parser. This was done to prepare the wiki parser as parser for wikipedia dumps, which will be used for performance test (to omit crawling) orbiter 2009-04-08 15:28:45 +00:00
  • faeff21012 - fix for display of automatic ReCrawls in surftips apfelmaennchen 2009-04-08 07:35:08 +00:00
  • 44e01afa5b - refactoring - a little bit more abstraction - new interfaces for index abstraction orbiter 2009-04-07 09:34:41 +00:00
  • 82fb60a720 increased memory limit for emergency cache flush orbiter 2009-04-06 15:54:19 +00:00
  • 4905a17f6a moved xerces.jar from libx to lib orbiter 2009-04-06 14:45:33 +00:00
  • 9180617dd9 *) Classes to handle import of lists (especially blacklists) from XML files, not used yet, but will be used soon. low012 2009-04-05 13:36:44 +00:00
  • 596e6215dc fix in case of white space in path name lotus 2009-04-03 16:07:24 +00:00
  • b887f4a116 keep more free mem orbiter 2009-04-03 14:27:04 +00:00
  • c2359f20dd refactoring: better abstraction of reference and metadata prototypes. This is a preparation to introduce other index tables as used now only for reverse text indexes. Next application of the reverse index is a citation index. Moved to version 0.74 orbiter 2009-04-03 13:23:45 +00:00
  • ab656687d7 more strict BLOB initialization .. may also help to save some ram orbiter 2009-04-03 12:42:24 +00:00
  • 5b138ada16 fixes to web structure reference collection and url construction orbiter 2009-04-03 08:29:40 +00:00
  • a29a11e526 added evaluation of incoming links in webstructure api the api hash changed, new XML schema. orbiter 2009-04-03 07:59:49 +00:00
  • f6691411b5 - migration of files from SplitTable (which are used for the URL-DB) to a different file name format. - the file generation logic is slightly different: files may now have only a maximum size of one gigabyte and a maximum age of one month. orbiter 2009-04-02 22:15:33 +00:00
  • 1f37cc6107 Robots.txt is now reused after one day. See forum-topic: http://forum.yacy-websuche.de/viewtopic.php?f=5&t=1669&p=13565#p13565 shostakovich 2009-04-02 15:29:36 +00:00
  • f21a8c9e9c a different naming scheme for BLOBArray files. This may be necessary if blobs are written more often than once in a second. orbiter 2009-04-02 15:08:56 +00:00
  • 7ba078daa1 - added fast site-operator - refactoring merge into BLOBArray orbiter 2009-04-02 13:26:47 +00:00
  • b4126432bc hardening of index dump write process orbiter 2009-04-02 12:24:15 +00:00
  • 9bfb2641db - removed deprecated threads - added automatic http client reset. this was necessary because excessive intranet crawling caused deadlocks. this hack solved the problem. orbiter 2009-04-01 20:13:57 +00:00
  • 293290c317 fix for bad assert in last commit orbiter 2009-04-01 15:17:14 +00:00
  • bd409fb7ba added web structure analysis for a special domain that can be requested from the api. orbiter 2009-04-01 14:53:23 +00:00
  • b6c2167143 - patch for bad web structure dumps - added automatic slow down of accessed to specific domains when access to a web page fails orbiter 2009-04-01 13:21:47 +00:00
  • 0139988c04 - added writing of temporary file names and renaming to final file name when index dump/merge are done. Interrupted merges can be cleaned up. - added clean-up of unfinished merges and unused idx/gap files - enhanced merge file selection method orbiter 2009-04-01 12:39:11 +00:00
  • 3621aa96ab - added a memory protection for the IndexCell migration - fix for bad cell file selection orbiter 2009-03-31 19:17:45 +00:00
  • 568e8f1741 fix in unmountBLOB orbiter 2009-03-31 17:03:13 +00:00
  • 9da69d6b68 - better selection of files to be merged - fix for getChannel().close(), which works on windows but not on macs and linux orbiter 2009-03-31 16:49:02 +00:00
  • d39a5b42ca more care about open file handles. Now files also close on windows and can be deleted afterwards. orbiter 2009-03-31 12:42:12 +00:00
  • 029495e64d fixed bug introduced in SVN 5756 in EcoTable.put() orbiter 2009-03-31 07:51:32 +00:00
  • 587838bd09 git-svn-id: https://svn.berlios.de/svnroot/repos/yacy/trunk@5758 6c8d7289-2bf4-0310-a012-ef5d649a1542 orbiter 2009-03-30 21:13:53 +00:00
  • d2e2420a68 - added another file selection method for index cell merge - more hacks to check that files are closed propertly and filehandles do not exist after files are closed. orbiter 2009-03-30 19:05:08 +00:00
  • 96eaecda3e - added migration class to go from index collections to the index cell data structure. - added better control over file deletion, because this sometimes fails, especially on windows orbiter 2009-03-30 15:31:25 +00:00
  • 9ab009b16b fix for http://forum.yacy-websuche.de/viewtopic.php?f=6&t=1890#p13476 apfelmaennchen 2009-03-30 07:33:43 +00:00
  • 0f0b4aec75 better index cell merge logic orbiter 2009-03-30 06:22:27 +00:00
  • 832fef670f migration of urls-files into subdirectory METADATA orbiter 2009-03-30 04:41:06 +00:00