Commit Graph

  • 32d7f5cb97 better warc injection load balancing Matt 2015-09-15 15:04:26 -0600
  • aada7d5e51 Merge branch 'ia-zak' of https://github.com/gigablast/open-source-search-engine into ia-zak Zak Betz 2015-09-14 22:14:01 -0600
  • 28ed7f66af Merge branch 'diffbot-testing' into ia-zak Matt 2015-09-14 19:16:20 -0600
  • f9c4f8fc9a test with 2 first Matt 2015-09-14 19:16:07 -0600
  • fcd4fe3ff3 Merge branch 'diffbot-testing' into ia-zak Matt 2015-09-14 19:04:15 -0600
  • 16d90e60d1 all up to 10 filter threads now Matt 2015-09-14 19:03:52 -0600
  • 4754e26dc5 Merge 4a7dc8d328 into 5caa219c71 Ivan Skytte Jørgensen 2015-09-14 19:01:31 +0000
  • 5caa219c71 Reduce false positives by not counting \0 as a non-ascii char in the url. Zak Betz 2015-09-14 12:24:50 -0600
  • 5d724cdcc3 Check for spaces before non-ascii chars to reduce false positives. Also print the position of non-ascii char to aid debugging. We still need to handle utf8 chars in path. Zak Betz 2015-09-14 11:11:56 -0600
  • 519b2c4f42 Fix repeating xn--xn-- when there are spaces in the domain. Make gb unittest take a name of the unit test to run. Zak Betz 2015-09-14 10:24:22 -0600
  • 519017828c Enable punycode domains for testing. We still need to display them as utf8 on the front end. Zak Betz 2015-09-14 09:32:25 -0600
  • 78125c809b Merge branch 'testing' of https://github.com/gigablast/open-source-search-engine into testing Zak Betz 2015-09-14 00:53:40 -0600
  • 68a0d08820 Merge branch 'ia-zak' of https://github.com/gigablast/open-source-search-engine into ia-zak Zak Betz 2015-09-14 00:35:22 -0600
  • 5622ca47ee Work on non-ascii domain names. It works on correct inputs, but will crash on some non correct inputs, so it is forced to be disabled. Zak Betz 2015-09-14 00:34:44 -0600
  • b4bac67fdf Merge branch 'diffbot-testing' into testing Matt 2015-09-13 19:40:57 -0600
  • 710661a0f3 Merge branch 'ia-zak' into ia Matt 2015-09-13 19:40:21 -0600
  • ffa465b942 Merge branch 'ia' of github.com:gigablast/open-source-search-engine into ia Matt 2015-09-13 19:40:17 -0600
  • f1db3aca94 Merge branch 'diffbot-testing' into ia-zak Matt 2015-09-13 19:39:50 -0600
  • b8e4046d61 remove unnecessary line Matt Wells 2015-09-13 17:54:38 -0700
  • 65613feb4c fix bug of not using part files when generating map Matt 2015-09-13 17:52:40 -0700
  • cb6ca24c26 Allow nospider and noquery on the same host. Fix punycoding of non-ascii domains. Zak Betz 2015-09-13 17:15:31 -0600
  • 3444c67851 exit faster. Matt 2015-09-13 14:28:35 -0700
  • 6370054bd5 fix problem of adding too many collections and not wrapping the collnum_t id Matt Wells 2015-09-13 14:21:52 -0700
  • cb03dec928 log debug tcp buf switch Matt 2015-09-13 14:20:16 -0700
  • 3b7814dc7c Merge branch 'diffbot-testing' of github.com:gigablast/open-source-search-engine into diffbot-testing Matt Wells 2015-09-13 13:21:53 -0700
  • b1f97a28e2 fix make dmozparse Matt Wells 2015-09-13 13:21:36 -0700
  • 880e626495 fix core from realtime profiler Matt Wells 2015-09-13 12:56:56 -0700
  • 630608c360 Revert "allow utf8 urls. but for domain lookups will have to convert" Matt 2015-09-12 22:35:08 -0600
  • 2a4947ea54 fix compiler bug Matt 2015-09-12 22:27:30 -0600
  • 9b2b101bb5 allow utf8 urls. but for domain lookups will have to convert the hostname to punycode. it is better to display topbeskæring.dk as the url than xn--topbeskring-g9a.dk i would think. Matt 2015-09-12 22:18:16 -0600
  • eb75a5ffc9 Merge branch 'testing' of github.com:gigablast/open-source-search-engine into testing Matt 2015-09-12 22:16:01 -0600
  • 815fd4cc48 Merge branch 'diffbot-testing' into ia-zak Matt 2015-09-12 20:40:12 -0600
  • fe4ee503ac try to fix msg7 based core Matt 2015-09-12 20:39:51 -0600
  • 911b2837ca Merge branch 'testing' of https://github.com/gigablast/open-source-search-engine into testing Zak Betz 2015-09-12 15:51:59 -0600
  • 77bd8dcff9 Start to detect non-asci urls and encode them to ascii. (Work In Progress) Zak Betz 2015-09-12 15:47:33 -0600
  • cc1fcdd8a1 fix spider proxy load table clean out again Matt Wells 2015-09-12 13:58:08 -0700
  • 583974093e clean out proxy load table more often to keep things fast. Matt Wells 2015-09-12 13:33:42 -0700
  • d8eb35b4fe Merge branch 'ia-zak' of https://github.com/gigablast/open-source-search-engine into ia-zak Zak Betz 2015-09-12 14:08:54 -0600
  • e50c57db90 Warc injector script Zak Betz 2015-09-12 14:08:42 -0600
  • fb332a23c4 another fix for infinite loop in spider proxy code Matt Wells 2015-09-12 11:58:46 -0700
  • da441e3daa Merge branch 'ia' into ia-zak Matt 2015-09-12 12:29:22 -0600
  • 92622d3799 Merge branch 'diffbot-testing' into testing Matt 2015-09-12 12:29:04 -0600
  • a738fe993f fix core Matt 2015-09-12 12:28:47 -0600
  • 5b0787d483 fix spider slots getting clogged Matt 2015-09-12 12:16:26 -0600
  • 783a6dd8c1 fix spider slots getting clogged up with same firstip urls Matt 2015-09-12 12:15:39 -0600
  • f817bae34d Merge branch 'ia' into ia-zak Matt 2015-09-12 11:54:57 -0600
  • 7d3b781c61 Merge branch 'diffbot-testing' into testing Matt 2015-09-12 11:54:37 -0600
  • f8d6a7f49b fix warc file read thread stuff Matt 2015-09-12 11:53:24 -0600
  • fd6875b94c make warc reading use a thread in xmldoc.cpp Matt 2015-09-12 11:42:27 -0600
  • 9c560dd415 Merge branch 'ia' into ia-zak Matt 2015-09-12 09:12:18 -0600
  • 52e3d63e0c Merge branch 'diffbot-testing' into testing Matt 2015-09-12 09:11:59 -0600
  • a29cf8c787 fix resuming a killed merge Matt Wells 2015-09-12 08:01:40 -0700
  • 37c3cc5d8d Merge branch 'diffbot-testing' of github.com:gigablast/open-source-search-engine into diffbot-testing Matt Wells 2015-09-12 07:11:56 -0700
  • 782bc9cdee inf loop fix Matt Wells 2015-09-12 07:11:39 -0700
  • e2c61c7a78 Merge branch 'ia-zak' of https://github.com/gigablast/open-source-search-engine into ia-zak Zak Betz 2015-09-11 14:22:27 -0600
  • e9724df2e9 Merge branch 'ia-zak' of https://github.com/gigablast/open-source-search-engine into ia-zak Zak Betz 2015-09-11 14:22:08 -0600
  • 0977c669b1 Merge branch 'diffbot-testing' into testing Matt 2015-09-11 14:17:11 -0600
  • f01db79e5f show inject requests in the spider queue table now Matt 2015-09-11 14:16:26 -0600
  • 20da5753f4 added switch to just return error if not all shards returned results because a shard was dead! the 'all or nothing' switch. Matt 2015-09-11 13:43:49 -0600
  • fcca380cbb added spiderdb disk cache Matt 2015-09-11 12:52:31 -0600
  • b85d55a097 Fix core. Zak Betz 2015-09-11 12:47:14 -0600
  • 02ca065aa4 Merge branch 'diffbot-testing' into testing Matt 2015-09-11 12:41:22 -0600
  • d0055cda6b do not hit file cache when merging files on disk. 1. it prevents corruption. 2. it drains our cache of good stuff. Matt Wells 2015-09-11 11:09:15 -0700
  • a270e163de Fix coring on udp timeout when clustering search results. Add ability to force update a list of items in warc injector. Zak Betz 2015-09-11 11:05:57 -0600
  • 1b9e509a55 more streaming socket fixes and debug stuff Matt Wells 2015-09-11 09:17:41 -0700
  • 710f9bcd8b Merge branch 'diffbot-testing' into testing Matt 2015-09-11 00:00:50 -0600
  • 1d49661fab fix note Matt Wells 2015-09-10 23:01:06 -0700
  • 3c8c415d04 fix file cache. fix core when file cache max mem is 0. Matt Wells 2015-09-10 22:21:55 -0700
  • 8611a06dc8 fix log spam Matt 2015-09-10 23:07:06 -0600
  • 877f6c7b9b turn off log statement Matt 2015-09-10 22:43:07 -0600
  • f3b2883268 fix core from new file caching logic Matt Wells 2015-09-10 20:42:45 -0700
  • aefd8772cf Merge branch 'ia-zak' of https://github.com/gigablast/open-source-search-engine into ia-zak Zak Betz 2015-09-10 21:32:36 -0600
  • ed9c5c580a some debug statement to track down the socket snafu on host 0 Matt Wells 2015-09-10 19:18:48 -0700
  • 054bb0e11a fix validate cache algo mem leak Matt 2015-09-10 14:10:55 -0600
  • 947baf6241 added cache validation logic Matt 2015-09-10 13:56:38 -0600
  • 8ac561e1ab turn off profiler automatically after 60 seconds. print red box if profiler is running. Matt 2015-09-10 13:37:14 -0600
  • ad68738b1d fix compiler warnings Matt 2015-09-10 13:24:59 -0600
  • 09de59f026 do not store cblock, etc. tags into tagdb to save disk space. added tagdb file cache for better performance, less disk accesses. will help reduce disk load. put file cache sizes in master controls and if they change then update the cache size dynamically. Matt 2015-09-10 12:46:00 -0600
  • 953259f360 re-disbale page cache. wtf? Matt Wells 2015-09-09 22:06:00 -0700
  • 41b8128bfd detect bogus saved hashtables Matt Wells 2015-09-09 21:49:12 -0700
  • b2f3c44650 Merge branch 'diffbot-testing' of github.com:gigablast/open-source-search-engine into diffbot-testing Matt 2015-09-09 19:26:43 -0600
  • d7b62bb068 retry non-tmp errors once per 90 days Matt 2015-09-09 19:26:26 -0600
  • 6c175c9cee if a seed has a bad ip lookup one round make sure we try it the next round! Matt Wells 2015-09-09 18:00:03 -0700
  • ddd5cc711e show hosts that say a collection has urls ready to spider on page crawlbot. Matt Wells 2015-09-09 16:56:01 -0700
  • 2483b73cbd fix infinite scanning loops caused by corrupt spiderdb records. Matt Wells 2015-09-09 14:28:02 -0700
  • 9d9da8fd38 fix bug of collection never completing rebuild of its waiting tree and doledb stuff. Matt Wells 2015-09-09 10:29:40 -0700
  • 647d004c04 fix core from sending a url alert, then customer deleting collection before email alert reply comes back. then it comes back to a delete collrec and cores. Matt Wells 2015-09-08 15:57:46 -0700
  • 4715718005 Get rid of warning. Zak Betz 2015-09-08 15:51:53 -0600
  • 156802c478 Try to fix stat axis scaling again. Zak Betz 2015-09-08 11:34:51 -0600
  • 7b24cdcddb Merge pull request #52 from isj-privacore/master Gigablast 2015-09-08 07:26:50 -0600
  • 4a7dc8d328 Use memset() instead of hand-made unrolled-loop Ivan Skytte Jørgensen 2015-09-08 14:08:30 +0200
  • ab7e4a18bb added Travis CI support Stewart Henderson 2015-09-07 16:25:09 -0500
  • 47795f4c70 fix core Matt 2015-09-07 12:14:10 -0600
  • d5bb632eb4 Optimize is_ascii3() by using logic for btyes <128 instead of table lookup (memory fetch) Ivan Skytte Jørgensen 2015-09-07 13:37:54 +0200
  • 620a367c3c Optimize UTF-8 handling in getUtf8CharSize() by using logic instead of table lookup (memory fetch) for bytes<128 Ivan Skytte Jørgensen 2015-09-07 13:32:36 +0200
  • dab602b7de limit ebadengineer disk read errors to 100 retries Matt Wells 2015-09-06 09:07:29 -0700
  • 11334ea8d3 nothing Matt Wells 2015-09-06 09:04:42 -0700
  • aaaf3c7fb9 fight log spam some more Matt 2015-09-06 09:11:36 -0600
  • 0bbb493199 limit crawl delay to 60 seconds Matt Wells 2015-09-05 10:38:44 -0700
  • fa2ecf8681 allow for niceness 1 'merge' threads to have their own queue. so merge operations take precedence in that cpu-based thread queue over regular spider operation rdblist merges. Matt 2015-09-05 08:44:52 -0600