Commit Graph

  • cb03dec928 log debug tcp buf switch Matt 2015-09-13 14:20:16 -07:00
  • 3b7814dc7c Merge branch 'diffbot-testing' of github.com:gigablast/open-source-search-engine into diffbot-testing Matt Wells 2015-09-13 13:21:53 -07:00
  • b1f97a28e2 fix make dmozparse Matt Wells 2015-09-13 13:21:36 -07:00
  • 880e626495 fix core from realtime profiler Matt Wells 2015-09-13 12:56:56 -07:00
  • 630608c360 Revert "allow utf8 urls. but for domain lookups will have to convert" Matt 2015-09-12 22:35:08 -06:00
  • 2a4947ea54 fix compiler bug Matt 2015-09-12 22:27:30 -06:00
  • 9b2b101bb5 allow utf8 urls. but for domain lookups will have to convert the hostname to punycode. it is better to display topbeskæring.dk as the url than xn--topbeskring-g9a.dk i would think. Matt 2015-09-12 22:18:16 -06:00
  • eb75a5ffc9 Merge branch 'testing' of github.com:gigablast/open-source-search-engine into testing Matt 2015-09-12 22:16:01 -06:00
  • 815fd4cc48 Merge branch 'diffbot-testing' into ia-zak Matt 2015-09-12 20:40:12 -06:00
  • fe4ee503ac try to fix msg7 based core Matt 2015-09-12 20:39:51 -06:00
  • 911b2837ca Merge branch 'testing' of https://github.com/gigablast/open-source-search-engine into testing Zak Betz 2015-09-12 15:51:59 -06:00
  • 77bd8dcff9 Start to detect non-asci urls and encode them to ascii. (Work In Progress) Zak Betz 2015-09-12 15:47:33 -06:00
  • cc1fcdd8a1 fix spider proxy load table clean out again Matt Wells 2015-09-12 13:58:08 -07:00
  • 583974093e clean out proxy load table more often to keep things fast. Matt Wells 2015-09-12 13:33:42 -07:00
  • d8eb35b4fe Merge branch 'ia-zak' of https://github.com/gigablast/open-source-search-engine into ia-zak Zak Betz 2015-09-12 14:08:54 -06:00
  • e50c57db90 Warc injector script Zak Betz 2015-09-12 14:08:42 -06:00
  • fb332a23c4 another fix for infinite loop in spider proxy code Matt Wells 2015-09-12 11:58:46 -07:00
  • da441e3daa Merge branch 'ia' into ia-zak Matt 2015-09-12 12:29:22 -06:00
  • 92622d3799 Merge branch 'diffbot-testing' into testing Matt 2015-09-12 12:29:04 -06:00
  • a738fe993f fix core Matt 2015-09-12 12:28:47 -06:00
  • 5b0787d483 fix spider slots getting clogged Matt 2015-09-12 12:16:26 -06:00
  • 783a6dd8c1 fix spider slots getting clogged up with same firstip urls Matt 2015-09-12 12:15:39 -06:00
  • f817bae34d Merge branch 'ia' into ia-zak Matt 2015-09-12 11:54:57 -06:00
  • 7d3b781c61 Merge branch 'diffbot-testing' into testing Matt 2015-09-12 11:54:37 -06:00
  • f8d6a7f49b fix warc file read thread stuff Matt 2015-09-12 11:53:24 -06:00
  • fd6875b94c make warc reading use a thread in xmldoc.cpp Matt 2015-09-12 11:42:27 -06:00
  • 9c560dd415 Merge branch 'ia' into ia-zak Matt 2015-09-12 09:12:18 -06:00
  • 52e3d63e0c Merge branch 'diffbot-testing' into testing Matt 2015-09-12 09:11:59 -06:00
  • a29cf8c787 fix resuming a killed merge Matt Wells 2015-09-12 08:01:40 -07:00
  • 37c3cc5d8d Merge branch 'diffbot-testing' of github.com:gigablast/open-source-search-engine into diffbot-testing Matt Wells 2015-09-12 07:11:56 -07:00
  • 782bc9cdee inf loop fix Matt Wells 2015-09-12 07:11:39 -07:00
  • e2c61c7a78 Merge branch 'ia-zak' of https://github.com/gigablast/open-source-search-engine into ia-zak Zak Betz 2015-09-11 14:22:27 -06:00
  • e9724df2e9 Merge branch 'ia-zak' of https://github.com/gigablast/open-source-search-engine into ia-zak Zak Betz 2015-09-11 14:22:08 -06:00
  • 0977c669b1 Merge branch 'diffbot-testing' into testing Matt 2015-09-11 14:17:11 -06:00
  • f01db79e5f show inject requests in the spider queue table now Matt 2015-09-11 14:16:26 -06:00
  • 20da5753f4 added switch to just return error if not all shards returned results because a shard was dead! the 'all or nothing' switch. Matt 2015-09-11 13:43:49 -06:00
  • fcca380cbb added spiderdb disk cache Matt 2015-09-11 12:52:31 -06:00
  • b85d55a097 Fix core. Zak Betz 2015-09-11 12:47:14 -06:00
  • 02ca065aa4 Merge branch 'diffbot-testing' into testing Matt 2015-09-11 12:41:22 -06:00
  • d0055cda6b do not hit file cache when merging files on disk. 1. it prevents corruption. 2. it drains our cache of good stuff. Matt Wells 2015-09-11 11:09:15 -07:00
  • a270e163de Fix coring on udp timeout when clustering search results. Add ability to force update a list of items in warc injector. Zak Betz 2015-09-11 11:05:57 -06:00
  • 1b9e509a55 more streaming socket fixes and debug stuff Matt Wells 2015-09-11 09:17:41 -07:00
  • 710f9bcd8b Merge branch 'diffbot-testing' into testing Matt 2015-09-11 00:00:50 -06:00
  • 1d49661fab fix note Matt Wells 2015-09-10 23:01:06 -07:00
  • 3c8c415d04 fix file cache. fix core when file cache max mem is 0. Matt Wells 2015-09-10 22:21:55 -07:00
  • 8611a06dc8 fix log spam Matt 2015-09-10 23:07:06 -06:00
  • 877f6c7b9b turn off log statement Matt 2015-09-10 22:43:07 -06:00
  • f3b2883268 fix core from new file caching logic Matt Wells 2015-09-10 20:42:45 -07:00
  • aefd8772cf Merge branch 'ia-zak' of https://github.com/gigablast/open-source-search-engine into ia-zak Zak Betz 2015-09-10 21:32:36 -06:00
  • ed9c5c580a some debug statement to track down the socket snafu on host 0 Matt Wells 2015-09-10 19:18:48 -07:00
  • 054bb0e11a fix validate cache algo mem leak Matt 2015-09-10 14:10:55 -06:00
  • 947baf6241 added cache validation logic Matt 2015-09-10 13:56:38 -06:00
  • 8ac561e1ab turn off profiler automatically after 60 seconds. print red box if profiler is running. Matt 2015-09-10 13:37:14 -06:00
  • ad68738b1d fix compiler warnings Matt 2015-09-10 13:24:59 -06:00
  • 09de59f026 do not store cblock, etc. tags into tagdb to save disk space. added tagdb file cache for better performance, less disk accesses. will help reduce disk load. put file cache sizes in master controls and if they change then update the cache size dynamically. Matt 2015-09-10 12:46:00 -06:00
  • 953259f360 re-disbale page cache. wtf? Matt Wells 2015-09-09 22:06:00 -07:00
  • 41b8128bfd detect bogus saved hashtables Matt Wells 2015-09-09 21:49:12 -07:00
  • b2f3c44650 Merge branch 'diffbot-testing' of github.com:gigablast/open-source-search-engine into diffbot-testing Matt 2015-09-09 19:26:43 -06:00
  • d7b62bb068 retry non-tmp errors once per 90 days Matt 2015-09-09 19:26:26 -06:00
  • 6c175c9cee if a seed has a bad ip lookup one round make sure we try it the next round! Matt Wells 2015-09-09 18:00:03 -07:00
  • ddd5cc711e show hosts that say a collection has urls ready to spider on page crawlbot. Matt Wells 2015-09-09 16:56:01 -07:00
  • 2483b73cbd fix infinite scanning loops caused by corrupt spiderdb records. Matt Wells 2015-09-09 14:28:02 -07:00
  • 9d9da8fd38 fix bug of collection never completing rebuild of its waiting tree and doledb stuff. Matt Wells 2015-09-09 10:29:40 -07:00
  • 647d004c04 fix core from sending a url alert, then customer deleting collection before email alert reply comes back. then it comes back to a delete collrec and cores. Matt Wells 2015-09-08 15:57:46 -07:00
  • 4715718005 Get rid of warning. Zak Betz 2015-09-08 15:51:53 -06:00
  • 156802c478 Try to fix stat axis scaling again. Zak Betz 2015-09-08 11:34:51 -06:00
  • 7b24cdcddb Merge pull request #52 from isj-privacore/master Gigablast 2015-09-08 07:26:50 -06:00
  • 4a7dc8d328 Use memset() instead of hand-made unrolled-loop Ivan Skytte Jørgensen 2015-09-08 14:08:30 +02:00
  • 47795f4c70 fix core Matt 2015-09-07 12:14:10 -06:00
  • d5bb632eb4 Optimize is_ascii3() by using logic for btyes <128 instead of table lookup (memory fetch) Ivan Skytte Jørgensen 2015-09-07 13:37:54 +02:00
  • 620a367c3c Optimize UTF-8 handling in getUtf8CharSize() by using logic instead of table lookup (memory fetch) for bytes<128 Ivan Skytte Jørgensen 2015-09-07 13:32:36 +02:00
  • dab602b7de limit ebadengineer disk read errors to 100 retries Matt Wells 2015-09-06 09:07:29 -07:00
  • 11334ea8d3 nothing Matt Wells 2015-09-06 09:04:42 -07:00
  • aaaf3c7fb9 fight log spam some more Matt 2015-09-06 09:11:36 -06:00
  • 0bbb493199 limit crawl delay to 60 seconds Matt Wells 2015-09-05 10:38:44 -07:00
  • fa2ecf8681 allow for niceness 1 'merge' threads to have their own queue. so merge operations take precedence in that cpu-based thread queue over regular spider operation rdblist merges. Matt 2015-09-05 08:44:52 -06:00
  • 2a214fe2e7 try relaxing msg13 packet drop constraint a little to increase downloads Matt Wells 2015-09-04 18:33:04 -07:00
  • 90f79a31e1 prevent log spam Matt Wells 2015-09-04 16:49:45 -07:00
  • 3a522f52d3 no longer use read size based thread queues. re-added merge disk read thread queue. fixed attemptMergeAll(). Matt Wells 2015-09-04 16:14:18 -07:00
  • 38e83d9eb0 thread max tuning Matt Wells 2015-09-04 14:34:23 -07:00
  • 81affde47d fix corrupt spider request bug some more Matt Wells 2015-09-04 14:14:14 -07:00
  • 8e40e41aa7 make a note if this future time bug happens again Matt Wells 2015-09-04 13:25:18 -07:00
  • 790d6a3b5f revert removal of pausing spiders off if too many udp slots in use. added new spider request corruption detection. Matt Wells 2015-09-04 13:19:03 -07:00
  • c653b0989c undo some possible averse changes Matt Wells 2015-09-04 11:31:43 -07:00
  • 4942112748 Merge branch 'ia-zak' of https://github.com/gigablast/open-source-search-engine into ia-zak Zak Betz 2015-09-03 22:17:48 -06:00
  • e7f1c75855 Add logic to limit number of msg7s to 100 per hosts, then we drop the requests. Zak Betz 2015-09-03 22:17:16 -06:00
  • adac0033d9 remove a couple unused parms Matt Wells 2015-09-03 20:41:40 -07:00
  • f157be1742 Merge branch 'diffbot-testing' of github.com:gigablast/open-source-search-engine into diffbot-testing Matt Wells 2015-09-03 20:28:47 -07:00
  • 611eaeca8e added 'separate disk reads' parm so we can allow spider disk threads to compete with query disk threads. that way ppl constantly doing queries won't slow the spiders but their queries might be slower. Matt Wells 2015-09-03 20:27:54 -07:00
  • 26404443a8 if disk thread took 0 ms then put *>*XMB/s in thread table Matt 2015-09-03 14:50:56 -06:00
  • d9a8a16751 unit fix Matt 2015-09-03 14:30:39 -06:00
  • 1fa7dc7cea Merge branch 'diffbot-testing' of github.com:gigablast/open-source-search-engine into diffbot-testing Matt 2015-09-03 14:28:17 -06:00
  • 6cf69399e5 handle bad host sending us a c1 request so we dont core Matt 2015-09-03 14:27:33 -06:00
  • 7ca4e753aa little core fix Matt Wells 2015-09-03 12:52:42 -07:00
  • ad95058fd5 Merge branch 'diffbot-testing' of github.com:gigablast/open-source-search-engine into diffbot-testing Matt 2015-09-03 12:01:15 -06:00
  • b50f36cd73 a whole new threads stack Matt 2015-09-03 11:59:43 -06:00
  • 254eba2a37 fix core Matt Wells 2015-09-02 15:26:43 -07:00
  • a38baaac0c spider time speed ups for crawlbot jobs Matt Wells 2015-09-02 15:03:46 -07:00
  • e92aefebcf only add docs indexed stats on host #0 statsdb Matt Wells 2015-09-02 13:56:25 -07:00
  • 8d69b6a867 profiler fix Matt Wells 2015-09-02 13:45:13 -07:00