Commit Graph

  • a1ed368d82 bring back max mem control into master controls. it's useful to limit per process mem usage to prevent oom killer because we can't save if we get killed. overhaul diskpagecache to just use rdbcache. much simpler and faster, but disabled for now until debugged more. reduce min files to merge for crawlbot collections so they stay more tightly merged to conserve fds and mem. improved logDebugDisk msgs. overhauled File.cpp fd pool. now it is way faster and doesn't use any extra mem. much simpler too. although could be sped up a little by using a linked list, but probably is not significant enough to warrant doing right now. increase mem ptr table from 3M to 8M slots. should really make dynamic though. fix core from null msg20s[0]->m_r. only call attemptMergeAll once every 60 seconds really. do not attempt merge if already merging. Matt 2015-08-14 12:58:54 -06:00
  • f09a94fc4e Merge branch 'ia-zak' of https://github.com/gigablast/open-source-search-engine into ia-zak Zak Betz 2015-08-13 23:31:17 -06:00
  • 36b8d384bd Fixes to injector script. New colors and metrics on performance graph. Zak Betz 2015-08-13 23:29:20 -06:00
  • 5c67cbe65d undo Matt 2015-08-12 08:43:44 -07:00
  • 444ebeeb65 one scp install per host Matt 2015-08-12 08:39:01 -07:00
  • 5c2a2ce496 fix core Matt 2015-08-12 08:36:23 -07:00
  • adc9d3bc89 Merge branch 'testing' into diffbot-testing Matt 2015-08-08 19:22:50 -06:00
  • 3477d39608 fix cores Matt 2015-08-08 19:22:01 -06:00
  • 840ca3fea1 fix rdbmap reduce mem thing Matt Wells 2015-08-08 15:43:09 -07:00
  • f8047ac5ef speed up Rdb::attemptMargeAll() because it is a problem according to the profiler when we got tens of thousands of collections. Matt Wells 2015-08-08 12:27:18 -07:00
  • c2bf461d27 call reduceMemFootprint() after writing rdb map to save mem immediately rather than on restart of gb Matt Wells 2015-08-08 11:23:14 -07:00
  • 890170aa90 fix core from archive.org yml file checking. show site ip in inlinker table for easier spam removal. Matt 2015-08-02 12:50:29 -06:00
  • c1ec4dedbb fix for bad query formation. text:""foo bar"" Matt 2015-08-02 11:34:55 -06:00
  • 37591be421 Merge branch 'diffbot-testing' of https://github.com/gigablast/open-source-search-engine Kevin Truong 2015-07-31 18:12:56 -07:00
  • b6207ec344 Fixes #3012. Allow facet ranges to work on negative numbers. Kevin Truong 2015-07-31 18:11:37 -07:00
  • 18d1a787bb fix core dump from meta data in title rec that was just a \0 from injecting content that way Matt 2015-07-31 18:42:21 -06:00
  • e18fca88f4 Merge branch 'diffbot' into diffbot-testing Matt Wells 2015-07-31 08:56:47 -07:00
  • 85c7fbae70 fix infinite loop bug from EBADRBDID Matt Wells 2015-07-31 08:56:26 -07:00
  • afc1b43619 Merge branch 'ia' into ia-zak Matt 2015-07-30 10:22:23 -06:00
  • 5af61ff59a fix core from boolean queries Matt 2015-07-30 10:21:30 -06:00
  • 72768c093d Merge branch 'diffbot-sam' of github.com:gigablast/open-source-search-engine into diffbot-sam Matt 2015-07-23 17:24:41 -06:00
  • 86946392d0 reverted stepping. Useless sam 2015-07-23 10:53:59 -07:00
  • da41d53575 Merge branch 'diffbot-testing' into diffbot-sam Matt 2015-07-23 09:27:00 -06:00
  • e165b5d668 speed up bool queries Matt Wells 2015-07-22 13:00:45 -07:00
  • e9f86f362e Merge branch 'ia' into ia-zak Matt 2015-07-22 12:02:19 -06:00
  • dead58329e Add a script for interacting with hosts.conf files. Zak Betz 2015-07-21 10:17:01 -06:00
  • 090e1b35d5 fix score info reporting for new bool query min score based on # of query terms contained. Matt 2015-07-20 14:37:37 -06:00
  • 69c791e5aa for now at least do not use siterank for ranking boolean search results. Matt 2015-07-20 11:50:31 -06:00
  • 1c93a88d82 use the # of matched terms as the score of a doc when doing a boolean query. later: use proximity scoring for non-field query terms. Matt 2015-07-20 11:09:56 -06:00
  • ff7639e323 do not get synonyms for boolean operators. just skip synonyms if ignoreWord is set at all. Matt 2015-07-19 13:07:05 -06:00
  • 646bc91c59 fix more possible unicode errors Matt 2015-07-19 12:05:09 -06:00
  • b9fc583cae fix core Matt 2015-07-18 18:01:11 -06:00
  • 16fd428887 fix more cores from the dynamic query size changes. add how many query terms we truncated in the json/xml replies. document those fields as well. Matt 2015-07-18 14:15:47 -06:00
  • dab0726fac typo fix Matt Wells 2015-07-17 10:43:38 -06:00
  • 5e7a06229c print special message if no seeds were able to be crawled. Matt 2015-07-17 08:42:01 -06:00
  • 3ffa651b63 Merge branch 'ia-zak' of https://github.com/gigablast/open-source-search-engine into ia-zak Zak Betz 2015-07-16 12:39:33 -06:00
  • 15eb7f659d Fix some malformed html on hosts page. Fix core when no collection record in injection request. Add a script to test disk speed. Zak Betz 2015-07-16 12:02:14 -06:00
  • 7e526863d7 do not include 'diffbot uri' in urls.csv. should not have been there. Matt 2015-07-16 10:11:04 -06:00
  • 0d3cfc2796 single words in quotes - keep them in quotes so we do not get synonym forms Matt 2015-07-15 09:58:25 -06:00
  • f1b0bd0149 quick fix for tree sanity checker Matt 2015-07-15 09:46:27 -06:00
  • 0d1acb09bc try to fix tree if corruption detected when dumping to disk Matt Wells 2015-07-14 22:27:43 -06:00
  • b0a6e590d6 treat estimatedDate like date sam 2015-07-14 17:18:16 -07:00
  • 8048517463 gbss fix Matt 2015-07-14 18:17:52 -06:00
  • 016fa88b29 treat estimatedDate like date sam 2015-07-14 17:17:21 -07:00
  • 9946b4b4be add gbssDiffbotType and gbssIsSeedUrl:1 to spider status docs. Matt 2015-07-14 17:59:50 -06:00
  • a697b3d5a5 Fix Bad File Descriptor loop bug when downloading a static file on a slow disk. Zak Betz 2015-07-14 17:00:09 -06:00
  • fa38d97ec4 Merge branch 'diffbot' into diffbot-testing Matt 2015-07-14 11:45:05 -06:00
  • f173b41e92 additional log info Matt 2015-07-14 11:44:14 -06:00
  • baff94875d fix another core from dynamic query sizing Matt 2015-07-14 09:23:44 -06:00
  • c8cf0e5440 fix some mem leaks from allowing really big queries. added a max query term control to search controls to limit users doing really big queries. but default it very high to 1M. Matt 2015-07-13 23:17:53 -06:00
  • f3d35b557f should solve defect #3002 sam 2015-07-13 18:08:25 -07:00
  • fc4b4db425 fix core related to increasing max query length Matt 2015-07-13 19:00:47 -06:00
  • c3a0f21600 nomenclature changes Matt 2015-07-13 18:42:13 -06:00
  • acf389debd Merge branch 'ia' into ia-zak Matt Wells 2015-07-13 18:39:00 -06:00
  • 1ba57f9278 fix pesky memory leak finally Matt Wells 2015-07-13 17:47:34 -06:00
  • c03594034d bump up some limits for extraordinarily long queries Matt Wells 2015-07-13 17:43:28 -06:00
  • 34ec49e804 get mike's super long query working Matt 2015-07-13 14:59:44 -06:00
  • 0e009fa6bc fix cores from dynamic # query terms fix Matt 2015-07-10 20:49:40 -06:00
  • f088e734f6 allow up to 3000 query terms. really we can allow much more since we are mostly dynamically allocating, only a few smaller arrays use the 3000 on the stack. Matt 2015-07-10 19:02:30 -06:00
  • 5d57862046 do not core on gigabits overflow issue Matt 2015-07-10 11:00:16 -06:00
  • 795bdf2a78 Merge branch 'ia-zak' of github.com:gigablast/open-source-search-engine into ia-zak Matt 2015-07-08 21:36:58 -06:00
  • 46af0e1bce if url too long return the EURLTOOBIG error code. it prints 'Too many chars in url' as the official error msg. Matt 2015-07-08 21:36:18 -06:00
  • f4effaecb8 Fix memory leak. Zak Betz 2015-07-08 21:27:57 -06:00
  • 8f19b1a0e2 Merge branch 'ia-zak' of https://github.com/gigablast/open-source-search-engine into ia-zak Zak Betz 2015-07-08 14:03:53 -06:00
  • 6e21bc7d7c Injection script fixes. Temporary fix for core when injecting large warc. Zak Betz 2015-07-08 14:03:39 -06:00
  • 6cd91ae32c Merge branch 'ia' into ia-zak Matt 2015-07-08 13:49:25 -06:00
  • a15d2470f5 Merge branch 'testing' into diffbot-testing Matt 2015-07-08 13:48:58 -06:00
  • 581f287113 api doc update for facets Matt 2015-07-08 13:48:16 -06:00
  • 3395ee8111 fix core in sections Matt 2015-07-08 08:15:30 -06:00
  • a7ae510e31 Fix string faceting display for json metadata. Add unit test for faceted metadata. Zak Betz 2015-07-06 23:05:18 -06:00
  • 97f2052d63 remove the debug log sam 2015-07-06 18:05:35 -07:00
  • bcd53016e9 Fixes #2947. Fixed a bug with counting facet 'totalDocsWithField' and 'totalDocsWithFieldAndValue' Kevin Truong 2015-07-06 17:42:47 -07:00
  • 6745c72232 implemented stepping sam 2015-07-06 17:28:17 -07:00
  • 87fcda0f93 Fix atotime5 to parse ISO8601. Fix qa test for warcs and arcs. Fix inject script. Zak Betz 2015-07-06 00:51:18 -06:00
  • a844f618cc Merge branch 'ia-zak' of https://github.com/gigablast/open-source-search-engine into ia-zak Zak Betz 2015-07-05 17:07:40 -06:00
  • fa78164a30 Injection script now keeps track of injection date and won't reinject something that hasn't changed. Zak Betz 2015-07-05 17:07:21 -06:00
  • adae6689e6 fix add url from root page. fix core from corruption Matt 2015-07-04 21:52:11 -06:00
  • 815bd7ce0a quite a few bug fixes. Matt 2015-07-02 17:42:05 -06:00
  • f664f1788d Merge branch 'ia-zak' of https://github.com/gigablast/open-source-search-engine into ia-zak Zak Betz 2015-07-02 12:18:08 -06:00
  • 6de4199ee8 Fix linkdb core. Make file and line number the label for StackBuf. Zak Betz 2015-07-02 12:17:10 -06:00
  • 1966f36c00 fix clock candidate bug Matt 2015-07-01 20:34:39 -06:00
  • 0cb3a5d44e fix ptr_metadata issue Matt 2015-07-01 20:01:57 -06:00
  • 1327301a8d Merge branch 'testing' into ia Matt 2015-07-01 19:03:56 -06:00
  • 4e987f71d2 use gigabits json array, and update the json results output documentation to describe each thing. Matt 2015-07-01 18:59:52 -06:00
  • 5c61495aea Merge branch 'diffbot-testing' into testing Matt 2015-07-01 18:56:56 -06:00
  • b88079a2d4 Fix warc injector script. Zak Betz 2015-06-30 22:19:13 -06:00
  • 7b507a70ef Set value length to 0 for something that does not return a string value in Json.cpp. Fix the '-' -> '_' when indexing generic fields. Add a StackBuf macro which is a Safebuf initialized with a small stack buffer for use in a local scope. Zak Betz 2015-06-30 14:09:57 -06:00
  • f615ac9331 try to fix infinite loop bug again Matt 2015-06-25 17:30:27 -07:00
  • 8c8c7eddf6 Merge branch 'diffbot-testing' into testing Matt 2015-06-23 17:45:35 -06:00
  • 57020dcb24 added micro.html Matt 2015-06-23 17:45:25 -06:00
  • 1e3c52a0ef fix infinite loop bug from performance enhancement using active list for spidering i put in a few days back. Matt 2015-06-23 13:52:02 -07:00
  • 5aa8e6ba2a Merge branch 'diffbot-testing' into diffbot Matt 2015-06-22 14:31:39 -07:00
  • dfdea910ad fix fix Matt 2015-06-19 08:50:59 -07:00
  • 7a0ae294a2 reduce log spam when rebalancing Matt 2015-06-19 08:47:06 -07:00
  • 5b104f1bb8 diffbt max ips from 1 back to 7 Matt Wells 2015-06-18 15:29:03 -07:00
  • bdebd79f4f spiderloop active list bug fix. change diffbot ip max from 1 to 7 again. Matt Wells 2015-06-18 15:05:16 -07:00
  • 5f5ce7d12c Merge branch 'diffbot-testing' into testing Matt 2015-06-18 11:02:21 -06:00
  • e1aab778e9 fix errno miscount bug. fix infinite loop in active list logic. Matt 2015-06-18 10:55:07 -06:00
  • 902a8fc61d fix errno mismatch bug Matt 2015-06-18 10:33:27 -06:00
  • 0493e7a899 use linked lists for closing least used fds for speed. right now just log if it differs from current algo. Matt Wells 2015-06-18 09:19:13 -07:00