Commit Graph

  • 8a6be3b4ac fix contetion of one collection starving others for urls from the same ip. Matt Wells 2015-08-10 11:04:49 -07:00
  • adc9d3bc89 Merge branch 'testing' into diffbot-testing Matt 2015-08-08 19:22:50 -06:00
  • 3477d39608 fix cores Matt 2015-08-08 19:22:01 -06:00
  • 840ca3fea1 fix rdbmap reduce mem thing Matt Wells 2015-08-08 15:43:09 -07:00
  • f8047ac5ef speed up Rdb::attemptMargeAll() because it is a problem according to the profiler when we got tens of thousands of collections. Matt Wells 2015-08-08 12:27:18 -07:00
  • c2bf461d27 call reduceMemFootprint() after writing rdb map to save mem immediately rather than on restart of gb Matt Wells 2015-08-08 11:23:14 -07:00
  • 890170aa90 fix core from archive.org yml file checking. show site ip in inlinker table for easier spam removal. Matt 2015-08-02 12:50:29 -06:00
  • c1ec4dedbb fix for bad query formation. text:""foo bar"" Matt 2015-08-02 11:34:55 -06:00
  • 37591be421 Merge branch 'diffbot-testing' of https://github.com/gigablast/open-source-search-engine Kevin Truong 2015-07-31 18:12:56 -07:00
  • b6207ec344 Fixes #3012. Allow facet ranges to work on negative numbers. Kevin Truong 2015-07-31 18:11:37 -07:00
  • 18d1a787bb fix core dump from meta data in title rec that was just a \0 from injecting content that way Matt 2015-07-31 18:42:21 -06:00
  • e18fca88f4 Merge branch 'diffbot' into diffbot-testing Matt Wells 2015-07-31 08:56:47 -07:00
  • 85c7fbae70 fix infinite loop bug from EBADRBDID Matt Wells 2015-07-31 08:56:26 -07:00
  • afc1b43619 Merge branch 'ia' into ia-zak Matt 2015-07-30 10:22:23 -06:00
  • 5af61ff59a fix core from boolean queries Matt 2015-07-30 10:21:30 -06:00
  • 72768c093d Merge branch 'diffbot-sam' of github.com:gigablast/open-source-search-engine into diffbot-sam Matt 2015-07-23 17:24:41 -06:00
  • 86946392d0 reverted stepping. Useless sam 2015-07-23 10:53:59 -07:00
  • da41d53575 Merge branch 'diffbot-testing' into diffbot-sam Matt 2015-07-23 09:27:00 -06:00
  • e165b5d668 speed up bool queries Matt Wells 2015-07-22 13:00:45 -07:00
  • e9f86f362e Merge branch 'ia' into ia-zak Matt 2015-07-22 12:02:19 -06:00
  • dead58329e Add a script for interacting with hosts.conf files. Zak Betz 2015-07-21 10:17:01 -06:00
  • 090e1b35d5 fix score info reporting for new bool query min score based on # of query terms contained. Matt 2015-07-20 14:37:37 -06:00
  • 69c791e5aa for now at least do not use siterank for ranking boolean search results. Matt 2015-07-20 11:50:31 -06:00
  • 1c93a88d82 use the # of matched terms as the score of a doc when doing a boolean query. later: use proximity scoring for non-field query terms. Matt 2015-07-20 11:09:56 -06:00
  • ff7639e323 do not get synonyms for boolean operators. just skip synonyms if ignoreWord is set at all. Matt 2015-07-19 13:07:05 -06:00
  • 646bc91c59 fix more possible unicode errors Matt 2015-07-19 12:05:09 -06:00
  • b9fc583cae fix core Matt 2015-07-18 18:01:11 -06:00
  • 16fd428887 fix more cores from the dynamic query size changes. add how many query terms we truncated in the json/xml replies. document those fields as well. Matt 2015-07-18 14:15:47 -06:00
  • dab0726fac typo fix Matt Wells 2015-07-17 10:43:38 -06:00
  • 5e7a06229c print special message if no seeds were able to be crawled. Matt 2015-07-17 08:42:01 -06:00
  • 3ffa651b63 Merge branch 'ia-zak' of https://github.com/gigablast/open-source-search-engine into ia-zak Zak Betz 2015-07-16 12:39:33 -06:00
  • 15eb7f659d Fix some malformed html on hosts page. Fix core when no collection record in injection request. Add a script to test disk speed. Zak Betz 2015-07-16 12:02:14 -06:00
  • 7e526863d7 do not include 'diffbot uri' in urls.csv. should not have been there. Matt 2015-07-16 10:11:04 -06:00
  • 0d3cfc2796 single words in quotes - keep them in quotes so we do not get synonym forms Matt 2015-07-15 09:58:25 -06:00
  • f1b0bd0149 quick fix for tree sanity checker Matt 2015-07-15 09:46:27 -06:00
  • 0d1acb09bc try to fix tree if corruption detected when dumping to disk Matt Wells 2015-07-14 22:27:43 -06:00
  • b0a6e590d6 treat estimatedDate like date sam 2015-07-14 17:18:16 -07:00
  • 8048517463 gbss fix Matt 2015-07-14 18:17:52 -06:00
  • 016fa88b29 treat estimatedDate like date sam 2015-07-14 17:17:21 -07:00
  • 9946b4b4be add gbssDiffbotType and gbssIsSeedUrl:1 to spider status docs. Matt 2015-07-14 17:59:50 -06:00
  • a697b3d5a5 Fix Bad File Descriptor loop bug when downloading a static file on a slow disk. Zak Betz 2015-07-14 17:00:09 -06:00
  • fa38d97ec4 Merge branch 'diffbot' into diffbot-testing Matt 2015-07-14 11:45:05 -06:00
  • f173b41e92 additional log info Matt 2015-07-14 11:44:14 -06:00
  • baff94875d fix another core from dynamic query sizing Matt 2015-07-14 09:23:44 -06:00
  • c8cf0e5440 fix some mem leaks from allowing really big queries. added a max query term control to search controls to limit users doing really big queries. but default it very high to 1M. Matt 2015-07-13 23:17:53 -06:00
  • f3d35b557f should solve defect #3002 sam 2015-07-13 18:08:25 -07:00
  • fc4b4db425 fix core related to increasing max query length Matt 2015-07-13 19:00:47 -06:00
  • c3a0f21600 nomenclature changes Matt 2015-07-13 18:42:13 -06:00
  • acf389debd Merge branch 'ia' into ia-zak Matt Wells 2015-07-13 18:39:00 -06:00
  • 1ba57f9278 fix pesky memory leak finally Matt Wells 2015-07-13 17:47:34 -06:00
  • c03594034d bump up some limits for extraordinarily long queries Matt Wells 2015-07-13 17:43:28 -06:00
  • 34ec49e804 get mike's super long query working Matt 2015-07-13 14:59:44 -06:00
  • 0e009fa6bc fix cores from dynamic # query terms fix Matt 2015-07-10 20:49:40 -06:00
  • f088e734f6 allow up to 3000 query terms. really we can allow much more since we are mostly dynamically allocating, only a few smaller arrays use the 3000 on the stack. Matt 2015-07-10 19:02:30 -06:00
  • 5d57862046 do not core on gigabits overflow issue Matt 2015-07-10 11:00:16 -06:00
  • 795bdf2a78 Merge branch 'ia-zak' of github.com:gigablast/open-source-search-engine into ia-zak Matt 2015-07-08 21:36:58 -06:00
  • 46af0e1bce if url too long return the EURLTOOBIG error code. it prints 'Too many chars in url' as the official error msg. Matt 2015-07-08 21:36:18 -06:00
  • f4effaecb8 Fix memory leak. Zak Betz 2015-07-08 21:27:57 -06:00
  • 8f19b1a0e2 Merge branch 'ia-zak' of https://github.com/gigablast/open-source-search-engine into ia-zak Zak Betz 2015-07-08 14:03:53 -06:00
  • 6e21bc7d7c Injection script fixes. Temporary fix for core when injecting large warc. Zak Betz 2015-07-08 14:03:39 -06:00
  • 6cd91ae32c Merge branch 'ia' into ia-zak Matt 2015-07-08 13:49:25 -06:00
  • a15d2470f5 Merge branch 'testing' into diffbot-testing Matt 2015-07-08 13:48:58 -06:00
  • 581f287113 api doc update for facets Matt 2015-07-08 13:48:16 -06:00
  • 3395ee8111 fix core in sections Matt 2015-07-08 08:15:30 -06:00
  • a7ae510e31 Fix string faceting display for json metadata. Add unit test for faceted metadata. Zak Betz 2015-07-06 23:05:18 -06:00
  • 97f2052d63 remove the debug log sam 2015-07-06 18:05:35 -07:00
  • bcd53016e9 Fixes #2947. Fixed a bug with counting facet 'totalDocsWithField' and 'totalDocsWithFieldAndValue' Kevin Truong 2015-07-06 17:42:47 -07:00
  • 6745c72232 implemented stepping sam 2015-07-06 17:28:17 -07:00
  • 87fcda0f93 Fix atotime5 to parse ISO8601. Fix qa test for warcs and arcs. Fix inject script. Zak Betz 2015-07-06 00:51:18 -06:00
  • a844f618cc Merge branch 'ia-zak' of https://github.com/gigablast/open-source-search-engine into ia-zak Zak Betz 2015-07-05 17:07:40 -06:00
  • fa78164a30 Injection script now keeps track of injection date and won't reinject something that hasn't changed. Zak Betz 2015-07-05 17:07:21 -06:00
  • adae6689e6 fix add url from root page. fix core from corruption Matt 2015-07-04 21:52:11 -06:00
  • 815bd7ce0a quite a few bug fixes. Matt 2015-07-02 17:42:05 -06:00
  • f664f1788d Merge branch 'ia-zak' of https://github.com/gigablast/open-source-search-engine into ia-zak Zak Betz 2015-07-02 12:18:08 -06:00
  • 6de4199ee8 Fix linkdb core. Make file and line number the label for StackBuf. Zak Betz 2015-07-02 12:17:10 -06:00
  • 1966f36c00 fix clock candidate bug Matt 2015-07-01 20:34:39 -06:00
  • 0cb3a5d44e fix ptr_metadata issue Matt 2015-07-01 20:01:57 -06:00
  • 1327301a8d Merge branch 'testing' into ia Matt 2015-07-01 19:03:56 -06:00
  • 4e987f71d2 use gigabits json array, and update the json results output documentation to describe each thing. Matt 2015-07-01 18:59:52 -06:00
  • 5c61495aea Merge branch 'diffbot-testing' into testing Matt 2015-07-01 18:56:56 -06:00
  • b88079a2d4 Fix warc injector script. Zak Betz 2015-06-30 22:19:13 -06:00
  • 7b507a70ef Set value length to 0 for something that does not return a string value in Json.cpp. Fix the '-' -> '_' when indexing generic fields. Add a StackBuf macro which is a Safebuf initialized with a small stack buffer for use in a local scope. Zak Betz 2015-06-30 14:09:57 -06:00
  • f615ac9331 try to fix infinite loop bug again Matt 2015-06-25 17:30:27 -07:00
  • 8c8c7eddf6 Merge branch 'diffbot-testing' into testing Matt 2015-06-23 17:45:35 -06:00
  • 57020dcb24 added micro.html Matt 2015-06-23 17:45:25 -06:00
  • 1e3c52a0ef fix infinite loop bug from performance enhancement using active list for spidering i put in a few days back. Matt 2015-06-23 13:52:02 -07:00
  • 5aa8e6ba2a Merge branch 'diffbot-testing' into diffbot Matt 2015-06-22 14:31:39 -07:00
  • dfdea910ad fix fix Matt 2015-06-19 08:50:59 -07:00
  • 7a0ae294a2 reduce log spam when rebalancing Matt 2015-06-19 08:47:06 -07:00
  • 5b104f1bb8 diffbt max ips from 1 back to 7 Matt Wells 2015-06-18 15:29:03 -07:00
  • bdebd79f4f spiderloop active list bug fix. change diffbot ip max from 1 to 7 again. Matt Wells 2015-06-18 15:05:16 -07:00
  • 5f5ce7d12c Merge branch 'diffbot-testing' into testing Matt 2015-06-18 11:02:21 -06:00
  • e1aab778e9 fix errno miscount bug. fix infinite loop in active list logic. Matt 2015-06-18 10:55:07 -06:00
  • 902a8fc61d fix errno mismatch bug Matt 2015-06-18 10:33:27 -06:00
  • 0493e7a899 use linked lists for closing least used fds for speed. right now just log if it differs from current algo. Matt Wells 2015-06-18 09:19:13 -07:00
  • 18dbaf89c9 Merge branch 'diffbot-testing' of github.com:gigablast/open-source-search-engine into diffbot-testing Matt Wells 2015-06-18 08:40:53 -07:00
  • e9f1ab1150 make donesleepingwrapper in spider.cpp faster using the active list of colls to save time. Matt Wells 2015-06-18 08:38:46 -07:00
  • f490847eb2 Fix injector build script. Add IA's lib for getting metadata. Zak Betz 2015-06-18 01:23:13 -06:00
  • 9f61636881 Change collection on inject script. Zak Betz 2015-06-18 00:24:36 -06:00
  • 9ca0223cf1 Translate metadata field names with dashes to _. Add unit tests for searching for certain types of metadata. Zak Betz 2015-06-17 23:36:31 -06:00