Commit Graph

  • 2c750b2c22 Merge branch 'diffbot-testing' into diffbot-matt mwells 2014-06-04 13:56:44 -07:00
  • d98cf4b2b0 try to prevent slamming diffbot backend with bulk jobs consisting of hundreds of different domains/ips. Matt Wells 2014-06-04 12:37:49 -07:00
  • 4298e4e752 sanity checks for debugging duplicate titledb file bug. Matt Wells 2014-06-04 12:15:12 -07:00
  • b7d9002a05 fix log bug Matt Wells 2014-06-04 10:57:25 -07:00
  • 8b74bd855b Merge branch 'master' into diffbot-testing Matt Wells 2014-06-04 09:37:55 -07:00
  • fcc8bc85cc update bulk job restart Matt Wells 2014-06-04 09:36:26 -07:00
  • e2ca303fe2 doc updates Matt Wells 2014-06-04 07:38:40 -07:00
  • a734240474 minor date change in documentation. mwells 2014-06-04 07:26:46 -07:00
  • beba94013e remove clustermaintenance documentation. seemed pretty obsolete. mwells 2014-06-04 07:26:10 -07:00
  • 3fd973a53e documentation updates for scaling the cluster mwells 2014-06-04 07:17:34 -07:00
  • ec1b66aff5 Merge branch 'master' into diffbot-testing Matt Wells 2014-06-03 20:50:59 -07:00
  • 585e6a357f parm documentation update for url filters Matt Wells 2014-06-03 20:50:22 -07:00
  • b534ac5812 do not print completed time if spidering is going on Matt Wells 2014-06-03 20:30:10 -07:00
  • 694b19e053 Merge branch 'diffbot-testing' of github.com:gigablast/open-source-search-engine into diffbot-testing Matt Wells 2014-06-03 18:07:51 -07:00
  • 07cf2f1129 fix core Matt Wells 2014-06-03 18:07:35 -07:00
  • f8073b5adc Merge branch 'diffbot-testing' into diffbot-matt Matt Wells 2014-06-03 14:59:31 -07:00
  • 50468293e7 fix bool expressions with only one operand. i.e. double parens bug. Matt Wells 2014-06-03 14:46:28 -07:00
  • bf70823260 take out <moreResultsFollow> for &stream=1 for now. maybe add back in later but would be at end of the reply. Matt Wells 2014-06-03 14:09:24 -07:00
  • d23032241d fix mem leak when downloading images is turned on. mwells 2014-06-03 13:26:56 -07:00
  • da677eb8a4 fix for searching for query pipe operator in quotes. Matt Wells 2014-06-03 13:08:35 -07:00
  • 91c7115c73 nothing mwells 2014-06-03 11:49:21 -07:00
  • a1f1daad16 Merge branch 'master' into diffbot-matt mwells 2014-06-03 11:41:46 -07:00
  • 6dcbc10e92 spider proxy updates. mwells 2014-06-03 11:38:44 -07:00
  • ba2329808b fix siteListIsEmpty bug causing spider to spider the whole internet when it shouldn't mwells 2014-06-03 11:37:31 -07:00
  • c3a823c99d fix relative url bug when relative url starts with ? Matt Wells 2014-06-03 10:54:50 -07:00
  • 536b43e19f Merge branch 'master' into diffbot-testing Matt Wells 2014-06-03 10:17:00 -07:00
  • a772e21db6 only show proxy stuff in logs when debugging is on for it mwells 2014-06-02 17:37:43 -07:00
  • 937e275134 zero out the proxy ips mwells 2014-06-02 17:32:13 -07:00
  • 8d55000952 formatting fixes mwells 2014-06-02 17:31:26 -07:00
  • abbc116442 show more spider proxy stats in table mwells 2014-06-02 17:12:25 -07:00
  • 2582a487a5 more spider proxy fixes mwells 2014-06-02 16:53:06 -07:00
  • 38854e44f3 added load points in table display of spider proxies mwells 2014-06-02 16:25:56 -07:00
  • 29c1c83967 select the proxy later down the pipeline to allow for cache hits, etc. mwells 2014-06-02 15:33:25 -07:00
  • 1ba445ae41 update times used mwells 2014-06-02 15:20:46 -07:00
  • 5377a7543c more spider proxy bug fixes mwells 2014-06-02 15:17:43 -07:00
  • ee5af6b30e more spider proxy fixes mwells 2014-06-02 14:59:15 -07:00
  • ca450e6bbd using msg55 when done downloading through a proxy to record stats for load balancing on host #0 mwells 2014-06-02 13:48:33 -07:00
  • 806cf79b73 spider proxy updates mwells 2014-06-02 13:18:18 -07:00
  • 0b9b77ea46 call buildProxyTable when ips updated mwells 2014-06-02 10:20:29 -07:00
  • 51bb653bb3 fix stack smash core. mwells 2014-06-01 10:42:49 -07:00
  • 918f43f80e still searched for stripped words even if has a synset. fixed query lang detector in SearchInput.cpp. mwells 2014-06-01 10:18:24 -07:00
  • d15f5d3ce7 when user searches for a word without the accent marks, we now also search for the same word but with the proper accent marks. Matt Wells 2014-06-01 09:37:00 -07:00
  • 6f704d3d6a fix wiktionary-based generation code so we can map a word with accents stripped to the word with the accents in place. Matt Wells 2014-06-01 06:33:16 -07:00
  • f16414b774 fix stripAccentMarks() to use libiconv stuff so all languages are now supported. mwells 2014-05-31 08:14:39 -07:00
  • 5f16013a9e add support for stripping accent marks from greek letters. mwells 2014-05-30 20:09:37 -07:00
  • 9767ec4c84 fix a few cores in spider proxies mwells 2014-05-30 15:13:19 -07:00
  • a811462d5f spider proxy stuff compiles now mwells 2014-05-30 15:05:00 -07:00
  • 509ae2fed8 remove limitations on # of search results requested. we are more of a back-end service so that can be handled by a middle or front layer. Matt Wells 2014-05-29 21:32:24 -07:00
  • 8fb8669da1 more spider proxy updates. mwells 2014-05-29 21:17:51 -06:00
  • 132aabf589 Merge branch 'diffbot-dan' into diffbot-testing Matt Wells 2014-05-29 10:28:38 -07:00
  • 79b2d4859b printCrawlDetailsInJson signature without version Daniel Steinberg 2014-05-28 10:41:32 -07:00
  • 1fae88b739 check version less than 99 Daniel Steinberg 2014-05-28 10:30:26 -07:00
  • a970c12f65 Merge branch 'diffbot' into diffbot-testing Matt Wells 2014-05-28 09:59:39 -07:00
  • bc5b126f2a Merge branch 'diffbot' Matt Wells 2014-05-28 09:15:48 -07:00
  • 662a8a33d0 emergency core fix Matt Wells 2014-05-28 09:29:54 -07:00
  • b3dcca6356 added make master-rpm mwells 2014-05-28 07:48:02 -07:00
  • d8aa79c90d Merge branch 'master' into diffbot-testing Matt Wells 2014-05-28 07:41:45 -07:00
  • 9b985fc233 Merge branch 'testing' mwells 2014-05-28 07:36:45 -07:00
  • 17e1fbc16c fix getPitPosLL() error causing lang detection to screw up. mwells 2014-05-28 07:35:05 -07:00
  • c06f9fde36 gigablast now has a notion of version based on the request Daniel Steinberg 2014-05-27 20:11:12 -07:00
  • d928a16211 Merge branch 'diffbot-testing' into diffbot-matt Matt Wells 2014-05-27 15:22:38 -07:00
  • f341dba0c8 got the general framework for load-balanced/reliabled floaters in place for the distributed spider network. need to fill in the blanks now. Matt Wells 2014-05-27 15:21:12 -07:00
  • 7448e8a1ff don't use "expand" for mode= requests or non-analyze requests Daniel Steinberg 2014-05-26 20:38:44 -07:00
  • da328a8d2f turn off spider reply indexing by default until we stop indexing simple words in url mwells 2014-05-26 13:22:43 -06:00
  • 068a299339 udpate documentation mwells 2014-05-26 13:09:57 -06:00
  • 2d4fb483b2 disambiguate error msg Matt Wells 2014-05-26 10:46:10 -07:00
  • 8149e99965 added developer.html warning msg. mwells 2014-05-26 11:41:12 -06:00
  • 58a2c04e30 more admin.html updates. mwells 2014-05-26 11:39:26 -06:00
  • cea69c35cb make sure all subsections of admin.html have a last updated time or a warning if documentation is old. mwells 2014-05-26 11:31:54 -06:00
  • c89c1f1471 Merge branch 'master' into testing mwells 2014-05-26 11:25:33 -06:00
  • db1703d500 admin.html updates mwells 2014-05-26 11:24:18 -06:00
  • f54c5192f2 updated admin.html to remove some stuff not needed. mwells 2014-05-26 10:57:39 -06:00
  • 8946517b7c minor admin.html update Matt Wells 2014-05-26 10:30:48 -04:00
  • fe536cf31f minor updates to admin.html mwells 2014-05-26 07:14:32 -07:00
  • 5ecd486f48 update admin.html Matt Wells 2014-05-25 22:28:05 -04:00
  • b201333549 Merge branch 'master' into testing Matt Wells 2014-05-25 22:13:45 -04:00
  • 8ad18d2cd3 make it so we don't need --nodeps with rpm -ivh (rpm install) to install pkg. Matt Wells 2014-05-25 22:08:46 -04:00
  • 2e7f32b01a fix getcwd2() so it works on red hat. defaults to /var/gigablast/data0/gb if cmd is "gb" and the "gb" binary is not in the current working directory. Matt Wells 2014-05-25 20:53:49 -04:00
  • d0df3da508 added dotemacs mwells 2014-05-25 07:54:30 -07:00
  • b0f9227bbc path fixes for gb startup Matt Wells 2014-05-25 10:28:13 -04:00
  • 98c2e7a8b6 redhat build updates on fedora Matt Wells 2014-05-25 09:58:07 -04:00
  • 3fe1d3f184 updates to compile cleanly on redhat. Matt Wells 2014-05-24 23:58:12 -04:00
  • b33959191b rpmbuild updates mwells 2014-05-24 07:16:17 -07:00
  • 8234aaed23 put lastspidertimeutc back in because we need it for debugging. Matt Wells 2014-05-23 09:43:46 -07:00
  • e3b6f6b74e a second fix for crawls saying they're done and then resuming. it seems to happen when we turn spiders off then back on again. so hack that. Matt Wells 2014-05-23 07:29:18 -07:00
  • 562b3eafda more spec file fixes. use relative symlinks mwells 2014-05-22 21:57:46 -07:00
  • 5c55517fe6 more rpm build fixes mwells 2014-05-22 21:01:30 -07:00
  • ddec6353ed rpm updates mwells 2014-05-22 19:24:33 -07:00
  • a783c9155b add spec file to build rpm. mwells 2014-05-22 19:06:09 -07:00
  • b2e9cfcc1b minor make install changes mwells 2014-05-22 18:46:38 -07:00
  • 1f4dc2df97 fix bug in spider scan of spiderdb for unique firstips Matt Wells 2014-05-22 13:08:01 -07:00
  • 68fcffb2da speed up scan of spiderdb to repopulate waiting tree by jumping over last firstip. Matt Wells 2014-05-22 12:20:03 -07:00
  • e9c4c9bb9a fix possible loss of data when doing reads on especially doledb. Matt Wells 2014-05-22 11:06:56 -07:00
  • 1660805f66 more useful logging for debugging Matt Wells 2014-05-22 10:36:44 -07:00
  • 32735677d2 wait 45 seconds before ending round, not 30 to try to fix some issues... Matt Wells 2014-05-22 08:32:19 -07:00
  • 935cc72e19 Merge branch 'diffbot-testing' of github.com:gigablast/open-source-search-engine into diffbot-testing Matt Wells 2014-05-21 13:55:29 -07:00
  • b8886c399c show start/end job times on pagecrawlbot. Matt Wells 2014-05-21 13:55:01 -07:00
  • 61fc015014 fix potential diffbot injection bug Matt Wells 2014-05-21 12:21:29 -07:00
  • b0c87b355c log update Matt Wells 2014-05-21 10:09:50 -07:00
  • 45df139ccb update logging Matt Wells 2014-05-21 10:05:49 -07:00