Commit Graph

  • 0fe0147913 fix invisible columns in url filters table. mwells 2013-09-25 12:24:13 -0600
  • 1d92004e06 fix spider flow debug msgs mwells 2013-09-25 12:07:11 -0600
  • 40192249f9 spider speedups and fixes. mwells 2013-09-25 11:58:03 -0600
  • e34afd21ea fix bug of possibly not removing some locks Matt Wells 2013-09-25 09:28:35 -0700
  • a687380aeb fix a bug of not reading enough spiderdb records for a given "ip" because short reads were causing us to bail out early. still not sure as to the cause of the short reads. Matt Wells 2013-09-24 20:48:48 -0700
  • fbd853fdf7 fix long-standing spider bug causing some ip queues to not get fully spidered. Matt Wells 2013-09-24 20:44:55 -0700
  • b16d8519fc more spider fixes. still need more speedups when spidering multiple spiders on same ip. mwells 2013-09-24 16:40:14 -0600
  • e594af898a seems like we can spider multiple urls from same ip at same time now. mwells 2013-09-24 09:32:26 -0600
  • 8461e33b53 fixed more spider bugs. mwells 2013-09-23 21:26:27 -0700
  • b90ef3de0d more spider fixes. right after getting lock, use msg12 to remove rec from doledb/doleiptable and add 0 entry to waiting table so doledb is again immediately repopulated with that firstIp so we can spider multiple urls from the same ip at the same time. mwells 2013-09-23 20:25:28 -0600
  • 7c31ecff4a fixed fakedb key support. mwells 2013-09-23 15:16:23 -0600
  • 4d33737ac1 fakedb fixes mwells 2013-09-23 08:19:54 -0700
  • 83e87fc755 fixed ability to spider multiple urls from the same IP at the same time. Also respects sameIpWait constraints. mwells 2013-09-20 15:42:48 -0700
  • 05400a0c25 updated spider code documentation. mwells 2013-09-20 11:19:24 -0700
  • fbd62cecba updated compilation instructions. need to apt-get install gcc-multilib. Matt Wells 2013-09-20 10:06:01 -0700
  • bcc55dc46b fixed a couple bugs. Added more documentation into Spider.h. Matt Wells 2013-09-19 18:21:52 -0700
  • 47465f6d90 more fixes. trying to fix spiders to spider multiple urls from same ip... Matt Wells 2013-09-19 11:13:40 -0700
  • a3ea867305 update crawlbot api. Matt Wells 2013-09-18 17:13:36 -0700
  • 022caeec04 use -diffbotxyz%li as a more unique appendage. show token on crawlbot page. Matt Wells 2013-09-18 17:05:41 -0700
  • 29f5c5d644 added isonsamesubdomain and isonsamedomain Matt Wells 2013-09-18 16:45:37 -0700
  • 8de246d9c4 only show urls being spidered from your coll Matt Wells 2013-09-18 16:29:47 -0700
  • 3bdd28ab1d fix spider bug Matt Wells 2013-09-18 16:17:08 -0700
  • 7fdbd0f66a delete spider coll when deleting coll Matt Wells 2013-09-18 15:36:30 -0700
  • f90d20f4dd diffbot api integration updates Matt Wells 2013-09-18 15:07:47 -0700
  • 70ff54ce03 hide the parms that might scare users away in the url filters. Matt Wells 2013-09-18 14:27:59 -0700
  • 6af02119a1 use cookies to display url filters table. Matt Wells 2013-09-18 13:50:55 -0700
  • 04b0a08ef9 propagate showtable=1 when submitting url filters table Matt Wells 2013-09-18 12:38:05 -0700
  • 924d1320a2 fix bugs inserting and deleting rows using TYPE_SAFEBUF parms. Matt Wells 2013-09-18 12:35:01 -0700
  • c1bcebb7bb url filter documentation update. Matt Wells 2013-09-18 12:00:29 -0700
  • 459a7e98fb add diffbot dropdown to url filters table Matt Wells 2013-09-18 11:24:16 -0700
  • 487d3f0a0e fix url filters bugs. Matt Wells 2013-09-18 11:02:09 -0700
  • 39d9760e5d added ismedia url filter to cover all the jpg,gif,mpeg,css rules. Matt Wells 2013-09-18 09:40:59 -0700
  • c77453348f Merge branch 'master' into diffbot Matt Wells 2013-09-18 09:23:48 -0700
  • d6815f2c9d if family filter enabled (&ff=1) then prepend "gbadult:0 |" to the query to restrict to non-adult pages. mwells 2013-09-18 00:11:55 -0600
  • a0032e0eb7 added another log statement for when debugging the adult content detectory. we err on the side of caution for the most part. mwells 2013-09-18 00:06:21 -0600
  • 119a4c0c22 fix adult content detector mwells 2013-09-17 23:53:17 -0600
  • 5ec3803312 fix core in hashing gbisadult:[0|1] term. mwells 2013-09-17 23:27:31 -0600
  • 3005f904c7 index gbisadult:1 if adult content gbisadult:0 if not. Matt Wells 2013-09-17 22:05:47 -0700
  • 10fcfb6987 minor updates Matt Wells 2013-09-17 17:32:49 -0700
  • b8590d7df9 do not show json pages if searching pages. Matt Wells 2013-09-17 17:23:58 -0700
  • 7fa4138d1c fix Next 10 link Matt Wells 2013-09-17 17:19:41 -0700
  • 98caa3225a fix query prepend logic for json searches Matt Wells 2013-09-17 17:16:39 -0700
  • 017a0febef fix api dropdown selection. Matt Wells 2013-09-17 16:38:56 -0700
  • 5e3b727eb5 crawlbot api fixes. Matt Wells 2013-09-17 16:30:57 -0700
  • b38d54cef9 save crawlinfo as binary so its easier to not miss anything. Matt Wells 2013-09-17 16:07:59 -0700
  • 2beff7f7d8 crawlbot api updates Matt Wells 2013-09-17 15:59:50 -0700
  • e50da4d012 crawlbot api fixes Matt Wells 2013-09-17 15:47:44 -0700
  • c16fe8601b more crawlbot api fixes Matt Wells 2013-09-17 15:32:28 -0700
  • e7151e6cc6 fix bug with spiders not coming on. Matt Wells 2013-09-17 14:35:48 -0700
  • c81f700bf0 get reset collection kinda working. Matt Wells 2013-09-17 14:13:44 -0700
  • 4321f02e4e trying to get reset collection working Matt Wells 2013-09-17 12:21:09 -0700
  • fff8b80969 get collection delete working Matt Wells 2013-09-17 11:27:31 -0700
  • 63973cf9c0 get "add new collection" working. Matt Wells 2013-09-17 10:43:23 -0700
  • 02bf6ab3cc new crawlbot api. not backwards compatible any more. Matt Wells 2013-09-17 10:25:54 -0700
  • f34a7f44ab compiler flag fix for xmldoc.o mwells 2013-09-16 22:35:16 -0600
  • afd1b3a9a2 added Diffbot.h mwells 2013-09-16 21:42:48 -0600
  • fc692202ba fix integration of urls filters into crawlbot page Matt Wells 2013-09-16 16:27:48 -0700
  • e7ed9254d4 formatting... Matt Wells 2013-09-16 15:33:45 -0700
  • 1a780d1f4a pretty up a little Matt Wells 2013-09-16 15:18:55 -0700
  • a034604cef clean up to remove g_conf.m_useDiffbot Matt Wells 2013-09-16 15:00:43 -0700
  • cb9969ad22 fix token bug Matt Wells 2013-09-16 14:38:29 -0700
  • 3dfba4de69 doc updates Matt Wells 2013-09-16 14:29:01 -0700
  • 4c11265a98 more updates to crawlbot api Matt Wells 2013-09-16 13:59:11 -0700
  • 676437c3c4 more universal api updates. Matt Wells 2013-09-16 11:42:04 -0700
  • 04f7774543 lower spider crawl info stats threshold Matt Wells 2013-09-16 11:27:09 -0700
  • df96f81e78 fix spidering and other things. Matt Wells 2013-09-16 11:22:07 -0700
  • f974d6a47b fixes for crawlbot universal api. Matt Wells 2013-09-16 10:49:37 -0700
  • a50898649b various fixes. Matt Wells 2013-09-16 10:16:49 -0700
  • 9db501d91c resolve merge conflict for nullTerm() Matt Wells 2013-09-16 09:06:33 -0700
  • 78a334198b Merge branch 'master' into diffbot Matt Wells 2013-09-16 09:05:37 -0700
  • 3ac79de92e fix type adurl -> addurl. Matt Wells 2013-09-16 08:11:06 -0700
  • e6f87f5049 do not send email alerts to sysadmin@gigablast. Matt Wells 2013-09-16 08:10:18 -0700
  • 5deda56ede minor documentation updates. Matt Wells 2013-09-15 22:16:14 -0700
  • 3fdbae4b05 admin.html documentation update. Matt Wells 2013-09-15 22:05:01 -0700
  • 68db2e6cc6 fix bug when checking the delete checkbox on the injection page. Matt Wells 2013-09-15 21:47:42 -0700
  • 965e23f192 fix core from hashtablex::set() not getting enough buf space. now we force it to allocate a minimum of 32 slots to fix another bug where it was re-allocating immediately upon adding a key because growTable() is ALWAYS called if there are less than 20 slots! Matt Wells 2013-09-15 21:15:58 -0700
  • 991e2f30f7 speed up whitelist hashtable like 20x using hashtable key magic. Matt Wells 2013-09-15 21:10:53 -0700
  • 928dc36a03 get "&site=abc.com+xyz.com"... working to restrict search results to specified sites. tested a little. Matt Wells 2013-09-15 20:16:48 -0700
  • 2211881e59 take apt-get install ssl stuff out of admin.html installation instructions since we supply the ssl headers now. mwells 2013-09-15 18:27:47 -0600
  • 01c2a6d381 we already include our own 32-bit libssl.a and libcrypto.a so we can ensure stability. so we have to include the header files as well really. mwells 2013-09-15 18:25:49 -0600
  • 107037c6a2 new &sites=xyz.com+abc.com+... functionality compiles ok. mwells 2013-09-15 18:14:32 -0600
  • b684414e16 almost done adding support for whitelists. i.e. list of sites to restrict search results to, for instance. mwells 2013-09-15 15:15:56 -0600
  • 7ecffec40f universal api updates Matt Wells 2013-09-13 18:10:03 -0700
  • d982997b0c streamline crawl stats. Matt Wells 2013-09-13 17:34:39 -0700
  • 93ce424d99 start working on the main gui for crawlbot which is /crawlbot Matt Wells 2013-09-13 16:22:07 -0700
  • 6b330da240 cleanup warnings in log. Matt Wells 2013-09-13 14:37:35 -0700
  • eb65b9265d call diffbot /api/analyze if classify is true or api = "all" now. it will return "type": in the json to indicate page type. basically, it classifies the page. Matt Wells 2013-09-13 14:13:56 -0700
  • 19056fc3f2 show "processed" instead of "matched". other fixes for spider stats. add new crawl stats. attempts and successes. Matt Wells 2013-09-13 11:51:55 -0700
  • e3e6551e23 fix diffbot bugs. Matt Wells 2013-09-13 11:34:40 -0700
  • 7dd647c222 trying to fix nukeJSON code. Matt Wells 2013-09-13 10:38:34 -0700
  • ef3990da98 now when re-indexing an existing xml doc we first call nukeJSONObjects() to delete any pages that it indexed from the json objects it had in its diffbot reply. in this way it can then re-add them if its new diffbot reply has them again. Matt Wells 2013-09-13 10:00:21 -0700
  • a412c798bf Merge branch 'master' into diffbot Matt Wells 2013-09-13 09:24:28 -0700
  • 5dc7bd2ab4 integrate diffbot from svn back into git. Matt Wells 2013-09-13 09:23:18 -0700
  • e152205765 make depend update mwells 2013-09-09 02:37:47 -0600
  • 1d63aa936c remove plotter.h includes causing compiler errors on some machines. Matt Wells 2013-09-09 01:25:00 -0700
  • 76b390aea2 fix typo Matt Wells 2013-09-08 19:51:57 -0700
  • d930a833cc try to fix compiler error related to bad delete function override. added "throw()" before the first "{" in the function body. mwells 2013-09-08 20:15:39 -0600
  • 828345a4c7 fix compiler warning in types.h. mwells 2013-09-08 20:00:52 -0600
  • e1968b2237 Merge branch 'master' of git@github.com:gigablast/open-source-search-engine Matt Wells 2013-09-08 18:43:00 -0700
  • cecc655eac minor ifdef fix. Matt Wells 2013-09-08 18:42:44 -0700