Commit Graph

  • 83e87fc755 fixed ability to spider multiple urls from the same IP at the same time. Also respects sameIpWait constraints. mwells 2013-09-20 15:42:48 -07:00
  • 05400a0c25 updated spider code documentation. mwells 2013-09-20 11:19:24 -07:00
  • fbd62cecba updated compilation instructions. need to apt-get install gcc-multilib. Matt Wells 2013-09-20 10:06:01 -07:00
  • bcc55dc46b fixed a couple bugs. Added more documentation into Spider.h. Matt Wells 2013-09-19 18:21:52 -07:00
  • 47465f6d90 more fixes. trying to fix spiders to spider multiple urls from same ip... Matt Wells 2013-09-19 11:13:40 -07:00
  • a3ea867305 update crawlbot api. Matt Wells 2013-09-18 17:13:36 -07:00
  • 022caeec04 use -diffbotxyz%li as a more unique appendage. show token on crawlbot page. Matt Wells 2013-09-18 17:05:41 -07:00
  • 29f5c5d644 added isonsamesubdomain and isonsamedomain Matt Wells 2013-09-18 16:45:37 -07:00
  • 8de246d9c4 only show urls being spidered from your coll Matt Wells 2013-09-18 16:29:47 -07:00
  • 3bdd28ab1d fix spider bug Matt Wells 2013-09-18 16:17:08 -07:00
  • 7fdbd0f66a delete spider coll when deleting coll Matt Wells 2013-09-18 15:36:30 -07:00
  • f90d20f4dd diffbot api integration updates Matt Wells 2013-09-18 15:07:47 -07:00
  • 70ff54ce03 hide the parms that might scare users away in the url filters. Matt Wells 2013-09-18 14:27:59 -07:00
  • 6af02119a1 use cookies to display url filters table. Matt Wells 2013-09-18 13:50:55 -07:00
  • 04b0a08ef9 propagate showtable=1 when submitting url filters table Matt Wells 2013-09-18 12:38:05 -07:00
  • 924d1320a2 fix bugs inserting and deleting rows using TYPE_SAFEBUF parms. Matt Wells 2013-09-18 12:35:01 -07:00
  • c1bcebb7bb url filter documentation update. Matt Wells 2013-09-18 12:00:29 -07:00
  • 459a7e98fb add diffbot dropdown to url filters table Matt Wells 2013-09-18 11:24:16 -07:00
  • 487d3f0a0e fix url filters bugs. Matt Wells 2013-09-18 11:02:09 -07:00
  • 39d9760e5d added ismedia url filter to cover all the jpg,gif,mpeg,css rules. Matt Wells 2013-09-18 09:40:59 -07:00
  • c77453348f Merge branch 'master' into diffbot Matt Wells 2013-09-18 09:23:48 -07:00
  • d6815f2c9d if family filter enabled (&ff=1) then prepend "gbadult:0 |" to the query to restrict to non-adult pages. mwells 2013-09-18 00:11:55 -06:00
  • a0032e0eb7 added another log statement for when debugging the adult content detectory. we err on the side of caution for the most part. mwells 2013-09-18 00:06:21 -06:00
  • 119a4c0c22 fix adult content detector mwells 2013-09-17 23:53:17 -06:00
  • 5ec3803312 fix core in hashing gbisadult:[0|1] term. mwells 2013-09-17 23:27:31 -06:00
  • 3005f904c7 index gbisadult:1 if adult content gbisadult:0 if not. Matt Wells 2013-09-17 22:05:47 -07:00
  • 10fcfb6987 minor updates Matt Wells 2013-09-17 17:32:49 -07:00
  • b8590d7df9 do not show json pages if searching pages. Matt Wells 2013-09-17 17:23:58 -07:00
  • 7fa4138d1c fix Next 10 link Matt Wells 2013-09-17 17:19:41 -07:00
  • 98caa3225a fix query prepend logic for json searches Matt Wells 2013-09-17 17:16:39 -07:00
  • 017a0febef fix api dropdown selection. Matt Wells 2013-09-17 16:38:56 -07:00
  • 5e3b727eb5 crawlbot api fixes. Matt Wells 2013-09-17 16:30:57 -07:00
  • b38d54cef9 save crawlinfo as binary so its easier to not miss anything. Matt Wells 2013-09-17 16:07:59 -07:00
  • 2beff7f7d8 crawlbot api updates Matt Wells 2013-09-17 15:59:50 -07:00
  • e50da4d012 crawlbot api fixes Matt Wells 2013-09-17 15:47:44 -07:00
  • c16fe8601b more crawlbot api fixes Matt Wells 2013-09-17 15:32:28 -07:00
  • e7151e6cc6 fix bug with spiders not coming on. Matt Wells 2013-09-17 14:35:48 -07:00
  • c81f700bf0 get reset collection kinda working. Matt Wells 2013-09-17 14:13:44 -07:00
  • 4321f02e4e trying to get reset collection working Matt Wells 2013-09-17 12:21:09 -07:00
  • fff8b80969 get collection delete working Matt Wells 2013-09-17 11:27:31 -07:00
  • 63973cf9c0 get "add new collection" working. Matt Wells 2013-09-17 10:43:23 -07:00
  • 02bf6ab3cc new crawlbot api. not backwards compatible any more. Matt Wells 2013-09-17 10:25:54 -07:00
  • f34a7f44ab compiler flag fix for xmldoc.o mwells 2013-09-16 22:35:16 -06:00
  • afd1b3a9a2 added Diffbot.h mwells 2013-09-16 21:42:48 -06:00
  • fc692202ba fix integration of urls filters into crawlbot page Matt Wells 2013-09-16 16:27:48 -07:00
  • e7ed9254d4 formatting... Matt Wells 2013-09-16 15:33:45 -07:00
  • 1a780d1f4a pretty up a little Matt Wells 2013-09-16 15:18:55 -07:00
  • a034604cef clean up to remove g_conf.m_useDiffbot Matt Wells 2013-09-16 15:00:43 -07:00
  • cb9969ad22 fix token bug Matt Wells 2013-09-16 14:38:29 -07:00
  • 3dfba4de69 doc updates Matt Wells 2013-09-16 14:29:01 -07:00
  • 4c11265a98 more updates to crawlbot api Matt Wells 2013-09-16 13:59:11 -07:00
  • 676437c3c4 more universal api updates. Matt Wells 2013-09-16 11:42:04 -07:00
  • 04f7774543 lower spider crawl info stats threshold Matt Wells 2013-09-16 11:27:09 -07:00
  • df96f81e78 fix spidering and other things. Matt Wells 2013-09-16 11:22:07 -07:00
  • f974d6a47b fixes for crawlbot universal api. Matt Wells 2013-09-16 10:49:37 -07:00
  • a50898649b various fixes. Matt Wells 2013-09-16 10:16:49 -07:00
  • 9db501d91c resolve merge conflict for nullTerm() Matt Wells 2013-09-16 09:06:33 -07:00
  • 78a334198b Merge branch 'master' into diffbot Matt Wells 2013-09-16 09:05:37 -07:00
  • 3ac79de92e fix type adurl -> addurl. Matt Wells 2013-09-16 08:11:06 -07:00
  • e6f87f5049 do not send email alerts to sysadmin@gigablast. Matt Wells 2013-09-16 08:10:18 -07:00
  • 5deda56ede minor documentation updates. Matt Wells 2013-09-15 22:16:14 -07:00
  • 3fdbae4b05 admin.html documentation update. Matt Wells 2013-09-15 22:05:01 -07:00
  • 68db2e6cc6 fix bug when checking the delete checkbox on the injection page. Matt Wells 2013-09-15 21:47:42 -07:00
  • 965e23f192 fix core from hashtablex::set() not getting enough buf space. now we force it to allocate a minimum of 32 slots to fix another bug where it was re-allocating immediately upon adding a key because growTable() is ALWAYS called if there are less than 20 slots! Matt Wells 2013-09-15 21:15:58 -07:00
  • 991e2f30f7 speed up whitelist hashtable like 20x using hashtable key magic. Matt Wells 2013-09-15 21:10:53 -07:00
  • 928dc36a03 get "&site=abc.com+xyz.com"... working to restrict search results to specified sites. tested a little. Matt Wells 2013-09-15 20:16:48 -07:00
  • 2211881e59 take apt-get install ssl stuff out of admin.html installation instructions since we supply the ssl headers now. mwells 2013-09-15 18:27:47 -06:00
  • 01c2a6d381 we already include our own 32-bit libssl.a and libcrypto.a so we can ensure stability. so we have to include the header files as well really. mwells 2013-09-15 18:25:49 -06:00
  • 107037c6a2 new &sites=xyz.com+abc.com+... functionality compiles ok. mwells 2013-09-15 18:14:32 -06:00
  • b684414e16 almost done adding support for whitelists. i.e. list of sites to restrict search results to, for instance. mwells 2013-09-15 15:15:56 -06:00
  • 7ecffec40f universal api updates Matt Wells 2013-09-13 18:10:03 -07:00
  • d982997b0c streamline crawl stats. Matt Wells 2013-09-13 17:34:39 -07:00
  • 93ce424d99 start working on the main gui for crawlbot which is /crawlbot Matt Wells 2013-09-13 16:22:07 -07:00
  • 6b330da240 cleanup warnings in log. Matt Wells 2013-09-13 14:37:35 -07:00
  • eb65b9265d call diffbot /api/analyze if classify is true or api = "all" now. it will return "type": in the json to indicate page type. basically, it classifies the page. Matt Wells 2013-09-13 14:13:56 -07:00
  • 19056fc3f2 show "processed" instead of "matched". other fixes for spider stats. add new crawl stats. attempts and successes. Matt Wells 2013-09-13 11:51:55 -07:00
  • e3e6551e23 fix diffbot bugs. Matt Wells 2013-09-13 11:34:40 -07:00
  • 7dd647c222 trying to fix nukeJSON code. Matt Wells 2013-09-13 10:38:34 -07:00
  • ef3990da98 now when re-indexing an existing xml doc we first call nukeJSONObjects() to delete any pages that it indexed from the json objects it had in its diffbot reply. in this way it can then re-add them if its new diffbot reply has them again. Matt Wells 2013-09-13 10:00:21 -07:00
  • a412c798bf Merge branch 'master' into diffbot Matt Wells 2013-09-13 09:24:28 -07:00
  • 5dc7bd2ab4 integrate diffbot from svn back into git. Matt Wells 2013-09-13 09:23:18 -07:00
  • e152205765 make depend update mwells 2013-09-09 02:37:47 -06:00
  • 1d63aa936c remove plotter.h includes causing compiler errors on some machines. Matt Wells 2013-09-09 01:25:00 -07:00
  • 76b390aea2 fix typo Matt Wells 2013-09-08 19:51:57 -07:00
  • d930a833cc try to fix compiler error related to bad delete function override. added "throw()" before the first "{" in the function body. mwells 2013-09-08 20:15:39 -06:00
  • 828345a4c7 fix compiler warning in types.h. mwells 2013-09-08 20:00:52 -06:00
  • e1968b2237 Merge branch 'master' of git@github.com:gigablast/open-source-search-engine Matt Wells 2013-09-08 18:43:00 -07:00
  • cecc655eac minor ifdef fix. Matt Wells 2013-09-08 18:42:44 -07:00
  • 657d669ec8 exclude events and seo functionality. most people want this for web search so it should be a non-issue. mwells 2013-09-08 17:07:42 -06:00
  • 34b6d3e74a fixed some cores. brought in fixes from old repo. mwells 2013-09-08 16:16:13 -06:00
  • dcf45dd69d dump out doledb to disk when it has more than 50,000 negative keys to avoid positive/negative key annihilations delays. mwells 2013-09-08 15:09:54 -06:00
  • 03706131fe documentation updates in Spider.h. Matt Wells 2013-09-08 13:42:02 -07:00
  • 54c9353dbd try to fix core from g_inSigHandler being set. it should never be set since we do not use real time signals any more mwells 2013-09-08 12:34:37 -06:00
  • 0581f86265 fix core from calling a gettime related function from a pthread when a signal handler from the main thread was in use and POSSIBLY in the same function when the signal went off. different threads should be able to access that function just fine i'd imagine. mwells 2013-09-06 15:39:53 -06:00
  • 7aa81abf91 use the "onsite" keyword in your url filters instead of this "only spider links from same host" switch to keep things simpler. mwells 2013-09-06 09:37:17 -06:00
  • c58df10155 fix major bug causing spiders not to work. Matt Wells 2013-09-04 11:01:24 -07:00
  • 91c4e768b1 more family filter fixes mwells 2013-09-01 18:28:49 -06:00
  • aaf333c46c try to get family filter (&ff=1) working again to filter out adult search results. mwells 2013-09-01 18:22:38 -06:00
  • afbd1e2b96 fix core from trying to get the time while in a sig handler. getTime() is not async safe. mwells 2013-09-01 12:55:22 -06:00
  • 93dfb0cfd4 fix for the "spiders stuck" fix. mwells 2013-08-31 11:25:26 -06:00