Commit Graph

  • ddbacab12f fix shard mapping of spiderdb. Matt Wells 2013-10-08 16:35:37 -0700
  • a76e8e42c3 fix json parsing oopsy. Matt Wells 2013-10-08 16:28:25 -0700
  • e1b798aa62 Merge branch 'diffbot' of github.com:gigablast/open-source-search-engine into diffbot mwells 2013-10-08 13:30:57 -0600
  • ae57540af1 prevent core during page parser Matt Wells 2013-10-08 11:54:55 -0700
  • ed0fbf2b99 fix core from not decoding json properly. Matt Wells 2013-10-08 11:46:18 -0700
  • a0e9e0df7a fix a little corage. Matt Wells 2013-10-07 16:47:55 -0700
  • 8adb920deb return all fields in diffbot reply. Matt Wells 2013-10-07 15:37:56 -0700
  • 1374fea70b do not run htmlDecode nor do iframe expansion on documents of type JSON. it was decoding " and messing up the json. Matt Wells 2013-10-07 14:50:32 -0700
  • 0b338161e4 fix a couple cores. Matt Wells 2013-10-07 11:59:07 -0700
  • 63c7764cd1 c=dmoz3 to c=dmoz mwells 2013-10-06 17:12:45 -0700
  • 59b491f007 return fake tag recs for links if usefakeips meta tag is given. saves some lookups in tagdb when adding gbdmoz.urls.txt.* files which have tons of links each. like 500,000. mwells 2013-10-06 16:42:32 -0700
  • 2383905c80 start using fakeips flag to stop ip tragrec lookups mwells 2013-10-06 16:40:04 -0600
  • 183b7c372e make sections grow dynamically so we do not OOM when trying to index a gbdmoz.urls.txt.* file which can be 25MB. mwells 2013-10-06 11:04:10 -0600
  • 3780789201 improve winning spider req selection using hop count. Matt Wells 2013-10-06 10:01:10 -0700
  • d8e6ac8748 fixed bug of not putting meta tags in all gbdmoz.urls.txt.* files in dmozparse.cpp mwells 2013-10-06 00:18:59 -0600
  • 000caa5a26 support for usefakeips meta tag mwells 2013-10-06 00:10:07 -0600
  • 2935a143f0 if downloading a url on 127.0.0.1 or other local ip then do not limit download size. should fix downloading of gbdmoz.urls.txt.* files which can be > 25MB big. mwells 2013-10-05 23:43:00 -0600
  • 612f2872f7 use addurl to add the gbdmoz url files to gigablast. it should index just those dmoz urls, and not spider their links. it should ignore external errors like ETCPTIMEDOUT when indexing so it will be identical to dmoz. mwells 2013-10-05 23:22:51 -0600
  • 9f73ba1531 fix core again. Matt Wells 2013-10-04 21:41:38 -0700
  • d464066da4 use catdb/ not cat/ mwells 2013-10-04 22:39:41 -0600
  • f21fb98c16 fix core when getting new spider reply when g_errno was ECORRUPTDATA Matt Wells 2013-10-04 20:44:29 -0700
  • 71d5d05f7c use catdb/ subdir not cat/ for consistency. mwells 2013-10-04 21:35:13 -0600
  • 2ef54f3601 fix core from hashing json fields when not diffbot reply Matt Wells 2013-10-04 20:00:03 -0700
  • f69b1f46c6 fix another bug from shard change. Matt Wells 2013-10-04 16:49:50 -0700
  • 0fea60eaae fix core Matt Wells 2013-10-04 16:35:19 -0700
  • fe97e08281 move from groups to shards. got rid of annoying groupid bit mask thing. Matt Wells 2013-10-04 16:18:56 -0700
  • 3fa0ad5786 fix './gb install' cmd to install the new files. Matt Wells 2013-10-04 14:04:47 -0700
  • c3afed946d Merge branch 'master' into diffbot mwells 2013-10-04 14:59:19 -0600
  • ad209bb403 debug log changes mwells 2013-10-04 14:59:03 -0600
  • e1bde7b7fe fixed bug of getting lock from the wrong group. mwells 2013-10-04 12:42:01 -0600
  • d4aa65c0fe try to fix spiders with m_msg5StartKey logic. mwells 2013-10-04 09:39:05 -0600
  • 78c4bda368 fix dmozparse urldump -s bugs for dumping out urls in dmoz. mwells 2013-10-04 00:00:26 -0600
  • f562e6da9a just ignore all urls with # (hashtag) in them from the dmoz dump. we were truncating http://twitter.com/#!/ronpaul to http://twitter.com/ and when looking up the catids of twitter.com got that ronpaul url. so that's bad. people should respect the hashtag. mwells 2013-10-03 23:33:55 -0600
  • 0176f8d6a7 fix cores in catdb logic. mwells 2013-10-03 22:34:49 -0600
  • 9e1fee2cb9 dmozparse works with latest dmoz files now mwells 2013-10-03 22:08:40 -0600
  • 7075b5ef88 try to fix spider lock/doledb issue some more mwells 2013-10-03 17:58:17 -0600
  • 044d1fb5b4 Merge branch 'diffbot' of github.com:gigablast/open-source-search-engine into diffbot mwells 2013-10-03 17:02:55 -0600
  • 87f624e8d5 try to fix spider bug. mwells 2013-10-03 17:02:43 -0600
  • 48c10bea64 Merge branch 'diffbot' of github.com:gigablast/open-source-search-engine into diffbot Matt Wells 2013-10-03 14:52:31 -0700
  • e09c1760e3 fix core from (broad)casting valueless cgi field. Matt Wells 2013-10-03 14:51:59 -0700
  • 10dad2e6bd fixed bug of not removing spider lock in addSpiderReply() because isAssignedToUs() was there. mwells 2013-10-03 10:45:19 -0600
  • 8c0fd6030c fixed infinite loop in spider. mwells 2013-10-03 10:31:00 -0600
  • a0c79932bb catdb is now generated successfully. mwells 2013-10-02 23:36:49 -0600
  • 942379427e log fixes for debugging. try to stop spammy log msgs. mwells 2013-10-02 22:37:20 -0600
  • 6c2c9f7774 trying to bring back dmoz integration. mwells 2013-10-02 22:34:21 -0600
  • 91b8921b9e have to use different ports if multiple gb instances/processes on same server. Matt Wells 2013-10-02 16:12:17 -0700
  • 43e4c939eb Merge branch 'master' into diffbot mwells 2013-10-02 13:15:07 -0600
  • c03e862b99 use a better version of hosts.conf where we specify the working directory for each host entry. then we can use the exact same hosts.conf file for each gb instance rather than having to change the single "working-dir:" directive for each instance, in the case where the each have a different working directory. mwells 2013-10-02 13:11:58 -0600
  • 44c7b65c21 take out isdead check for now mwells 2013-10-02 12:50:11 -0600
  • 259ec08e09 email hook now works but you have to supply the IP address of your sendmail server and it has to allow email forwarding from host #0's IP. specify the sendmail server's IP in the Master Controls. mwells 2013-10-02 09:36:44 -0600
  • 45941e4b2f fix notification system. mwells 2013-10-01 17:30:06 -0600
  • 3fecb3eb1f got email and url notification code compiling. when crawl hits a limit we do notifications. mwells 2013-10-01 15:14:39 -0600
  • c911a606c9 renamed matches.h and matches.cpp to matches2.h and matches2.cpp to avoid potential confusion with Matches.h and Matches.cpp files. Matt Wells 2013-10-01 07:58:24 -0700
  • 710fedfef1 fix core from data dump mwells 2013-09-30 15:28:25 -0600
  • 923d1becce support &spiderlinks=1 in addition to &spiderLinks=1 for add url in PageAddUrl.cpp. mwells 2013-09-30 14:59:48 -0600
  • 12af05960e fix core mwells 2013-09-30 14:25:33 -0600
  • c947638d24 more fixes for downloads mwells 2013-09-30 14:12:22 -0600
  • e71266e2db fix data downloading for large files mwells 2013-09-30 13:48:37 -0600
  • 76c9f47498 file download api updates. to include collection name in filename being downloaded. mwells 2013-09-30 11:10:43 -0600
  • 68e0fc0dcc download formatting fixes. mwells 2013-09-30 10:46:54 -0600
  • b1809b1a08 just allow user to specify diffbot api as a url string, not a menu item number selection from a drop down. still print a fixed drop down that will set the diffbot api url string directly though. i.e. use &dapi=http://www.diffbot.com/api/article? and then gigablast will append the &token=...&url=... to it before fetching it. mwells 2013-09-30 10:27:28 -0600
  • 20952eedbe customizable api list in url filters mwells 2013-09-30 09:18:22 -0600
  • 0edcbcc7d8 printlocktable() function mwells 2013-09-29 10:20:14 -0600
  • 9bf8bf7712 add spider reply even on g_errno now with an error code of EINTERNAL error in the spider reply. no longer just sit on the lock. this was blocking an entire ip when just lock sitting for 3 hrs. and only do read rate timeouts if there was at least one byte read. this was causing diffbot reply to read rate timeout after just 60 seconds even though its timeout was specified as 90 seconds. mwells 2013-09-29 09:22:20 -0600
  • c216f7b2a7 use 48 bit url hash for lock keys again. query reindex recs can just use their prob docids as fake uh48s. we need it so we can avoid the fakedb record and just use the spider reply to trigger a 5-second lock expiration. a little simpler. added logdebugspiderwait for waiting tree debugging. fixed per ip spider limiting. fixed losing spiders down blackhole from updateCrawlInfo. check UrlLock::m_confirmed when counting outstanding spiders on one ip since may have a lock on one host but not get granted on all! it calls confirmLockAcquisition() when it gets fully granted the lock so it can set UrlLock::confirmed. mwells 2013-09-29 00:09:46 -0600
  • d11e9520bd couple fixes to makefile etc. mwells 2013-09-28 16:37:39 -0600
  • ded9ccff0b Merge branch 'diffbot' of github.com:gigablast/open-source-search-engine into diffbot mwells 2013-09-28 14:27:58 -0600
  • afa2a87542 remove alias parm mwells 2013-09-28 14:17:43 -0600
  • c0f1330d70 Merge branch 'master' into diffbot Matt Wells 2013-09-28 13:13:12 -0700
  • 5a52072888 recommended SSDs for optimal performance in admin.html. mwells 2013-09-28 14:02:02 -0600
  • a80cb52740 minor log msg. mwells 2013-09-28 13:58:59 -0600
  • 737f3eae4d Merge branch 'master' of github.com:gigablast/open-source-search-engine mwells 2013-09-28 13:45:39 -0600
  • 5884951190 only do certain things if running on a machine in matt wells datacenter. like fan switching based on temps, or printing seo links. made seo functions weak overridable placeholder stubs so if seo.o is linked in it will override. include seo.o object if seo.cpp file exists for automatic seo module building and linking. mwells 2013-09-28 13:43:56 -0600
  • 9730e5f3ef fix lost spiders from updating crawl info. fix maxspidersperip limitation not being obeyed. removed fakedb. only add "0" time waiting tree keys to waiting tree. only scanSpiderdb() will change their times to a future time or add them to doledb directly. confirmLockAcquisition() will not add to waitingtree if max spiders per ip limit would be exceeded. an incoming spider reply will trigger the add to waiting tree with a time of "0". mwells 2013-09-28 13:12:33 -0600
  • 00910e36d7 do not send "children" docs (robots.txt roots etc) to diffbot. mwells 2013-09-28 10:10:44 -0600
  • 321f5cf938 quite a few fixes. something still overwrite CollectionRec::m_overflow/m_overflow2... mwells 2013-09-27 21:00:40 -0600
  • 88677e1a15 fix bad engineer error that comes up sometimes when viewing cached pages. mwells 2013-09-27 18:15:59 -0600
  • 7cdb3d6f9c fix infinite loop from json parsing and fix some core dumps. mwells 2013-09-27 17:52:36 -0600
  • 83ec4f6de7 ignore a couple cores until we figure out what happened mwells 2013-09-27 14:25:30 -0600
  • cb1b5f3fe4 fixed diffbot api display bug mwells 2013-09-27 13:39:13 -0600
  • e3bccfa706 use robots txt radio button fixes. mwells 2013-09-27 12:17:22 -0600
  • e7377d72ab fix robots.txt switch. fix collection rec saving. require collname explicitly for injecturl urldata. mwells 2013-09-27 11:39:23 -0600
  • ac72e31f35 fixed a few more things. mwells 2013-09-27 11:04:46 -0600
  • eb3f657411 fixed distributed support for adding/deleting/resetting collections. now need to specify collection name like &addcoll=mycoll when adding a coll. mwells 2013-09-27 10:49:24 -0600
  • f043cc67e4 Merge branch 'master' into diffbot mwells 2013-09-26 22:43:27 -0600
  • fd081478de fix crawlbot to work on a distributed network as far as adding/deleting/resetting colls and updating parms. ideally we'd have a Colldb Rdb where each key was a parm. that would make syncing easier if a host went down, then it would get the negative/positive colldb parm keys later. so it could sync up on all your operations as long as all your operations in terms of adding and deleting database key/value pairs. mwells 2013-09-26 22:41:05 -0600
  • 3a4e5da997 make "id" equivalent to "c". print out "id" not "name" in the json output collection objects. perhaps should be "collectionId" not just "id". mwells 2013-09-26 15:32:11 -0600
  • 16ead85cfd added support for adding an alias to a collection using &alias=xxxxx mwells 2013-09-26 14:50:34 -0600
  • e3c4ce189a fixed cores. fixed json. mwells 2013-09-26 14:28:04 -0600
  • 8fde0c5343 added support for serialize/deserialize of TYPE_SAFEBUF parms over distributed network. mwells 2013-09-26 08:56:14 -0600
  • f252dd9189 minor crawlbot gui updates mwells 2013-09-25 19:41:20 -0600
  • 65df6dfe52 added some handy links mwells 2013-09-25 18:00:16 -0600
  • 0b5a45e8aa more api updates. added m_avoidSpiderLinks to spider request so urldata=xxxx can turn link spidering off. probably desirable so its default. so &spiderlinks=[0|1] applies to urldata as well as injecturl= mwells 2013-09-25 17:51:43 -0600
  • 01fa9fe383 make it proper json output mwells 2013-09-25 17:12:01 -0600
  • 6fca32e4b5 minor oops fix. mwells 2013-09-25 17:06:01 -0600
  • 5fbf323cb5 json api now shows all collections and their relevant parms and stats for /crawlbot?token=xxx&format=json mwells 2013-09-25 16:59:31 -0600
  • d14832f93e new json api code compiles. need to test now. mwells 2013-09-25 16:04:16 -0600
  • 0039b23064 almost done with json api. mwells 2013-09-25 15:37:20 -0600
  • 50ba93991b minor ui changes mwells 2013-09-25 13:09:02 -0600
  • 9dc9114902 added stat page for all collections. mwells 2013-09-25 12:57:07 -0600