open-source-search-engine

Forks/open-source-search-engine

Fork 0

Commit Graph

Select branches

Hide Pull Requests

Linkdb-cleanup

compunixaustralia-master

devel

diffbot-testing

inja-template

master

testing

#1

#1

#100

#11

#11

#136

#138

#138

#14

#14

#145

#15

#15

#152

#157

#170

#175

#18

#18

#180

#182

#183

#184

#189

#19

#192

#193

#194

#195

#195

#21

#22

#22

#23

#24

#26

#29

#30

#32

#35

#52

#53

#53

#54

#54

#57

#57

#58

#58

#66

#72

#72

#73

#82

#94

#94

#96

#99

f562e6da9a just ignore all urls with # (hashtag) in them from the dmoz dump. we were truncating http://twitter.com/#!/ronpaul to http://twitter.com/ and when looking up the catids of twitter.com got that ronpaul url. so that's bad. people should respect the hashtag. mwells 2013-10-03 23:33:55 -06:00
0176f8d6a7 fix cores in catdb logic. mwells 2013-10-03 22:34:49 -06:00
9e1fee2cb9 dmozparse works with latest dmoz files now mwells 2013-10-03 22:08:40 -06:00
7075b5ef88 try to fix spider lock/doledb issue some more mwells 2013-10-03 17:58:17 -06:00
044d1fb5b4 Merge branch 'diffbot' of github.com:gigablast/open-source-search-engine into diffbot mwells 2013-10-03 17:02:55 -06:00
87f624e8d5 try to fix spider bug. mwells 2013-10-03 17:02:43 -06:00
48c10bea64 Merge branch 'diffbot' of github.com:gigablast/open-source-search-engine into diffbot Matt Wells 2013-10-03 14:52:31 -07:00
e09c1760e3 fix core from (broad)casting valueless cgi field. Matt Wells 2013-10-03 14:51:59 -07:00
10dad2e6bd fixed bug of not removing spider lock in addSpiderReply() because isAssignedToUs() was there. mwells 2013-10-03 10:45:19 -06:00
8c0fd6030c fixed infinite loop in spider. mwells 2013-10-03 10:31:00 -06:00
a0c79932bb catdb is now generated successfully. mwells 2013-10-02 23:36:49 -06:00
942379427e log fixes for debugging. try to stop spammy log msgs. mwells 2013-10-02 22:37:20 -06:00
6c2c9f7774 trying to bring back dmoz integration. mwells 2013-10-02 22:34:21 -06:00
91b8921b9e have to use different ports if multiple gb instances/processes on same server. Matt Wells 2013-10-02 16:12:17 -07:00
43e4c939eb Merge branch 'master' into diffbot mwells 2013-10-02 13:15:07 -06:00
c03e862b99 use a better version of hosts.conf where we specify the working directory for each host entry. then we can use the exact same hosts.conf file for each gb instance rather than having to change the single "working-dir:" directive for each instance, in the case where the each have a different working directory. mwells 2013-10-02 13:11:58 -06:00
44c7b65c21 take out isdead check for now mwells 2013-10-02 12:50:11 -06:00
259ec08e09 email hook now works but you have to supply the IP address of your sendmail server and it has to allow email forwarding from host #0's IP. specify the sendmail server's IP in the Master Controls. mwells 2013-10-02 09:36:44 -06:00
45941e4b2f fix notification system. mwells 2013-10-01 17:30:06 -06:00
3fecb3eb1f got email and url notification code compiling. when crawl hits a limit we do notifications. mwells 2013-10-01 15:14:39 -06:00
c911a606c9 renamed matches.h and matches.cpp to matches2.h and matches2.cpp to avoid potential confusion with Matches.h and Matches.cpp files. Matt Wells 2013-10-01 07:58:24 -07:00
710fedfef1 fix core from data dump mwells 2013-09-30 15:28:25 -06:00
923d1becce support &spiderlinks=1 in addition to &spiderLinks=1 for add url in PageAddUrl.cpp. mwells 2013-09-30 14:59:48 -06:00
12af05960e fix core mwells 2013-09-30 14:25:33 -06:00
c947638d24 more fixes for downloads mwells 2013-09-30 14:12:22 -06:00
e71266e2db fix data downloading for large files mwells 2013-09-30 13:48:37 -06:00
76c9f47498 file download api updates. to include collection name in filename being downloaded. mwells 2013-09-30 11:10:43 -06:00
68e0fc0dcc download formatting fixes. mwells 2013-09-30 10:46:54 -06:00
b1809b1a08 just allow user to specify diffbot api as a url string, not a menu item number selection from a drop down. still print a fixed drop down that will set the diffbot api url string directly though. i.e. use &dapi=http://www.diffbot.com/api/article? and then gigablast will append the &token=...&url=... to it before fetching it. mwells 2013-09-30 10:27:28 -06:00
20952eedbe customizable api list in url filters mwells 2013-09-30 09:18:22 -06:00
0edcbcc7d8 printlocktable() function mwells 2013-09-29 10:20:14 -06:00
9bf8bf7712 add spider reply even on g_errno now with an error code of EINTERNAL error in the spider reply. no longer just sit on the lock. this was blocking an entire ip when just lock sitting for 3 hrs. and only do read rate timeouts if there was at least one byte read. this was causing diffbot reply to read rate timeout after just 60 seconds even though its timeout was specified as 90 seconds. mwells 2013-09-29 09:22:20 -06:00
c216f7b2a7 use 48 bit url hash for lock keys again. query reindex recs can just use their prob docids as fake uh48s. we need it so we can avoid the fakedb record and just use the spider reply to trigger a 5-second lock expiration. a little simpler. added logdebugspiderwait for waiting tree debugging. fixed per ip spider limiting. fixed losing spiders down blackhole from updateCrawlInfo. check UrlLock::m_confirmed when counting outstanding spiders on one ip since may have a lock on one host but not get granted on all! it calls confirmLockAcquisition() when it gets fully granted the lock so it can set UrlLock::confirmed. mwells 2013-09-29 00:09:46 -06:00
d11e9520bd couple fixes to makefile etc. mwells 2013-09-28 16:37:39 -06:00
ded9ccff0b Merge branch 'diffbot' of github.com:gigablast/open-source-search-engine into diffbot mwells 2013-09-28 14:27:58 -06:00
afa2a87542 remove alias parm mwells 2013-09-28 14:17:43 -06:00
c0f1330d70 Merge branch 'master' into diffbot Matt Wells 2013-09-28 13:13:12 -07:00
5a52072888 recommended SSDs for optimal performance in admin.html. mwells 2013-09-28 14:02:02 -06:00
a80cb52740 minor log msg. mwells 2013-09-28 13:58:59 -06:00
737f3eae4d Merge branch 'master' of github.com:gigablast/open-source-search-engine mwells 2013-09-28 13:45:39 -06:00
5884951190 only do certain things if running on a machine in matt wells datacenter. like fan switching based on temps, or printing seo links. made seo functions weak overridable placeholder stubs so if seo.o is linked in it will override. include seo.o object if seo.cpp file exists for automatic seo module building and linking. mwells 2013-09-28 13:43:56 -06:00
9730e5f3ef fix lost spiders from updating crawl info. fix maxspidersperip limitation not being obeyed. removed fakedb. only add "0" time waiting tree keys to waiting tree. only scanSpiderdb() will change their times to a future time or add them to doledb directly. confirmLockAcquisition() will not add to waitingtree if max spiders per ip limit would be exceeded. an incoming spider reply will trigger the add to waiting tree with a time of "0". mwells 2013-09-28 13:12:33 -06:00
00910e36d7 do not send "children" docs (robots.txt roots etc) to diffbot. mwells 2013-09-28 10:10:44 -06:00
321f5cf938 quite a few fixes. something still overwrite CollectionRec::m_overflow/m_overflow2... mwells 2013-09-27 21:00:40 -06:00
88677e1a15 fix bad engineer error that comes up sometimes when viewing cached pages. mwells 2013-09-27 18:15:59 -06:00
7cdb3d6f9c fix infinite loop from json parsing and fix some core dumps. mwells 2013-09-27 17:52:36 -06:00
83ec4f6de7 ignore a couple cores until we figure out what happened mwells 2013-09-27 14:25:30 -06:00
cb1b5f3fe4 fixed diffbot api display bug mwells 2013-09-27 13:39:13 -06:00
e3bccfa706 use robots txt radio button fixes. mwells 2013-09-27 12:17:22 -06:00
e7377d72ab fix robots.txt switch. fix collection rec saving. require collname explicitly for injecturl urldata. mwells 2013-09-27 11:39:23 -06:00
ac72e31f35 fixed a few more things. mwells 2013-09-27 11:04:46 -06:00
eb3f657411 fixed distributed support for adding/deleting/resetting collections. now need to specify collection name like &addcoll=mycoll when adding a coll. mwells 2013-09-27 10:49:24 -06:00
f043cc67e4 Merge branch 'master' into diffbot mwells 2013-09-26 22:43:27 -06:00
fd081478de fix crawlbot to work on a distributed network as far as adding/deleting/resetting colls and updating parms. ideally we'd have a Colldb Rdb where each key was a parm. that would make syncing easier if a host went down, then it would get the negative/positive colldb parm keys later. so it could sync up on all your operations as long as all your operations in terms of adding and deleting database key/value pairs. mwells 2013-09-26 22:41:05 -06:00
3a4e5da997 make "id" equivalent to "c". print out "id" not "name" in the json output collection objects. perhaps should be "collectionId" not just "id". mwells 2013-09-26 15:32:11 -06:00
16ead85cfd added support for adding an alias to a collection using &alias=xxxxx mwells 2013-09-26 14:50:34 -06:00
e3c4ce189a fixed cores. fixed json. mwells 2013-09-26 14:28:04 -06:00
8fde0c5343 added support for serialize/deserialize of TYPE_SAFEBUF parms over distributed network. mwells 2013-09-26 08:56:14 -06:00
f252dd9189 minor crawlbot gui updates mwells 2013-09-25 19:41:20 -06:00
65df6dfe52 added some handy links mwells 2013-09-25 18:00:16 -06:00
0b5a45e8aa more api updates. added m_avoidSpiderLinks to spider request so urldata=xxxx can turn link spidering off. probably desirable so its default. so &spiderlinks=[0|1] applies to urldata as well as injecturl= mwells 2013-09-25 17:51:43 -06:00
01fa9fe383 make it proper json output mwells 2013-09-25 17:12:01 -06:00
6fca32e4b5 minor oops fix. mwells 2013-09-25 17:06:01 -06:00
5fbf323cb5 json api now shows all collections and their relevant parms and stats for /crawlbot?token=xxx&format=json mwells 2013-09-25 16:59:31 -06:00
d14832f93e new json api code compiles. need to test now. mwells 2013-09-25 16:04:16 -06:00
0039b23064 almost done with json api. mwells 2013-09-25 15:37:20 -06:00
50ba93991b minor ui changes mwells 2013-09-25 13:09:02 -06:00
9dc9114902 added stat page for all collections. mwells 2013-09-25 12:57:07 -06:00
0fe0147913 fix invisible columns in url filters table. mwells 2013-09-25 12:24:13 -06:00
1d92004e06 fix spider flow debug msgs mwells 2013-09-25 12:07:11 -06:00
40192249f9 spider speedups and fixes. mwells 2013-09-25 11:58:03 -06:00
e34afd21ea fix bug of possibly not removing some locks Matt Wells 2013-09-25 09:28:35 -07:00
a687380aeb fix a bug of not reading enough spiderdb records for a given "ip" because short reads were causing us to bail out early. still not sure as to the cause of the short reads. Matt Wells 2013-09-24 20:48:48 -07:00
fbd853fdf7 fix long-standing spider bug causing some ip queues to not get fully spidered. Matt Wells 2013-09-24 20:44:55 -07:00
b16d8519fc more spider fixes. still need more speedups when spidering multiple spiders on same ip. mwells 2013-09-24 16:40:14 -06:00
e594af898a seems like we can spider multiple urls from same ip at same time now. mwells 2013-09-24 09:32:26 -06:00
8461e33b53 fixed more spider bugs. mwells 2013-09-23 21:26:27 -07:00
b90ef3de0d more spider fixes. right after getting lock, use msg12 to remove rec from doledb/doleiptable and add 0 entry to waiting table so doledb is again immediately repopulated with that firstIp so we can spider multiple urls from the same ip at the same time. mwells 2013-09-23 20:25:28 -06:00
7c31ecff4a fixed fakedb key support. mwells 2013-09-23 15:16:23 -06:00
4d33737ac1 fakedb fixes mwells 2013-09-23 08:19:54 -07:00
83e87fc755 fixed ability to spider multiple urls from the same IP at the same time. Also respects sameIpWait constraints. mwells 2013-09-20 15:42:48 -07:00
05400a0c25 updated spider code documentation. mwells 2013-09-20 11:19:24 -07:00
fbd62cecba updated compilation instructions. need to apt-get install gcc-multilib. Matt Wells 2013-09-20 10:06:01 -07:00
bcc55dc46b fixed a couple bugs. Added more documentation into Spider.h. Matt Wells 2013-09-19 18:21:52 -07:00
47465f6d90 more fixes. trying to fix spiders to spider multiple urls from same ip... Matt Wells 2013-09-19 11:13:40 -07:00
a3ea867305 update crawlbot api. Matt Wells 2013-09-18 17:13:36 -07:00
022caeec04 use -diffbotxyz%li as a more unique appendage. show token on crawlbot page. Matt Wells 2013-09-18 17:05:41 -07:00
29f5c5d644 added isonsamesubdomain and isonsamedomain Matt Wells 2013-09-18 16:45:37 -07:00
8de246d9c4 only show urls being spidered from your coll Matt Wells 2013-09-18 16:29:47 -07:00
3bdd28ab1d fix spider bug Matt Wells 2013-09-18 16:17:08 -07:00
7fdbd0f66a delete spider coll when deleting coll Matt Wells 2013-09-18 15:36:30 -07:00
f90d20f4dd diffbot api integration updates Matt Wells 2013-09-18 15:07:47 -07:00
70ff54ce03 hide the parms that might scare users away in the url filters. Matt Wells 2013-09-18 14:27:59 -07:00
6af02119a1 use cookies to display url filters table. Matt Wells 2013-09-18 13:50:55 -07:00
04b0a08ef9 propagate showtable=1 when submitting url filters table Matt Wells 2013-09-18 12:38:05 -07:00
924d1320a2 fix bugs inserting and deleting rows using TYPE_SAFEBUF parms. Matt Wells 2013-09-18 12:35:01 -07:00
c1bcebb7bb url filter documentation update. Matt Wells 2013-09-18 12:00:29 -07:00
459a7e98fb add diffbot dropdown to url filters table Matt Wells 2013-09-18 11:24:16 -07:00
487d3f0a0e fix url filters bugs. Matt Wells 2013-09-18 11:02:09 -07:00
39d9760e5d added ismedia url filter to cover all the jpg,gif,mpeg,css rules. Matt Wells 2013-09-18 09:40:59 -07:00